Challenges in Environmental ’Omics
Hardware
The size and nature of ‘omics data means it is often necessary to employ high performance computing (HPC) resources for analysis. This presents an inherent challenge as use of such resources requires a specific skill set that not all researchers will have (see ‘skills’ below for more details).
A secondary challenge relating to hardware is the rapidly changing HPC landscape. Between institutions HPC architectures can vary wildly. Although the basic skills needed to access them remain the same, the setup and execution of jobs may look quite different. Even within institutions, HPC setups mature and are replaced regularly as new technologies develop and the demand for resources grows. For example, the Biology department at the University of York has had access to three different setups (c2d2, YARCC and Viking) in the last nine years, with a new iteration (Viking2) currently in the works. This frequent turnover requires users to continually adjust and adapt their workflows to the new system.
Software
There are several issues surrounding the software involved in analysis of environmental omics datasets. Firstly, software tends to have a steep learning curve, requiring a substantial time investment for researchers. This investment will not necessarily always pay off, if the end result is not what is required.
Secondly, even if a piece of software does do what is needed, it is not guaranteed that it will be usable on the HPC architecture available. Installation of software is not always straightforward, if it is allowed in the first place. The rapid turnover and replacement of HPC architectures only serves to compound this problem, and the heterogeneity of HPC setups between institutions makes it difficult to find bespoke instructions for software installation.
The final, broader issue is around access to learning resources and tutorials. Some popular, non-field-specific tools such as R or Python have countless online tutorials and instructions dedicated to their use, aimed at all different levels of understanding. Others, especially more niche software programs, have very few resources. Those that do exist may be out of date, or assume a level of knowledge beyond that of most novices (for example, many documentation pages are entirely inaccessible to a newcomer). As new software emerges and supersedes previously popular programs, the lack of help available only worsens.
Skills
As previously mentioned, environmental omics analysis has a steep learning curve. A major challenge for many new researchers is grappling with previously unencountered skills such as using the UNIX command line, navigating file systems, writing shell scripts, grappling with dependencies and specifying resources for HPC. This is all before any specific pieces of software are involved, each of which will require its own set of skills and understanding.
These skills are required on top of the experimental design and data collection skills needed to generate datasets in the first place. Often those collecting data are the ones best placed to know how to interrogate it, as all experiments are different and bespoke analysis is crucial. This requires researchers to learn and juggle a large collection of skills, not all of which are immediately relevant to their chosen area of study.
Time
Finally, there are time investments involved in all of the above challenges. There is the ‘brain time’ involved in learning new skills, problem-solving and working with new software. Then, once an analysis is ready to run, it will take time to run. HPC resources are usually shared across many users, with jobs being added to a queue to run when resources are available - analyses requiring large amounts of compute may be queued for days or weeks waiting for the required resources to come available. In addition, some analyses take a long time to run given the size of the datasets involved and the complexity of the analysis.
Once analysis is completed time must be invested in interpreting and visualising the results. If parameters need to be adjusted following this, then the whole process must begin again. This makes optimisation of analysis difficult and time-consuming to the point that it may not even happen at all.
At Cloud-SPAN our goal is to help you overcome these challenges. Read more about the courses we offer or take a look at our introductory ’Prenomics’ course materials or specialised Genomics course to see how our training can equip you better!