|Home | About | Journals | Submit | Contact Us | Français|
The vast number of microbial sequences resulting from sequencing efforts using new technologies require us to re-assess currently available analysis methodologies and tools. Here we describe trends in the development and distribution of software for analyzing microbial sequence data. We then focus on one widely used set of methods, dimensionality reduction techniques, which allow users to summarize and compare these vast datasets. We conclude by emphasizing the utility of formal software engineering methods for development of computational biology tools, and the need for new algorithms for comparing microbial communities. Such large-scale comparisons will allow us to fulfill the dream of rapid integration and comparison of microbial sequence data sets, in a replicable analytical environment, in order to describe the microbial world we inhabit.
Recent innovations in sequencing technologies allowed microbial ecologists to advance from analyzing a few hundred sequences per study to hundreds of millions (••1, ••2). These quantitative differences in the amount of sequence data produce qualitative differences in the types of studies that can be performed. For example, ten years ago, characterization of a single clone library from a single body site in one subject represented a substantial advance in knowledge about the human body. A few years ago, quantifying interpersonal differences in one body site, e.g. the gut, represented a major advance (3, 4). Three years ago, performing a multi-site microbial scan of the body, showing how the microbial communities that live on the same person’s body are clearly separated by body site, primarily skin, mouth and stool (5). Now, with higher-throughput sequencing technologies, we can observe the dynamics of the human microbiota across multiple sites and individuals through time, demonstrating that our microbial guests are highly volatile day-to-day even in healthy adults (••6). These examples also illustrate the daunting analytical challenges that microbial researchers face to handle datasets that are ever increasing in size. These challenges range from simply finding the right hypotheses to test, to finding the correct analytical tools and computational power to test them, to finding the methods for visualizing the key results. Here we review computational tools developed in the last three years and algorithms conceived over the last few decades, but only recently applied in microbial ecology; we conclude with suggestions for computational tool developers who wish to help the field continue its rapid pace of development over the next few years.
As 16S rRNA and shotgun metagenomic datasets grow dramatically, the need for easily accessible, well-documented and well-tested tools in the form of a pipeline becomes increasingly critical. In particular, the complexity of what is considered a “standard” analysis has increased rapidly, from small trees and pie chart to advanced analyses incorporating multivariate statistics, machine learning, and, increasingly, explicitly spatial and/or temporal analysis, Figure 1. These new challenges, and especially the need to integrate multiple tools, have forced researchers to move from ad hoc scripts developed in numerical computing environments like R (7) or MATLAB (8) to more general libraries that provide solutions to a specific research niche. Examples include vegan, which provides statistical functions for vegetation (and other) ecologists (9); ade4, which allows exploratory analyses for environmental sciences (10); and ape, which provides methods for phylogenetics and evolution (11); see Table 1. However, developing expertise in, appropriately formatting data, loading large datasets and transferring datasets among multiple packages can be time-consuming: for example, see the methods section and reference list of (12).
A more recent approach has been to develop pipelines that provide complete analysis solutions, combining many steps. For example, if a researcher is interested in analyzing microbial community data generated via high-throughput amplicon sequencing data (such as SSU rRNA), starting with files containing a hundred million sequences to a set of meaningful statistics and visualizations, one tactic is to create a single workflow solution like mothur (13), which provides one program for analysis (for a use case see (•14)); an inherent downside of this approach is increased development time and support burden for a larger codebase, and errors arising from reimplementation of each specialized analysis step into a single tool. Another strategy is to wrap the original different applications in one single package; for example, Quantitative Insights Into Microbial Ecology (QIIME) (••15) provides workflows by splitting the steps into fully transparent scripts (for a use case see (6)); the cost is that the user must track down and install the individual tools, but the user has substantially more control over the analysis and knows they are using “name-brand” software. Another solution is to create analytical web servers, like Visualization and Analysis of Microbial Population Structures (VAMPS) (16), which allows researchers to upload their 16S rRNA data for analysis and visualization (for a use case see (••17)), or the Metagenomics RAST (MG-RAST) server (18) for studies based on shotgun metagenomic sequence. However, web servers usually limit the control users have over their analyses, some analysis steps and methods are hidden when source code is not available, and the user must fully commit to these tools rather than inserting data at later stages or retrieving partial results. A recent comparison of pipelines for metagenomic annotation and analysis pipelines, can be found in the supplementary material of SmashCommunity (•19), which is an open-source, local solution to some of these problems; see Table 1. Open source software, where the source code is available for download, is critical for research software in general as investigators can then check the correctness of the algorithms and make improvements.
The newest approach is to use virtual instances, either by virtualizing in a single computer (e.g. VirtualBox (https://www.virtualbox.org/) or VMWare (http://www.vmware.com/)), where resources are shared within a local machine (which can be a processing bottleneck), or virtualizing in the “cloud” (e.g. EC2 (http://aws.amazon.com/ec2) or Magellan (http://magellan.alcf.anl.gov)), where external resources are used, sometimes at cost. Both virtualization scenarios provide an environment to run virtual machines with preloaded operating systems and programs. For example, CLoVR (••20) can run several metagenomic analysis pipelines, and parallelizes some of these steps across virtual machines to speed up the analysis. Similarly, Galaxy (http://galaxy.psu.edu/) provides a web interface to create analysis pipelines, share them, and share data and results; see Table 1. Both resources are open source.
The QIIME pipeline in particular exemplifies several key software engineering methodologies. First, it is developed using agile software development techniques (21), which require constant interaction with end-users, rapid iterative development and updates, simplicity of implementations and interfaces, etc. QIIME also relies heavily on test-driven development (22), which is similar to the concept of positive and negative controls in lab research and reduces errors considerably. Furthermore, it is open source and distributes its software dependences for a range of computational options, such as direct personal computer installation, virtual machines images for single computer access via VirtualBox, and powerful cloud computing options such as EC2 and Magellan.
The democratization of sequencing technology allows researchers to sequence large numbers of samples from diverse environments (1, 2). Large-scale collaborative projects have taken advantage of this possibility. For example, the Human Microbiome Project (23) sampled 250 individuals 2–3 times, in 5 main sites (the GI tract, the mouth, the vagina, the skin, and the nasal cavity), and the Earth Microbiome Project (24) will sequence up to 200,000 diverse environmental samples. A new challenge generated by these types of projects is to compare not only large numbers of sequences but also large numbers of samples, and to relate the variation in these samples to key clinical or environmental parameters. Although, as outlined above, many ways of examining the data can be valuable, we focus here on dimensionality reduction, an especially useful technique for examining these multidimensional matrices that have more variables than samples. Dimensionality reduction often yields easily interpretable results, while reducing computational costs, relative to trying to understand large taxon tables (25, 26).
Dimensionality reduction techniques help us simplify data represented by a large number of features compared to the number of samples (25, 26). There are two general strategies: feature transformation, which calculates a lower-dimension projection of the original features while retaining as much information as possible, and feature selection, which minimizes the number of variables by locating the “best” minimum subset of the original features (25). The two strategies can also be combined (27). In general, feature transformation has been more widely applied in microbial ecology, even though the transformed features may have no biological meaning (25, •28); feature selection has primarily been applied, often informally, in source tracking and biomarkers (29, 30). Feature transformation can be performed using unsupervised methods (that use only the data matrix itself), including metric and non-metric multidimensional scaling (MDS), or by supervised approaches (that use information about the samples, e.g. clinical or environmental categories) such as Linear Discriminant Analysis (LDA) (25, 31); see Table 2. Both supervised and unsupervised techniques are susceptible to noise in the category labels, e.g. due to mislabeling of samples or contamination. As these issues are a fact of life in projects covering thousands of samples, tools such as SourceTracker (30), which can detect contamination and mislabeling, are increasingly useful.
One of the most commonly used dimensionality reduction techniques in microbial ecology is PCoA, also known as MDS. PCA, or principal coordinates analysis, is a special case of PCoA using Euclidean distance as a dissimilarity measure (32). PCoA takes as input an n × n matrix of distances, generally the results of beta diversity comparisons between n samples in p-dimensional space (traits) although phylogenetic distances such as UniFrac (33) can also be used. It produces a k-dimensional, k ≤ p, representation of the items such that the distances among the points in the new space preserve as closely as possible the distances in the original data (26). In other words, points that are close in the original space are also close in the new space. Results of MDS are indeterminate with respect to translation, rotation, and reflection; in other words, the direction of each axis is arbitrary, although typically the axes are chosen to maximize the variation in the data. PCoA can be used with any dissimilarity metric (beta diversity): for current best practices for non-phylogenetic metrics see (28), and for phylogenetic metrics see (34).
PCA and PCoA rely on solving the eigenvalue equation to find a linear representation of our samples by combining the original variables to generate the resulting k-dimensional representation of the data (32). Another approach that can reduce certain artifacts, such as the horseshoe effect (a pattern in which the two ends of an axis attract each other due to a shared lack of the taxa in the middle, thus obscuring the gradient pattern), is to use nonlinear methods (35). NMDS can better preserve the high-dimensional structure with few axes in some cases, although cannot fully avoid the arch effect in realistic microbial datasets (28). The main differences between PCoA and NMDS are that the former is based on distances, where the final configuration should match the original distances as close as possible, and the latter is based on ranks, which is robust to distribution effects, similar to the difference between Pearson and Spearman correlations (36). One drawback to MDS is that it is not based on an eigenvalue solution but on numerical optimization: for larger datasets, the calculations become time-consuming; see Table 2.
Because even PCoA is slow on large datasets, integrating new samples rapidly into large existing datasets poses a major algorithmic challenge. Such techniques are critical for integrating results from new studies, e.g. new environments or patient populations, into large-scale datasets such as those provided by the Human Microbiome Project (23) or Earth Microbiome Project (24). There has been substantial recent improvement in the performance of some of these approximate algorithms for PCoA. For example, Nystrom techniques such as FastMap, which uses a mapping technique to derive the k-dimension representation, are linear-time algorithms rather than quadratic like PCoA (i.e. the time increases in proportion to the number of samples rather than to the square of the number of samples) (37). MetricMap expands FastMap to assess many projections at once, whereas FastMap calculates one dimension at the time (38). Landmark MDS (LMDS) uses a small number landmark points, either manually or randomly selected, to derive new coordinates (39); see Table 2. For a performance comparison of these methods see (40). The accuracy of these techniques have been assessed by methods that determine how much of the variance is explained by the new set of axes (R2) or how much the distances change in the low-dimensional projection (Kruskal stress). The inherent problem of these methods for determining accuracy, however, is that they do not relate well to clustering quality or ability to interpret the patterns in the data (as has been previously observed for different distance metrics, where the metric that explains most of the variance may produce results that have no biological meaning (28)). Thus improved, and biologically informed, evaluations of these methods are a key area of current interest.
We are currently faced with daunting bioinformatics and computational challenges because of the large numbers of sequences and samples now examined in microbial ecology studies, which require the use of defined software engineering methods to create pipelines that are user-driven and well-tested. Although these pipelines integrate many different techniques for visualizing and understanding data, dimensionality reduction techniques such as PCoA have proven especially valuable for understanding patterns in the data. However, these techniques are reaching their limits as very large numbers of samples are analyzed in large-scale, and ongoing studies could potentially reach a processing bottleneck as these methods do not scale linearly to the number of samples; approximate algorithms, which can be much faster, provide a way out of this conundrum, but could also create a complication if research do not focus in exact approximations. Thus, substantial additional work will be required in order to realize the dream of rapid integration of new samples into large existing frameworks that cover our bodies or our planet.
We thank Greg Caporaso, Jesse Stombaugh and Meg Pirrung for assistance in creating Figure 1, and Jessica Metcalf for helpful comments and edits on the manuscript. The work described in this review was supported by the National Institutes of Health, the Bill and Melinda Gates Foundation, the Crohns and Colitis Foundation of America, the Colorado Center for Biofuels and Biorefining and the Howard Hughes Medical Institute.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.