Studies of complex microbial communities, including those found on and within humans (the human microbiome1
) and those found in both natural and engineered environments, have been constrained by the enormous levels of diversity contained within these communities. The vast majority of this diversity cannot be observed using cultivation-based techniques2
. However, recent advances in DNA sequencing technology such as pyrosequencing3
provide the opportunity to survey microbial diversity in unprecedented detail, through direct sequencing of the small ribosomal subunit rRNA gene. Hundreds of individual communities can now be analyzed simultaneously by coupling pyrosequencing with the use of error-correcting barcoded primers4
, as has been demonstrated in a range of environments including rivers, the mammalian gut, multiple environments in the human body, soil, and the atmosphere1,4–6
. Modern datasets from a single study may contain hundreds of thousands to millions of 16S rRNA sequences, drawn from hundreds of environmental samples. Such sequences are obtained without the biases inherent in culture-dependant methods, and typically include many sequences representing undescribed and uncharacterized species. The ability to obtain such extensive data relatively easily and cheaply has revealed important constraints in our ability to detect patterns in these increasingly large and complex datasets, and to relate such patterns to underlying biotic or abiotic variables.
The problems associated with assessing and explaining patterns in complex datasets are not unique to the field of microbiology. For example, plant and animal ecologists have developed a variety of strategies for the analysis of the relationships between individual biological communities17–21
. The major goal of many of the techniques for the comparison of biological communities among samples is the identification of an environmental gradient (or gradients) instrumental in structuring community diversity, and/or the identification of factors that contribute to the clustering of compositionally similar communities. Several approaches exist for elucidating diversity relationships among samples, including cluster analyses (where samples are assigned to discrete groups), ordination methods (where samples are arranged in low-dimensional space), and explicit hypothesis testing methods (such as ANOVA and Mantel tests).
Humans in particular host a wide variety of microbial communities: microbial cells outnumber human cells by an order of magnitude7
, and microbial communities inhabiting different body habitats such as the mouth and the skin differ more from one another than do microbial communities inhabiting non-host-associated environments such as soil and water8
. Microbial community composition has been associated with the health of the host, and variations in a host’s microbiome are linked to myriad disorders including obesity, vaginosis, and inflammatory bowel disease (IBD)1
The interplay between environmental or host factors and microbial communities can be subtle and complex. However, many ecological systems are driven by environmental gradients; for example, pH has a major and consistent influence on soil microbial communities, whether traditional fingerprinting methods such as denaturing gradient gel electrophoresis (DGGE), restriction fragment length polymorphism (RFLP). or pyrosequencing analyses are used9
. Whether equivalent gradients are found in human-associated body habitats is less clear. Meta-analysis of large numbers of hand and gut samples suggests that they might, although larger numbers of subjects with more careful phenotypic characterization will be required to define the patterns10
. Previous work on the efficacy of different methods for identifying gradients, although useful, has typically relied on simulated datasets that are far smaller in scale than those currently being collected by pyrosequencing11–13
. Although environmental gradients in host-associated microbial communities have not been frequently described, datasets that demonstrate clusters or categorical differences between host-associated microbial communities are relatively common. For example, different samples collected along the distal gut in three humans cluster by subject14
, mammalian fecal samples cluster by diet15
, and fecal pellets of mice cluster by diet and physiological state16
. Do the methods that generally work well for gradient analysis in ecological systems also work well for cluster detection?
We consider only ordination analyses here, as they have been most useful for revealing patterns in large-scale surveys (Supplementary Table 1
). In addition, we chose to address taxon-based (non-phylogenetic) methods in this paper because modeling phylogenetic approaches requires substantial additional decisions about the phylogenetic tree and the rate of environment switching, which make it more difficult to isolate the effects of ordination methods from the effects of model parameters. A discussion of such phylogenetic methods and their utility have been addressed previously10,19,22
. We also consider only unconstrained ordination methods. Constrained methods (or direct gradient analysis methods) such as Canonical Correspondence Analysis (CCA) are useful when investigating the effect of measured environmental variables (sample pH, host health, or sample location) on microbial species present in a sample - in these methods the ordination axes are constrained to represent linear combinations of the measured environmental variables. However, here we assess techniques based on their ability to correctly reveal the diversity patterns inherent in microbial community sequence data, regardless of whether the researcher measured the underlying environmental variables responsible for shaping the communities. Finally, it is worth noting that although ordination methods allow simultaneous display of samples and species (biplots), we display only the samples here, as identification of the specific taxa responsible for differentiating samples does not affect a method’s usefulness at revealing sample clusters or gradients.
The optimal analysis approach depends on factors such as the size of the expected effect, the number of samples, the number of sequences per sample, the degree of replication, and the environmental data available for the sample set. The analysis techniques we compared were Principal Components Analysis (PCA) on raw abundance data as well as data subjected to chi-square, chord, hellinger, and species profile transforms, as well as both Principal Coordinates Analysis (PCoA) and Nonmetric Multidimensional Scaling (NMDS) techniques using each of the common dissimilarity metrics listed in Supplementary Table 2
To assess the performance of these various analysis techniques, we used real and simulated pyrosequencing datasets modeling different microbial communities that we suspect are either shaped by a gradient in environmental conditions or partitioned by environmental factors into distinct groupings, or clusters of samples. We compared the performance of each analysis technique on real community data to the performance on simulated datasets where the inherent gradients and clusters of communities are known a priori. By using these simulated datasets we were able to distinguish between techniques that accurately reveal gradients and clusters inherent in the data versus those techniques that artificially generate patterns where they do not exist.