Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Trends Microbiol. Author manuscript; available in PMC 2012 October 1.
Published in final edited form as:
PMCID: PMC3184378

Combined phylogenetic and genomic approaches for the high-throughput study of microbial habitat adaptation


High-throughput sequencing technologies provide new opportunities to address longstanding questions about habitat adaptation in microbial organisms. How have microbes managed to adapt to such a wide range of environments, and what genomic features allow for such adaptation? We review recent large-scale studies of habitat adaptation, with emphasis on those that utilize phylogenetic techniques. On the basis of current trends, we summarize methodological challenges faced by investigators, and the tools, techniques, and analytical approaches available to overcome them. Phylogenetic approaches and detailed information about each environmental sample will be critical as the ability to collect genome sequences continues to expand.

Setting the stage for high-throughput studies of habitat adaptation

We live in a world suffused with microbial life. Universal trees of life[1, 2] constructed by a variety of methods unambiguously show that microbial bacteria, archaea, and eukaryotes constitute the vast majority of life’s diversity. These diverse organisms perform many important ecological functions across a wide range of natural and man-made environments: photosynthesis in the world’s oceans[3]; nitrogen fixation and provision of carbohydrates in association with plant roots[4]; even modification of the chemistry of the upper atmosphere by communities in droplets of cloud-water[5]. The bodies of animals are also colonized internally and externally by microorganisms, which play crucial roles in the development[6], homeostasis[7], and behavior[8] of their hosts.

How have bacteria, archaea, and microbial eukaryotes adapted to survive and thrive across such a range of lifestyles and habitats? Understanding the relationship between microbial genome sequence and fitness in a given environment is both a fundamental question in evolutionary biology, and a matter of societal importance. As we seek to gain a predictive understanding of phenomena such as the emergence (or reemergence) of pathogens, the impact of human activities from agriculture to the combustion of fossil fuels on ecosystems, or the effects of dietary or medical interventions on human health (e.g. administration of anti- or probiotics), accurate descriptions of the mechanisms by which microorganisms have adapted to environmental changes in the past will provide critical guidance.

Traditionally, questions of microbial habitat adaptation have been addressed by experimental manipulation of microbes in pure culture, or by comparisons of genome sequences. More recently, however, large decreases in the cost of sequencing have allowed such approaches to be complemented by the collection of unprecedented quantities of 16S rRNA [9], metagenomic [10, 11], transcriptomic [12], and whole-genome data [13]. The ‘microbial data deluge’ has spurred the development of new computational tools, and has also made possible systematic study of large-scale processes such as habitat adaptation in ways that would have been previously intractable. Here we highlight how the increasing availability of sequence data from diverse environments is allowing researchers to systematically explore questions about the evolution of habitat adaptation in microbial genomes. We emphasize current trends in the use of tools and analytical approaches, highlighting those that have recently been applied to yield novel insights into this question (Table 1), as well as the outstanding methodological challenges that remain to be overcome.

Table 1
Links to software and resources discussed in the text

High-throughput studies of microbial habitat adaptation

It is now well established that the distribution of microbial organisms across different habitat types is correlated with their phylogeny, both in terms of the β diversity of microbial communities [14, 15] and the habitat range of individual lineages [16]. For example, 16S rRNA surveys clearly separate bacteria into host and free-living communities; planktonic saline and non-saline communities; and soil and sediment communities [14]. An association between habitat and phylogeny has also been detected in an analysis combining phylogenetically-informative marker genes identified in metagenomic studies, comparison of the isolation environment for cultured organisms, and 16S rRNA gene surveys [16]. These results suggest that microbial habitat preferences are fairly stable over evolutionary time. For example, we would not expect to see such patterns if horizontal gene transfer (HGT) was so rampant that all microorganisms were equally capable of adapting to a given environment (by rapid acquisition of the necessary genes from indigenous microbes). However, the observed correlation between phylogeny and habitat in microbial communities doesn’t imply that habitat range for any individual organism can be perfectly predicted from phylogeny alone (nor does it contradict the observation of long tails of rare microbes in many samples [17]). Instead, this observation demonstrates that phylogenetic information can provide a useful first approximation for habitat range; accurate probabilistic models for determining how accurately phylogeny (or gene repertoire) can predict microbial habitat range remain a topic for future research.

The adaptation of microbial taxa to different habitats or lifestyles is reflected in their genome sequences. Some of the best established examples of habitat adaptation identified in genomic studies include reduced genome size in intracellular endosymbionts [18] (Box 1), increases in genome size and the prevalence of two-component regulators in cosmopolitan organisms[19], increased acidic amino acids as a response to salinity ([20] and references contained therein), and increased rRNA copy number in fast-growing microorganisms ([2124] and references contained therein). Additionally, numerous comparative genomic analyses have identified genomic changes associated with differences in habitat or lifestyle within specific taxa (see [25] and [26] for recent examples and [27] for a review).

Box 1. Genome reduction

Genome reduction is one of the best-studied examples of genome evolution as a habitat adaptation in microbial organisms. Genomic minimalism is typically associated with organisms living in a host-associated environment, either as endosymbionts or obligate parasites (e.g. [18, 82]), where increasing reliance on the host leads to loss of numerous pathways. The reduced genomes of the insect symbionts Buchnera (450 kb) and Carsonella (160 kb) have lost many biosynthetic pathways, but retain genes for amino acid biosynthesis, which forms the basis for their relationship with the host [18]. The extent of genomic reduction tends to increase as the length of the obligate relationship with the host increases, with the greatest reduction seen in the mitochondria and plastid organelles that have been stably incorporated in eukaryotic cells for more than 1 billion years and contain only a handful of genes [83]. Organelles also provide the most extreme example of eukaryotic genome reduction, in this case in the secondary plastids, which were acquired by acquisition of a eukaryotic alga. Two lineages with secondary plastids, cryptophytes and chlorarachniophytes, still retain a relict nucleus of the secondary red or green algal symbiont called a nucleomorph that has undergone extreme genome reduction and appears to be on a path to complete loss [84].

The genomic trajectory of obligate intracellular parasites has followed a similar reductive path, with extensive loss and/or reduction in biosynthetic pathways that corresponds to an increased reliance on the host[82]. Many eukaryotic lineages have undergone large scale genomic streamlining when they become obligate parasites. The most extreme example is in microsporidia, a lineage of highly reduced fungi that are obligate intracellular parasites of diverse animals. The microsporidian Enterocytozoon bieneusi, an enteric pathogen in humans, has even lost the ability to synthesize its own ATP and instead has transporters to import ATP from its host [85]. Genomic reduction has also occurred in species of the highly abundant, free-living bacteria Pelagibacter and Prochlorococcus, where selection for efficient reproduction and/or reduced cell size are proposed to have selected for streamlining of genomic content [86, 87]. In both Pelagibacter ubique and reduced strains of Prochlorococcus, loss of paralogous gene copies has been demonstrated to play a role in genome reduction [86, 87]. In Prochlorococcus strains, loss of entire gene families has also played a role in genome reduction [87], while in Pelagibacter few ancestral pathways have been lost [86]. Genome reduction in Pelagibacter has instead been achieved by a reduction in the length of intergenic regions (these regions have a median length of only 3 nucleotides), the elimination of phage genes and pseudo-genes, and loss of recently duplicated paralogs [86].

Finally, metagenomic surveys have also shed light on many important aspects of habitat adaptation. These include changes in the aggregate functional profiles of microbial communities along gradients of depth [28], across diverse habitat categories [29], or between oligotrophic and copiotrophic communities [21].

Increasingly, research into microbial habitat adaptation is successfully leveraging publically available genome, marker gene, and metagenome sequence data to contextualize new findings. Specifically, several recent studies of microbial co-occurrence [30], habitat adaptation [31, 32], survival strategy [21], and genome evolution [33, 34] have combined phylogenetic and genomic or metagenomic information to better understand microbial habitat adaptation. Such studies have converged on related strategies, and faced common challenges. Based on these trends, we discuss a generalized workflow for comparative analysis (Figure 1), including the challenges involved in matching sequenced genomes to habitat assignments, determining which environmental parameters are most likely to be relevant for an analysis, separating the effects of habitat adaptation from those of shared evolutionary history, and detecting horizontal gene transfer (Figure 1).

Figure 1
Recurrent themes in the analysis of microbial habitat adaptation

Challenges in defining microbial habitat range

In order to understand how microbial genomes change in response to environmental adaptation, comparative genomics approaches to habitat adaptation require an operational definition for environment, and a way to relate individual microbial genomes and the environments to which they are adapted (Figure 1). In the future this problem may be resolved by single-cell genomics: careful selection of a range of environments, followed by the sequencing of large numbers of phylogenetically representative complete genomes directly from those environments would provide an unambiguous association between individual sequenced microorganisms and their habitat. In practice, however, this is not yet attainable on a large scale, although substantial progress is being made in techniques for obtaining genome sequences from single-cells [35, 36]. Direct assembly of genomes from deep metagenomic sequencing provides a similarly direct connection between genome and environment [37]. However, the assembly of complete genomes from metagenomic data is limited both because it can be difficult to obtain sufficient coverage for complete assembly in many complex communities, and due to the potential for chimeric assemblies. Thus, many comparative genomics approaches currently rely on proxy information about an organism’s habitat range. Common proxy approaches for determining a sequenced microorganism’s habitat range include annotating environment based on the original isolation source (for cultured organisms), the reported collection site for environmental studies, or database annotations based on one of these approaches. Annotating habitat from the source of the isolate is limited both by cultivation bias (the organisms that grow best in culture often represent a non-random subset of environmental diversity [38]), and because many organisms, especially those abundant in individual samples, are ‘cosmopolitan’ and can inhabit a variety of environments[39]. Careful surveys of the literature can be very useful in establishing a broader sense of the set of environments with which a sequenced organism must contend, but such surveys are laborious and are limited to the lineages actually discussed. An emerging alternative approach is to search community (marker gene and/or metagenomic) survey data for close relatives of sequenced genomes. Such an approach has the advantage that it can be conducted in a relatively unbiased manner, and can associate sequenced organisms with the environmental samples in which their close relatives are found. As with annotations based on isolation source, however, care must be taken, as some organisms present in samples might simply be ‘passing through’, or might be contaminants (Box 2). As databases of 16S rRNA and metagenomic community surveys accumulate, automated methods for surveying the habitat range of microbial taxa (see e.g. [32, 40]) using community surveys should become increasingly effective. However, this improvement is entirely dependent on the consistency and quality of the contextual information associated with marker gene and metagenomic community surveys.

Box 2. Source/sink dynamics

Attempts to map the habitat range of an organism using (metagenomic or marker gene) community surveys is that the presence of a microbe in an assemblage is not proof that the organism is adapted for life there. If a productive (source) and an unproductive (sink) environment are linked by high rates of migration, even relatively abundant organisms in the unproductive environment can be maintained primarily by migration from the source, rather than reproduction in the sink [88]. Such source/sink dynamics have been extensively documented in the ecology of micro- and macroscopic organisms [88, 89] and are likely to play important roles in many microbial communities. For example, microbial assemblages from the human gut may contain transient populations of microorganisms associated with ingested food or the mouth community, in addition to the indigenous community. The complexities presented by source/sink dynamics are compounded by the prevalence of dormancy in microbial populations [90], which can increase the ability of microbes to emigrate to, and persist in, marginal habitats. Currently available techniques for minimizing the effect of source/sink dynamics when annotating habitat range from community surveys include requiring the presence of an OTU across multiple samples, considering the relative presence of an organism in a habitat as a proportion of its total abundance across all environments and experimental comparison of rRNA and rDNA ratios to test for metabolism in the sample can indicate the presence of alive and actively transcribing organisms as opposed to just their DNA. One additional recent approach to this problem involves new algorithms for tracking recent migration from a source environment [91]. This approach can also detect laboratory contamination, which can lead to inappropriate conclusions about cosmopolitanism (see [39] and references contained therein). However, accurate techniques for inferring microbial habitat adaptation (fitness in a particular habitat, rather than merely presence) from community surveys remain a topic where further development is needed.

Metadata annotation

The rapid accumulation of studies encompassing thousands of samples and billions of sequences has the potential to allow myriad new insights through comparative analysis. However, in order to maximize this potential, accurate contextual information about the samples (often called ‘sequence metadata’) is an increasingly important consideration (Figure 1, Topic 1). In addition to the existence of metadata, the form of that metadata is crucially important: if data are not consistently annotated in a standardized machine-readable format, large-scale comparative analyses become difficult or impossible. The utility of datasets for comparative analysis is thus frequently limited by the quality of metadata reported for the sampled environment. Such limitations can be introduced during data collection, data encoding, or data reporting. During data collection, datasets are often limited by reporting only those physical, chemical or geographic parameters relevant to the particular hypothesis at hand (even if other parameters were collected). A lack of widely adopted standards for encoding the metadata that describes samples also presents significant challenges for comparative analyses. Differences in annotation can range from relatively simple (the use of different names or abbreviations to represent the same body site), to very challenging (differing definitions of environment types). Another limitation occurs during publication: although journals require that sequence data be made publically available, similar requirement have not been enforced for sample metadata.

In order to address these issues, many new sequencing efforts are now adopting the minimal information about any (x) sequence (MIxS) standards ( which was proposed by the Genomic Standards Consortium [41]. The MIxS standard encapsulates three metadata compliant data types, which are the minimal information about a (meta)genome sequence [42] and the minimal information about a marker gene sequence [41]. These standards require researchers to supply their metadata using controlled vocabulary terminology and ontological values, which will greatly benefit cross-study comparisons. Due to the adoption of such standards, some databases are also starting to require MIxS-compliance during metadata submission. These include the Metagenomics RAST Server (MG-RAST; [43], the Human Microbiome Project (HMP;, the Earth Microbiome Project (EMP; [44] and the QIIME Database (

Ordination methods

When investigating habitat adaptation in microbes, it is crucial to first have a baseline understanding of how microbial communities vary across environmental samples (microbial β diversity), and the main factors that drive such variation (Figure 1, Topic 2). Ordination methods have been widely and fruitfully applied to address these questions. By assessing the microbial composition of each microbial community, ordination methods allow an assessment of the extent to which communities are partitioned into distinct clusters, or arrayed along a continuous gradient based on environmental factors (see [45] for a survey of ordination methods).

Ordination analyses performed on microbial community composition data acquired via sequencing of the gene encoding the small subunit ribosomal RNA have been used to distinguish microbial communities, and to identify environmental factors that contribute to both large and small-scale differences between communities. For example, Lozupone et al. [46] found a clear split between saline and non-saline environments among non-host associated microbial communities. King et al. combined ordination techniques with biogeography to demonstrate the dominant role of pH, plant abundance and snow depth in shaping the microbial communities found in alpine soil and to build global distribution models for microorganisms in this habitat [47], and Fierer et al. used 16S rRNA composition data to show that microbial communities on individuals’ hands were far more similar to the communities on their computer keyboards than they were to communities from other individuals’ hands [48].

A community-wide perspective on the factors structuring microbial diversity can also be obtained by shotgun metagenomic data. DNA or RNA sequences from random locations on the genomes of many microbes in a community can be assigned to functional (or other) categories, and again ordination methods can be applied to the resulting data. The (dis)agreement between 16S rRNA data and metagenomic data can then be visualized and quantified via Procrustes analysis, which compares the similarity of pairs of ordinations (see [9] for an example of applying this technique to the 5’ and 3’ paired-end reads of the same rRNA molecules in environmental samples). Such comparisons are one method of determining, at the community level, the degree to which the pool of functional genes in a microbial assemblage is predictable from phylogeny (relative to other reference communities). An unusual degree of difference between phylogeny and gene content may be a biologically interesting signature of competition or functional convergence [32].

Finally, ordination methods can help to inform high-throughput studies of microbial habitat adaptation by determining which environmental parameters are most important in structuring community diversity. Objective methods for defining relevant metadata parameters and defining working habitat categories are crucial, because many studies rely heavily on the lifestyle or habitat categories defined in a small number of online databases (primarily NCBI [49] and GOLD [50]) to test comparative genomic hypotheses. Careful refinement of these categories and addition of more detailed subcategories (based in part on the results of ordination techniques) would yield rapid dividends in comparative analysis.

Application of machine learning techniques

Machine learning techniques hold promise for relating gene functions to habitat distributions (Figure 1, Topic 7). These techniques have been used in taxonomic classification of metagenomic data and many other problems in bioinformatics. Although their application to classification and clustering of microbial communities by habitat is relatively new [51], machine learning techniques have been applied extensively to habitat classification in microarray data [52]. This emerging approach has been successful for classifying microbial communities across a number of different habitat types. For example, Muegge et al. [53] used a nearest-neighbor approach to demonstrate that phylogenetic characterizations of microbial communities can be used to predict metagenomic profiles of those communities. Werner et al. used supervised classifiers to identify a small subset of operational taxonomic units (OTUs) that were highly predictive of the type of bioreactor in brewery wastewater-treatment systems [54]. Supervised classifiers have also recently been applied to source tracking of fecal contamination in water supplies [55].

The primary purpose of supervised machine learning in the context of microbial habitat adaptation is to build predictive models of the differences between habitats. A supervised classifier takes as its input a set of biological samples (training data) characterized by, for example, observations of OTUs, or counts of gene categories, along with metadata identifying the source habitats of those communities. The output is a model designed to predict the source habitat for novel biological samples not included in the training data, and an estimate of the expected future accuracy of the model. In many cases the classifier will also report a measure of the predictive capability of each of the dependent variables (e.g. gene categories). One of the main advantages of machine learning techniques is that they are designed to discover general trends present in the training data even when the number of dependent variables is much larger than the number of samples, while avoiding overfitting. This is a challenging task, however, in data as sparse and high-dimensional as microbial community surveys or metagenomic analyses, and exceptional caution must be taken to avoid overestimating the future accuracy of supervised classifiers with small sets of training data [56]. Novel techniques might also need to account for the compositional nature of metagenomics data; for example, changes in a dominant community member could introduce spurious correlations between minor members [57]. Nonetheless, one exciting direction is that once sufficient genomes linked to environmental samples have been collected, machine learning techniques will be ideal for understanding which genes, regulatory structures, or other properties of the genome are specifically associated with presence in an environment, especially when combined with the phylogenetic methods discussed in the next section.

Phylogenetic comparative methods

Once habitats have been assigned to organisms, relating genome properties to habitats is still challenging. Because all organisms share a common ancestry, each genome sequence cannot be counted as an independent observation when conducting statistical analyses, including machine learning techniques. Instead, the evolutionary history that relates organisms must be taken into account [58] (Figure 1, Topic 4). The importance of this well established, but often ignored, principle is illustrated in Figure 2. Phylogenetic comparative methods are of particular relevance to microbial ecologists because the organisms selected for genome sequencing are not distributed across the tree of life evenly (although efforts are underway to ameliorate this problem [13]). This sequencing bias exacerbates the problems of interpretation introduced when traits are correlated with phylogeny.

Figure 2Figure 2
The importance of phylogenetic correction in comparing traits across habitats

Recent investigations of microbial adaptation to the human gut [32], global co-occurrence patterns[30], and genomic changes associated with growth rate [21] have investigated phylogenetic patterns by plotting relevant traits against phylogenetic distance, and found useful information in both trends that can largely be explained by phylogeny (e.g. similarity in GC content [21, 30]) and those can only be partially explained by phylogeny (e.g. gene content during adaptation to life in the gut [32]; and gene content and genome size in co-occurring organisms [30]). Other studies have employed rarefaction, in which data are evened out across categories by discarding members of overrepresented taxa. Rarefaction can provide a useful check on the effects of oversampled taxa, but suffers from the obvious drawback that it frequently discards a large portion of the data, and is limited by the least sampled taxon. Nonetheless, the utility of relatively unsophisticated methods such as rarefaction and regression against phylogenetic distance suggests that inclusion of more formal analyses of phylogenetic signal (for example, phylogenetic independent contrasts [59] and phylogenetic generalized least squares), along with reconstructions of ancestral states (Box 3) could play an important role in future studies of microbial habitat adaptation. The development [59, 60] and testing [61, 62] of phylogenetic comparative methods for quantitative traits, as well as software packages [63, 64] to make such methods easily accessible, are active areas of research, but many tools exist for estimating these traits without phylogenetic bias (Table 1) and should be applied in microbial studies.

Box 3. Ancestral state reconstruction

Reconstruction of ancestral states is a powerful tool to understand molecular and genomic evolution, which is increasingly being applied to the study of microbial habitat adaptation. Ancestral traits for a group of species can be inferred based on a phylogenetic tree, an alignment of the observed states, and a model of evolution of the character under study. By analyzing a character in a group of extant species, the most probable state the character had in the common ancestor of these species can be determined, thus identifying changes that have occurred since divergence. The ancestral sequence can be estimated by one of several methods, such as parsimony [92], maximum likelihood ([93] and references therein), or Bayesian inference ([94] and references therein). Selected tools for performing ancestral state reconstruction are listed in Table 1. For relatively recent evolutionary events, it is sometimes possible to infer probable gene sequences at ancestral nodes. The estimated sequence can then be synthesized, cloned into a vector that is transfected into a cell, and the expressed protein can subsequently be purified in order to study its properties. Based on this process, new insights into the evolution of dim-light vision [95] and steroid receptors [96] have been gained. In addition to inferring ancestral gene sequences, ancestral state reconstruction has also been applied to infer other traits, such as mitochondrial metabolism [97] and the content of genomes [98]. In the future, it seems likely that integrated studies of genomic evolution including both ancestral state reconstruction of genome contents, sequence-based analyses of selective pressure (e.g. via ratios of synonymous to non-synonymous nucleotide substitutions [99]), tests of the order of trait divergence [100] and detection of horizontal gene transfer could yield new insights into the evolution of microbial habitat adaptation.

Relating co-occurrence patterns to bacterial genomes

One way to understand potential interactions between organisms that might impact environmental distribution is through the application of co-occurrence analysis (Figure 1 Topic 3). For instance, species that support each other’s growth, such as in syntrophic relationships where one organism produces metabolites that are consumed by the other, would be expected to positively co-occur across samples. In contrast, species that competitively exclude each other (e.g. because of similar metabolic requirements) might negatively co-occur. Co-occurrence patterns, however, are confounded because both positive and negative associations can also be driven by environmental preferences [30, 65]. Additionally, differences in the depth of sampling between environmental isolates could obscure co-occurrence patterns, especially for rare taxa.

Combining co-occurrence studies with comparative genomics can clarify the biological properties that drive associations among microbes [30]. As an example, Chaffron et al. performed a global analysis of co-occurrence patterns using 16S rRNA surveys representing 3000 distinct sampling events for which sequence data was deposited in GenBank [30]. They then assessed the genomic properties of the subset of OTUs for which close relatives had genome sequences. Although some of the positive associations in the 16S rRNA OTU network reflected known or suspected syntrophic associations, such as a consortium involved in the anaerobic oxidation of methane, the general trends suggested that the major factor driving positive associations was shared environmental preference. Positively co-occurring OTUs were more phylogenetically related than random OTU pairs, extending to lineages that diverged up to 10% at the 16S rRNA level (these would typically be placed in different taxonomic families). Interestingly, positively co-occurring OTUs had more similar genome size, GC content, and relative coverage of KEGG functional pathways than random OTU pairs. Phylogeny could largely explain the high similarity in GC content, but not similarities in genome size and KEGG functional pathway coverage. Thus inhabiting the same environment could drive convergence of genome size and metabolic potential in divergent microbes [30].

Horizontal gene transfer

Ongoing studies have continued to document the important roles played by horizontal gene transfer (HGT) in microbial habitat adaptation (Figure 1 Topic 5). HGT can be detected by several methods: phylogenetic methods, which typically compare gene trees with a ‘species tree’; compositional methods, which analyze deviations in nucleotide, codon, or amino acid composition; or mobile-element methods, which search for specific genes or sequences associated with DNA mobility (see [66] for a review). Although there is ongoing controversy [67, 68] about the total extent of HGT, and the implications of HGT for microbial (especially bacterial and archaeal) phylogeny [69], it is increasingly clear both that (i) HGT has played a major role in bacterial evolution, and (ii) that trees of the universal or nearly-universal genes give the same overall phylogenetic pattern on average [68], implying that the extent of HGT is not so great that measures of vertical inheritance, such as 16S rRNA phylogenies, are meaningless. Several recent studies of HGT have therefore focused on separating the relative contribution of HGT (by conjugation, phage transduction, transformation, etc.[66]) and vertical descent (including gene loss, duplication, evolution of new gene families, and sequence divergence) to the evolution of gene content.

Schliep and colleagues [34] used information embedded in the set (or ‘forest’) of gene trees from 100 bacteria and archaea to identify sections of gene trees that were not consistent with vertical descent, but did correspond to lifestyle (e.g. ‘anaerobe’) or habitat (e.g. ‘soil’) features as derived from NCBI annotations. This analysis yielded sets of gene families that could be better explained by lifestyle or habitat annotations than by taxonomy (~19% of gene families analyzed for hyperthermophiles) as well as networks of gene exchange amongst taxa and clusters of genes that were gained or lost in association with lifestyle.

David and Alm [33] used AnGST, a model that tests for gene duplication, gene loss, and HGT within a single framework, to reconstruct the evolutionary history of 3,983 gene families. The results implied an ‘archaeal expansion’ 3.33–2.65 billion years ago in which the number of gene families expanded by ~26% during a period of rapid diversification. By examining the timing of the expansion, and finding that the gene categories increasing during this event were primarily associated with redox and electron transfer (O2 binding, Fe binding, and Fe-S binding were the most enriched categories), David and Alm were able to connect this expansion to the ‘great oxygenation event’: a dramatic biotically-mediated event in Earth’s history, in which the production of oxygen by photosynthesis began to exceed buffering capacity and thus raise O2 levels in the atmosphere and ocean.

Algorithms that include a unified model of gene evolution hold great promise for the study of habitat adaptation in microbial genomes (Table 1). The separation of genome evolution into specific vertical or horizontal components, and relating patterns in each to changes in habitat or lifestyle are also promising avenues for future research.


The increasing availability of 16S rRNA and metagenomic community surveys, in combination with new genome sequences, provides novel opportunities to conduct large-scale studies relating the survival strategies of microbial organisms to their genomic features (Box 4). Using the structure of the tree of life will be essential in establishing baseline predictions for trait conservation given phylogeny, and thereby distinguishing novel adaptations to a particular habitat from traits preserved solely due to shared evolutionary history. Given this phylogenetic baseline, large collections of community surveys with backing metadata can be used to detect genomic variations associated with life in a range of environmental conditions. Statistical tools are now available for investigating adaptation along ecological gradients, detecting HGT, reconstructing the evolutionary history of genes involved in environmental adaptation, and inferring correlations in species abundance. A major challenge for future studies will be designing accessible, high-throughput pipelines that combine these tools to gain biological insight and generate testable hypotheses from the large-scale sequence collection efforts currently underway.

Box 4. Outstanding questions

  • What is the relative role of sequence change in existing genes vs. transfer of new genes in microbial habitat adaptation?
  • Given that microbes may be detected as present in an assemblage, but not genuinely adapted for life there (due to source/sink dynamics or laboratory contamination), how can we best tell which organisms are adapted?
  • How can we determine the order of adaptive events that permits colonization of a new environment?
  • Do adaptations to environments with similar features (e.g. the guts of various mammals) share similar features?
  • Can adaptation to other members of the community (e.g. syntrophy) be distinguished from shared adaptation to common abiotic factors present in the environment?
  • Given source/sink dynamics, what is the best (experimental or bioinformatic) measure of habitat adaptation?
  • If HGT has played an important role in microbial adaptation to a variety of environments, then what is the timing of specific genomic changes that allow for adaptation to a habitat relative to dispersal into a new habitat?
  • To what extent has HGT affected microbial eukaryotes?
  • What are the relative contributions of different gene transfer mechanisms to adaptive evolution across habitats?
  • Do adaptation to habitat (e.g. the human gut) and adaptation to other organisms co-occurring in a specific assemblage (e.g. syntrophy) show interchangable genomic signals, or are these patterns distinct?


The authors would like to thank Mike Robeson for useful comments on the draft. The work from our laboratory described in this review was supported in part by the National Institutes of Health, the Crohns and Colitis Foundation of America, the Bill and Melinda Gates Foundation, and the Howard Hughes Medical Institute.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


1. Pace NR. A molecular view of microbial diversity and the biosphere. Science. 1997;276:734–740. [PubMed]
2. Ciccarelli FD, et al. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006;311:1283–1287. [PubMed]
3. Johnson ZI, et al. Niche partitioning among Prochlorococcus ecotypes along ocean-scale environmental gradients. Science. 2006;311:1737–1740. [PubMed]
4. Lugtenberg B, Kamilova F. Plant-growth-promoting rhizobacteria. Annu Rev Microbiol. 2009;63:541–556. [PubMed]
5. Womack AM, et al. Biodiversity and biogeography of the atmosphere. Philos Trans R Soc Lond B Biol Sci. 2010;365:3645–3653. [PMC free article] [PubMed]
6. Cheesman SE, et al. Epithelial cell proliferation in the developing zebrafish intestine is regulated by the Wnt pathway and microbial signaling via Myd88. Proc Natl Acad Sci U S A. 2011;108 Suppl 1:4570–4577. [PubMed]
7. Samuel BS, et al. Effects of the gut microbiota on host adiposity are modulated by the short-chain fatty-acid binding G protein-coupled receptor, Gpr41. Proc Natl Acad Sci U S A. 2008;105:16767–16772. [PubMed]
8. Sharon G, et al. Commensal bacteria play a role in mating preference of Drosophila melanogaster. Proc Natl Acad Sci U S A. 2010;107:20051–20056. [PubMed]
9. Caporaso JG, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci U S A. 2011;108 Suppl 1:4516–4522. [PubMed]
10. Qin J, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65. [PMC free article] [PubMed]
11. Peterson J, et al. The NIH Human Microbiome Project. Genome Res. 2009;19:2317–2323. [PubMed]
12. Stewart FJ, et al. Community transcriptomics reveals universal patterns of protein sequence conservation in natural microbial communities. Genome Biol. 2011;12:R26. [PMC free article] [PubMed]
13. Wu D, et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature. 2009;462:1056–1060. [PMC free article] [PubMed]
14. Lozupone CA, Knight R. Global patterns in bacterial diversity. Proc Natl Acad Sci U S A. 2007;104:11436–11440. [PubMed]
15. Ley RE, et al. Worlds within worlds: evolution of the vertebrate gut microbiota. Nat Rev Microbiol. 2008;6:776–788. [PMC free article] [PubMed]
16. von Mering C, et al. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science. 2007;315:1126–1130. [PubMed]
17. Pedros-Alio C. Ecology. Dipping into the rare biosphere. Science. 2007;315:192–193. [PubMed]
18. Moran NA, et al. Genomics and evolution of heritable bacterial symbionts. Annu Rev Genet. 2008;42:165–190. [PubMed]
19. Konstantinidis KT, Tiedje JM. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc Natl Acad Sci U S A. 2004;101:3160–3165. [PubMed]
20. Rhodes ME, et al. Amino acid signatures of salinity on an environmental scale with a focus on the Dead Sea. Environ Microbiol. 2010;12:2613–2623. [PubMed]
21. Vieira-Silva S, Rocha EP. The systemic imprint of growth and its uses in ecological (meta)genomics. PLoS Genet. 2010;6 e1000808. [PMC free article] [PubMed]
22. Klappenbach JA, et al. rRNA operon copy number reflects ecological strategies of bacteria. Appl Environ Microbiol. 2000;66:1328–1333. [PMC free article] [PubMed]
23. Klappenbach JA, et al. rrndb: the Ribosomal RNA Operon Copy Number Database. Nucleic Acids Res. 2001;29:181–184. [PMC free article] [PubMed]
24. Stevenson BS, Schmidt TM. Life history implications of rRNA gene copy number in Escherichia coli. Appl Environ Microbiol. 2004;70:6670–6677. [PMC free article] [PubMed]
25. Cho YJ, et al. Genomic evolution of Vibrio cholerae. Curr Opin Microbiol. 2010;13:646–651. [PubMed]
26. Deng X, et al. Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification. BMC Genomics. 2010;11:500. [PMC free article] [PubMed]
27. Binnewies TT, et al. Ten years of bacterial genome sequencing: comparative-genomics-based discoveries. Funct Integr Genomics. 2006;6:165–185. [PubMed]
28. DeLong EF, et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science. 2006;311:496–503. [PubMed]
29. Dinsdale EA, et al. Functional metagenomic profiling of nine biomes. Nature. 2008;452:629–632. [PubMed]
30. Chaffron S, et al. A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res. 2010;20:947–959. [PubMed]
31. Merhej V, et al. Massive comparative genomic analysis reveals convergent evolution of specialized bacteria. Biol Direct. 2009;4:13. [PMC free article] [PubMed]
32. Zaneveld JR, et al. Ribosomal RNA diversity predicts genome diversity in gut bacteria and their relatives. Nucleic Acids Res. 2010;38:3869–3879. [PMC free article] [PubMed]
33. David LA, Alm EJ. Rapid evolutionary innovation during an Archaean genetic expansion. Nature. 2011;469:93–96. [PubMed]
34. Schliep K, et al. Harvesting evolutionary signals in a forest of prokaryotic gene trees. Mol Biol Evol. 2011;28:1393–1405. [PubMed]
35. Ishoey T, et al. Genomic sequencing of single microbial cells from environmental samples. Curr Opin Microbiol. 2008;11:198–204. [PMC free article] [PubMed]
36. Coleman ML, Chisholm SW. Ecosystem-specific selection pressures revealed through comparative population genomics. Proc Natl Acad Sci U S A. 2010;107:18634–18639. [PubMed]
37. Narasingarao P, et al. De novo metagenomic assembly reveals abundant novel major lineage of Archaea in hypersaline microbial communities. ISME J. 2011 [PMC free article] [PubMed]
38. Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005;71:8228–8235. [PMC free article] [PubMed]
39. Nemergut DR, et al. Global patterns in the biogeography of bacterial taxa. Environ Microbiol. 2011;13:135–144. [PubMed]
40. Lozupone CA, et al. The convergence of carbohydrate active gene repertoires in human gut microbes. Proc Natl Acad Sci U S A. 2008;105:15076–15081. [PubMed]
41. Yilmaz P, et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat Biotechnol. 2011;29:415–420. [PMC free article] [PubMed]
42. Kottmann R, et al. A standard MIGS/MIMS compliant XML Schema: toward the development of the Genomic Contextual Data Markup Language (GCDML) OMICS. 2008;12:115–121. [PubMed]
43. Meyer F, et al. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9:386. [PMC free article] [PubMed]
44. Gilbert JA, et al. The Earth Microbiome Project: Meeting report of the "1 EMP meeting on sample selection and acquisition" at Argonne National Laboratory October 6 2010. Stand Genomic Sci. 2010;3:249–253. [PMC free article] [PubMed]
45. Ramette A. Multivariate analyses in microbial ecology. FEMS Microbiol Ecol. 2007;62:142–160. [PMC free article] [PubMed]
46. Knight R, et al. PyCogent: a toolkit for making sense from sequence. Genome Biol. 2007;8:R171. [PMC free article] [PubMed]
47. Rousk J, et al. Soil bacterial and fungal communities across a pH gradient in an arable soil. ISME J. 2010;4:1340–1351. [PubMed]
48. Fierer N, et al. Forensic identification using skin bacterial communities. Proc Natl Acad Sci U S A. 2010;107:6477–6481. [PubMed]
49. Benson DA, et al. GenBank. Nucleic Acids Res. 2011;39:D32–D37. [PMC free article] [PubMed]
50. Liolios K, et al. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2010;38:D346–D354. [PMC free article] [PubMed]
51. Knights D, et al. Supervised classification of human microbiota. FEMS Microbiol Rev. 2011;35:343–359. [PubMed]
52. Lee JW, et al. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis. 2005;48:869–885.
53. Muegge BD, et al. Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science. 2011;332:970–974. [PMC free article] [PubMed]
54. Werner JJ, et al. Bacterial community structures are unique and resilient in full-scale bioenergy systems. Proc Natl Acad Sci U S A. 2011;108:4158–4163. [PubMed]
55. Smith A, et al. Novel application of a statistical technique, Random Forests, in a bacterial source tracking study. Water Res. 2010;44:4067–4076. [PubMed]
56. Efron B, Tibshirani R. Improvements on Cross-Validation: The .632+ Bootstrap Method. Journal of the American Statistical Association. 1997;92:548–560.
57. Aitchison J. The statistical analysis of compositional data. Chapman & Hall, Ltd.; 1986.
58. Harvey PHaP, Mark D. The Comparative Method in Evolutionary Biology. Oxford University Press; 1991.
59. Felsenstein J. Phylogenies and the Comparative Method. The American Naturalist. 1985;125:1–15.
60. Blomberg SP, et al. Testing for phylogenetic signal in comparative data: behavioral traits are more labile. Evolution. 2003;57:717–745. [PubMed]
61. Laurin M. Assessment of the relative merits of a few methods to detect evolutionary trends. Syst Biol. 2010;59:689–704. [PubMed]
62. Freckleton RP, et al. Phylogenetic analysis and comparative data: a test and review of evidence. Am Nat. 2002;160:712–726. [PubMed]
63. Jombart T, et al. adephylo: new tools for investigating the phylogenetic signal in biological traits. Bioinformatics. 2010;26:1907–1909. [PubMed]
64. Kembel SW, et al. Picante: R tools for integrating phylogenies and ecology. Bioinformatics. 2010;26:1463–1464. [PubMed]
65. Horner-Devine MC, et al. A comparison of taxon co-occurrence patterns for macro- and microorganisms. Ecology. 2007;88:1345–1353. [PubMed]
66. Zaneveld JR, et al. Are all horizontal gene transfers created equal? Prospects for mechanism-based studies of HGT patterns. Microbiology. 2008;154:1–15. [PubMed]
67. Galtier N, Daubin V. Dealing with incongruence in phylogenomic analyses. Philos Trans R Soc Lond B Biol Sci. 2008;363:4023–4029. [PMC free article] [PubMed]
68. Koonin EV, et al. Comparison of phylogenetic trees and search for a central trend in the "forest of life". J Comput Biol. 2011;18:917–924. [PMC free article] [PubMed]
69. Andam CP, et al. Biased gene transfer mimics patterns created through shared ancestry. Proc Natl Acad Sci U S A. 2010;107:10679–10684. [PubMed]
70. Caporaso JG, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7:335–336. [PMC free article] [PubMed]
71. Schloss PD, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–7541. [PMC free article] [PubMed]
72. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. [PubMed]
73. Carmel L, et al. EREM: parameter estimation and ancestral reconstruction by expectation-maximization algorithm for a probabilistic model of genomic binary characters evolution. Adv Bioinformatics. 2010 167408. [PMC free article] [PubMed]
74. Paradis E, et al. APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20:289–290. [PubMed]
75. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. [PubMed]
76. Drummond AJ, Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007;7:214. [PMC free article] [PubMed]
77. Dufour SDaAB. The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software. 2007;22:1–20.
78. Than C, et al. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics. 2008;9:322. [PMC free article] [PubMed]
79. Podell S, et al. A database of phylogenetically atypical genes in archaeal and bacterial genomes, identified using the DarkHorse algorithm. BMC Bioinformatics. 2008;9:419. [PMC free article] [PubMed]
80. Schliep KP. phangorn: phylogenetic analysis in R. Bioinformatics. 2011;27:592–593. [PMC free article] [PubMed]
81. Jombart T, et al. Putting phylogeny into the analysis of biological traits: a methodological approach. J Theor Biol. 2010;264:693–701. [PubMed]
82. Pallen MJ, Wren BW. Bacterial pathogenomics. Nature. 2007;449:835–842. [PubMed]
83. Gray MW. Evolution of organellar genomes. Curr Opin Genet Dev. 1999;9:678–687. [PubMed]
84. Archibald JM, Lane CE. Going, going, not quite gone: nucleomorphs as a case study in nuclear genome reduction. J Hered. 2009;100:582–590. [PubMed]
85. Keeling PJ, et al. The reduced genome of the parasitic microsporidian Enterocytozoon bieneusi lacks genes for core carbon metabolism. Genome Biol Evol. 2010;2:304–309. [PMC free article] [PubMed]
86. Giovannoni SJ, et al. Genome streamlining in a cosmopolitan oceanic bacterium. Science. 2005;309:1242–1245. [PubMed]
87. Luo H, et al. Genome reduction by deletion of paralogs in the marine cyanobacterium Prochlorococcus. Mol Biol Evol. 2011 [PMC free article] [PubMed]
88. Kawecki TJ. Adaptation to marginal habitats: contrasting influence of the dispersal rate on the fate of alleles with small and large effects. Proc Biol Sci. 2000;267:1315–1320. [PMC free article] [PubMed]
89. Sokurenko EV, et al. Source-sink dynamics of virulence evolution. Nat Rev Microbiol. 2006;4:548–555. [PubMed]
90. Jones SE, Lennon JT. Dormancy contributes to the maintenance of microbial diversity. Proc Natl Acad Sci U S A. 2010;107:5881–5886. [PubMed]
91. Knights D, et al. Bayesian community-wide culture-independent microbial source tracking. Nat Methods. 2011 [PMC free article] [PubMed]
92. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. [PubMed]
93. Koshi JM, Goldstein RA. Probabilistic reconstruction of ancestral protein sequences. J Mol Evol. 1996;42:313–320. [PubMed]
94. Pagel M, et al. Bayesian estimation of ancestral character states on phylogenies. Syst Biol. 2004;53:673–684. [PubMed]
95. Yokoyama S, et al. Elucidation of phenotypic adaptations: molecular analyses of dim-light vision proteins in vertebrates. Proc Natl Acad Sci U S A. 2008;105:13480–13485. [PubMed]
96. Thornton JW, et al. Resurrecting the ancestral steroid receptor: ancient origin of estrogen signaling. Science. 2003;301:1714–1717. [PubMed]
97. Gabaldon T, Huynen MA. Reconstruction of the proto-mitochondrial metabolism. Science. 2003;301:609. [PubMed]
98. Paten B, et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 2008;18:1829–1843. [PubMed]
99. Marri PR, et al. Gene gain and gene loss in Streptococcus: is it driven by habitat? Mol Biol Evol. 2006;23:2379–2391. [PubMed]
100. Ackerly DD, et al. Niche evolution and adaptive radiation: testing the order of trait divergence. Ecology. 2006;87:S50–S61. [PubMed]