|Home | About | Journals | Submit | Contact Us | Français|
High-throughput sequencing technologies provide new opportunities to address longstanding questions about habitat adaptation in microbial organisms. How have microbes managed to adapt to such a wide range of environments, and what genomic features allow for such adaptation? We review recent large-scale studies of habitat adaptation, with emphasis on those that utilize phylogenetic techniques. On the basis of current trends, we summarize methodological challenges faced by investigators, and the tools, techniques, and analytical approaches available to overcome them. Phylogenetic approaches and detailed information about each environmental sample will be critical as the ability to collect genome sequences continues to expand.
We live in a world suffused with microbial life. Universal trees of life[1, 2] constructed by a variety of methods unambiguously show that microbial bacteria, archaea, and eukaryotes constitute the vast majority of life’s diversity. These diverse organisms perform many important ecological functions across a wide range of natural and man-made environments: photosynthesis in the world’s oceans; nitrogen fixation and provision of carbohydrates in association with plant roots; even modification of the chemistry of the upper atmosphere by communities in droplets of cloud-water. The bodies of animals are also colonized internally and externally by microorganisms, which play crucial roles in the development, homeostasis, and behavior of their hosts.
How have bacteria, archaea, and microbial eukaryotes adapted to survive and thrive across such a range of lifestyles and habitats? Understanding the relationship between microbial genome sequence and fitness in a given environment is both a fundamental question in evolutionary biology, and a matter of societal importance. As we seek to gain a predictive understanding of phenomena such as the emergence (or reemergence) of pathogens, the impact of human activities from agriculture to the combustion of fossil fuels on ecosystems, or the effects of dietary or medical interventions on human health (e.g. administration of anti- or probiotics), accurate descriptions of the mechanisms by which microorganisms have adapted to environmental changes in the past will provide critical guidance.
Traditionally, questions of microbial habitat adaptation have been addressed by experimental manipulation of microbes in pure culture, or by comparisons of genome sequences. More recently, however, large decreases in the cost of sequencing have allowed such approaches to be complemented by the collection of unprecedented quantities of 16S rRNA , metagenomic [10, 11], transcriptomic , and whole-genome data . The ‘microbial data deluge’ has spurred the development of new computational tools, and has also made possible systematic study of large-scale processes such as habitat adaptation in ways that would have been previously intractable. Here we highlight how the increasing availability of sequence data from diverse environments is allowing researchers to systematically explore questions about the evolution of habitat adaptation in microbial genomes. We emphasize current trends in the use of tools and analytical approaches, highlighting those that have recently been applied to yield novel insights into this question (Table 1), as well as the outstanding methodological challenges that remain to be overcome.
It is now well established that the distribution of microbial organisms across different habitat types is correlated with their phylogeny, both in terms of the β diversity of microbial communities [14, 15] and the habitat range of individual lineages . For example, 16S rRNA surveys clearly separate bacteria into host and free-living communities; planktonic saline and non-saline communities; and soil and sediment communities . An association between habitat and phylogeny has also been detected in an analysis combining phylogenetically-informative marker genes identified in metagenomic studies, comparison of the isolation environment for cultured organisms, and 16S rRNA gene surveys . These results suggest that microbial habitat preferences are fairly stable over evolutionary time. For example, we would not expect to see such patterns if horizontal gene transfer (HGT) was so rampant that all microorganisms were equally capable of adapting to a given environment (by rapid acquisition of the necessary genes from indigenous microbes). However, the observed correlation between phylogeny and habitat in microbial communities doesn’t imply that habitat range for any individual organism can be perfectly predicted from phylogeny alone (nor does it contradict the observation of long tails of rare microbes in many samples ). Instead, this observation demonstrates that phylogenetic information can provide a useful first approximation for habitat range; accurate probabilistic models for determining how accurately phylogeny (or gene repertoire) can predict microbial habitat range remain a topic for future research.
The adaptation of microbial taxa to different habitats or lifestyles is reflected in their genome sequences. Some of the best established examples of habitat adaptation identified in genomic studies include reduced genome size in intracellular endosymbionts  (Box 1), increases in genome size and the prevalence of two-component regulators in cosmopolitan organisms, increased acidic amino acids as a response to salinity ( and references contained therein), and increased rRNA copy number in fast-growing microorganisms ([21–24] and references contained therein). Additionally, numerous comparative genomic analyses have identified genomic changes associated with differences in habitat or lifestyle within specific taxa (see  and  for recent examples and  for a review).
Genome reduction is one of the best-studied examples of genome evolution as a habitat adaptation in microbial organisms. Genomic minimalism is typically associated with organisms living in a host-associated environment, either as endosymbionts or obligate parasites (e.g. [18, 82]), where increasing reliance on the host leads to loss of numerous pathways. The reduced genomes of the insect symbionts Buchnera (450 kb) and Carsonella (160 kb) have lost many biosynthetic pathways, but retain genes for amino acid biosynthesis, which forms the basis for their relationship with the host . The extent of genomic reduction tends to increase as the length of the obligate relationship with the host increases, with the greatest reduction seen in the mitochondria and plastid organelles that have been stably incorporated in eukaryotic cells for more than 1 billion years and contain only a handful of genes . Organelles also provide the most extreme example of eukaryotic genome reduction, in this case in the secondary plastids, which were acquired by acquisition of a eukaryotic alga. Two lineages with secondary plastids, cryptophytes and chlorarachniophytes, still retain a relict nucleus of the secondary red or green algal symbiont called a nucleomorph that has undergone extreme genome reduction and appears to be on a path to complete loss .
The genomic trajectory of obligate intracellular parasites has followed a similar reductive path, with extensive loss and/or reduction in biosynthetic pathways that corresponds to an increased reliance on the host. Many eukaryotic lineages have undergone large scale genomic streamlining when they become obligate parasites. The most extreme example is in microsporidia, a lineage of highly reduced fungi that are obligate intracellular parasites of diverse animals. The microsporidian Enterocytozoon bieneusi, an enteric pathogen in humans, has even lost the ability to synthesize its own ATP and instead has transporters to import ATP from its host . Genomic reduction has also occurred in species of the highly abundant, free-living bacteria Pelagibacter and Prochlorococcus, where selection for efficient reproduction and/or reduced cell size are proposed to have selected for streamlining of genomic content [86, 87]. In both Pelagibacter ubique and reduced strains of Prochlorococcus, loss of paralogous gene copies has been demonstrated to play a role in genome reduction [86, 87]. In Prochlorococcus strains, loss of entire gene families has also played a role in genome reduction , while in Pelagibacter few ancestral pathways have been lost . Genome reduction in Pelagibacter has instead been achieved by a reduction in the length of intergenic regions (these regions have a median length of only 3 nucleotides), the elimination of phage genes and pseudo-genes, and loss of recently duplicated paralogs .
Finally, metagenomic surveys have also shed light on many important aspects of habitat adaptation. These include changes in the aggregate functional profiles of microbial communities along gradients of depth , across diverse habitat categories , or between oligotrophic and copiotrophic communities .
Increasingly, research into microbial habitat adaptation is successfully leveraging publically available genome, marker gene, and metagenome sequence data to contextualize new findings. Specifically, several recent studies of microbial co-occurrence , habitat adaptation [31, 32], survival strategy , and genome evolution [33, 34] have combined phylogenetic and genomic or metagenomic information to better understand microbial habitat adaptation. Such studies have converged on related strategies, and faced common challenges. Based on these trends, we discuss a generalized workflow for comparative analysis (Figure 1), including the challenges involved in matching sequenced genomes to habitat assignments, determining which environmental parameters are most likely to be relevant for an analysis, separating the effects of habitat adaptation from those of shared evolutionary history, and detecting horizontal gene transfer (Figure 1).
In order to understand how microbial genomes change in response to environmental adaptation, comparative genomics approaches to habitat adaptation require an operational definition for environment, and a way to relate individual microbial genomes and the environments to which they are adapted (Figure 1). In the future this problem may be resolved by single-cell genomics: careful selection of a range of environments, followed by the sequencing of large numbers of phylogenetically representative complete genomes directly from those environments would provide an unambiguous association between individual sequenced microorganisms and their habitat. In practice, however, this is not yet attainable on a large scale, although substantial progress is being made in techniques for obtaining genome sequences from single-cells [35, 36]. Direct assembly of genomes from deep metagenomic sequencing provides a similarly direct connection between genome and environment . However, the assembly of complete genomes from metagenomic data is limited both because it can be difficult to obtain sufficient coverage for complete assembly in many complex communities, and due to the potential for chimeric assemblies. Thus, many comparative genomics approaches currently rely on proxy information about an organism’s habitat range. Common proxy approaches for determining a sequenced microorganism’s habitat range include annotating environment based on the original isolation source (for cultured organisms), the reported collection site for environmental studies, or database annotations based on one of these approaches. Annotating habitat from the source of the isolate is limited both by cultivation bias (the organisms that grow best in culture often represent a non-random subset of environmental diversity ), and because many organisms, especially those abundant in individual samples, are ‘cosmopolitan’ and can inhabit a variety of environments. Careful surveys of the literature can be very useful in establishing a broader sense of the set of environments with which a sequenced organism must contend, but such surveys are laborious and are limited to the lineages actually discussed. An emerging alternative approach is to search community (marker gene and/or metagenomic) survey data for close relatives of sequenced genomes. Such an approach has the advantage that it can be conducted in a relatively unbiased manner, and can associate sequenced organisms with the environmental samples in which their close relatives are found. As with annotations based on isolation source, however, care must be taken, as some organisms present in samples might simply be ‘passing through’, or might be contaminants (Box 2). As databases of 16S rRNA and metagenomic community surveys accumulate, automated methods for surveying the habitat range of microbial taxa (see e.g. [32, 40]) using community surveys should become increasingly effective. However, this improvement is entirely dependent on the consistency and quality of the contextual information associated with marker gene and metagenomic community surveys.
Attempts to map the habitat range of an organism using (metagenomic or marker gene) community surveys is that the presence of a microbe in an assemblage is not proof that the organism is adapted for life there. If a productive (source) and an unproductive (sink) environment are linked by high rates of migration, even relatively abundant organisms in the unproductive environment can be maintained primarily by migration from the source, rather than reproduction in the sink . Such source/sink dynamics have been extensively documented in the ecology of micro- and macroscopic organisms [88, 89] and are likely to play important roles in many microbial communities. For example, microbial assemblages from the human gut may contain transient populations of microorganisms associated with ingested food or the mouth community, in addition to the indigenous community. The complexities presented by source/sink dynamics are compounded by the prevalence of dormancy in microbial populations , which can increase the ability of microbes to emigrate to, and persist in, marginal habitats. Currently available techniques for minimizing the effect of source/sink dynamics when annotating habitat range from community surveys include requiring the presence of an OTU across multiple samples, considering the relative presence of an organism in a habitat as a proportion of its total abundance across all environments and experimental comparison of rRNA and rDNA ratios to test for metabolism in the sample can indicate the presence of alive and actively transcribing organisms as opposed to just their DNA. One additional recent approach to this problem involves new algorithms for tracking recent migration from a source environment . This approach can also detect laboratory contamination, which can lead to inappropriate conclusions about cosmopolitanism (see  and references contained therein). However, accurate techniques for inferring microbial habitat adaptation (fitness in a particular habitat, rather than merely presence) from community surveys remain a topic where further development is needed.
The rapid accumulation of studies encompassing thousands of samples and billions of sequences has the potential to allow myriad new insights through comparative analysis. However, in order to maximize this potential, accurate contextual information about the samples (often called ‘sequence metadata’) is an increasingly important consideration (Figure 1, Topic 1). In addition to the existence of metadata, the form of that metadata is crucially important: if data are not consistently annotated in a standardized machine-readable format, large-scale comparative analyses become difficult or impossible. The utility of datasets for comparative analysis is thus frequently limited by the quality of metadata reported for the sampled environment. Such limitations can be introduced during data collection, data encoding, or data reporting. During data collection, datasets are often limited by reporting only those physical, chemical or geographic parameters relevant to the particular hypothesis at hand (even if other parameters were collected). A lack of widely adopted standards for encoding the metadata that describes samples also presents significant challenges for comparative analyses. Differences in annotation can range from relatively simple (the use of different names or abbreviations to represent the same body site), to very challenging (differing definitions of environment types). Another limitation occurs during publication: although journals require that sequence data be made publically available, similar requirement have not been enforced for sample metadata.
In order to address these issues, many new sequencing efforts are now adopting the minimal information about any (x) sequence (MIxS) standards (http://www.gensc.org/gc_wiki/index.php/MIxS) which was proposed by the Genomic Standards Consortium . The MIxS standard encapsulates three metadata compliant data types, which are the minimal information about a (meta)genome sequence  and the minimal information about a marker gene sequence . These standards require researchers to supply their metadata using controlled vocabulary terminology and ontological values, which will greatly benefit cross-study comparisons. Due to the adoption of such standards, some databases are also starting to require MIxS-compliance during metadata submission. These include the Metagenomics RAST Server (MG-RAST; http://metagenomics.anl.gov/) , the Human Microbiome Project (HMP; http://www.hmpdacc.org/), the Earth Microbiome Project (EMP; http://www.earthmicrobiome.org/)  and the QIIME Database (http://www.microbio.me/qiime).
When investigating habitat adaptation in microbes, it is crucial to first have a baseline understanding of how microbial communities vary across environmental samples (microbial β diversity), and the main factors that drive such variation (Figure 1, Topic 2). Ordination methods have been widely and fruitfully applied to address these questions. By assessing the microbial composition of each microbial community, ordination methods allow an assessment of the extent to which communities are partitioned into distinct clusters, or arrayed along a continuous gradient based on environmental factors (see  for a survey of ordination methods).
Ordination analyses performed on microbial community composition data acquired via sequencing of the gene encoding the small subunit ribosomal RNA have been used to distinguish microbial communities, and to identify environmental factors that contribute to both large and small-scale differences between communities. For example, Lozupone et al.  found a clear split between saline and non-saline environments among non-host associated microbial communities. King et al. combined ordination techniques with biogeography to demonstrate the dominant role of pH, plant abundance and snow depth in shaping the microbial communities found in alpine soil and to build global distribution models for microorganisms in this habitat , and Fierer et al. used 16S rRNA composition data to show that microbial communities on individuals’ hands were far more similar to the communities on their computer keyboards than they were to communities from other individuals’ hands .
A community-wide perspective on the factors structuring microbial diversity can also be obtained by shotgun metagenomic data. DNA or RNA sequences from random locations on the genomes of many microbes in a community can be assigned to functional (or other) categories, and again ordination methods can be applied to the resulting data. The (dis)agreement between 16S rRNA data and metagenomic data can then be visualized and quantified via Procrustes analysis, which compares the similarity of pairs of ordinations (see  for an example of applying this technique to the 5’ and 3’ paired-end reads of the same rRNA molecules in environmental samples). Such comparisons are one method of determining, at the community level, the degree to which the pool of functional genes in a microbial assemblage is predictable from phylogeny (relative to other reference communities). An unusual degree of difference between phylogeny and gene content may be a biologically interesting signature of competition or functional convergence .
Finally, ordination methods can help to inform high-throughput studies of microbial habitat adaptation by determining which environmental parameters are most important in structuring community diversity. Objective methods for defining relevant metadata parameters and defining working habitat categories are crucial, because many studies rely heavily on the lifestyle or habitat categories defined in a small number of online databases (primarily NCBI  and GOLD ) to test comparative genomic hypotheses. Careful refinement of these categories and addition of more detailed subcategories (based in part on the results of ordination techniques) would yield rapid dividends in comparative analysis.
Machine learning techniques hold promise for relating gene functions to habitat distributions (Figure 1, Topic 7). These techniques have been used in taxonomic classification of metagenomic data and many other problems in bioinformatics. Although their application to classification and clustering of microbial communities by habitat is relatively new , machine learning techniques have been applied extensively to habitat classification in microarray data . This emerging approach has been successful for classifying microbial communities across a number of different habitat types. For example, Muegge et al.  used a nearest-neighbor approach to demonstrate that phylogenetic characterizations of microbial communities can be used to predict metagenomic profiles of those communities. Werner et al. used supervised classifiers to identify a small subset of operational taxonomic units (OTUs) that were highly predictive of the type of bioreactor in brewery wastewater-treatment systems . Supervised classifiers have also recently been applied to source tracking of fecal contamination in water supplies .
The primary purpose of supervised machine learning in the context of microbial habitat adaptation is to build predictive models of the differences between habitats. A supervised classifier takes as its input a set of biological samples (training data) characterized by, for example, observations of OTUs, or counts of gene categories, along with metadata identifying the source habitats of those communities. The output is a model designed to predict the source habitat for novel biological samples not included in the training data, and an estimate of the expected future accuracy of the model. In many cases the classifier will also report a measure of the predictive capability of each of the dependent variables (e.g. gene categories). One of the main advantages of machine learning techniques is that they are designed to discover general trends present in the training data even when the number of dependent variables is much larger than the number of samples, while avoiding overfitting. This is a challenging task, however, in data as sparse and high-dimensional as microbial community surveys or metagenomic analyses, and exceptional caution must be taken to avoid overestimating the future accuracy of supervised classifiers with small sets of training data . Novel techniques might also need to account for the compositional nature of metagenomics data; for example, changes in a dominant community member could introduce spurious correlations between minor members . Nonetheless, one exciting direction is that once sufficient genomes linked to environmental samples have been collected, machine learning techniques will be ideal for understanding which genes, regulatory structures, or other properties of the genome are specifically associated with presence in an environment, especially when combined with the phylogenetic methods discussed in the next section.
Once habitats have been assigned to organisms, relating genome properties to habitats is still challenging. Because all organisms share a common ancestry, each genome sequence cannot be counted as an independent observation when conducting statistical analyses, including machine learning techniques. Instead, the evolutionary history that relates organisms must be taken into account  (Figure 1, Topic 4). The importance of this well established, but often ignored, principle is illustrated in Figure 2. Phylogenetic comparative methods are of particular relevance to microbial ecologists because the organisms selected for genome sequencing are not distributed across the tree of life evenly (although efforts are underway to ameliorate this problem ). This sequencing bias exacerbates the problems of interpretation introduced when traits are correlated with phylogeny.
Recent investigations of microbial adaptation to the human gut , global co-occurrence patterns, and genomic changes associated with growth rate  have investigated phylogenetic patterns by plotting relevant traits against phylogenetic distance, and found useful information in both trends that can largely be explained by phylogeny (e.g. similarity in GC content [21, 30]) and those can only be partially explained by phylogeny (e.g. gene content during adaptation to life in the gut ; and gene content and genome size in co-occurring organisms ). Other studies have employed rarefaction, in which data are evened out across categories by discarding members of overrepresented taxa. Rarefaction can provide a useful check on the effects of oversampled taxa, but suffers from the obvious drawback that it frequently discards a large portion of the data, and is limited by the least sampled taxon. Nonetheless, the utility of relatively unsophisticated methods such as rarefaction and regression against phylogenetic distance suggests that inclusion of more formal analyses of phylogenetic signal (for example, phylogenetic independent contrasts  and phylogenetic generalized least squares), along with reconstructions of ancestral states (Box 3) could play an important role in future studies of microbial habitat adaptation. The development [59, 60] and testing [61, 62] of phylogenetic comparative methods for quantitative traits, as well as software packages [63, 64] to make such methods easily accessible, are active areas of research, but many tools exist for estimating these traits without phylogenetic bias (Table 1) and should be applied in microbial studies.
Reconstruction of ancestral states is a powerful tool to understand molecular and genomic evolution, which is increasingly being applied to the study of microbial habitat adaptation. Ancestral traits for a group of species can be inferred based on a phylogenetic tree, an alignment of the observed states, and a model of evolution of the character under study. By analyzing a character in a group of extant species, the most probable state the character had in the common ancestor of these species can be determined, thus identifying changes that have occurred since divergence. The ancestral sequence can be estimated by one of several methods, such as parsimony , maximum likelihood ( and references therein), or Bayesian inference ( and references therein). Selected tools for performing ancestral state reconstruction are listed in Table 1. For relatively recent evolutionary events, it is sometimes possible to infer probable gene sequences at ancestral nodes. The estimated sequence can then be synthesized, cloned into a vector that is transfected into a cell, and the expressed protein can subsequently be purified in order to study its properties. Based on this process, new insights into the evolution of dim-light vision  and steroid receptors  have been gained. In addition to inferring ancestral gene sequences, ancestral state reconstruction has also been applied to infer other traits, such as mitochondrial metabolism  and the content of genomes . In the future, it seems likely that integrated studies of genomic evolution including both ancestral state reconstruction of genome contents, sequence-based analyses of selective pressure (e.g. via ratios of synonymous to non-synonymous nucleotide substitutions ), tests of the order of trait divergence  and detection of horizontal gene transfer could yield new insights into the evolution of microbial habitat adaptation.
One way to understand potential interactions between organisms that might impact environmental distribution is through the application of co-occurrence analysis (Figure 1 Topic 3). For instance, species that support each other’s growth, such as in syntrophic relationships where one organism produces metabolites that are consumed by the other, would be expected to positively co-occur across samples. In contrast, species that competitively exclude each other (e.g. because of similar metabolic requirements) might negatively co-occur. Co-occurrence patterns, however, are confounded because both positive and negative associations can also be driven by environmental preferences [30, 65]. Additionally, differences in the depth of sampling between environmental isolates could obscure co-occurrence patterns, especially for rare taxa.
Combining co-occurrence studies with comparative genomics can clarify the biological properties that drive associations among microbes . As an example, Chaffron et al. performed a global analysis of co-occurrence patterns using 16S rRNA surveys representing 3000 distinct sampling events for which sequence data was deposited in GenBank . They then assessed the genomic properties of the subset of OTUs for which close relatives had genome sequences. Although some of the positive associations in the 16S rRNA OTU network reflected known or suspected syntrophic associations, such as a consortium involved in the anaerobic oxidation of methane, the general trends suggested that the major factor driving positive associations was shared environmental preference. Positively co-occurring OTUs were more phylogenetically related than random OTU pairs, extending to lineages that diverged up to 10% at the 16S rRNA level (these would typically be placed in different taxonomic families). Interestingly, positively co-occurring OTUs had more similar genome size, GC content, and relative coverage of KEGG functional pathways than random OTU pairs. Phylogeny could largely explain the high similarity in GC content, but not similarities in genome size and KEGG functional pathway coverage. Thus inhabiting the same environment could drive convergence of genome size and metabolic potential in divergent microbes .
Ongoing studies have continued to document the important roles played by horizontal gene transfer (HGT) in microbial habitat adaptation (Figure 1 Topic 5). HGT can be detected by several methods: phylogenetic methods, which typically compare gene trees with a ‘species tree’; compositional methods, which analyze deviations in nucleotide, codon, or amino acid composition; or mobile-element methods, which search for specific genes or sequences associated with DNA mobility (see  for a review). Although there is ongoing controversy [67, 68] about the total extent of HGT, and the implications of HGT for microbial (especially bacterial and archaeal) phylogeny , it is increasingly clear both that (i) HGT has played a major role in bacterial evolution, and (ii) that trees of the universal or nearly-universal genes give the same overall phylogenetic pattern on average , implying that the extent of HGT is not so great that measures of vertical inheritance, such as 16S rRNA phylogenies, are meaningless. Several recent studies of HGT have therefore focused on separating the relative contribution of HGT (by conjugation, phage transduction, transformation, etc.) and vertical descent (including gene loss, duplication, evolution of new gene families, and sequence divergence) to the evolution of gene content.
Schliep and colleagues  used information embedded in the set (or ‘forest’) of gene trees from 100 bacteria and archaea to identify sections of gene trees that were not consistent with vertical descent, but did correspond to lifestyle (e.g. ‘anaerobe’) or habitat (e.g. ‘soil’) features as derived from NCBI annotations. This analysis yielded sets of gene families that could be better explained by lifestyle or habitat annotations than by taxonomy (~19% of gene families analyzed for hyperthermophiles) as well as networks of gene exchange amongst taxa and clusters of genes that were gained or lost in association with lifestyle.
David and Alm  used AnGST, a model that tests for gene duplication, gene loss, and HGT within a single framework, to reconstruct the evolutionary history of 3,983 gene families. The results implied an ‘archaeal expansion’ 3.33–2.65 billion years ago in which the number of gene families expanded by ~26% during a period of rapid diversification. By examining the timing of the expansion, and finding that the gene categories increasing during this event were primarily associated with redox and electron transfer (O2 binding, Fe binding, and Fe-S binding were the most enriched categories), David and Alm were able to connect this expansion to the ‘great oxygenation event’: a dramatic biotically-mediated event in Earth’s history, in which the production of oxygen by photosynthesis began to exceed buffering capacity and thus raise O2 levels in the atmosphere and ocean.
Algorithms that include a unified model of gene evolution hold great promise for the study of habitat adaptation in microbial genomes (Table 1). The separation of genome evolution into specific vertical or horizontal components, and relating patterns in each to changes in habitat or lifestyle are also promising avenues for future research.
The increasing availability of 16S rRNA and metagenomic community surveys, in combination with new genome sequences, provides novel opportunities to conduct large-scale studies relating the survival strategies of microbial organisms to their genomic features (Box 4). Using the structure of the tree of life will be essential in establishing baseline predictions for trait conservation given phylogeny, and thereby distinguishing novel adaptations to a particular habitat from traits preserved solely due to shared evolutionary history. Given this phylogenetic baseline, large collections of community surveys with backing metadata can be used to detect genomic variations associated with life in a range of environmental conditions. Statistical tools are now available for investigating adaptation along ecological gradients, detecting HGT, reconstructing the evolutionary history of genes involved in environmental adaptation, and inferring correlations in species abundance. A major challenge for future studies will be designing accessible, high-throughput pipelines that combine these tools to gain biological insight and generate testable hypotheses from the large-scale sequence collection efforts currently underway.
The authors would like to thank Mike Robeson for useful comments on the draft. The work from our laboratory described in this review was supported in part by the National Institutes of Health, the Crohns and Colitis Foundation of America, the Bill and Melinda Gates Foundation, and the Howard Hughes Medical Institute.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.