|Home | About | Journals | Submit | Contact Us | Français|
Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. By enabling an analysis of populations including many (so-far) unculturable and often unknown microbes, metagenomics is revolutionizing the field of microbiology, and has excited researchers in many disciplines that could benefit from the study of environmental microbes, including those in ecology, environmental sciences, and biomedicine. Specific computational and statistical tools have been developed for metagenomic data analysis and comparison. New studies, however, have revealed various kinds of artifacts present in metagenomics data caused by limitations in the experimental protocols and/or inadequate data analysis procedures, which often lead to incorrect conclusions about a microbial community. Here, we review some of the artifacts, such as overestimation of species diversity and incorrect estimation of gene family frequencies, and discuss emerging computational approaches to address them. We also review potential challenges that metagenomics may encounter with the extensive application of next-generation sequencing (NGS) techniques.
Metagenomics, a term first tossed in 1998 , is a methodology that applies genome sequencing or assays of functional properties to the culture-independent analysis of complex and diverse (“meta”) populations of microbes. In sequencing studies, unlike traditional microbial genomic sequencing projects, metagenomics research attempts to determine directly the whole collection of genes within an environmental sample (i.e., the metagenome), and analyze their biochemical activities and complex interactions. Current metagenomics projects are facilitated by the rapid development of so-called Next-Generation Sequencing (NGS) techniques , which provide lowered cost experimental tools without the cloning process inherent in conventional capillary-based methods. Landmark progress in metagenomics occurred in 2004 [3, 4] when two research groups published results from large-scale environmental sequencing projects. Metagenomics projects have very broad applications, from ecology and environmental sciences , to the chemical industry , and human health (e.g., the human gut microbiome metagenomics) [7, 8]. Other than the large-scale shotgun metagenomics, there are small-scale approaches, such as the 16S rRNA-based surveys, and targeted metagenomics to study the microbes in the environment. 16S rRNA-based surveys of microbial communities have been conducted well before the large-scale shotgun metagenomics started, and is now boosted by the development of the barcoded pyrosequencing methodology . Targeted metagenomics, as the name indicates, aims at acquiring sequence reads with specific protein functions, such as glycoside hydrolases [10, 11], bile salt hydrolase , and bleomycin resistance genes  (for a review see ).
Metagenomics, in principle, enables the study of any microbial organism, including the large number of microorganisms (more than 99%) that cannot be isolated or are difficult to grow in lab . More importantly, microbes by nature live in communities, where they interact with each other by exchanging nutrients, metabolites and signaling molecules. Although the conventional pure-culture paradigm remains important for complete characterization of a species, its traditional exclusive usage limits exploring the microbial world. Traditional clonal culture microbiology needs to be complemented by culture-independent microbiology that can directly characterize microbes in natural environments, and can address important biological questions related to those microbial environments, such as the diversity of microbes in different environments , microbial (and microbe-host) interactions , and the environmental and evolutionary processes .
Taking advantage of several excellent, recent reviews on metagenomics [19–21], we will focus, in this review, on the introduction of new computational and statistical tools for metagenomic data analysis, and the artifacts that could lead to wrong conclusions if not handled appropriately.
Computational analysis has shown an even greater impact on metagenomic studies compared to traditional genomic projects, due not only to the large amount of metagenomic data, but also to the new complexity introduced by metagenomic projects (e.g., assembly of multiple genomes simultaneously is more challenging than the assembly of single genomes), and the new questions we are asking (e.g., host-microbe interaction).
Once the sequences are collected, the first step in data analysis is to reconstruct the entire microbial genomes from metagenomic sequence reads using fragment assemblers. Unfortunately, due to the high species complexity and the short length of sequencing reads from NGS sequencers, the reconstruction goal is too difficult if not impossible to attain for samples from many microbial environments. As a result, metagenomic sequences are often subject to further analysis as a collection of short reads.
The early attempts for assembling metagenomic sequences used conventional whole genome assembly (WGA) pipelines, including whole genome assemblers and gene finding programs originally designed for conventional whole genome shotgun sequence (WGS) projects with only some small parameter modifications . The development of genome assembly algorithms has been boosted recently by the development of NGS techniques. New genome assemblers for short reads, including Velvet (a Eulerian path assembler) , ALLPATHS , Euler-SR , have been developed that are specifically targeted at short and ultra-short reads (for examples, the 454 pyrosequencer, and the Illumina/Solexa and SOLiD sequencers). We note that none of these assemblers are designed for assembling mixture genomes (i.e., metagenomes), and refer the reader to a review  on the recent development of genome assemblers and the difficult challenges assembly approaches encounter in the metagenomics field.
For finding genes in metagenomic sequences, one can certainly use the full genome scale “gene predictors” (i.e., those developed for gene prediction from whole genomes). If not, one has to deal with short reads (or small contigs). Most metagenomic studies use 6-frame translation when conducting a similarity search on the short reads. There has been very little development of gene prediction methods specifically for metagenomic sequences or for short contigs assembled from the reads. MetaGene  and Orphelia [28, 29] were designed as gene predictors for short reads (~700bp). MetaGene utilizes dicodon frequencies estimated from the GC content of a given sequencing read and other measures, such as the length distribution of open reading frames (ORFs). Orphelia uses a two-stage machine learning approach for protein-coding gene prediction from the metagenomic reads: first it uses linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences, and then it combines these features with ORF length and fragment GC-content in an artificial neural network to compute the probability that an ORF encodes a protein [28, 29]. Krause and colleagues  developed a gene finder for small contigs that may have in-frame stop codons as well as frame shifts. The gene finder works by initially conducting a similarity search using BLAST for all the contigs and sequences, and by next calculating the coding potential for each position in a contig based on BLAST-based search results (related by the number of synonymous and non-synonymous substitutions), and finally, by predicting the coding sequences via looking for a chain of nucleotide positions that maximizes the sum of coding potential using dynamic programming. The ORFome assembly approach  took a different path, which starts with open reading frames (ORFs) prediction followed by assembly of the predicted peptides. This approach can produce long peptides even when the sequence coverage of reads is extremely low (and the reads may have synonymous mutations). A potential application of the ORFome assembly is to increase the sensitivity of homology searching by using longer peptides, thus to improve the functional annotation of metagenomic sequences.
The most extensive development of computational tools for metagenomics might well have occurred in conjunction with taxonomic studies. One of the primary goals of metagenomic projects is to characterize the organisms present in an environmental sample and, in turn, the taxonomic composition of environmental communities is an important indicator of their ecology and function. In contrast, the molecular approaches, such as genome assembly and function prediction, are relatively standard or well known, although their usage and extension provides new challenges in the metagenomics field. Many computational tools have been developed to infer species information from raw short reads directly, i.e., without the need for assembly. Here we only give a very brief review of the development in this aspect, and we refer the readers to a review  for more details.
Similarity-based and phylogeny-based phylotyping tools utilize similarity searches of the metagenomic sequences against a database of known genes/proteins. MEGAN applies a simple lowest common ancestor algorithm to assign reads to taxa, based on BLAST similarity search results . Phylogenetic analysis of marker genes, including 16S rRNA genes , DNA polymerase genes , and 31 selected marker genes  have also been applied to determining taxonomic distribution. MLTreeMap  and AMPHORA  are two phylogeny-based phylotyping tools that use the phylogenetic analysis of marker genes for taxonomic distribution estimation. CARMA  searches for conserved Pfam domain and protein families  in the raw metagenomic sequences and classifies them into a higher-order taxonomy, based on the reconstruction of a phylogenetic tree of each matching Pfam family.
Other tools have also been developed to address a related problem; namely, the binning problem, which is to cluster metagenomic sequences into different bins (i.e., operational taxonomic units or species). Most existing computational binning tools simply utilize DNA composition. The basis of these approaches is that genome G+C content, dinucleotide frequencies, and synonymous codon usage vary among organisms, and are generally characteristic of evolutionary lineages . TETRA  uses z-scores from tetramer frequencies to classify metagenomic sequences. MetaClust  uses a combination of k-mer frequency metrics to score metagenomic sequences, and was used to classify sequences from the endosymbionts of a gutless worm. CompostBin , a semi-supervised approach, uses a weighted PCA algorithm to project high dimensional DNA composition data into an informative lower-dimensional space, and then uses the normalized cut clustering algorithm to classify sequences into taxon-specific bins. Zhou et al  developed a tool that generates genomic barcodes from DNA composition, and then, uses the barcodes to classify sequences. However, the DNA composition-based approaches only work with long reads (~800–1Kbp); they fail with short reads because of the local variation of DNA composition across a genome. (The GC content across genomes has been observed to be far from uniform ).
More tools are emerging that target ultra-short reads of 100 bps or shorter for taxonomic studies . However, computational tools with higher accuracy, especially for short reads, for phylotyping (both quantitative and qualitative) and binning problems still remain much needed.
There has been relatively less algorithmic and software development for function predictions for metagenomics, though such tools are extremely important and will be needed for understanding the function of a community. (This is especially so given the presence of so many protein families of unknown function in the microbial communities studied ). Function prediction on metagenomic data is generally more difficult than that for genomic studies, because short reads are more difficult to annotate, and some information (such as the gene neighborhood in a genome) that is very important for function prediction in sequence data from individual genomes may not be so useful in working with metagenomic data. Existing functional annotation pipelines (developed for annotating genomic sequences derived from traditional genome projects) have been extensively used for functional annotation of metagenomic sequences. An often used practice is to predict COG families, KEGG families , FIG families , and other functional categories for metagenomic short reads based on BLAST search results. Currently, BLAST is the only readily available or established similarity tool that can handle comprehensive analyses of a metagenomic dataset (with the support of a big computer cluster with hundreds or thousands of CPUs). Very often, an E-value cutoff (e.g., 1e-5) has been adopted for annotating if a read has a certain function [7, 50]. Biological pathways (or subsystems) can also be reconstructed for metagenome building from protein family predictions. (In most studies, the information obtained is on pooled biological pathways from various species in the same microbial community). MG-RAST is an automatic server for subsystem annotation for metagenomic datasets, based on an extension of the very successful microbial genome annotation server RAST . Pooled biological pathways provide key insights into the functional/metabolic ability of a microbial community (e.g., the functional profiling by KEGG pathways in ), although in most metagenomic studies, many details, such as the contribution of individual species to the total metabolic ability of a microbial community, are not well defined.
In addition to mapping reads onto known protein families, discovering how many protein families are encoded in a metagenome (including known and unknown families) is important for understanding the functionality and functional diversity of a community and is also important for expanding the protein (family) universe. For example, the Sorcerer II Global Ocean Sampling (GOS) expedition expanded the known protein space by more than doubling its size and adding thousands of new families . Li et al  developed an approach for rapid analysis of the sequence diversity for very large metagenomic datasets using a clustering approach powered by CD-HIT algorithm . Similarity search (by BLAST or by a more sensitive similarity search tool) helps predict the function of some of these protein families; yet, the functions of numerous small, rare protein families remain to be discovered. Though we have witnessed the successful application of non-homology based methods to predicting metagenomic sequences (e.g., gene neighborhood methods customized to metagenomics ), many of the existing non-homology based function prediction methods  cannot be applied for function prediction for metagenomic sequences. And the community is waiting for the new approaches that take advantage of the information provided through metagenomic projects but not in traditional genomic projects.
Just as comparative genomics, comparative metagenomics is important for functional and evolutionary studies of microbial communities living in different environments, with the hope that fingerprints reflecting known characteristics of the sampled environments can be extracted for interpreting and diagnosing environments . Comparisons have been made among microbial communities collected from different environments (e.g., terrestrial and marine ) and different hosts (e.g., lean mouse versus obese mouse ). Comparative metagenomics studies have revealed significant variation in microbial composition , sequence composition , genome size , evolutionary rates , and metabolic capabilities [50, 59].
UniFrac has been a very popular tool for comparing communities based on the lineages they contain ever since it was developed in 2005 , partially because of its simple yet useful distance measurement between two communities. Given the sets of taxa for the communities to be compared, UniFrac calculates the phylogenetic distances between two communities (i.e., the phylogenetic distances between their corresponding sets of taxa) as the fraction of the branch length of the phylogenetic tree (with all the taxa from both communities) that leads to the taxa from either one set, but not both. The UniFrac metric can be computed for all pairwise combinations of communities from different environments to make a distance matrix, which can then be used with multivariate statistics such as clustering and principal coordinate analysis to provide an overview of the overall phylogenetic relationship of the communities. Further analyses often include intuitive graphical representation of the relationship of the communities from different environments based on their taxonomic distributions (such as a tree from clustering analysis, and a 2D-graph showing the first two principal components in PCA analysis). UniFrac was first implemented as a tool considering only the qualitative information of lineage distribution of a community, and later incorporated a weighting schema to count for the quantitative information of the taxon distribution (so that more abundant taxa will be weighted more).
MEGAN is another tool that provides visual and statistical comparison of metagenomes based on what the lineages they contain. MEGAN was first developed for studying taxonomic distribution of single samples based on BLAST search result of metagenomic sequences (metagenomic reads are mapped to the tree of known species based on the BLAST top hits). MEGAN 2.0 further incorporated comparative analysis of functionality, and MEGAN 3.0 also included a statistical analysis component for pairwise comparison as discovered below [61, 62]. MEGAN can provide a quick glance at the similarity between multiple datasets (if BLAST search results for all the metagenomic datasets are given), and, can highlight taxa with statistically different numbers of assigned reads between two datasets.
Both UniFrac and MEGAN enable comparisons of microbial communities based on the lineages they contain. Microbial communities can also be compared based on other types of information, such as the functions encoded by metagenomes. Functional profiling of metagenomes are often used in metagenomic studies, and metagenomes (and their corresponding environments) can be compared based on their functional profiles [16, 50], or compared by MG-DOTUR as shown in .
There is no doubt that statistics should be heavily involved in analyzing and interpreting metagenomic data, considering that both sampling and sequencing in metagenomic surveys can be largely viewed as at best a stochastic process (i.e., if no systematic bias was introduced). If one observes some apparent difference between communities, one should determine if that difference is statistically significant or not. In addition, many metagenomic projects adopted NGS techniques—such as the 454 sequencers in its early development, and the current Illumina/Solexa sequencers—that produce ultra-short reads. Those short reads may introduce additional complexity to the interpretation of metagenomic sequences, because short reads are more difficult to annotate . Another problem that obviously needs statistical attention is the estimate of species diversity based on one or many observations of a microbial community (i.e., a dataset of metagenomic shotgun sequences, or a set of 16S rRNAs). Unless there is an application of rigorous statistical analysis, the significance of many conclusions from metagenomic analysis will be discounted. Even a simple task like the estimation of relative abundance of different protein coding gene families in a community seems not so straightforward, as we show in the Artifacts session.
Since exhaustive inventories of microbial communities would be impractical or too expensive, the actual diversity of microbial communities based on metagenomic samples needs instead to be estimated. Nevertheless it is important to know how well a sample reflects a community’s “true” diversity (“counting the uncountable” ). One class of methods is the extrapolation that uses the observed accumulation curve to fit an assumed functional form that models the process of observing new species as sampling effort increases. The species richness expected at infinite effort can then be projected (i.e., as the asymptote of the curve). Parametric estimators are another class of estimation methods, which estimate the number of unobserved species in the community by fitting sample data to models of relative species abundances. Thus, in these methods, an assumption about the true abundance distribution of a community is made. Power-law and the exponential functions are the basic functional forms of relative abundance distribution curves observed for biological populations. And these two forms have been tested to describe the observed distribution of q-contigs (i.e., the groups of q overlapping sequences) to model the viral community structure . Nonparametric estimators have also been used in metagenomics studies, including the Chao1 and abundance-based coverage estimators (ACE) that add a correction factor to the observed number of species  or protein families . (Chao1 uses singletons and doubletons, whereas ACE incorporates data from all species with fewer than 10 individuals.).
PHACCS is an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information . PHACCS builds models of possible community structure using a modified Lander-Waterman algorithm to predict the underlying contig spectrum, a vector containing the number of q-contigs from the assembly of shotgun reads, and finds the most appropriate structure model by optimizing the model parameters until the predicted contig spectrum is as close as possible to the one derived experimentally. The resulted structure model provides the basis for making estimates of uncultured viral community richness, evenness, diversity index and abundance of the most abundant genotype.
Statistical techniques are also needed for comparing the structure of microbial communities based on incomplete observation/characterizations of microbial communities through metagenomic studies. Phylogeny-based statistical tools for comparing community structures (i.e., what are the members and how they are related to each other) include integral-LIBSHUFF, TreeClimber, UniFrac (see Subsection 2.1.4), analysis of molecular variance (AMOVA, which determines whether the genetic diversity within two or more communities is greater than their pooled genetic diversity), and homogeneity of molecular variance (HOMOVA, which determines whether the amount of genetic diversity in each community is significantly different) . Tools have also been developed for community membership and structure comparison based on operational taxonomic units (OTUs). (DOTUR was developed by Schloss et al. for assigning sequences to OTUs based on the genetic distance between sequences ). SONs implements nonparametric estimators for the fraction and richness of OTUs shared between two communities . Many of these tools (including DOTUR, SONS, and TreeClimber) are now all available in a package called MOTHER.
Recently, Metastat  was developed for detecting significantly different features (such as taxa, biological pathways, or gene families) between two populations (each population has multiple subjects; e.g., the lean mouse population which has multiple individuals), aiming to study how two populations are different from each other. The input of Metastat is a feature (relative) abundance matrix, with rows representing different features and columns representing subjects or replicates (from two populations). For significance estimation, Metastat utilized a nonparametric t-test, which was implemented so that it is not limited to the assumption that the underlying distribution is normal. For low frequency features, Metastat utilizes Fisher’s exact test, which was demonstrated to outperform other methods for features with sparse counts.
Note that different statistical techniques are used for different tests, and all have their limitations. For example, Schloss has conducted a comparison of different statistical techniques and concluded that different techniques need to be employed correctly for different testing purposes .
Metagenomics has enabled us to look into hidden microbial world, and started to ask important questions such as how microbes adapt to the environment they live in, and how microbes co-evolve with their hosts (or reshape the environment). Studying the host-bacterial coevolution and microbe-environment interaction has both fundamental and practical values. (For example, these studies help to discover new drug targets  and biosensors ). Gianoulis et al. introduced a new concept, the metabolic footprint, to describe an ensemble of weighted biological pathways that maximally covaries with a combination of environmental variables (e.g., the temperature and the sample depth) by employing canonical correlation analysis and related techniques (which model many-to-many relationships) . They identified footprints predictive of aquatic environments that can potentially be used as biosensors using available aquatic datasets. Turning to biological or host constrained microbes, Ley et al.  applied a network-based analysis of bacterial 16S ribosomal RNA gene sequences from the fecal microbiota of humans and 59 other mammalian species living in two zoos and in the wild. 16S ribosomal RNAs were used to define OTUs of the bacterial species. In the network, the nodes are the OTUs (representing bacterial species) and 60 mammalians; the edges connect OTUs with the mammalians in which their sequences were found. Probing the coevolution of mammals and their indigenous microbial communities, their analysis (supported by Cytoscape ; http://www.cytoscape.org/) indicates that host diet and phylogeny both influence bacterial diversity; that bacterial communities codiversified with their hosts; and that the gut microbiota of humans living a modern life-style is typical of omnivorous primates.
Superficially, there is no difference at the lowest or nucleotide level between DNA sequence data from conventional genomic projects and the metagenomics sequencing projects—both are just strings of the A, T, C, and G. But metagenomic data differ from conventional genomic data. The patterns and long-range sequence data are different in metagenomics projects. In a conventional genome project, whose goal was to recover a sequence of the complete genome of a single organism, only one genome was represented and the organization of its DNA was known. In contrast, metagenomic sequencing projects randomly sample the sequence data from the aggregate members of the community (yielding the sample’s metagenome). We do not have a clear model for the organization of the DNA in any given organism, nor know how many different organisms are contained in the sequenced sample. Even the most extensive sequencing of a specific sample (like in the Global Ocean Sampling [4, 75]) will provide only a partial sampling of the DNA in a given environment. Less abundant and particularly rare species will yield only incomplete information obtained from DNA fragments from these species. (One cannot get a complete genome or even contigs of reasonable size for those species.) Thus, metagenomic sequence data mining is challenging because of the large scale and complexity of the metagenomic sequence, which means one also could easily make mistakes in solving problems that look so simple at first glance (e.g., the estimation of gene family frequencies for a microbial community). Below we list several artifacts discovered in metagenomic studies that could impact the conclusion drawn from those studies, and the new computational and statistical tools developed with the aim of reducing the impact of these artifacts.
The application of barcoded pyrosequencing can produce large 16S rRNA datasets that contain hundreds and thousands of 16S RNAs, enabling deep views into hundreds of microbial communities simultaneously . Chimera among the sequenced 16S rRNAs can affect the estimates of species diversity within a microbial community sample, if they are not filtered out properly in advance. (A chimeric sequence usually comprises two phylogenetically distinct parent sequences, and in rare cases, they may be from more than two parents.) Chimeras are well known in community profiling with 16S rRNA: chimeras occur when a prematurely terminated amplicon reanneals to a foreign DNA strand and is copied to completion in the following PCR cycles. Ashelford et al. reported that at least one in twenty 16S rRNA sequences held in public repositories is estimated to contain substantial anomalies . Improved experimental techniques (protocols), e.g., the development of emulsion PCR  serve to minimize the generation of chimera. Computational techniques have also been developed for removing chimeric 16S rRNA data, including Bellerophon , Pintail , and Mallard ; most of such tools are based on comparison with reference 16S rRNAs. The scalability of these tools, though, needs to be improved so that they can work with large 16S rRNA datasets that are derived from barcoded pyrosequencing.
A recent communication by Gomez-Alvarez et al  reports a systematic error in metagenomes generated by 454 pyrosequencing and introduced by artificial replicates that may cause incorrect estimation of the gene and taxon abundance. They discovered that between 11% and 35% of sequences in a typical metagenome are artificial replicates, which need to be removed before further analysis. In most metagenomic studies, only the exact duplicates were removed, which however, only account for 3–18% of the artificial replicates. Geomez-Alvarez et al implemented a pipeline for removing these artifacts by identifying not only exact duplicates, but also other types of artificial replicates, i.e., reads that began at the same position but varied in length or contained a sequencing discrepancy. (A simple statistical analysis would show that the probability of multiple reads occurring at the same position at random, given independent sampling with replacement, is very low.) They also discussed that failure to remove replicated sequences can lead to incorrect conclusions. One example is that the fraction of genes identified as being involved in denitrification did not differ much between agricultural and forested sites before and after removing duplicate reads (exact replicates), but a statistically significant difference was detected after removing all replicate reads.
Computing gene family frequencies seems not so straightforward since Sharon et al. addressed such calculations . We use here an extreme (yet simple) example to explain the problem. Suppose functional annotation exists for of all of the reads in a metagenomic dataset. Suppose 1000 and 500 reads are found with function A and B, respectively. It would seem very obvious that function A is more abundant than function B in the corresponding microbial community. And based on this simple read counting, one might guess that function A is likely to be more important than function B in this environment, which unfortunately could be wrong. The hidden information that was ignored is that proteins with function A may be significantly longer than proteins with function B. Even though there are the exactly the same number of copies of the genes encoding function A and B in the environment, randomly sampling and sequencing of the community DNA will result in a significantly higher number of reads that encode function A (which is longer) than function B. As this example shows, even a very simple analysis, such as the functional distribution studies, could turn wrong if we are not very careful with the interpretation. Sharon et al. proposed a statistical framework for assessing gene family frequencies by considering the different lengths of various proteins . Their statistical model is based on the Lander-Waterman model , proposed in 1988, that describes the shotgun sequencing of a genome as a Poisson process. By applying this statistical model, the gene family frequencies (estimated by read-count approach) will be adjusted, with frequencies for short gene families increased while frequencies for long gene families decreased.
Since biological pathways are important contexts for understanding the functionality of a microbial community, biological pathway based analysis are frequently attempted in metagenomic studies. The random sampling of metagenomic sequences from microbial community genomes will certainly leave gaps in the reconstructed pathways for any metagenome. The “optimistic” way to handle this situation is to consider that a pathway is encoded by the metagenome whenever a functional role (such as an enzyme) can be found in the metagenomic sequences, independent of the presence of pathway gaps or holes (missing enzymes). Such an optimistic pathway reconstruction by a naïve mapping strategy, as termed by the authors , may leave us with many artificial pathways because of the redundant nature of the mapping from functional roles to pathways; that is, some functional roles could be involved in multiple pathways. The entire cellular network is partitioned into several hundreds of biological pathway entities (such as documented in the KEGG database), which is extremely important for an understanding of biological processes, even though all pathways are to some extent connected and overall, there is a singular, integrated network within any cell . Thus, It is not surprising that many pathways defined in the pathway databases overlap. Also, some proteins carry out multiple biological functions  through, for example, different protein domains, active sites, or substrate specificities. Ignoring the pathway redundancy may also leave us with artificial pathways. Examples of potential artificial pathways reconstructed from metagenomes include the inositol metabolism pathway, the androgen and estrogen metabolism pathway, and the caffeine metabolism pathway annotated for a coral biome , which obviously do not sound likely for this particular biome.
A new computational tool, MinPath, was proposed to reduce these artificial pathways based on a parsimony approach . In MinPath, the goal was to define the minimal set of pathways that can explain all annotated functions, instead of using all the pathways that have at least one function mapped in the sequences. For all of the datasets we tested, MinPath reduced the total number of annotated pathways (or subsystems) significantly. For example, for the metagenome sampled from a coral microbial community, there are in total 232 KEGG biological pathways annotated in at least one of its 7 sequencing datasets. Based on MinPath, however, only 160 KEGG biological pathways are sufficient to explain all of the functions predicted for these datasets. These results indicate that the naïve mapping of the biological pathways from predicted functions may overestimate the biological pathways (and in turn, the functional diversity) of those microbial communities, and one needs to be cautious when interpreting the results from such an analysis.
The growth of public DNA sequence data over the last two decades has been exponential, with a doubling time of about 14 months; the doubling time appears likely to be substantially shortened due to the addition of metagenomic data , as well as the impact from NGS approaches. The preliminary data from the Global Ocean Survey (GOS), with 6,109,770 ORFs, more than quadrupled the number of known existing (predicted) protein-coding ORFs (data from CAMERA database ; http://camera.calit2.net/). Similarly, as of Aug 25, 2009, the IMG/M system (http://img.jgi.doe.gov/cgi-bin/m/main.cgi) had collected 67 metagenomes, and the GOLD genomes online database (http://www.genomesonline.org/) listed a total of 168 completed or ongoing metagenomic projects. The increasingly longer read lengths from 454 runs and the rapidly falling cost of sequencing will expedite this process. Note that recent second-generation sequencing techniques, such as the Solexa/Illumina sequencer, still generate extremely short reads. The massive metagenomic data poses great challenges in many areas involving data management and data mining.
There have recently been initial developments of faster and scalable tools toward the requirements of metagenomic projects. The developments include a faster similarity search tool, FastBLAST ; chimera removal tools that can handle 16S rRNA datasets from barcoded pyrosequencing; and a clustering method that can handle large scale data: ESPRIT . We expect more developments of efficient and powerful computational tools that can deal with (or, even better, take advantage of) the huge amount of metagenomic sequences. Besides, these tools need to be able to deal with the high species complexity of the metagenomic datasets.
Several recent researches targeted at the expressed genetic information and proteins of a microbial community, aiming to reveal the regulation and dynamics of genes in an environment. A pysequencing of cDNA prepared from ocean water samples revealed a large number of microbial small RNAs in the ocean water sample, and many of these RNAs have not seen before . A shotgun mass spectrometry-based whole community proteomics approach resulted in a metaproteome with thousands of proteins in human fecal samples, which showed a skewed distribution relative to the metagenome when compared to what was earlier predicted from metagenomics . There is no doubt that metaproteomic and metatranscriptomic studies, together with metagenomic studies offer an unprecedented opportunity to explore both the organization and function of microbial communities. Integration of various datasets will be challenging since there is often lack of direct cross-reference. A research group may only look at one functional aspect of a microbial community, like the gene expression, but not all. A direct result of this is the knowledge gap between different projects. (For example, only ~50% of the transcripts from an ocean sample are highly similar to genes previously detected in ocean metagenomic survey .) However, it is clear that there will be significant reward to integrating all of these different pieces of information to gain a more comprehensive picture of the organization and function of microbial communities.
Metagenomics (and related research) provides us with enormous opportunities to dig into the hidden world of microbes and their relation with biological hosts and/or physical environments. Probably, the only thing that might slow down our scientific march is our imagination, since producing genomic sequences from microbial communities is no longer a stumbling block in our journey to revealing the secrets of our microbial planet.
The authors would like to acknowledge the support from NIH grant 1R01HG004908-01 and NSF grant DBI-0845685 (YY), and from the Gordon and Betty Moore Foundation for the Community Cyberinfrastructure for Marine Microbial Ecological Research and Analysis (CAMERA) project (JW).
John Wooley is Associate Vice Chancellor of Research at UCSD. He now largely focuses on structural genomics (SG) and metagenomics (MG), along with various bioinformatics and computational methods for probing SG and MG data and for community engagement via web services technology.
Yuzhen Ye is an Assistant Professor at the School of Informatics and Computing, Indiana University, Bloomington. Ye received her Ph.D degree in computational biology from Shanghai Institute of Biochemistry, Chinese Academy of Sciences in 2001. Ye’s research interests are in the areas of bioinformatics (especially structural bioinformatics), and computational metagenomics. Ye received a NIH grant for developing computational tool for human microbiome project in 2008, and NSF CAREER Award in 2009.