Processing of genomic sequence data and processing of metagenomic sequence data have many features in common, namely, read preprocessing, assembly including selected instances of finishing (dominant populations), and gene prediction and annotation. As mentioned above, the key difference between genomes and metagenomes is that the latter, with the exception of finishable dominant populations, do not have a fixed end point, i.e., one or more completed chromosomes as for microbial isolate genomes. This means that metagenomes rarely progress beyond draft assemblies and lack many of the quality assurance procedures associated with producing finished genomes. Therefore, greater care needs to be taken when processing sequences of metagenomic data sets than when processing genomic data sets.
Sequence Read Preprocessing
Preprocessing of sequence reads prior to assembly, gene prediction, and annotation is a critical and largely overlooked aspect of metagenomic analysis. Preprocessing comprises the base calling of raw data coming off the sequencing machines, vector screening to remove cloning vector sequence, quality trimming to remove low-quality bases (as determined by base calling), and contaminant screening to remove verifiable sequence contaminants. Errors in each of these steps can have greater downstream consequences in metagenomes than in genomes and will be discussed in turn.
Base calling is the procedure of identifying DNA bases from the readout of a sequencing machine. There are surprisingly few choices for base callers, and the differences between them for the purposes of metagenomics are small; therefore, we have no specific recommendation from the ones described below. By far, the dominant base caller used today is phred (41
). phred initiated the widespread use of probabilistic-based quality scores, which all later base callers adopted. phred quality scores are estimates of per-base error probabilities. The quality score, q
, assigned to a base is related to the estimated probability, p
, of erroneously calling the base by the following formula: q
= −10 × log10
). Thus, a phred quality score of 20 corresponds to an error probability of 1%. Other frequently used base callers are Paracel's TraceTuner (www.paracel.com
) and ABI's KB (www.appliedbiosystems.com
), which behave very similarly to phred, converting raw data into accuracy probability base calls. In general, however, metagenomic assemblies have lower coverage than do genomes, and therefore, errors are more likely to propagate to the consensus. For complex communities, the majority of reads will not assemble into contigs, and base-calling errors in these unassembled reads will appear directly in the final data set.
Vector screening is the process of removing cloning vector sequences from base-called sequence reads. The complete and accurate removal of cloning vector sequence is especially important in metagenomic data sets since these data sets often have large regions of very low coverage in which each read uniquely represents a part of a genome. The assembly of these data without vector trimming can produce chimeric contigs in which the vector sequence, being common to most reads, acts to draw together unrelated sequences (Fig. ). Also, genes may be predicted on the vector sequence introducing phantom gene families into downstream analyses (see “Gene-Centric Analysis”).
FIG. 3. Phrap assemblies visualized with the Consed (53) program. The consensus sequence is shown at the top of the display and is derived from aligned reads shown below the consensus. Note that the Phrap assembler uses the highest-quality base for the consensus (more ...)
A number of different tools are available for vector screening, including cross_match (www.phrap.org
), LUCY (22
), and vector_clip (128
). Also, some assemblers include vector trimming as part of a preprocessing pipeline, including PGA (http://www.paracel.com
) and Arachne (13
). The most commonly used tool is cross_match, which uses a modified Smith-Waterman algorithm to identify matches to vectors that are extended to produce optimal alignments. However, cross_match requires exact matches to vector sequences and has no expectation for the location of the vector sequence in a read. In our experience, this program frequently fails to remove vector sequences because of frequent base-calling errors on the edges of reads where the vector sequence is found. Another vector-trimming tool, LUCY, avoids this problem by specifying error rates as a function of sequence position. In every case that we have tested to date, LUCY results are substantially better than those achieved with cross_match. The downstream effects of improved vector screening are fewer spurious protein predictions and fewer errors in predictions of real protein-coding sequences, particularly open reading frames (ORFs) at the ends of reads (see “Gene Prediction and Annotation”).
Most postprocessing pipelines appear to ignore base quality scores associated with reads and contigs, and few take positional sequence depth into account as a weighting factor for consensus reliability. Therefore, low-quality data will be indistinguishable to the average user from the rest of the data set and should be removed. An extreme example of a poor-quality read that inadvertently passed through to gene prediction is shown in Fig. . In the worst-case scenario, such phantom genes called on a low-quality sequence may pass unchecked into public repositories. We recommend quality trimming to be performed after vector screening, as described above. The reason is that the trimming of low-quality bases might truncate the vector sequence and impede the ability of vector-screening programs to recognize the remainder of the vector. In such cases, significant parts of the vector might still remain for the next stages of the pipeline. LUCY combines vector and quality trimming into one tool.
Part of the chromatogram of a low-quality read without quality trimming on which multiple nonexistent genes were predicted (bottom).
Recognition of sequence contamination of metagenomic data sets, other than vector sequence, is nontrivial. Sanger data sets from clonal organisms are routinely screened for E. coli genomic sequence because E. coli is the cloning vector host, and small amounts of its genome may get through plasmid purification. Pyrosequencing, which does not rely on cloning of DNA into E. coli, will not have this problem; however, other types of contamination cannot be excluded. For metagenomic data sets, host contamination screening should be considered carefully because the environment under study may have E. coli or close relatives as bona fide members of the community, and screening would therefore bias the representation of these species in the data set. Occasionally, the mislabeling of sequence plates occurs in production pipelines. These types of cross-contamination between two data sets can usually be detected if one of the data sets is from an isolate by differences in GC content or BLAST. If plates from two metagenomic projects are mixed up, the contamination may be harder to detect, since neither data set is likely to be homogeneous. It is quite common that reads and even contigs are not incorporated into finished microbial genomes, and these are usually dismissed as being either low-quality or contaminant sequences. In contrast, metagenomic projects will keep high-quality contaminating reads and contigs, as they will probably not be easily distinguishable from the rest of the data set and may therefore skew downstream analyses such as gene-centric analysis, depending on the degree of contamination. Presently, there is no solution to this quandary, and suspected contaminant sequences would need to be investigated on a case-by-case basis.
Assembly is the process of combining sequence reads into contiguous stretches of DNA called contigs, based on sequence similarity between reads. The consensus sequence for a contig is either based on the highest-quality nucleotide in any given read at each position or based on majority rule, i.e., the most frequently encountered nucleotide at each position. The number of reads underlying each consensus base is called depth or coverage. Sequencing is typically performed from both sides of an insert in a vector plasmid, and such pairs are called paired reads or mate pairs. Knowledge of the approximate insert size of the library facilitates the production of a more accurate assembly since mate pairs provide an external constraint to guide assembly. The presence of paired reads in two different contigs allows those contigs to be linked into a larger noncontiguous DNA sequence called a scaffold, whose intercontig gap size can be estimated based on the insert size of the read pairs. For this reason, large-insert clones such as fosmids are particularly useful for improving assemblies.
The major cause of misassembly in genomic projects is repetitive regions that can be resolved in the finishing process (79
). The assembly of metagenomic projects will also be confounded by repeats but pose additional assembly challenges in the form of nonuniform read depth due to nonuniform species abundance distribution and the potential for the coassembly of reads originating from different species. Therefore, not only can misassembled reads be retained in the final published data set due to the absence of finishing, but reads from more than one species can also be assembled together, producing chimeric contigs. Coassembly is more likely to happen with reads from closely related genomes where the sequence similarity is higher (we routinely observe homologous regions of two or more strains with up to 4% nucleotide sequence divergence coassembling) but has been found between reads originating from phylogenetically distant taxa, with conserved genes serving as the focal point for misassembly. For example, a contig from a surface seawater metagenome comprised reads originating from bacteria and archaea, as evidenced by gene calls, with the 16S rRNA gene serving as the focal point in this instance (32
). A recent simulation study found that chimeras are particularly prevalent among contigs lower than 10 kbp in size (94
). High-complexity microbial communities lacking dominant populations rarely produced contigs larger than 10 kbp (Fig. ), prompting the recommendation that such data sets should not be assembled at all (94
A variety of assembly programs are publicly available, including Phrap (www.phrap.org
), Arachne (13
), the Celera Assembler (97
), PGA (http://www.paracel.com/
), and CAP3 (60
; for a description and history of these assemblers, we refer the reader to reference 79
). Most currently available assemblers were designed to assemble individual genomes or, in some cases, genomes of polyploid eukaryotes; however, they were not designed to assemble metagenomes comprising multiple species with nonuniform sequence coverage, and therefore, their performance with metagenomic data sets varies significantly (94
). For example, the Celera assembler does not assemble contigs with atypically high read depths (based on an expected Poisson distribution) because it interprets them as potential assembly artifacts due to the coassembly of repeats, whereas in metagenomic data, they may be bona fide contigs arising from dominant populations (140
). A second example is that Phrap is optimized for making maximal use of its input data using a “greedy” algorithm and will extend contigs as far as possible. This is a good approach for assembling low-coverage nonrepetitive regions from low-quality reads, as it makes the most of the available data, particularly if the assembly will be verified by finishing but is not desirable for metagenomes since it is more likely to produce chimeras when data include reads from multiple strains and species. More conservative assembly programs such as Arachne have been shown to produce smaller but more reliable contigs than Phrap (94
A useful auxiliary approach to de novo assembly is comparative assembly, that is, aligning reads and/or contigs to a reference genome of a closely related organism. The AMOS comparative assembler has been developed specifically for this purpose (112
). For metagenomic data sets, this can improve the assembly of dominant populations since it provides a mechanism to span hypervariable regions in a composite population genome and is computationally much less expensive than de novo assembly (4
). A major caveat of the approach, however, is that it will be useful for only a small subset of the average metagenomic data set since reference genomes cover only a fraction, and a highly biased fraction at that, of microbial diversity (see “Postsequencing Community Composition Estimates”).
One thing is clear: there is no magic bullet for assembling metagenomic data sets, and all assemblers will make numerous errors. Ideally, therefore, every metagenomic assembly should be manually inspected for errors before public release. Assembly errors can be easily identified with visualization tools, such as Consed (Fig. ) (53
), which are used to facilitate genome finishing; however, the sheer scale of most metagenomic data sets precludes manual inspection let alone the correction of all identified assembly errors. One approach that we have taken to address this limitation is to make two or more assemblies of the same data using different assemblers (47
) to facilitate the identification of misassemblies during the downstream analysis phase following gene calling. It is, however, feasible and worthwhile to resolve misassemblies of the largest contigs in a metagenomic assembly, especially contigs that are greater in length than or equal in length to fosmids, using standard initial steps in the finishing process (79
The final products of assembly, contigs and scaffolds, are submitted to public databases as flat text files, meaning that all information about the underlying reads is lost, including sequencing depth and quality scores of each base, length and overlaps between reads, and quality of vector trimming. This is not ideal for two reasons. Firstly, the quality of the contigs cannot be assessed and is also not taken into consideration by tools such as BLAST. Secondly, meaningful polymorphisms in the data due to coassembled strains (haplotypes) (see “Analyzing Dominant Populations”) are lost because a single consensus sequence is submitted. Methods for weighting consensus accuracy and preserving polymorphism information for subsequent analyses are needed. A first step in this direction has been taken by public databases with the establishment of the Trace and Assembly archives, which archive raw read files and assemblies associated with submitted genomic and metagenomic data sets, respectively (147
). In practice, however, most users will work only with the flat text consensus data and ignore read and consensus quality unless it is presented to them in a more convenient user interface. Such interfaces are beginning to be provided by dedicated comparative genome and metagenome platforms (see Data Management).
Genome closure and finishing are commonplace for microbial isolate projects and part of the standard processing pipeline at sequence facilities such as JGI. For most metagenomes, finishing is not possible. However, for dominant populations within metagenome data sets that have draft-level coverage, finishing may be an option. This is dependent largely on the degree of microheterogeneity within the population. Genome rearrangements such as insertions, deletions, and inversions will break assemblies, whereas point mutations usually will not. Even in instances where chromosomal walking along large-insert clones is used instead of shotgun sequencing, microheterogeneity can still complicate assembly (56
). However, there are now several examples in the literature of complete or near-complete composite population genomes of uncultivated organisms derived from environmental sources including Cenarchaeaum symbiosum
, the sole archaeal symbiont of a marine sponge (56
); Kuenenia stuttgartiensis
, an anaerobic ammonia-oxidizing planctomycete sequenced from a laboratory-scale bioreactor enrichment (129
); a Rice Cluster 1 methanogen from an enrichment culture (40
Cloacamonas acidaminovorans,” the first sequenced representative of the candidate phylum WWE1, from an anaerobic digester (107
); and Ferroplasma acidarmanus
, one of a few dominant populations in an acid mine drainage biofilm (5
). In the last case, the assembly was facilitated by the availability of an isolate genome (fer1) obtained from the same habitat. The Kuenenia
, Rice Cluster 1 methanogen, and “Candidatus
Cloacamonas” genomes, however, could be assembled without reference to an isolate genome because the populations were near clonal. We make the general observation that sequence microheterogeneity within populations often seems to reflect spatial heterogeneity within the ecosystem from which the populations were derived. Homogenized systems such as bioreactors or enrichment cultures have produced composite population genomes with very low levels of polymorphism (40
), perhaps due to the higher likelihood of selective sweeps through the population curtailing genomic divergence (24
). Therefore, if the goal is to assemble a complete population genome from an environmental sample, we recommend the use of ecosystems with low spatial heterogeneity if at all possible or finer-scale sampling to reduce the effect of spatial heterogeneity.
Gene Prediction and Annotation
Gene prediction (or gene calling) is the procedure of identifying protein and RNA sequences coded on the sample DNA. Depending on the applicability and success of the assembly, gene prediction can be done on postassembly contigs, on reads from the unassembled metagenome, and, finally, for a mixture of contigs and individual unassembled reads.
There are two main approaches for gene prediction. The “evidence-based” gene-calling methods use homology searches to identify genes similar to those observed previously. Simple BLAST comparisons against protein databases as well as tools like CRITICA (11
) and Orpheus (46
) use such an approach. Conversely, the second approach, “ab initio” gene calling, relies on intrinsic features of the DNA sequence to discriminate between coding and noncoding regions, allowing the identification of genes without homologs in the available databases. The use of gene training sets, i.e., sets of parameters derived from known genes of the same or related organisms, can enhance the quality of the predicted genes for some of those programs (e.g., fgenesB [http://www.softberry.com
]), while others are self-trained on the target sequence (Genemark [16
], GLIMMER [31
], and MetaGene [100
Pipelines that use a combination of evidence-based and “ab initio” gene calling are frequently used for complete genomes. In the first step, genes are identified based on homology searches of the sequence of interest versus public databases. Hits to genes in databases are considered to be real genes and can be used as a training set for the ab initio gene-calling programs. Subsequently, an “ab initio” method fine-tuned for a particular genome is used to identify more genes that were missed in the previous step. One such pipeline, called mORFind, uses a combination of Orpheus, CRITICA, and GLIMMER.
In metagenomic sequences, genes can originate from many, frequently diverse organisms. When dominant populations exist, their sequences can be separated from the rest of the data set (see “Binning”) and the pipeline generally used for complete genomes applied to this subset of the data. For communities or their parts that defy assembly or assemble poorly, no training is possible. In these cases, “generic” gene prediction models or models fine-tuned to the closest phylogenetic group can be used. For example, MetaGene (100
) is a gene prediction program developed specifically for metagenomic data sets using two generic models, one for archaea and one for bacteria. Due to the fragmented nature of such data sets and the quality of the sequencing, gene prediction is further complicated by the fact that many genes are represented only by fragments, contain frameshifts, or are chimeras due to errors in assembly. Recently, a tool that allows gene prediction despite these problems, even on short 454 reads, has been reported (73
), although its performance has yet to be evaluated in real applications. The method is based on similarity comparisons of the metagenomic nucleotide sequences either to the same metagenome or to other external sequences and the subsequent discrimination of conserved coding sequences from conserved noncoding sequences by synonymous substitution rates. BLAST searches are conducted at the amino acid level to provide higher resolution than nucleotide searches.
Both evidence-based and “ab initio” methods have been used for the prediction and analysis of metagenomic data sets (Table ). Evidence-based gene calling has been used as the sole method of gene calling in at least one metagenomic study using Sanger reads (140
) and all metagenomic studies using unassembled pyrosequence data due to short read lengths (Table ). Since this approach relies entirely on comparisons to existing databases, it has two major drawbacks. Low values of similarity to known sequences either due to evolutionary distance or due to the short length of metagenomic coding sequences and the presence of sequence errors prevent the identification of homologs. Moreover, novel genes without similarities are completely ignored. Despite these drawbacks, this approach has been used in several studies and can be useful for gene-centric comparisons of metagenomes, especially in cases where the size of the sequence fragments is not adequate for the ab initio gene prediction, such as high-complexity metagenomes and metagenomes sequenced by high-throughput parallel pyrosequencing.
Treating all ORFs as putative genes usually produces prohibitive amounts of data, contains too much noise, and is therefore very hard to use. Methods based on features of the sequences, the size of the predicted ORFs, and the similarity to known sequences have been used to lower the total number of candidate coding sequences from a population of ORFs (151
At the JGI, we are using two “ab initio” gene prediction pipelines for the analysis of metagenomic data sets. The first gene prediction pipeline uses fgenesB with specific training models for sequences that can be assigned to phylogenetic groups and generic models for the unassigned sequences (Table ). The second uses Genemark, which allows gene prediction without the need for training sets and classification of sequences. Both pipelines have proved to be quite accurate when used on simulated data sets (http://fames.jgi-psf.org
). Other studies have employed GLIMMER, MetaGene, and the mORFind pipeline (Table ).
RNA genes (tRNA and rRNA) are predicted using tools such as tRNAscan (82
) for tRNAs and similarity searches for rRNAs. While tRNA predictions are quite reliable, it is not uncommon for rRNA genes to be incompletely identified (incorrect gene boundary coordinates) or even entirely missed. In these instances, it is also not uncommon to see nonexistent hypothetical protein-coding genes called in the place of rRNA genes. Other types of noncoding RNA (ncRNA) genes can be detected by comparison to covariance models (55
) and sequence-structure motifs (84
). However, searching of covariance models and motifs is computationally expensive, and it is prohibitively long for large metagenomic data sets. Overall, the identification of other ncRNA genes is difficult, since their sequences are not conserved and reliable “ab initio” methods are lacking even for isolate genomes. High-throughput transcript (cDNA) sequencing holds great promise for improving the accuracy of RNA gene prediction. Currently, genes encoding ncRNAs are largely excluded from downstream analyses; however, we may expect this situation to change in the coming years as transcriptomic data enrich our inventories of these genes.
There are several types of errors that can be made by a gene-calling pipeline. A gene can be missed completely or called on the wrong strand. A less severe mistake would call part of the gene correctly but fail in estimating gene boundaries or call genes that are partly correct and partly wrong due to chimeric assemblies or frameshifts (94
). The quality of the gene prediction relies on the quality of read preprocessing and assembly. Gene-calling methods used on accurately assembled sequences predict more than 90% of the genes that are included in the data set correctly, as evidenced from studies of simulated data sets (http://fames.jgi-psf.org
). This large number was achieved with training on generic models or self-trained algorithms. Gene prediction on unassembled reads exhibits lower accuracy than that on contigs (~70% versus >80%, respectively) (94
), a result attributed to the small size and greater chance of sequencing errors for individual reads.
Often, even in low-complexity communities, a large number of reads belonging to less abundant organisms remain unassembled. Although the genes predicted on the assembled sequences allow the metabolic reconstruction of the abundant organisms, a better representation of the metabolic capacity of the community is gained when genes from both contigs and reads are included in the subsequent analyses as a majority of the functionality may in fact be encoded in the unassembled reads (94
). Therefore, it is advisable to perform gene calling on both reads and contigs. For high-complexity communities, where assembly is minimal, gene calling on unassembled reads is the only possibility.
Gene prediction is usually followed by functional annotation. Functional annotation of metagenomic data sets is very similar to genomic annotation and relies on comparisons of predicted genes to existing, previously annotated sequences. The goal is to propagate accurate annotations to correctly identified othologs. However, there are additional complications in metagenomic data where predicted proteins are often fragmented and lack neighborhood context. The annotation of metagenomic data created by short-read methods, such as 454, is even more complicated since most reads contain only fractions of proteins.
At the JGI, we use profile-to-sequence searches to identify functions. Protein sequences are compared to sequence alignment profiles from the protein families TIGRFAM (122
), PFAM (43
), and COGs (131
) using RPS-BLAST (91
). PFAMs allow the identification and annotation of protein domains. TIGRFAMs include models for both domain and full-length proteins. COGs also allow the annotation of the full-length proteins. Unfortunately, although PFAMs and TIGRFAMs are updated regularly, allowing the annotation of new protein families, COGs are still lacking such updates. As a rule, the assignment of protein function solely based on BLAST results should be avoided, mainly because of the potential for error propagation through databases (49
In addition to annotation by homology, several methods for context annotation are available. These include genomic neighborhood (30
), gene fusion (38
), phylogenetic profiles (106
), and coexpression (87
). We are aware of one study that performed adapted neighborhood analysis on metagenomic data, which, combined with homology searches, inferred specific functions for 76% of the metagenomic data sets (83% when nonspecific functions are considered) (59
). It is possible that more context information will be used to predict protein function in metagenomic data in the future.
It is common practice that all gene predictions and annotations for microbial genomes are manually checked as part of informatic QC pipelines. Such manual curation is not feasible for metagenomic projects, although, as for the assembly, we recommend manual curation of larger contigs. Therefore, the quality of gene calling and annotation for the majority of metagenomic data rests solely on automated procedures. A recent benchmarking study using simulated metagenomic data sets suggests that there is significant room for improvement in existing gene prediction and annotation tools (94
). One final note of caution: some vector-screening and -trimming programs only mask out rather than remove vector and low-quality sequences, resulting in runs of N′s at the ends of reads and contigs. When sequences are submitted to public databases, terminal runs of N′s are removed as part of the submission process, which can introduce systematic errors in the start-stop coordinates of any genes predicted on the untrimmed reads and contigs. Therefore, all reads and contigs should be trimmed of terminal N runs prior to gene prediction and annotation.