Metagenomes are analyzed through simultaneous sequencing of all species in a microbial community without prior cultivation under laboratory conditions. The result is usually a large collection of sequencing reads from many species, and the phylogenetic origin of each read is unknown. A major goal in all metagenomic studies is the identification of potential protein functions and metabolic pathways. Reliable gene predictions are the basis for correct functional annotation, and for the discovery of new genes with their functions.
Several gene prediction methods have been developed for the ab initio
identification of protein coding genes in complete microbial genomes (e.g. GLIMMER and GeneMark [1
]). These methods require an initial training phase on some data from the target genome, or training on the genome of a closely related species. Such conventional
gene finders can in principle be applied to metagenomic data, given that single sequencing reads can be assembled
into longer contigs in order to provide sufficient training data. The applicability of conventional
gene finders to metagenomic contigs can be improved by binning
contigs and reads into separate phylogenetic scaffolds, e.g. by their oligonucleotide signature [3
]. However, the assembly of metagenomic sequencing reads is problematic. Mavromatis et al
. (2007) demonstrated on artificial metagenomes that assembly quality highly depends on the sequencing coverage of single species within the metagenome [4
]. They also showed that short contigs are at high risk of chimerism, i.e. a read from species A is joined with a read of species B, which limits the use of contigs for further analysis. Some proportion of most metagenomes remains in single unassigned sequencing reads after assembly and binning, and in some cases, metagenome assembly fails completely, e.g. for the hypersaline microbial mat metagenome [5
]. For this reason, the ability of predicting genes in single and anonymous sequencing reads is essential to fully explore a metagenome.
This problem can be solved by two strategies. One possibility is the identification of protein coding regions through sequence similarity. An example is to conduct a BLAST search [6
] with metagenomic sequences against a database of known proteins. Annotation success is here limited to already known genes and their close relatives. This problem is particularly prominent for viral sequences that are poorly represented in databases [7
]. Clustering of open reading frames (ORFs) in principle enables sequence similarity based methods to identify novel genes that are conserved within the metagenomic sample [10
]. Considering the size of most metagenomes, computational cost is a limiting factor for these methods.
A different strategy is based on gene prediction with statistical models. GeneMark with heuristic models [12
], MetaGene [13
], Orphelia [14
] and MetaGeneAnnotator [16
] fall into the category of model-based metagenomic gene prediction tools. The common advantage of these tools is the capability to predict known and novel genes at a lower computational cost. Their mostly unexplored disadvantage is the susceptibility to sequencing errors - which methods that are based on sequence similarity may automatically compensate to a certain extent.
The possible effect of sequencing errors on model-based metagenomic gene prediction depends on the actual error rates. The two major sequencing techniques that are commonly used in metagenomics have different sequencing accuracy. Chain termination sequencing [17
] was the first method to be used for metagenome sequencing. It produces an average read length of ~700 nucleotides (nt). The error rates reported for Sanger sequencing vary from 0.001% [13
] to more than 1% [19
] and seem to depend on the software that is used for post processing of reads. Pyrosequencing, also known as "454 sequencing", produces shorter reads [21
]. In the beginning, read length was about 110 nt and has now increased to more than ~450 nt. Huse et al
. (2007) reported an error rate of 0.49% for reads of the length 100-200 nt [23
], the read simulation software MetaSim [20
] produces reads with an error rate of 2.8% with parameters that are adjusted according to an original 454 publication [22
]. Pyrosequencing is still subject to constant research. In the near future, a further increase in read length can be expected.
For all techniques, sequencing accuracy is high at the beginning of a read and decreases with read length. Three error types can occur: (1) substitution errors, that means a wrong nucleotide is read out, (2) deletion errors, in which one or more nucleotides are omitted, and (3) insertion errors, where one or more nucleotides are falsely added to the sequence during the reading process.
All statistical gene prediction tools utilize codon usage as an important feature to identify protein coding genes. If a nucleotide is deleted or inserted into the sequence, this causes a shift in the reading frame. Methods that do not compensate for frame shifts cannot predict affected genes accurately. Substitution errors will only affect one codon and their influence on gene prediction accuracy is therefore generally smaller. All types of errors may also result in additional stop codons. False stop codons may have an even more severe effect on gene prediction than a frame shift because they will definitely terminate a predicted gene.
The robustness with respect to sequencing errors in Sanger reads has been investigated and discussed for MetaGene and Orphelia [13
], other tools have not been evaluated with regard to this property. In particular, no studies about the effect of sequencing errors in 454 reads on metagenomic gene prediction are available. Three benchmark data sets that were supposed to facilitate the accuracy evaluation of metagenome analysis tools on real data were introduced [4
] but so far, metagenomic gene prediction tools have not been evaluated on these data sets.
In this work, we demonstrate the extent to which typical errors in Sanger and pyrosequencing reads affect metagenomic gene prediction. The effect strongly depends on the actual error rate. For investigation, we utilize sequences simulated with MetaSim, a metagenome simulator [20
]. Gene prediction quality on the metagenomic benchmark data sets is also shown and discussed. ESTScan [24
], a tool for the curation of expressed sequence tags (ESTs), was trained for the application to metagenomes, and gene prediction accuracy results of ESTscan lead us to the conclusion that the integration of error compensating methods into metagenomic gene prediction tools might significantly improve their performance, and with this metagenome annotation quality.