|Home | About | Journals | Submit | Contact Us | Français|
The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ~62% for reads of 400 bases with 1% sequencing errors, and ~18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.
Microbes are ubiquitous in nature and co-exist with other organisms including humans, and play a critical role in sustaining biological and environmental cycles (1–5). As such, the analysis of microbial genomes is necessary for a better understanding of the functionality of the microbes and their interactions within the microbe community as well as with the environment or the host (6). Most microbes, however, are difficult to culture—previous studies have indicated that <1% of microbes in many environments can be cultivated (4,7,8). Environmental sequencing reads can be assigned into specific genomes and then assembled for further analysis. High-complexity communities that contain diverse species in a metagenomic sequencing project (thus low coverage reads for each composite genome), however, make this problem extremely challenging. In addition, the sequencing reads generated by next-generation sequencing (NGS) techniques have sequencing error rates of up to 3%, some of which can cause frameshifts and thus make the prediction of protein-coding regions even more difficult (9,10).
Identification of genes is one of the most important and challenging problems in whole-microbial genome-sequencing projects (11–13). In metagenomics, gene finding can provide the opportunity to elucidate the activities and interactions of genes within an environmental sample, from which the metabolic and signaling pathways specific to the environment can be reconstructed and identified (14). To date, only a few methods have been developed for gene prediction in metagenomic sequences (15–19). Most commonly, genes encoded by metagenomes have been identified by using homology-based methods such as BLASTX (20,21). Homology searches against known protein databases, however, cannot be used to predict novel genes, although discovering new genes is one of the most important aspects in metagenomics research. Alternatively, sequence conservation information can be utilized for prediction of novel protein-coding genes (17,22); for example, a Ka/Ks value of ~1 for a group of similar sequences indicates that these sequences are under no selective pressure and hence unlikely to code for proteins. This way, novel families that have multiple members in a metagenomic dataset can be identified (22). The other straightforward solution to novel gene prediction in metagenomics is to use feature-based approaches such as probabilistic models to evaluate the probabilities of open reading frames (ORFs) being protein-coding regions (16,18,23), in a manner similar to conventional gene finding methods such as Glimmer and GeneMark (24–26).
Short read length and sequencing errors are two major issues that pose significant challenges to gene prediction: incomplete genes (gene fragments) are difficult to predict, and sequencing errors may cause frameshifts that further complicate gene prediction. The average length of genes in microorganisms is about 950bp (16), which is much longer than the sequencing reads generated by NGS (27,28). Different NGS methods now produce sequencing reads of various length ranging from 35bp (from Illumina/Solexa Genome Analyzers) to 400bp (from Roche/454 sequencers) and have different error profiles (27). Sanger sequencers produce reads with an error rate of up to 1%, whereas 454 sequencers produce reads with an error rate of up to 3% (9,10). Illumina Genome Analyzer may produce reads that have high mismatch rates, especially when relatively long reads are acquired (e.g. G is mistaken as T, and in later cycles A, C and G are mistaken as T) (29). In 454 sequencing reads, sequencing errors tend to occur in the homopolymer regions, resulting in frequent insertions and deletions. It has been shown that ORF-based gene prediction methods are more substantially affected by sequencing errors that cause frameshifts (9). As a consequence, programs that are currently available for gene prediction from short reads show a significant decrease in their performance as the sequencing error rate increases. For example, a low sensitivity of 26–43% was observed with sequencing error rate of 2.8% (9).
We propose a probabilistic model combining sequencing error models and codon usages (Figure 1) to improve the accuracy in predicting protein-coding regions from environmental sequences. In a study of the effect of sequencing errors on metagenomic gene prediction (9), all tested gene predictors showed poor performance on short reads that have sequencing errors. The gene predictors tested include GeneMark (26), MetaGene (16), MetaGeneAnnotator (extended from MetaGene aiming to improve gene prediction for whole genomes; 23) and Orphelia (which uses machine-learning technique with features for codon usage, di-codon usage and translation initiation sites, and shows performance comparable with MetaGene for gene prediction in short reads; 30). Taking advantage of this benchmark study, here we focus on the comparison of our program FragGeneScan with Glimmer (24,25) and MetaGene, since they are commonly used in genomics and metagenomics studies, respectively (31–35).
Compared with classical microbial gene finding methods, FragGeneScan has two unique features. The first feature is finding genes fragmented by the boundary of given input sequences such as sequencing reads. The second feature, which is more important for gene prediction from reads generated by the current NGS methods, is correcting frameshifts caused by indel errors in reads. Even though a few recently developed methods (MetaGene and Orphelia) were designed to predict genes from short reads (thus can predict gene fragments), they do not provide a solution for predicting genes with frameshifts and genes in very short reads. We thus developed FragGeneScan to predict fragmented genes and genes with frameshifts in addition to the complete genes.
FragGeneScan is built on a hidden Markov model (HMM) (36), which incorporates codon usage bias, sequencing error models and start/stop codon patterns in a unified model. Given a short read (or a complete genome), the gene prediction problem is to find the best path of hidden states (see below) that is most likely to generate the observed nucleotide sequence, which can be solved by the Viterbi algorithm. FragGeneScan reports genes if they meet the following three conditions: (i) the length of the genes is longer than 60bp, (ii) the genes start in a start state (start codon) or in a match state (internal region of genes) and (iii) the genes end in a stop state (stop codon) or in a match state (internal region of genes). Therefore, FragGeneScan can predict complete genes as well as partial (fragmented) genes without start and/or stop codons.
FragGeneScan HMM consists of two-level representations based on data abstraction (Figure 1). In order to predict genes simultaneously from both strands, FragGeneScan considers separate states representing the gene regions in the forward strand and the reverse strand of a nucleotide sequence. The model has seven super-states (denoted as shaded boxes in Figure 1) representing gene regions (i), start codons (ii) and stop codons (iii) for the forward (i–iii) and backward (v–vii) strands, and non-coding regions (iv), respectively. The states for gene regions consist of six consecutive sets of a match state, an insertion state and a deletion state, which collectively correspond to a six-periodic inhomogeneous HMM. This representation allows using different parameters for each position in a di-codon (i.e. six nucleotides). Notably, by allowing transitions between the insertion/deletion states and the match states, this model effectively detects frameshifts that are caused by indel errors in sequencing. Considering that complete genomic sequences are unlikely to contain indel errors, the transition probabilities to insertion and deletion states are set to 0 when applying FragGeneScan to gene prediction in complete genomic sequences. Each match state in the gene regions [(i) and (v) in Figure 1] uses a second-order Markov chain to model the codon usage. The state for non-coding regions is based on a first-order Markov chain. Since the probability of gene regions and non-coding regions are calculated solely based on the composition of sequences (which is consistent regardless of the read length and gene length), our method is more robust when input sequences are of different lengths (see ‘Results’ section).
FragGeneScan also incorporates the sequence patterns for each start codon (ATG, GTG and TTG) and stop codon (TAA, TAG and TGA) in the start and stop state, respectively. The stop state is modeled by a probability distribution of the stop codons estimated from the training set: P(TAG|stop)=0.54, P(TAA|stop)=0.30 and P(TGA|stop)=0.16. The model of the start states, on the other hand, takes into consideration the sequence pattern within a window surrounding the start codons. Notably, although the start codon model does not necessarily help the prediction of genes in short reads without a start codon, it can improve the gene prediction for complete genomes, as well as the short reads that contain a start codon.
Start codons in bacterial genomes are relatively difficult to predict because several putative start codons are often present around each of the real ones. In order to achieve accurate prediction of start codons, the probability of a start codon in the start states [(ii) and (v) in Figure 1] is modeled by using a positional weight matrix (PWM) over 63 nucleotides centered on a putative start codon ATG, GTG or TTG. In accordance with previous findings [A/T-rich region, Shine–Dalgarno sequence (AGGAG) (37), and triple-A downstream box] (38–40), the sequence patterns around the real start codons are different from those around false start codons that are present upstream or downstream of the real ones. We thus compute the following score for each putative start codon based on its 63nt window (i.e. 61 overlapping trinucleotides).
where trinucleotidei represents the triplet at position i, and P(trinucleotidei|PWM) is the probability of observing the trinucleotide at position i, given the PWM of triplet frequencies, which was trained by using the same complete genomic sequences for HMM parameter estimation (see below). Two Gaussian distributions, one for real start codons and the other for false start codons, were fitted to the scores computed for a collection of annotated start codons (Supplementary Figure S1). For a putative start codon in a read, the probabilities (likelihood) of observing its 63-bp window given the condition that it is a real or false start codon, denoted as P(score|real) and P(score|false), respectively, can then be estimated from the two fitted Gaussian distributions. We calculate the posterior probability of a start codon being a real one given its surrounding 63-bp window, from P(score|real) and P(score|false) by using a naïve Bayesian classifier (41).
A total of 139 complete genomes (collected from the NCBI website; Supplementary Table S1) were used to estimate parameters of second-order Markov chains for all 12 match states in the forward strand [M1–M6 in Figure 1(i)] and in the reverse strand [M’1–M’6 in Figure 1(vii)] (see Supplementary Table S2 for a summary of the training data for different states). The parameters show linear correlation with GC contents, and therefore a linear regression (Supplementary Figure S2) was applied to give estimations of parameters for various GC contents. Note that FragGeneScan does not need training for gene prediction in individual genomes or datasets of short reads. Given a dataset of short reads, FragGeneScan estimates GC contents independently for each read and uses the corresponding set of pre-computed parameters based on the GC content for gene prediction in that read.
The parameters of emission and transition probabilities for insertion and deletion states were estimated for different sequencing methods with different error rates. The current version of FragGeneScan contains different sets of parameters for Sanger, 454 pyrosequencing and Illumina sequencing. Since the error rates directly affect the transition probabilities from match states to insertion/deletion states, we estimated parameters for four sequencing error rates: 0.5% and 1% for Sanger and Illumina sequencing, and 1% and 3% for 454 pyrosequencing, respectively. The sequencing reads used in the estimation were generated using MetaSim (10). If emission and transition probabilities of HMM are needed for error rates different from what we provided, they can be easily obtained and combined into existing models, which are separate from the gene prediction procedure.
The computational complexity of FragGeneScan is O(n), where n is the total length of the input genomic sequences. FragGeneScan is sufficiently fast for predicting genes for genome-wide annotation and metagenomics studies, achieving gene predictions for ~2Mb/min on an Intel Xeon CPU 2GHz. The running time for all the tests shown in the paper ranges from 2min (simulated reads from the Escherichia coli genome) to 58minutes (the TS50 dataset).
A total of nine complete genomes (with various GC contents) and their annotations were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/) (Table 1). (This set of genomes does not overlap with the genomes we used for training.) To systematically test FragGeneScan, reads of various lengths (100, 200, 400 and 700bp) and with various sequencing error rates (0–3%) were simulated from these genomes using MetaSim (10). For each genome, up to 1-fold coverage of reads was sampled for each read length and sequencing error rate. Based on the current estimation of sequencing error rates (10), Sanger sequencing reads of 700bp were simulated with the error rates ranging from 0% to 1%, and 454 sequencing reads were simulated with the error rates ranging from 0% to 3%.
Three real metagenomes were used for gene prediction in metagenomic sequences (Supplementary Table S3). Two real metagenomes (TS28 and TS50) from the twin obese and lean study (14) were downloaded from the MG-RAST website (http://metagenomics.nmpdr.org). The other real metagenome (SRX007415) from the rumen microbiota response study was downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov). These three metagenomes were BLASTXed against 98% non-redundant protein sequences from prokaryotic genomes, plasmids and phages collected from IMG 3.0 (http://img.jgi.doe.gov) using an E-value cutoff of 1.0e-3 for TS28 and TS50, and 1.0e-1 for SRX007415 (which has shorter reads), respectively. FragGeneScan gene prediction in these metagenomes was compared to the similarity search results.
The performance was measured in terms of sensitivity (the ratio of true positives to all annotated genes) and specificity (the ratio of true positives to all predicted genes). The accuracy was calculated by averaging the sensitivity and specificity. The performance of FragGeneScan was compared to that of Glimmer3 and MetaGene, which were downloaded from http://www.cbcb.umd.edu/software/glimmer/ and http://metagene.cb.k.u-tokyo.ac.jp/metagene/download.html, respectively. For training of Glimmer and MetaGene, we followed the standard procedures provided by the programs. Given a new genome, Glimmer uses its internal long-ORF scanner to predict long genes, which will be used for parameter estimation for the genome. MetaGene has its model parameters pre-trained.
FragGeneScan is implemented in C and Perl. In the training phase, we built Markov chains for the intergenic region and match states, and calculated emission and transition probabilities for the HMM. In the main module, the program predicts genes after identifying the best path of the hidden states by using the Viterbi algorithm implemented in C. The package of FragGeneScan includes all the programs, and does not require any other third party programs. FragGeneScan is available as open source at the FragGeneScan website, http://omics.informatics.indiana.edu/FragGeneScan/. All of our predictions are also available for download at the FragGeneScan website.
We tested and compared FragGeneScan with Glimmer and MetaGene by using the nine complete genomes (Table 1). We also tested FragGeneScan on short reads simulated from the same set of genomes, which allowed us to systematically evaluate the performance of FragGeneScan on the reads with various length and error rates. Finally, we tested and compared the performance of FragGeneScan with those of MetaGene and homology-search approach (BLASTX) on three real metagenomic datasets. The performances were measured in terms of sensitivity (the ratio of true positives to all annotated genes), specificity (the ratio of true positives to all predicted genes) and accuracy (the average of sensitivity and specificity). A gene fragment prediction is considered to be a true positive if it is of at least 60 bases (i.e. encoding 20 amino acids), and overlaps with ≥80% of the true protein-coding region in the read. We also compared the performances using different overlap criteria.
The nine complete genomic sequences of various GC contents (Table 1) we used are also widely used for testing gene predictors in previous studies (16,18). Overall, the accuracy of FragGeneScan is comparable with MetaGene, and slightly higher than Glimmer (Table 2). In particular, FragGeneScan and Glimmer showed higher sensitivity, whereas MetaGene showed higher specificity on average. (The conclusion remains the same when a different overlap threshold of 50% was applied, or a perfect match was required; see Supplementary Table S4.)
All three methods consistently showed the highest accuracy for Buchnera aphidicola and the lowest accuracy for Wolbachia endosymbiont. The low accuracy obtained for W. endosymbiont is due to very low specificities. This might be caused by the fewer number of genes in W. endosymbiont. However, Glimmer showed significantly lower specificity compared with the other two methods. We note that both FragGeneScan and MetaGene use generalized model parameters obtained by regression across genomes of different GC contents. But Glimmer uses specific model parameters trained from the genes in the testing genome. We thus suggest that more generalized parameter estimation may improve the performance in gene prediction for the cases when only insufficient data is available for training a specific model.
Table 3 shows the sensitivity, specificity and accuracy of gene prediction in simulated reads of 100 (Figure 2), 200 and 400bp (Figure 3) with 1% sequencing error rate and 700bp with 0.5% sequencing error rate. Prediction in longer reads shows higher accuracies with few exceptions (predictions in the 200bp reads from B. aphidicola and Perkinsus marinus show higher accuracies than those in the 400bp reads). Overall, FragGeneScan achieved 21–68% higher accuracies as compared to MetaGene. We note that FragGeneScan shows consistently high accuracy ranging from 63% to 89% for all the lengths we tested (100–700bp); MetaGene, on the other hand, shows highly varied and lower accuracy ranging from 16% to 52%. Additional results on simulated reads with different sequencing error rates, and using different overlap criteria for defining true positive gene prediction, are listed in Supplementary Tables S5 and S6 (Supplementary Table S7 lists the proportion of simulated reads that contain annotated genes for each simulated dataset.)
The performance of FragGeneScan as a function of read lengths and sequencing error rates is summarized in Table 4. For the 400 and 700bp reads without sequencing error, FragGeneScan and MetaGene show comparable performance with <1% difference. For the shorter reads of 200 and 100bps that have no sequencing errors, however, FragGeneScan outperforms MetaGene significantly. In particular, the accuracy of FragGeneScan in predicting genes in 100bp reads without sequencing error is only 5% lower than those in longer reads (200 and 400bp); in contrast, MetaGene shows 22% decrease in accuracy under the same condition. For all the cases studied with reads of various lengths and sequencing errors, a consistently better performance (up to 65%) was observed for FragGeneScan over MetaGene.
We also tested FragGeneScan on real metagenomes generated by different sequencing platforms (Supplementary Table S3): two datasets TS28 and TS50 from the twin obese and lean study (sequenced by 454 sequencers) (14), and one dataset SRX007415 from the rumen microbiota response study (sequenced by Illumina sequencers). For these real metagenomes, there are no standard annotations available for comparison. So we compared genes predicted by FragGeneScan and MetaGene with those predicted by a homology-based approach (i.e. a read is considered to contain a protein-coding region if BLASTX finds its homologs in a protein database). Here, we consider a predicted gene as a true positive if the predicted gene overlaps with the entire length of the annotated gene from the BLASTX search (which, however, may not give the precise boundaries of the real gene).
For the TS28 dataset, FragGeneScan successfully predicted 92% of the genes identified by BLASTX search, whereas MetaGene predicted 47% of the genes (Table 5). For the TS50 dataset, FragGeneScan predicted 92% of the genes identified by BLASTX, whereas MetaGene predicted 69% of the genes. Note that MetaGene predicted significantly fewer genes, proportionally, in the TS28 dataset than in the TS50 dataset (with respect to BLASTX). In contrast, these ratios are roughly the same for both datasets for FragGeneScan (~92%). (The comparison based on 50% gene overlap, i.e. the gene predicted by FragGeneScan or MetaGene overlaps with at least half of the BLAST annotated gene, is also shown in Table 5.)
FragGeneScan also predicted potentially novel genes that were missed by homology searches, including 28% (89340 out of 317440) of the putative genes predicted from the TS50 dataset, and 25% (142007 out of 579362) from the TS28 dataset. We note that BLASTX searches discovered protein-coding genes in ~74% of the reads in both datasets (462815 out of 622554 reads in the TS50 dataset, and 231946 out of 312665 reads in TS28). Considering that on average 90% of the 200bp simulated reads (which is similar to the read lengths of TS28 and TS50 data sets) contain annotated protein-coding genes (see Supplementary Table S7), and that ~90% of bacterial genomes encode for proteins (42), the fraction of reads annotated as protein-coding regions by BLASTX searches on these two metagenomes (74%) is rather low. Our observation indicates the potential application of ab initio gene predictors such as FragGeneScan in the discovery of novel genes, which may constitute a significant proportion of protein-coding genes from an environmental sample.
The reads in SRX007415 are much shorter (of 72bp) than those in TS28 and TS50. Considering that a BLASTX search of the original dataset (which has 4.2Gb nucleotides) would require a drastic amount of CPU hours, we only carried out gene prediction for a small subset (2%), which has 1164805 reads, and compared the results with BLASTX results (MetaGene works only with input sequences of at least 100bp, thus cannot be used for comparison). BLASTX search predicted protein-coding genes in 8% of the reads (87431 out of 1164805) with E-values <1.0e-3. When a less stringent E-value cutoff was applied (1.0e-1), the ratio increased to 19%. Both ratios are extremely low, which may not be that surprising—it has been shown that the sensitivity of similarity search drops significantly when the reads become shorter (43). But it also indicates that for short reads, gene prediction based on homology search may severely underestimate the gene content in an environmental sample. FragGeneScan predicted 1099193 gene fragments in total, among which 189875 gene fragments were also predicted by BLASTX (i.e. FragGeneScan predicted 86% of the potential genes obtained by BLASTX using E-value cutoff 1.0e-1). It is slightly lower than TS28 and TS50 datasets, which might be caused by the shorter length of the reads (72bp).
FragGeneScan integrates sequencing error models in its HMM so that it can predict genes broken by frameshift sequencing errors. Here, we show two examples of predicted genes that contain such sequencing errors (Figure 4) to demonstrate the importance of incorporating sequencing error models in gene prediction. Figure 4a shows a gene predicted by FragGeneScan from a simulated read, in which two frameshifts caused by sequencing errors were fixed. The read (r19) was simulated with two insertions of Cs from the E. coli genome (sequence from 4578113 to 4578339bp) (Figure 4a). By adding insertion states near the positions of original sequencing errors—for example, FragGeneScan predicted the nucleotide sequence ACTA in the simulated read as ACA, which encodes Threonine, by annotating the T as an insertion—FragGeneScan predicted a gene (without fragmentation) that is almost identical to the annotated gene (see ‘Discussion’ section). Figure 4b shows another gene predicted in a real metagenomic read by FragGeneScan. From the read (E4LJNJL01APZ27) in the TS28 dataset, FragGeneScan predicted an incomplete gene with an insertion state (of nucleotide A, highlighted in Figure 4b). (MetaGene did not predict any gene from this read.) BLAST search of this predicted gene against the IMG 3.0 database resulted in a significant match (YP_002939026 from Eubacterium rectal ATCC 33656 with an E-value of 2e-28), and the alignment of the predicted protein and the homolog is shown in Figure 4b.
Although FragGeneScan was intentionally developed for gene prediction in short and error-prone reads, it provides a versatile method to predict genes in complete microbial genomes, as well as in short reads with or without sequencing errors. The read lengths and sequencing error rates profoundly affect the performance of gene prediction methods. Sequencing errors that cause frameshifts are difficult to be detected by the ORF-based gene finding approaches such as MetaGene, resulting in cases where only fragments of true genes, if anything, are identified; and it may be difficult to interpret the gene fragments. FragGeneScan was developed to overcome the limitations of existing methods in addressing these two major issues, by incorporating sequencing error models into six-periodic inhomogeneous Markov models. FragGeneScan is robust with consistently high performance of predicting genes in reads with widely ranged sequencing error rates.
FragGeneScan is also less affected by the length of sequencing reads. FragGeneScan achieved consistently high gene prediction accuracies in reads of length 100–700bp, whereas MetaGene showed significant variation in its performance. The robustness of FragGeneScan comes from its design principle: FragGeneScan uses only the probability of emitting a nucleotide at each position throughout the entire sequence (in contrast, existing gene prediction methods use statistical parameters such as the length distribution of genes). This robustness is essential for predicting genes in the reads generated by different NGS methods since each sequencing technique generates reads with different lengths (27).
We used rather stringent criteria that a predicted gene fragment is of at least 60bp (i.e. 20 amino acids) and has 50–100% overlap with the true gene in the read to be considered as a true positive. In contrast, Hoff (9) used a rather loose definition of true positives: predicted genes that have a BLAT (44) alignment of at least 20 amino acids with the annotated gene and at least 80% sequence identity were called true positives. As a result, truncated gene fragments (due to frameshift-causing sequencing errors) may not be considered as true positives according to our criteria (because they are too short), but still may be considered as true positives according to Hoff’s criteria. So the accuracy reported in our article for MetaGene may be lower than the accuracy reported in Hoff’s paper (9). We want to emphasize that it is important to use stringent criteria for measuring the performance of gene predictors in error-prone short reads, as truncated gene fragments are more difficult to interpret and are less informative.
For gene prediction in Illumina reads (e.g. the SRX007415 dataset with reads of 72bp), we used the same parameters as for Sanger reads, considering that Illumina sequencers do not show high sequencing errors in the homopolymer regions as 454 sequencers, and produce reads with overall low sequencing error rate. We can learn emission and transition probabilities of HMM for Illumina sequencing when more, longer Illumina reads (e.g. of 125bp) become available.
The exact amino acids encoded by nucleotide sequences containing frameshift sequencing errors may be difficult to predict (for examples, see Figure 4). But these subtle mistakes in the predicted gene sequences (as long as the overall genes are predicted correctly) will not considerably affect many downstream analyses, such as the similarity search of the predicted genes. We will further explore the possibility of improving the prediction by incorporating the quality score of the reads.
Funding for open access charge: National Institutes of Health (1R01HG004908-02); National Science Foundation (CAREER award DBI-0845685).
Conflict of interest statement. None declared.
Supplementary Data are available at NAR Online.
We thank Chad Burrus for reading the article. We are grateful to the authors of MetaGene and Glimmer for providing their programs for comparison.