|Home | About | Journals | Submit | Contact Us | Français|
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.
A metagenomic sample is a heterogeneous mixture of rather short sequences originated from a shotgun sequencing of a microbial community. A vast majority (99%) of microbial species in a given community are likely to be non-cultivable (1). Many protein-coding regions in a new metagenome are likely to code for barely detectable homologs of already known proteins. Therefore, along with comparative genomic methods that rely on sequence similarity search, ab initio methods able to identify genes having no similarity to ones existing in databases are vitally important tools of metagenomic sequence analysis. Sequence similarity-based methods possess high specificity and ability to characterize function of predicted genes (2–5). Ab initio gene finders exhibit high sensitivity along with sufficiently high specificity. The standard tools for ab initio prokaryotic gene prediction such as EasyGene (6), GeneMarkS (7) or Glimmer (8) were not designed to work with short sequence fragments from unknown genomes. However, a special method for assignment of parameters of a gene finder, the ‘heuristic model’ method, designed for accurate gene finding in short prokaryotic sequences with anonymous origin was proposed 4 years prior to the advent of metagenomics (9).
The idea was to bypass traditional ways of parameter estimation such as supervised training on a set of validated genes or unsupervised training on an anonymous sequence supposed to contain a large enough number of genes. It was proposed to use dependencies, apparently formed in evolution, between codon frequencies and genome nucleotide composition. Therefore, the vector of codon frequencies, critical for the model parameterization, could be derived from frequencies of nucleotides observed in a short sequence. This ‘heuristic model’ method has been used for (i) reconstructing codon frequency vector for gene finding in viral genomes (10) and (ii) initializing the algorithms for iterative parameters estimation for prokaryotic as well as eukaryotic gene finders (7,11–12). Recently, several new methods for ab initio gene finding in metagenomic sequences have been developed (13–15). Particularly, the authors of MetaGene (14) saw a significant potential in the ‘heuristic model’ method (9); they have extended the method to use of di-codon frequencies. The authors of new tools have shown that their performance is comparable to performance of the original ‘heuristic model’ method (Supplementary Table S3 in (14)) (16).
In this article, we describe further improvement of the ‘heuristic model’ method. A key observation made upon analysis of 17 genomes (9) was that frequencies of nucleotides in the three codon positions depend linearly, though with distinctly different slope coefficients, on global nucleotide frequencies. In turn, due to the second Chargaff rule (17), this observation means that nucleotide frequencies in the three codon positions depend linearly on genomic GC content. These linear functions were used to reconstruct codon frequencies in the whole genome using information derived from its short sequence fragment and to derive parameters of the ‘heuristic’ second-order Markov models [the Heuristic ALgorithm (HAL)-99 models] for a gene finding algorithm. Gene finding with ‘heuristic’ models was proved to be effective for viral genomes (10,18) as well as for metagenomic sequences (Nikos Kyrpides, personal communication).
With hundreds of new prokaryotic genomes available, it is now possible to enhance the original approach and to utilize direct polynomial and logistic approximations of oligonucleotide frequencies. Also, the analysis of a larger set of genomic sequences has shown that patterns of dependence of codon frequencies from nucleotide frequencies are distinctly different in the two domains of life, bacteria and archaea. Interestingly, distinctly different patterns of the dependence of codon frequencies from genome nucleotide composition have also been observed in mesophilic and thermophilic species. Thus, for gene finding in a short sequence, it is worthwhile to make a simultaneous use of two models, bacterial and archaeal, or mesophilic and thermophilic.
We have assessed the accuracy of a hidden Markov model (HMM) based gene finder, GeneMark.hmm, using the new models on the sets of short sequences obtained by splitting known genomes into equal length fragments (ranging from 72 to 1100 nt). The results demonstrate a higher accuracy in comparison with several other existing methods as well as with the use of original heuristic models.
Application of whole-genome shotgun sequencing to studies of mixed microbial communities, such as gut microbiota of human and mouse have a potential to reveal details of a large picture of the host metabolism combining microbial and mammalian elements. It is estimated that human intestinal microbiota consists of 1013–1014 microorganisms. This microbiome should contain at least 100 times as many genes as a human genome per se. Still, due to diversity of the microbiome, metagenomic data sets consist mainly of unassembled single-read sequences. We have applied the new method to the sequences of human and mouse gut microbial communities (19–20). We detected a large number of protein-coding regions not yet annotated; for a significant fraction of the protein products of newly predicted genes, we found homologs among known proteins. Notably, identification of incomplete genes carries valuable information for reconstruction of metabolic networks and signaling pathways. Since a number of protein-coding regions in a metagenome may be counted by millions (4), improving accuracy of gene finding by a percentage point would affect accurate prediction of tens of thousands of genes of the organisms constituting microbial communities. Therefore, development of accurate metagenome-specific methods is of critical importance for quality analysis of sequence data produced by the next generation sequencing technologies (21).
Sequence data of 582 complete prokaryotic genomes (534 bacteria and 48 archaea; genetic code 11) were from the NCBI RefSeq database. Length of the shortest genome in the sample, Nanoarchaeum equitans (22), was 490 kb. Genome GC contents varied from 16.6% to 74.9%. The data on optimal growth temperature for 357 prokaryotic species (Supplementary Table S1) was from the NCBI Entrez genome database (23). Metagenomic sequence data and annotation for human and mouse gut microbiomes were from the JGI IMG/M database (24).
For assessment of gene prediction accuracy, we used fragments from whole genomes of 29 bacterial and 15 archaeal species (containing 50 microbial chromosomes) listed in Supplementary Table S2. The genomic sequences were split into equal length non-overlapping fragments, with length ranging from 72 to 1100 nt; fragment annotations were derived from corresponding RefSeq records. To retain genes with most reliable annotation, fragments overlapping annotated hypothetical genes were discarded.
A conventional ab initio gene finding algorithm employs a probabilistic model of genomic sequence containing protein- and non-coding regions. Gene prediction accuracy critically depends on precision of the estimation of model parameters that are genome specific. The number of parameters of the probabilistic model of a protein-coding region, a three-periodic Markov chain model (25) increases exponentially (~4N) with the Markov chain order N. The higher the model order, the larger the size of a set of training sequences required for parameter estimation without over-fitting, e.g. in practice, estimation of parameters of the fifth-order model is made on a set of verified protein-coding sequences with total length of 400 000 nt. Note that in our observations even if a larger training set is available, models with an order higher than five did not make a noticeable difference in power of discrimination between coding and non-coding regions (26).
Metagenomic sequence data, mixtures of shotgun sequences from numerous members of microbial communities, are populated with short sequences (with length ≤400 nt). The task is to identify a complete or incomplete protein-coding region residing in a short fragment. A gene finding algorithm, e.g. GeneMark.hmm, could be applied to solve this task should we know or are able to derive the genome-specific model parameters. However, the fact that the genomic context of the short fragment is missing precludes the use of standard approaches for parameter estimation. In a previous work (9), we proposed a method to infer parameters of the three-periodic second-order Markov model for gene finding in a short (e.g. 400 nt) sequence fragment of unknown origin. First, we have identified dependencies that link the nucleotide composition of a genome with the genome-specific codon frequencies. These dependencies are apparently the strongest determinants of a genome-wide synonymous codon usage pattern (27–28). Second, nucleotide frequencies observed in a short DNA fragment served as estimates of global nucleotide frequencies in the whole genome, the source of the short fragment. Then, starting from estimated values of global nucleotide frequencies we reconstructed the genome-specific codon frequencies.
In more details, in the first step, analysis of genomes with known annotation, by taking one genome at a time, we determined frequencies of occurrence of each of the 61 codons in a genome-wide set of annotated protein-coding regions. The codon frequency data determines 12 genome-specific positional frequencies f1X, f2X, and f3X, where X = A, C, G, T in the three codon positions. For a sample of known genomes, r = 1, 2, … R with observed fkX, k = 1, 2, 3 the fkX values were approximated by linear regression on the global nucleotide frequency fX, X = A, C, G, T. Initially, in 1999, the analysis was done for 17 completely sequenced genomes [Figure 1 in (9), see also (29)].
Now, with many more sequenced genomes available, the linear regression analysis was done for 319 bacterial genomes (Figure 1) as well as for 38 archaeal genomes (Table 1). Graphs in Figure 1 look different from graphs in Figure 1 in (9) for the following reasons. The global nucleotide frequency variable strongly correlates with the genome GC content. The second Chargaff rule states that at a whole-genome level, nucleotide frequencies, fX, X = A, C, G, T in a single DNA strand are such that fA ~ fT and fG ~ fC. Therefore, four nucleotide frequencies observed in whole genomes can be derived from a single parameter, the GC content; if s is a genomic GC content, fG + fC, then frequencies of nucleotides fG = fC = s/2 and fA = fT = (1 − s)/2. Thus, new graphs of positional nucleotide frequencies (Figure1) were plotted as functions of genomic GC content.
Further, the s value determined for a short genomic fragment is used as predictor of positional nucleotide frequencies fkX, where k = 1, 2, 3 and X = A, C, G, T. Assuming that a codon frequency, fXYZ, is proportional to product f1Xf2Yf3Z we could obtain an initial approximation of codon frequency f′XYZ. Additional correction comes from the value of predicted frequency of encoded amino acid α, fα (s) determined by linear regression of frequencies of amino acid α observed in corresponding proteomes with respect to the genomic GC contents. To give an example, for alanine with four synonymous codons, predicted frequency fGCT of codon GCT is:
Note that the left part of the formula does not change in further iterations (i.e. by substituting thus found fGCT into right part of the equation).
Finally, it was shown that all parameters of the three-periodic Markov chain model of a protein-coding region could be determined as functions of the set of predicted codon frequencies (9). A model of non-coding region was defined as the multinomial model, the zero-order Markov model. GC content of non-coding regions was observed to have strong correlation with the genome-wide GC content (Figure 2). Therefore, nucleotide frequencies observed in a relatively short DNA fragment are accepted as estimates of four parameters of the non-coding region model. Thus parameterized models of protein- and non-coding regions are ready for use in a gene finding program such as GeneMark.hmm (7,30).
With hundreds of prokaryotic genomes sequenced and annotated, it is possible to use non-linear (polynomial or logistic) regression to more precisely determine the dependence of codon frequencies on genome GC content. To choose the order of regression polynomial, we recall the observed linearity in dependence of frequencies of nucleotides in the three codon positions on genome GC content; product of three linear functions is natural to approximate by the third-order polynomial A + Bs + Cs2 + Ds3; the least squares method is applied to estimate the four coefficients.
A logistic function could approximate observed codon frequencies scaled with respect to the minimum and maximum values: f scaled = ( f − f min)/( f max − f min); this approach was used earlier (14). A generalized linear regression function glmfit from the MatLab Statistics Toolbox was used to determine β0 and β1 parameters from the equation . For a given s, codon frequency was determined as follows. With , predicted codon frequency was determined as f(s) = f scaled * ( f max − f min) + f min. Frequencies of 64 nucleotide triplets residing in each of two other reading frames could be reconstructed by either one of the two regression approaches outlined above. The three vectors of triplet frequencies thus reconstructed for a short sequence S with respect to its GC content are sufficient for computing parameters of the second-order three-periodic Markov chain model, the model of protein-coding region in an unknown genome sequence S came from.
Summarizing the options described above, parameters of the three-periodic second-order Markov chain could be determined by several alternative techniques: (i) reconstructing codon frequencies from predicted nucleotide frequencies in the three codon positions, with subsequent derivation of triplet frequencies in the second and third frame (9), the technique named above HAL-99; (ii) reconstructing codon frequencies by the third-order polynomial functions, with derivation of triplet frequencies in two other frames as in HAL-99, C-3 technique; (iii) reconstructing frequencies of K-mers, K=3, 4, 5, 6 in the three frames with the K-order polynomial regression, K–K techniques; and (iv) reconstructing frequencies of K-mers, K=3, 4, 5, 6 in the three frames with the logistic regression, K–L techniques.
We show examples of typical regression graphs for codons AAT, GCC, TTG and CGT frequencies observed in bacterial genomes (Figure 3); the regression curves were produced by the HAL-99, C-3 and 3-L techniques. Codon AAT is A and T ‘rich’. As a rule, frequencies of eight out of 64 AT rich codons show monotonous decrease over the whole GC range with a rather small variation in any given GC content (Figure 3a). The codon GCC frequency, as well as frequency of other seven GC rich codons increases as genome GC content grows (Figure 3b). Frequencies of codons with mixed composition, such as TTG and CGT (Figure 3c and d) show more variation particularly in the mid GC range, and the task of approximation of these frequencies by a function of single variable is more challenging. It was reported that in genomes with the same GC content, the differences in codon frequencies correlate with differences in optimal growth temperature, t (31). These observations motivate introduction of yet another technique, designated as the C-M technique using approximation of codon frequencies by a function of two variables , the sequence GC content, s, and the temperature of microbiome habitat, t, with parameters determined by multiple regression (Figure 4).
Linear trends in frequencies of nucleotides in the three codon positions with respect to genome GC content have been observed to be different in bacteria and archaea (Table 1). Therefore, two distinct heuristic models could be built, one for bacterial and another for archaeal sequences. Notably, no pre-processing is needed to identify a domain of life the short sequence fragment represents. The bacterial and archaeal heuristic models can be used in the GeneMark.hmm algorithm simultaneously (Figure 5), similarly to the simultaneous use of typical and atypical gene models (30). A protein-coding region, if present in the sequence, is supposed to be recognized by either bacterial or archaeal model.
Alternatively, all prokaryotic species could be divided into mesophilic and thermophilic (310 mesophilic and 47 thermophilic in our reference set of sequenced genomes). Then, application of regression analysis of nucleotide frequencies in the three codon positions produced once again two distinct sets of 12 linear functions (Table 1). The two heuristic models (built for mesophiles and thermophiles) could also be used simultaneously in GeneMark.hmm. However, such a dual model seems to be less effective for practical use, as the temperature of a microbiome habitat is supposed to be known and one of the models could be chosen a priori.
In the Results section, we designate the model pairs by suffix BA or TM, e.g. 3-3BA stands for use a pair of bacterial and archaeal models derived by the third-order polynomial approximation of triplet frequencies.
An average gene length in a prokaryotic genome is about 900 nt. In a metagenomic sequence shorter than 900 nt, it is more likely to observe a part of a gene than a complete gene. To account for the frequent occurrence of partial coding sequences (CDS), we have to modify a formula for the gene length frequency distribution used in GeneMark.hmm for gene finding in complete genomes. This distribution is approximated by p(d) = Nc(d/dc)2exp(−d/dc), the γ distribution formula with two parameters (30). Also, the length distribution of non-coding regions is approximated by exponential distribution p(d)=Nnexp(−d/dn). Parameters, dc and dn are estimated by fitting to empirical distributions of gene length in known genomes. It was observed that values of dc and dn vary little among different prokaryotic species. Therefore, values of these parameters in the algorithm were given as default values: dc = 300 and dn = 150. The formula for length distribution of protein-coding regions in short metagenomic sequences is p(d) = Np(d2 + dcd +2d2c)exp(−d/dc), with parameters Np and dc. Corresponding graphs of theoretical and observed length distributions are shown in Figure 6. To avoid predicting too short partial genes, we have effectively defined 60 nt as minimum length of a predicted gene by setting p(l ≤ 60) = 0.
To analyze how accuracy of GeneMark.hmm depends on dc and dn values, we used sets of 700-nt long fragments of Escherichia coli and Bacillus subtilis genomes; the model used in the runs was the C-3BA one. Sensitivity (Sn) and specificity (Sp) were determined by comparison of gene predictions with fragments annotation. A prediction was accounted as a true positive if locations of the predicted and annotated 3′-ends matched or for partial genes without 3′-ends there was a match between predicted and annotated reading frames. The values of dc could vary from 100 to 800, while values of dn varied from 100 to 300. Particularly, dependence of Sn and Sp for dc = 800 while dn varied from 100 to 300 as indicated by blue line in Figure 7; similarly, dependence of Sn and Sp for dn = 100 while dc is varied from 100 to 800 as indicated by purple line. The dc, dn setting used for analysis of complete genomes (300, 150) is indicated by red dot. Combining larger dc (800) and smaller dn (100) leads to a substantial increase of Sp and a slight decrease of Sn. This result is due to the decrease in number of predicted short genes, many of them not matching annotation. To facilitate comparison of average values S = (Sn + Sp)/2, produced by the program runs with different dc and dn values, the constant S level lines (with slope −1) were plotted in Figure 7a and b. Performance (Sn, Sp) of MetaGene and MetaGeneAnnotator (with default parameters) was depicted for each of the two genomes as well; one can see that the performance is high, though it can be outperformed, especially in the E. coli, by GeneMark.hmm with a wide range of parameters dc and dn. As the result of modeling, we have used dc = 800 and dn = 100 in further analysis of artificial and real metagenomic sequences.
We used the GeneMark.hmm program with the pairs of heuristic models, bacterial and archaeal (or mesophilic and thermophilic) derived by methods described above to analyze sequence fragments with fixed length, from 50 microbial chromosomes (Supplementary Table S2). All models were tested on sets of fragments with length of 400 and 700 nt; moreover, the models with highest performance were tested on sets of fragments with shorter (down to 72 nt) and longer (up to 1100 nt) lengths. Performance characteristics of different models are shown in Table 2 (with more details provided in Supplementary Tables S3–S6). Observed values of (Sn + Sp)/2 were clustered between 94.5% and 96.5% for 700-nt long fragments and between 93.5% and 96.0% for 400-nt long fragments. Interestingly, among the triplet-based models, C-3BA, C-3MT, 3-3BA and 3-LBA, the codon frequency derived models, C-3BA and C-3MT, demonstrated higher performance than 3-3BA and 3-LBA models, where frequencies of triplets as functions of GC content are independently approximated in each frame. Use of higher order Markov models: the third order, 4-4BA, the fourth order, 5-5BA, and the fifth order, 6-6BA and 6-LBA, resulted in similar performance, with differences in (Sn + Sp)/2 values <0.3%; this performance level is comparable to performance of the second-order models C-3BA and C-3MT. Still, a slightly higher (Sn + Sp)/2 for 700- and 400-nt long fragments was achieved with the use of 6-LBA heuristic model containing a pair of the fifth-order model, bacterial and archaeal, with parameters obtained by logistic regression approximation of hexamer frequencies. Note that the MetaGene authors found performance of MetaGene on 700-nt fragments comparable to performance of GeneMark.hmm with HAL-99 model (Supplementary Table S3 in 14). This result corresponds to our observations as well (Table 2).
The use of models utilizing higher order oligonuclotides brought in a marginal improvement of (Sn + Sp)/2 for gene prediction in 400- and 700-nt fragments in comparison with the codon-based models, e.g. C-3BA and C-3MT (Table 2, Supplementary Tables S3–S6). This observation is in agreement with findings of other authors that use of the fifth-order Markov chains and/or di-codon frequencies leads to a slight increase in gene prediction accuracy (13–15). In order to determine accuracy of gene prediction in fragments with length other than 400 and 700 nt, the particular values used in tests by several authors, we have derived from the 50 microbial chromosomes, 11 additional test sets with fragment lengths varying from 72 to 1100 nt (Table 3). Here, in comparison of MetaGene and MetaGeneAnnotator with GeneMark.hmm using HAL-1999, C-3BA and 6-LBA models, we see that GeneMark.hmm with 6-LBA model performs marginally better in terms of Sn and Sp average. Yet, MetaGene shows higher Sn for all the 13 test sets, while C-3BA model shows higher Sp for fragment length longer than 200 nt. For better visualization, we show the programs’ performance as functions of fragment length for the sequence sets with fragment length ≥100 nt (Figure 8). Notably, since the second-order C-3BA model is very close to the 6-LBA model in terms of performance, we use the C-3BA model in several applications discussed below along with the 6-LBA model (Tables 2–3, Supplementary Tables S3–S6).
Upon analysis of short sequence fragments from 50 microbial chromosomes, a run of GeneMark.hmm with bacterial and archaeal model pair not only produced a list of predicted genes but also an indication of a likely origin of each gene (Supplementary Tables S7 and S8). We have seen that a vast majority of genes in bacterial (archaeal) sequence fragments was predicted by the bacterial (archaeal) model. Similarly, a vast majority of genes in thermophilic (mesophilic) sequence fragments were predicted by thermophilic (mesophilic) model. Interestingly, for the thermophilic bacteria Thermotoga maritima (with optimal growth temperature of 80°C) the archaeal model predicted 3137 out of a total of 3225 fragmented genes, corroborating the findings made in the original T. maritima genome paper (32) of massive horizontal influx of genes transferred from archaeal species (33). On the other hand, a vast majority of genes in Methanosarcina acetivorans, identified in many sources as mesophilic archaea, were predicted by the thermophilic model. This result corresponds to observations that M. acetivorans is able to live in deep sea hydrothermal vents. Similar observations were made for bacteria Aquifex aeolicus (34) living in high temperature, as well as for low temperature archaeal species such as Haloarcula, Halobacterium and Methanosphaera (Supplementary Tables S7 and S8).
In short fragments, one rarely sees more than one gene per fragment; therefore, a gene characterization could normally be extended to the whole sequence fragment. Rare cases, when there are several genes in a metagenomic fragment each predicted by different models are worthwhile to set aside as candidates for case study of horizontal gene transfer. Throughout, in the test set of 700-nt long fragments, with a total of 31 584 archaeal (136 210 bacterial) fragments, GeneMark.hmm with C-3BA model misclassified 2757 fragments as bacterial type (16 284 fragments as archaeal type); thus archaeal fragments were identified correctly in 91.27% of cases and bacterial fragments were identified correctly in 88.04% of cases (Supplementary Table S7, column C-3BA). Similar analysis for a set of 400-nt long fragments resulted in 89.92% correct predictions for archaea and 87.26% for bacteria (Supplementary Table S8, column C-3BA). Note that a life domain classification within a metagenomic gene finder was first proposed by Noguchi et al. (14). The difference with the method they used is rather technical; domain recognition in GeneMark.hmm is embedded in the Viterbi algorithm that assigns the most likely type of a hidden state, bacterial or archaeal (thermophilic or mesophilic), to predicted coding region.
We used GeneMark.hmm with C-3BA model to predict genes in metagenomic sequences from two human and five mouse gut microbiomes (Table 4). In these sequence sets, we have identified 11 865 genes that were not annotated earlier. Protein products of 1984 genes (in human samples) and 3435 genes (in mouse samples) had similarity to known proteins detectable by BLASTP with E-value threshold 10−5. Protein functions that could be assigned to the 50 longest genes predicted in the gut microbiomes derived sequences are listed in Supplementary Table S9. A relative proportion of new genes in the mouse gut metagenomic sequences is about three times higher than that in human; the mere numbers are about or larger 50% of the number of initially annotated genes. Interestingly, 17% (15%) of the metagenomic sequences in human Subject 7 (8) could be mapped to known genomes of bacteria and archaea (Supplementary Tables S10–S12) by the BLASTN search with E-value threshold 10−13. However, in the metagenomic sequences from mice guts, we were not able to identify DNA sequence fragments highly similar to a sequence in already sequenced genomes (with threshold 10−13). Still, for less stringent threshold 10−5, we observed dozens of fragments with similarity to genomes of known species in each mouse gut metagenomic sample. Typical situations that are prone to errors in annotation are illustrated in Figure 9: short genes could be missed (Figure 9a). Some genes could be omitted due to artifacts, such as erroneous extension of the 5′-end of a gene to the longest possible start (Figure 9b); such an extension may overlap a real gene in the opposite strand and this real gene will be missed in annotation.
The whole set of gene predictions is available at (http://exon.gatech.edu/GeneMark/metagenome/database); it was also visualized in a genome browser utilizing the GBrowse program (35).
We have designed a web site providing access to the new program for gene prediction in metagenomic sequences: http://exon.gatech.edu/GeneMark/metagenome. Running time of GeneMark.hmm with the 6-LBA models on the Sargasso Sea environmental sample with size 1.045 GB was 88 s. The program is available for download for academic use. For reference purposes, we have also provided an interface to the database of genome-wide codon frequencies observed in genomes used in the training set.
Back in 1999, upon analysis of 17 prokaryotic genomes, we have determined that genome-wide 61 codon frequencies could be approximated by functions of genome-wide nucleotide frequencies (9), the functions of a single parameter, genomic GC content. This critical observation strongly suggested that genomic GC content is the major factor influencing genome-wide codon usage pattern. This was a conclusion formulated upon introduction of the heuristic models (9). Moreover, it soon received further support by results independently obtained by other authors (27–28). The major focus of the current study is on further developing the heuristic models and on their applications to gene finding in metagenomic sequences. Therefore, we had to leave aside intriguing questions on (i) possible evolutionary mechanisms that formed the dependence of codon usage pattern on genome GC content and (ii) how could this dependence evolve differently in the domains of bacteria and archaea, or in the classes of mesophilic and thermophilic species.
Notably, both divides, either by phylogeny (bacteria versus archaea) or by the optimal growth temperature (mesophiles versus thermophiles), have produced similar results in terms of accuracy of gene finding in short sequences. Use of the bacteria and archaeal model pairs is a natural choice, since the origin of a short sequence is not known a priori. The second pair of models, mesophilic and thermophilic, may have less frequent use since the temperature of microbiome habitat is known and the model can be chosen a priori.
The ability to identify a sequence origin in terms of bacterial or archaeal domain appears to be an added value benefit since the algorithm automatically identifies the model, bacterial or archaeal that best fits the gene sequence and ‘is attached to’ the most likely type of a hidden state. Domain classification was shown to be correct for 88.04% of 700-nt long bacterial fragments and for 91.27% of 700-nt long archaeal fragments (Supplementary Tables S7 and S8). Notably, genes horizontally transferred between the two domains should be responsible for a fraction of misclassification errors.
The results indicate that gene prediction in fragmented sequences of prokaryotic genomes has the same rate of success as in complete prokaryotic genomes. This result is rather surprising as the complete genomes provide a context for each individual sequence fragment and offer much larger sets of sequence data for model training. However, most of prokaryotic genomes are heterogeneous in terms of GC content. Parameters of a conventional model used in a genomic gene finder are defined for the genome as a whole and the accuracy may slightly suffer in regions whose local GC content deviates from the average one. Derivation of model parameters for each short sequence individually, as it is done for metagenomic sequences, tunes up parameters for each sequence with regard to its GC content. Therefore, short sequences as targets for gene prediction have advantages as well.
Existence of a difference between GC content of protein- and non-coding regions is a well-known fact. However, the nearly constant value of this difference among genomes ranging wide in GC content is an interesting observation (Figure 2). Notably, RNA genes have been observed to be uniformly GC rich regardless of genome GC content (Figure 2); hence, tRNA genes could be easily detected in AT-rich genomes as local regions with a sharp GC content elevation. GC content of protein-coding genes does not correlate with temperature of the species habitat. Still, it is the RNA genes that show temperature-dependent composition. RNA genes in genomes of thermophilic species (genomes that could be either AT or GC rich) have a significantly higher GC content than RNA genes in genomes of mesophilic species (Figure 2). Frequencies of nucleotides in the three codon positions in protein coding regions of mesophilic and thermophilic species show difference in patterns of dependence on genome GC content. Similarly to inferring a domain of origin, bacterial or archaeal, for a gene within the gene finding algorithm with bacterial and archaeal model pair, a pair of heuristic models derived for mesophilic and thermophilic species could be used to for inferring mesophilic or thermophilic origin for an individual gene.
We should mention that the sets of bacterial and mesophilic species used in this study well overlap each other; 301 out of 319 species in the bacterial set are mesophilic. Hence, bacterial and mesophilic protein-coding regions exhibit a similar dependence of frequencies of nucleotides in the three codon positions on genome GC content (Table 1). On the other hand, although the set of 38 archaeal species contains 23 thermophiles and overlaps significantly with the set of 47 thermophilic species in this study, most archaeal and thermophilic regression slope coefficients (Table 1) are distinctly different.
We should note that frameshifts in protein-coding regions, caused by sequencing errors, are more frequent in metagenomes than in complete genomes. It was shown (36) that performance of all current methods for metagenome gene finding including GeneMark.hmm with the original HAL-99 models is sensitive to presence of frameshifts. The new heuristic models make no exception, and sensitivity to sequence errors has roughly the same pattern as one already reported for HAL-99 (36). Additionally, as a separate project we have developed a new algorithm and software tool for frameshift identification (37) that could be combined with the heuristic models and used for frameshift detection in metagenomic sequences.
In conclusion, we should say that we have presented here methods of reconstruction of codon and oligomer frequencies that have led to new heuristic models for gene finding in short sequences. We have shown that use of the new models in GeneMark.hmm resulted in more accurate gene predictions than use of heuristic model HAL-99, developed earlier. The gene prediction accuracy was shown to be higher than that of MetaGene and MetaGeneAnnotator (Table 2).
The HAL-99 models have been used in gene prediction and annotation since 1999. They were used in ab initio prokaryotic and eukaryotic gene finders GeneMarkS and GeneMark-ES to initiate unsupervised training for complete and nearly complete genomes (7,11–12). Particularly, HAL-99 models were used in ab initio gene prediction and annotation in viral genomes (10) and in metagenomic sequences within the pipeline of DOE Joint Genome Institute (Nikos Kyrpides, personal communication).
Georgia Tech School of Biology; US National Institute of Health (grant HG00783 to M.B.). Funding for open access charge: National Institute of Health (grant HG00783 to M.B.).
Conflict of interest statement. None declared.
Supplementary Data are available at NAR Online.
We wish to thank Konstantinos Mavromatis, Natalia Ivanova and Nikos Kyrpides for interest to the project and useful discussions.