Back in 1999, upon analysis of 17 prokaryotic genomes, we have determined that genome-wide 61 codon frequencies could be approximated by functions of genome-wide nucleotide frequencies (9
), the functions of a single parameter, genomic GC content. This critical observation strongly suggested that genomic GC content is the major factor influencing genome-wide codon usage pattern. This was a conclusion formulated upon introduction of the heuristic models (9
). Moreover, it soon received further support by results independently obtained by other authors (27–28
). The major focus of the current study is on further developing the heuristic models and on their applications to gene finding in metagenomic sequences. Therefore, we had to leave aside intriguing questions on (i) possible evolutionary mechanisms that formed the dependence of codon usage pattern on genome GC content and (ii) how could this dependence evolve differently in the domains of bacteria and archaea, or in the classes of mesophilic and thermophilic species.
Notably, both divides, either by phylogeny (bacteria versus archaea) or by the optimal growth temperature (mesophiles versus thermophiles), have produced similar results in terms of accuracy of gene finding in short sequences. Use of the bacteria and archaeal model pairs is a natural choice, since the origin of a short sequence is not known a priori. The second pair of models, mesophilic and thermophilic, may have less frequent use since the temperature of microbiome habitat is known and the model can be chosen a priori.
The ability to identify a sequence origin in terms of bacterial or archaeal domain appears to be an added value benefit since the algorithm automatically identifies the model, bacterial or archaeal that best fits the gene sequence and ‘is attached to’ the most likely type of a hidden state. Domain classification was shown to be correct for 88.04% of 700-nt long bacterial fragments and for 91.27% of 700-nt long archaeal fragments (Supplementary Tables S7
). Notably, genes horizontally transferred between the two domains should be responsible for a fraction of misclassification errors.
The results indicate that gene prediction in fragmented sequences of prokaryotic genomes has the same rate of success as in complete prokaryotic genomes. This result is rather surprising as the complete genomes provide a context for each individual sequence fragment and offer much larger sets of sequence data for model training. However, most of prokaryotic genomes are heterogeneous in terms of GC content. Parameters of a conventional model used in a genomic gene finder are defined for the genome as a whole and the accuracy may slightly suffer in regions whose local GC content deviates from the average one. Derivation of model parameters for each short sequence individually, as it is done for metagenomic sequences, tunes up parameters for each sequence with regard to its GC content. Therefore, short sequences as targets for gene prediction have advantages as well.
Existence of a difference between GC content of protein- and non-coding regions is a well-known fact. However, the nearly constant value of this difference among genomes ranging wide in GC content is an interesting observation (). Notably, RNA genes have been observed to be uniformly GC rich regardless of genome GC content (); hence, tRNA genes could be easily detected in AT-rich genomes as local regions with a sharp GC content elevation. GC content of protein-coding genes does not correlate with temperature of the species habitat. Still, it is the RNA genes that show temperature-dependent composition. RNA genes in genomes of thermophilic species (genomes that could be either AT or GC rich) have a significantly higher GC content than RNA genes in genomes of mesophilic species (). Frequencies of nucleotides in the three codon positions in protein coding regions of mesophilic and thermophilic species show difference in patterns of dependence on genome GC content. Similarly to inferring a domain of origin, bacterial or archaeal, for a gene within the gene finding algorithm with bacterial and archaeal model pair, a pair of heuristic models derived for mesophilic and thermophilic species could be used to for inferring mesophilic or thermophilic origin for an individual gene.
We should mention that the sets of bacterial and mesophilic species used in this study well overlap each other; 301 out of 319 species in the bacterial set are mesophilic. Hence, bacterial and mesophilic protein-coding regions exhibit a similar dependence of frequencies of nucleotides in the three codon positions on genome GC content (). On the other hand, although the set of 38 archaeal species contains 23 thermophiles and overlaps significantly with the set of 47 thermophilic species in this study, most archaeal and thermophilic regression slope coefficients () are distinctly different.
We should note that frameshifts in protein-coding regions, caused by sequencing errors, are more frequent in metagenomes than in complete genomes. It was shown (36
) that performance of all current methods for metagenome gene finding including GeneMark.hmm with the original HAL-99 models is sensitive to presence of frameshifts. The new heuristic models make no exception, and sensitivity to sequence errors has roughly the same pattern as one already reported for HAL-99 (36
). Additionally, as a separate project we have developed a new algorithm and software tool for frameshift identification (37
) that could be combined with the heuristic models and used for frameshift detection in metagenomic sequences.
In conclusion, we should say that we have presented here methods of reconstruction of codon and oligomer frequencies that have led to new heuristic models for gene finding in short sequences. We have shown that use of the new models in GeneMark.hmm resulted in more accurate gene predictions than use of heuristic model HAL-99, developed earlier. The gene prediction accuracy was shown to be higher than that of MetaGene and MetaGeneAnnotator ().
The HAL-99 models have been used in gene prediction and annotation since 1999. They were used in ab initio
prokaryotic and eukaryotic gene finders GeneMarkS and GeneMark-ES to initiate unsupervised training for complete and nearly complete genomes (7
). Particularly, HAL-99 models were used in ab initio
gene prediction and annotation in viral genomes (10
) and in metagenomic sequences within the pipeline of DOE Joint Genome Institute (Nikos Kyrpides, personal communication).