Ever since the recognition of the reading frame in ribosomal translation of protein coding sequence, it has been realized that off-frame stop codons play a role in avoiding production of erroneous protein products. At the very least, erroneous peptides resulted from frameshift have reduced function or be entirely non-functional, and consume precious cellular resources; and in the worst case, they may be toxic and interfere with normal cellular metabolism. Hence, it is natural to postulate that OSCs would be selected for in the course of genome evolution. An increase in the occurrences of OSCs results in more truncations of the erroneous peptides due to frameshifts, and leads to less metabolic wastage and potentially less toxic products. In agreement with this line of reasoning, there is empirical evidence that protein production increases with the number of OSCs in the coding gene [
31].
Our study is divided into two main parts. Firstly, we showed that GC bias in the coding sequences is the primary determinant of OSC frequencies, consistent with the results of a smaller study [
12]. Furthermore, with a lone exception, individual OSC biases are also primarily determined by the G+C content of the coding sequences. Hence, these results establish the need to account for the effect of nucleotide compositional bias on OSC frequencies. In the second part of the study, we investigated the effects of higher order compositional biases, like dinucleotide, hexanucleotide and dicodon biases, that have been recognized in genomes previously [
32-
35]. Markov modeling provides a straightforward and natural way of describing these biases and hence allow for the estimation of the effects of the different biases on OSC frequencies. Perhaps the biggest advantage of Markov modeling in the context of this study is the ease with which nested models could be developed and compared. These models would have been more complicated to implement using previous approaches like
k-mer shuffling [
36] or odds ratio of word counts [
22]. The generation of random genomes under different models greatly facilitates the study of a wide range of genomic features in relation to the underlying compositional biases, and the flexibility of the approach is only limited by the computational expense of the associated Monte Carlo method.
The selection of Markov models examined represents a balance of biological relevance and statistical considerations. Markov models with orders of six or above were not examined in the present analysis due to the limited size of prokaryotic genomes resulting in insufficient sample sizes for parameter estimation. Furthermore, except for special cases like palindromic sequences, it is uncertain whether any biological mechanism exists to produce such a high-order oligonucleotide bias. The same argument applies to the codon-based Markov models. At the other end of the spectrum, Markov models simpler than second-order nucleotide-based models reflect only simple nucleotide composition or dinucleotide bias and could not account for the absence of in-frame stop codons. The choice of second- and fifth-order nucleotide-based three-periodic Markov models as used in the present study is not arbitrary. Previous work in applying Markov models to gene prediction have shown them to be the most useful models for describing protein coding sequences [
37,
38], and are important in the majority of current gene prediction programs.
The results of the present study supported the general presence of selection for OSC in prokaryotic genomes, with more than 93% of examined genomes clearly showing OSC overrepresentation under the nucleotide-based Markov models. In further support for the ambush hypothesis, the magnitude of OSC overrepresentation is found to be significantly correlated with G+C content. The results showed that genomes with higher G+C content tend to have a higher degree of OSC overrepresentation. As the same genomes have less OSCs as shown in the first part of the study, the increased OSC overrepresentation might well be a compensatory mechanism to boost the number of OSCs. This observation highlighted a previously overlooked aspect of the ambush hypothesis -- the selection for OSCs can occur simultaneously at multiple levels and there exists a complex layer of interaction among them.
On the other hand, the magnitude of OSC overrepresentation was found to be quite modest, and does not exceed 6% in the most pronounced case. However, before dismissing the practical significance of the effect, it should be reminded that the present calculations were done on a per genome basis. Taking the case of
Yersinia enterocolitica as an example, OSC overrepresentation of around 0.64 per 1000 codons in its 4.6 MB genome would translate to an excess of over 800 OSCs. Even a weak selection of OSCs can sometimes produce unexpected and significant effects in the phenotype, as exemplified by the recent discovery of a positive association between numbers of mitochondrial OSCs and the accuracy of vertebrate morphogenetic development [
39]. In our results, OSC overrepresentation is negatively correlated with optimal growth temperature of the organism in general. We hypothesize that low temperatures may promote non-specific binding of transcriptional or initiation factors to incorrect sites and thus confer a selectional advantage to a greater abundance of OSCs. While there is insufficient data to indicate that translational or transcriptional error rates are elevated in low temperatures, we note that our proposed mechanism shares conceptual and functional similarities with the arrest of initiation factor-dependent translation initiation mediated by the cold shock response [
40].
Our results provide a picture of OSC selection averaged over the whole genome. As the probability and adverse effects associated with frameshift occurrences may vary with individual genes, so will the "selection pressure" to incorporate additional OSCs into its coding sequence. Thus, it is possible that excess OSCs are not evenly distributed but more concentrated in a subset of genes, in which they may exert a pronounced effect against frameshift peptide translation. Logically, genes with frameshift-prone slippage regions such as homopolymeric tracts [
41] would benefit most from excess OSCs. Alternatively, it may be possible that highly expressed genes would also be under selection for more OSCs, as the absolute number of errors would increase with greater transcription and translation activity. While the uneven distribution of OSCs in the genes and genomes was not explored in the present study, we calculated the ratio of OSCs in the +2 and +3 frames, which showed significant variation among the different genomes and could not be fully explained by the genomic G+C content as shown in figure . With respect to the importance of the physical distribution of OSCs, the concept of the "tri-frame model" and its application of the ribosome occupancy distribution may provide a useful framework for understanding the uneven distribution of OSCs with respect to reducing mistranslation and modulating gene expression [
42].
The diversity of results from the detailed analysis on selected genomes is useful in showing that codon or dipeptide biases alone could not explain the near-universal observation of OSC overrepresentation in prokaryotic genomes. We notice that the expected OSC frequencies under the dicodon bias model closely match the actually observed freqencies, suggesting that dicodon bias may play an important part in affecting OSC occurrences. However, there appears to be exceptions, like Pyrococcus furiosus, for which the dicodon bias model failed to model the observed OSC frequency (Table ). A related observation is that the simpler models appear inadequate in describing the compositional biases in the coding sequences. For example, the zeroth-order codon-based Markov model assumed complete independence of each codon from its neighbors, thus implying the absence of dinucleotide or other compositional biases across codon boundaries. Hence, the presence of biologically inaccurate assumptions renders the model irrelevant for comparison. Since the above results have largely ruled out the role of the lower-order compositional biases, another prime candidate for contributing to the OSC overrepresentation is local synonymous codon usage. This possibility could not be confirmed with the current methods and deserve exploration in future studies.
Maintenance of the reading frame of a coding sequence is a complex and error-prone process. During the transfer of genetic information from DNA to protein, errors resulting in frameshifts may occur during DNA replication, mRNA transcription and ribosomal translation. It is also possible that some errors may arise from DNA and RNA mutations, that may occur spontaneously or be induced by mutagens. To minimize the metabolic impact of these errors, the cell has several layers of defense. Firstly, the relevant cellular processes have been highly optimized to avoid the errors in the first place. For example, higher fidelity of DNA replication could be achieved with the use of proofreading DNA polymerase. Next, if errors had nonetheless occurred, the appropriate response mechanisms will be engaged. Damaged DNA may be recognized and corrected with the cellular DNA repair machinery while translational frameshifts may be reduced with frameshift suppressor tRNAs. Finally, the cell possesses a certain degree of metabolic robustness to resist the negative effects of these errors, such as the presence of alternative metabolic pathways. In this framework, the selection of OSCs in coding sequences could be considered a passive second layer of defense against frameshift errors. It is uncertain if there is greater selective pressure against transcriptional than translational frameshifts, given that the effect of OSCs is identical in both cases. A related mechanism identified to play a similar role in potentially reducing mistranslation errors is the selection on codon-pair context during gene evolution to maximize mRNA decoding fidelity by optimizing translational efficiency [
43]. This effect would be independent of and additive to that provided by OSCs.
As a final note, we would like to explore the differences between the present results and those from a previous study [
17]. In that study, the authors examined the preferred and avoided dicodons in different genomes and noted that some avoided dicodons allows for out-of-frame UAA/UAG stop codons (but not UGA stop codons) in alternate reading frames. However, to put their findings in perspective, we noticed that the set of preferred dicodons also included dicodons that encode such OSCs, and no calculations were performed to confirm whether the net effect of the dicodon bias actually decreased OSC frequencies. Thus, there was no direct demonstration of OSC avoidance in the genomes. More importantly, by calculating the odds ratio of dicodon frequencies based on the constituent codon frequencies, they have shown only the effect of dicodon bias on overall OSC frequencies and not the actual difference between observed and expected OSC frequencies. For instance, our analysis on
Laribacter hongkongensis strain HLHK9 [
44] (Table ) revealed OSC overrepresentation in its genome though its dicodon bias actually decreased the OSC abundance relative to its codon usage and dipeptide biases. Hence, it is clear that the results from the previous study are not sufficiently informative in the investigation of OSC selection in genomes.