To initiate translation in prokaryote, a ribosome binds to a specific region of mRNA and then recognizes a nearby start codon. The position of the first nucleotide base pair (bp) in the start codon is denoted by translation initiation site (TIS). The sequence upstream to the TIS, the start codon itself and the sequence downstream to the TIS show specific patterns which differ from genome to genome. The sequence at about 20 bps upstream to the TIS in most prokaryotic genes contains primarily purine rich Shine-Dalgarno sequence [1
]. However, increasing numbers of genes with missing Shine-Dalgarno sequences, known as leaderless genes if they also lack a 5'-untranslated region, have been reported in archaeal genomes [2
]. Genome-wide computational analysis on leaderless genes revealed A/T rich sequences in a region at about 30 bps further upstream [3
]. The start codon in most cases shows a strong preference to the ATG triplet than to others such as TTG and GTG [4
]. Sequences downstream to the TIS exhibit a periodicity of three in the codon usage. Comparative genomic studies show that the sequence patterns around the true TIS might differ significantly between genomes. With the aid of a sequence logo tool, Torarinsson et al
] and Zhu et al
] reported the variation of sequence patterns among dozens of archaeal genomes, which shed light on the understanding of the divergence of translation initiation mechanisms in prokaryote.
Knowledge of exact TIS is essential for conducting experiments involving the identification of natively purified proteins by N-terminal amino acid sequencing as well as heterologous protein production [6
]. However, there are increasing concerns on the TIS annotation quality in widely used databases such as GenBank and RefSeq [5
]. Earlier completed microbial genome projects tend to annotate the 5'-most candidate start which is in frame to the stop codon [7
]. On the other hand, Poole et al
] has observed a strong discrepancy of TIS annotation between databases CMR and RefSeq on several genomes. Despite manual corrections and periodic updates, the quality of the current TIS annotations is still largely uncertain, and it is intriguing to develop an independent method for assessing the TIS annotation reliability. Such method, if successful, may also be helpful to provide hints for further improvement. The need for developing such method is becoming more urgent for the database such as RefSeq is so widely used by experimental biologists that errors in the annotation might have big impact.
Several attempts have been made to assess the reliability of TIS annotation. Nielsen and Krogh [8
] were the first to make a serious large-scale assessment of the reliability of the TIS annotation in RefSeq, but their approach that takes EasyGene 1.2 as the "gold standard" for comparison is questionable. As we will see later, EasyGene's own accuracy is not outstanding, hence the biased assessment is of limited interest. Frishman et al
], using the Orpheus program, show that the information content of aligned TIS upstream sequences correlates with the TIS prediction accuracy. Zhu, et al
] made a qualitative assessment of the relative TIS annotation quality for two TIS predictors, by comparing the sequence logo [11
] of aligned TIS upstream sequences. In this assessment, the sequence logo around the aligned TISs of a consensus set predicted by both predictors (called consensus logo) is considered to be reliable, and hence the difference to the sequence logo of the aligned TISs of a 'specific' set predicted by only one program (called specific logo) would indicate qualitatively the TIS accuracy of that program. Taking S. solfataricus
as an example, Zhu, et al
] showed that the specific sequence logo of MED 2.0 is very similar to the consensus logo obtained jointly with GenBank annotation, but the specific logo of the GenBank shows almost no sequence pattern. This result suggests that the GenBank TIS annotation in S. solfataricus
is lower than MED 2.0. Generally speaking, there exists no systematic method to computationally evaluate the accuracy of TIS prediction.
We propose here a computational method to quantitatively estimate the TIS annotation accuracy of a prokaryotic genome; the annotation can be provided by either a program or a database. The method is based on a homogeneity assumption that the sequence patterns represented by a PWM around TISs are homogenous for a generic subset of genes of a genome. The whole set of TIS predictions are split into two sets; set
is called reference set and is so constructed to be nearly 100% accurate (see section "Reference set") and set
has only partially accurate prediction which are to be quantitatively evaluated. We assume that the set
are generic subsets; this assumption is diffcult to prove, but is sound as a first approximation. It is then assumed that the PWM around predicted TISs in the set
can be modelled as a linear combination of three elementary PWMs, one around true TIS and the others two around false TISs which are located upstream and downstream to the true TIS, respectively. All the three elementary PWMs are obtained from the sequence patterns of the reference set
, which carries naturally genome-specific features. A generalized least square estimator then determines the weighting of each of the three PWMs, and the weighting of the true TIS naturally determines the accuracy of the TIS annotation in the set
. Hence, the prediction accuracy over the entire genome,
, is derived.
The validity of the method is established with tests on experimentally verified TISs set EcoGene [12
]. Then, the method is applied to estimate the TIS annotation accuracy of 532 genomes on the public databases and publicly available programs such as RefSeq [13
], ProTISA [14
], EasyGene [8
], GeneMarkS [7
], Glimmer 3 [16
] and TiCo [17
]. Finally, this analysis has led to a construction of a new TIS database, SupTISA, which is much better than RefSeq on TIS annotations.