Six-fold cross-validation tests
To test the new algorithm, six-fold cross-validation tests are performed. In the version of MIPS database, Release September 27, 1999, the ORFs were classified into six classes, in which the first class consists of 3199 entries corresponding to the known proteins. Excluding the protein coding genes from the mitochondria and those containing introns, 2958 protein coding genes of the first class residing at the 16 yeast chromosomes remain. The number of the mitochondrial genes available at present is too limited to perform a statistical study. They are thus excluded from the present study. Randomly divide the 2958 genes into two unequal parts, in which the larger part consists of 1958 genes, and the smaller consists of 1000 genes. The former serves as a training set used to find the Fisher coefficients; whereas the latter serves as a test set used to test the accuracy of the algorithm.
As mentioned above, both the training and test sets should be accompanied by the counterparts of negative samples. We have randomly selected about 6000 intergenic sequences with length longer than 300 bp from the 16 yeast chromosomes, and each of them starts with ATG and ends with one of the stop codons. The detailed procedure to select the intergenic sequences is described as follows. For each of the 16 yeast chromosomes:
(i) Find the number and locations of the ORFs annotated in the MIPS database and denote the number of ORFs by K.
(ii) Calculate the length for each of the (K–1) DNA sequences between any two adjoining ORFs. Ignore sequences where the length is <300 bp.
(iii) For all sequences ≥300 bp, starting from the first base, search for the first ‘ATG’ codon encountered along the sequence. In the downstream direction, starting from the 101st codon beginning from ATG, search for the first stop codon encountered. Then the DNA sequence starting from ATG and ending with one of the stop codons is regarded as one candidate for the intergenic sequences. Note that this is not an ORF because there often may be several stop codons within it. Continue to search for more intergenic sequences in the downsteam direction until no more can be found in the remaining sequence.
(iv) Repeat step (iii) for each of the six phases in the sequence. The possible numbers of such sequences are quite large. Randomly select about 6000 such sequences from the 16 yeast chromosomes as the intergenic sequences used for complementing the Fisher algorithm. A computer program has been written to do this job. We should point out that the lengths of the intergenic sequences thus obtained are roughly equal to the ORF lengths, but not identical. Because the present algorithm is based on the difference of the base composition between coding and non-coding sequences, the non-identity of the lengths between the two kinds of sequences does not seem to be a major problem. When the lengths of both kinds of sequences are >300 bp, the calculated results of base composition are not usually sensitive to small variations in sequence length.
Randomly select 1958 and 1000 intergenic sequences from the 6000 sequences, which form the training and test sets of negative samples, respectively. In summary, the training set consists of 1958 positive samples (true genes) and 1958 negative samples (intergenic sequences). The test set consists of 1000 positive samples (true genes) and 1000 negative samples (intergenic sequences). Using the sequences in the training sets, the Fisher coefficients c0, c1, c2, … and c10 are determined. Using the Fisher coefficients just obtained, the accuracy of the gene-finding algorithm is calculated based on the test set.
Repeating the above procedure three times, we have performed 3-fold cross-validation tests. The sensitivity, specificity and accuracy of each test are listed in Table . As can be seen, all three quantities obtained are >95%.
The accuracy of the gene-recognition algorithm for three different test sets
There are 223 intron-containing genes of the 1st class in the MIPS database. These ORFs are used as an independent test set to perform another 3-fold cross-validation test. Consequently, the accuracy (defined as the sensitivity) is always >95% for each of the above three tests.
We now discuss the definitions of accuracy, sensitivity and specificity, which are used to evaluate the performance of the algorithm. The notations used here are the same as those used by Burset and Guigo (15
). Using TP and FN to denote the number of coding ORFs that have been predicted as coding and non-coding, respectively, we define the sensitivity sn
That is, sn is the proportion of coding ORFs that have been correctly predicted as coding. Similarly, using TN and FP to denote the number of intergenic sequences that have been predicted as non-coding and coding, respectively, we define the specificity sp as
That is, sp is the proportion of intergenic sequences that have been correctly predicted as non-coding. The accuracy is defined as the average of sn and sp.
The definition of sp
in equation 13
may cause problems in recognizing genes along the genomic DNA sequence. Because the frequency of non-coding nucleotides is generally much larger than that of coding ones, TN >> FP, and therefore sp
tends towards 1. To solve this problem, instead of using the definition of sp
in equation 13
, one used the refined definition (15
However, in the present study, the test set consists of 1000 coding ORFs and 1000 intergenic sequences, respectively, and it is therefore appropriate to use sp as defined in equation 13, rather than in equation 14.
The final Fisher coefficients
The 2958 positive samples (true genes) are merged together as a new training set. The 2958 negative samples are selected randomly from the 6000 intergenic sequences mentioned above. The random selection is repeated three times. Consequently, we have three experiments. For each experiment the positive samples are identical, whereas the negative samples are different each time. Calculating the Fisher coefficients for each experiment, the results are listed in Table . The final Fisher coefficients are obtained by simply averaging the corresponding values for the three experiments, which are listed in the last column of Table . The Fisher coefficients c0 ~ c10 make an internally consistent set. Averaging with coefficients from several experiments may break the internal consistency. However, since the variations of coefficients for different experiments are considerably small, as shown in Table , the problem is not severe. On the other hand, the Fisher super-plane in the 10-D space is described by the equation c·u – c0 = 0. To take advantage of each experiment, averaging the coefficients allows to adjust the position and orientation of the super-plane slightly.
Fisher coefficients for three different training sets and their averages
Apply the algorithm to recognize yeast genes
As mentioned above, in the version of the MIPS database, Release September 27, 1999, the ORFs were classified into six classes, which consist of 3199, 248, 869, 789, 805 and 447 entries, respectively. They correspond to known proteins (1st class), strong similarity to known proteins (2nd class), similarity or weak similarity to known proteins (3rd class), similarity to unknown proteins (4th class), no similarity (5th class) and questionable ORFs (6th class), respectively. Using the final Fisher coefficients and the criterion of c·u > c0 / c·u < c0 for making the decision of coding/non-coding, we re-recognize the nuclear genes from the ORFs in the 2nd ~ 6th classes in the MIPS database. The detailed results are listed in Tables and , for the non-coding ORFs in the 2nd ~ 5th classes and the 6th class, respectively, in which the names of non-coding ORFs are clearly indicated. As shown in Table , 434 ORFs of the 2nd ~ 5th classes in the MIPS database are recognized as non-coding. Similarly in Table , 340 ORFs of the 6th class are recognized as non-coding. However, due to the limited sensitivity (95%) and specificity (95%) achieved, statistically, 119 of the 434 ORFs listed in Table and four of the 340 ORFs listed in Table (see calculations below), are actually coding genes. We cannot identify which 119 of the 434 or which four of the 340 ORFs are coding genes at present, unless the sensitivity and specificity are further increased.
The 434 ORFs of the 2nd ~ 5th classes in the MIPS database, which are recognized as non-coding
The 340 ORFs of the 6th class in the MIPS database, which are recognized as non-coding
The four quantities TP, TN, FP and FN mentioned above can be calculated, based on the sensitivity, specificity and the gene-recognition result obtained. The calculation for recognizing genes of the 2nd ~ 5th class ORFs in the MIPS database should be performed first. The total number of ORFs to be recognized is 2710, of which 2276 and 434 are recognized as coding and non-coding, respectively. We have a set of four equations as follows: TP/(TP + FN) = 0.95; TN/(TN + FP) = 0.95; TP + FP = 2276 and TN + FN = 434. Solving the above set of equations, we find TP ≈ 2259; TN ≈ 315; FP ≈ 17 and FN ≈ 119. The number of real coding ORFs should be equal to TP + FN ≈ 2378. Of the 434 ORFs recognized as non-coding, statistically, 119 (FN) are actually coding. Next, the calculation for the 6th class ORFs in the MIPS database should be performed. The total number of ORFs to be recognized is 439, of which 99 and 340 are recognized as coding and non-coding, respectively. In this case, the set of four equations consists of: TP/(TP + FN) = 0.95; TN/(TN + FP) = 0.95; TP + FP = 99 and TN + FN = 340. Solving this set of equations, we find TP ≈ 81; TN ≈ 336; FP ≈ 18 and FN ≈ 4. The number of real coding ORFs should be equal to TP + FN ≈ 86. Of the 340 ORFs recognized as non-coding, statistically, four (FN) are actually coding.
Based on the above results, we re-estimate the number of protein coding genes in the 16 yeast chromosomes. The total number should be equal to the number of intronless genes in the 1st class (2958) + the number of intron-containing genes in the 1st class (223) + the number of coding ORFs in the 2nd ~ 5th classes (including intronless and intron-containing genes) recognized by the present algorithm (2378) + the number of coding ORFs in the 6th class (including intronless and intron-containing genes) recognized by the present algorithm (86). The final result is 5645. Considering the fact that the actually sensitivity and specificity are >95% (see Table ), the above estimate should be considered as an upper limit. Note that the above number (5645) does not include the mitochondrial genes. The estimate that the total number of the nuclear protein coding genes in the yeast genome is ≤5645 conflicts with the previous estimate of 5800–6000 genes (10
The YZ score for each ORF annotated in the MIPS database is calculated. The distribution of the YZ scores for the 2958 genes classified as ORFs of the 1st class in the MIPS database is shown in Figure . Here the y-axis indicates the YZ scores, whereas the x-axis indicates the rank number of ORFs, arranged according to the increasing order of the YZ scores. For comparison, the YZ scores for 2958 negative samples (intergenic sequences) are also calculated. The corresponding plot is also shown in Figure . As can be seen, for most genes the points are situated above the threshold 0.5, denoted by a horizontal line, whereas for most intergenic sequences the points are situated below the threshold 0.5. This fact demonstrates the accuracy of the new algorithm in distinguishing between the two kinds of DNA sequences. Furthermore, the curves clearly displaying the above two distributions are shown in Figure . Both distribution curves are well fitted by normal distributions with a small overlapping area between them. For comparison, the curve displaying the distribution of YZ scores calculated for the 2669 ORFs of the 2nd ~ 5th classes in the MIPS database is also shown. This curve is also well fitted by a normal distribution. As can be seen, the third normal distribution curve is in between the former two, indicating that a fraction of the ORFs of the 2nd ~ 5th classes are actually non-coding. This observation is in agreement with the data listed in Table .
Figure 1 Distribution of the YZ scores for the 2958 protein coding genes of the 1st class in the MIPS database. Here the y-axis indicates the YZ scores, whereas the x-axis indicates the rank number of ORFs, arranged according to the increasing order of the YZ (more ...)
Figure 2 Distribution curves showing the YZ score distributions for 2958 genes and 2958 intergenic sequences in the yeast genome, respectively. Here the x-axis indicates the YZ score, whereas the y-axis indicates the probability of the genes or intergenic sequences (more ...)
On the mystery of orphan ORFs
There are more than 7000 ORFs longer than 300 bp in the yeast genome (4
). For some of them, known as orphan ORFs (17
), neither their function nor homology is known. With the increase in known genes, more orphans should be found to have homologous relationships with the known genes and, as a result, the number of orphans should decrease. In fact, this is not the case. This paradox was deemed as a mystery of orphans (17
). However, the results presented in this paper give some insight into the problem. According to the classification of ORFs in the MIPS database, orphans are mainly assigned to the 5th class (no similarity) and the 6th class (questionable, including no similarity to other ORFs). As can be seen from Table , of the 805 ORFs in the 5th class, 193 (24%) are non-coding. Furthermore, of the 439 ORFs in the 6th class, 340 (77%) are non-coding. In other words, more than 500 orphans or partially overlapping ORFs are actually not protein-coding genes. After removing these ORFs from the list of orphans in the MIPS database, there remain some real orphans which may be true protein-coding genes whose functions and homology need to be explored.
The percentages of non-coding ORFs of the 2nd ~ 6th classes recognized by the present algorithm, over the total numbers of ORFs in the classes
We should emphasise that the present work is based on an assumption that has not been clearly elucidated previously. The claim of <5% error is valid only when all of the unknown genes have the same statistical properties as the known genes. As pointed out by an anonymous referee, this obviously will not be the case, especially for the ORFs in the 6th class. Some genes tend to have low expression levels and many of them will only express in extreme conditions. Since they are under-represented in the training sample (i.e., in the ORFs of the 1st class), they would mostly be predicted to be non-coding. One would not be surprised if many of the ‘non-coding’ ORFs predicted in Table later turn out to be coding. Therefore, based on this consideration, the predictive error for the ORFs in the 6th class would be >5%. We remind readers that the results listed in Table should be referred to with caution.
It will be very interesting to see if most or many ORFs listed in Table will be experimentally verified to be functional genes in the future. If the answer is yes, we have to say that the DNA sequences coding for these genes have different statistical properties with those coding for genes of the 1st class in the MIPS database. Alternatively, if the answer is no, the statistical properties for both the 1st and 6th class ORFs should be similar. To avoid the inherently circular argument, we have compared the distributions of bases at the first and second codon positions for the 1st and 6th class ORFs in the MIPS database with those of other species, specifically human, Escherichia coli
, etc. One cannot simply compare the base distributions at the third codon position between different species, because the distributions are species-dependent (19
). Consequently, we have found that the distributions of bases at the first and second codon positions for the 1st class ORFs in the MIPS database of the yeast genome show considerable similarity to those of genes for other species. In contrast, the distributions of bases at the first and second codon positions for the 6th class ORFs are not only remarkably different from those of the 1st class ORFs, but are also remarkably different from those of genes from other species. It is thought that the distributions of DNA bases at the first and second codon positions reflect the need for native folding of proteins (19
). Based on this consideration, it is unlikely that most or many ORFs listed in Table code for proteins.