The application of gene prediction algorithms to whole-genome sequencing data produces hundreds of ‘hypothetical’ genes with low similarity to well-known genes and poor EST coverage. The quantities and boundaries of hypothetical genes annotated using different prediction algorithms have little overlap. Moreover, updated versions of the same gene-finder often produce drastically different results. For example, human annotated genes for the Build 35 (1
) Build 36 (2007) and Build 37 (2009) genomic assemblies have little overlap in their hypothetical gene subsets. Therefore, there is a continuing need for improvement in gene prediction. For eukaryotes, a primary component of gene prediction is the discrimination of exonic and intronic sequences, which is the focal point of this article. While other factors such as identifying splicing junctions and promoters are important, they are beyond the scope of the current study.
Markov models (MMs) are widely used in bioinformatics, particularly for gene prediction (2
). The order
of a MMs is the number of contiguous nucleotides used to obtain the probability of occurrence of the next nucleotide in a sequence. Generally, the order of the MM ranges from order 0 (single nucleotide analysis with no memory) to about 5 (3
). Larger order MMs tend to be more accurate. In order to train a MM of order 5, the frequencies of all possible 6-mer oligonucleotides within the training sequences must be determined. This is feasible with a training set comprising several hundreds of genes. For higher order MMs the size of the training set must be drastically increased. For example, MMs of order 9 (10-mer oligonucleotides) may require more than 10
000 genes in training set. Due to such an exponential growth in the training set, the limit of the order of MM usable for gene prediction is near 7. Thus, contemporary MMs are limited to analyzing short-range (1–8
bp) DNA patterns for sequence classification. However, longer mid-range (10–44
bp) sequence patterns are abundant and functionally important in the genomes of higher eukaryotes (5
). These mid-range patterns are not adequately leveraged for sequence classification if one analyzes only the composition of 1–8
nt long oligonucleotides using conventional MM approaches.
In this study, we tailor homogeneous Markov chain algorithms to use mid-range genomic patterns for sequence classification. Specifically, our approach utilizes longer oligonucleotides (from 10 to 44
bp) within Markov model exon/intron classifiers by first abstracting the nucleotide information content into representative binary strings. explains various levels of nucleotide to binary conversions. The process for conversion of nucleotide letters into 0 or 1 digits is specified by an ‘abstraction scheme.’ The smallest level of abstraction (reducing the information present by about a half) occurs when each nucleotide is converted into a binary digit (zero or one). Abstraction schemes for single nucleotides are said to be at binary abstraction level 1
or BA1 (second row in ). An example BA1 abstraction scheme would be to convert all purines to 0s and pyrimidines to 1s. After the conversion of exons and introns into binary strings, they are processed with the usual homogenous MM training algorithm and tested for the exon/intron discrimination ability. We refer to such Markov models trained with binary representations of nucleotide sequences as Binary Abstraction Markov Models (BAMM).
Figure 1. The abstraction of nucleotides into binary sequences starting at abstraction level 0 (no abstraction) up to binary abstraction level 4 (BA4). A sliding window for the homogeneous MM algorithm shows the information being processed on the binary level as (more ...)
The highest abstraction level investigated in this study, BA4, shown in , reduces each of 256 4-mer oligonucleotide words to a single 0 or 1. For example, the sequence fragment ‘ATGCGGATAACC’ with a specific abstraction scheme (ATGC
0; etc) can be converted into a three-digit-long ‘010’ binary string. When using binary sequences as opposed to nucleotide sequences, the alphabet is reduced to from four characters to two. Therefore, longer MM orders are permissible, given the reduced number of possible words for a given order. A training set of several hundred genes is sufficient for use of 11-bit long binary strings (e.g. 11000101100) required for MM order 10. Such 11-bit strings can correspond to 11-mer oligonucleotides if converted at the BA1 abstraction level, and oligonucleotides up to length 44 at BA4.
There are an astronomical number of possible abstraction schemes that can convert nucleotide sequences into binary ones (for the BA4 abstraction level, 2256 ~ 1077 abstraction schemes are possible.) Using optimization algorithms on a supercomputer, we found optimal schemes for exon/intron discrimination at abstraction levels BA1 through BA4. We show that the best BA4 abstraction scheme (BA4-best) is more informative in terms of discrimination (87% accuracy) than a conventional order 5 homogenous MM that uses nucleotide sequences (84% accuracy).
We last demonstrate that it is possible to effectively combine several abstraction schemes into a very powerful exon/intron discriminator using a support vector machine (SVM) approach. This SVM-unified method achieves ~95% accuracy in exon/intron classification. Furthermore, this is achieved without the reading-frame information required by an inhomogenous MM. After describing the development and use of BAMM in more detail, we will discuss the biological meaning of mid-range genomic patterns and approaches to utilize them in gene finders.