Current methods for gene prediction perform well on genes that are “typical” in several respects, including number and length of exons, length of introns, quality of splice sites, and conservation (similarity to known genes). Some divergent genes may be difficult to discover by experimental observation of transcripts if their range of expression is restricted to one or a few cell types or if they are expressed at very low levels. It is significantly more difficult to identify and produce correct models of genes with extremely long introns or short exons or that have diverged extensively from other genes; this is particularly true for genes that do not code for proteins. Such genes would be composed almost entirely of intronic sequence and could be practically “invisible” to current computational gene prediction methods.
We found that transcribed sequences hold significant information about the direction of transcription, in the form of significant orientation biases of (1) nucleotide composition, (2) mutations within interspersed repeats, (3) the interspersed repeats themselves, and (4) PASs. We implemented and integrated four algorithms (). Greens and CHOWDER rely on biases introduced by transcription in the germline, but the generality of the skews suggests that this includes a large fraction of the genes [32
]. ROAST and PASTA reflect functional transcription in both autosomal and germline tissues. The observed skews are evidence for sustained transcription over evolutionary time, and are not caused by “transcriptional noise,” i.e., indiscriminate transcription of random regions of the genome, that complicate the interpretation of most experimentally based transcript identification methods.
For the purpose of the current work, interspersed repeats have two interesting characteristics. First, since copies generally do not adopt a function within the genome, they accumulate substitutions in a neutral fashion. The availability of sufficient copies allows for a relatively accurate reconstruction of the element sequence at the time of integration, while comparison of the extant copies against these “consensus sequences” gives an accurate account of the frequency spectrum of neutral substitutions. These data have, for example, been used to derive log-odds matrices for comparison of interspersed repeats to a consensus database in the program RepeatMasker as well as for the alignment of genomic sequences of different mammals [33
]. We exploit this aspect of interspersed repeats by measuring the strand-specific substitution biases in repeats (CHOWDER) and the changes in PAS strength (PASTA) to predict the presence and orientation of a transcribed region. Second, while decayed interspersed repeats are generally relatively inert, except for promoting homologous recombination, at the time of integration they contain functional transcription regulatory signals that can affect nearby gene transcription, as exemplified by the discovery of oncogenes constitutively expressed from nearby retroviral LTR [35
]. Probably mostly because of transcriptional disruption by their PAS, LINE and LTR elements are underrepresented in the forward orientation of genes [13
]. Although the nature of the interaction of other interspersed repeats with genes is less clear, their distribution is nonrandom with respect to the location of genes as well. Notably, the location of lineage specific (and therefore independently accumulated) SINEs in different mammals is remarkably similar [37
]. Thus, the distribution pattern of repeats harbors significant information concerning the location of genes. We have utilized this aspect to infer transcription unit locations, by quantifying the abundance of each type of repeat in the forward and reverse strand of genes. The data were stratified by GC level, to accommodate the large-scale correlation of repeat densities with isochores.
The “transcriptional footprints” described here have some conceptual similarity to “content” methods like coding potential and coding sequence compositional biases. While those are limited to coding exons, the signal of transcriptional footprints can be observed throughout the length of the transcript, the vast majority of which is usually intronic in nature. Furthermore, while the “content” methods detect deviations from the sequence composition expected under a random model, the FEAST methods detect significant strand biases of selected signals, regardless of their absolute frequency. A generalized linear model for transcript detection was published [39
], integrating nucleotide skews and some repeat densities (not strand biases). Sémon and Duret used a 20-kb sliding window approach to identify putatively transcribed regions but found their method to be insufficiently accurate for automatic gene prediction.
The basic model underlying the four FEAST methods assumes the accumulation in introns and UTRs of strand-biased signals arising as side effects of transcription. Several deviations from this model can be postulated: (1) a significant proportion of the genomic region included in a gene might not be transcribed, as is the case for somatic rearranging immune loci; (2) antisense overlapping transcripts may lead to partial signal cancellation; and (3) a gene may have long coding exons, or a large number of exons separated by short introns. Furthermore, the statistical model used assumes independence between the observed signals, which may not be true for (4) arrays of tandemly duplicated sequences or (5) interspersed repeats “homing” into similar repeats, e.g., Alu [40
]. Finally, sensitivity may suffer if there is insufficient signal, e.g., for (6) short genes or (7) evolutionarily new transcribed regions, or if the signal was lost by (8) inversions within the introns or other genomic rearrangements. Conversely, (9) decaying pseudogenes derived from genomic duplications may yield spurious signals. Most of these model deviations are expected to lead to false negatives, suggesting that FEAST may be underpredicting the number and/or the extent of genes.
In the current implementation of FEAST, the four algorithms are combined with equal weights, except for Greens and CHOWDER being weighted according to the repeat fraction. This will be improved in future versions by identifying context-dependent optimal weights for the different algorithms. Since the mutation-based methods refer to germline expression but the selection-based methods reflect functional importance at any developmental stage, additional functional information could be obtained by using different combinations of the four algorithms. Finally, a promising area for future development is the joint gene prediction on orthologous regions, by collating biases accrued independently in different species lineages.
Sequence-based gene prediction has long been dominated by methods based on modeling gene structure and sequence comparisons, followed by extensive expert curation. It might have appeared impossible to detect genes from genomic sequence without identifying splicing signals or sequence conservation and not even relying on the genomic localization of experimentally observed expressed sequences. We presented here a third basic concept (), i.e., the genomic effects of sustained transcription, and four transcript prediction algorithms based on it. It is important to stress that the methods described here differ from conventional gene prediction methods in that they do not lead to detailed prediction of the intron-exon structure of the predicted genes but rather identify the overall extent and orientation of the transcribed regions. To achieve a complete gene model, further analyses are required.
In addition to yielding hypotheses for correcting 5′ incomplete gene annotations and novel independent predictions, many of which cannot be detected by gene structure or similarity, the new algorithms are complementary to existing methods (). We therefore expect these tools to add valuable information when integrated with the algorithms based on gene structure and sequence similarity, as a further step toward achieving the sensitivity and specificity required for fully automated whole-genome annotation.