The unfolding of the instructions encoded in the DNA sequence is initiated by the transcription of DNA to RNA, and the subsequent processing of the primary transcript to functional RNA sequences. According to the central dogma, most of these processed RNAs correspond to mRNAs that are eventually translated to proteins. Despite the fact that the identification of the protein-coding mRNAs (or genes) is essential for our understanding of how the genome sequence translates into biological phenomena, uncertainty still remains with respect to the set of human genes. The lack of an accurate and complete gene catalogue undermines the impact of the genome sequence on human biology and biomedical research. Experimental determination of expressed mRNA sequences and computational mapping of this sequence onto the sequence of the genome constitutes the most reliable approach to identify the exonic structure, and chromosomal location, of protein-coding genes. However, this approach has limitations. First, it is unclear what fraction of low and specifically expressed transcripts can be effectively sequenced, and high throughput mRNA sequencing often leads to only partial sequences. Second, computational mapping of mRNA to genomic sequences is not trivial, and it is complicated by fragmentary mRNA sequences, sequencing errors, sequence polymorphism, and the highly repetitive nature of the human genome. Moreover, the high pseudogene content of the human genome, and the presence of small exons, leads to uncertain or incorrect mapping of exon boundaries. Therefore, substantial manual intervention is required to delineate an accurate protein coding gene map from the available mRNA sequence data.
We organized EGASP as a community experiment with the goal of assessing the ability of computational methods to automatically reproduce the accurate protein-coding gene map produced by a team of expert human curators. Such a map [33
], subsequently verified experimentally, has been obtained for only 1% of the human genome selected by the ENCODE project [30
]. Scaling the map to the entire human genome will require substantial additional resources, and it will enormously benefit from improved computational strategies for gene finding. With its focus on this 1% of the human genome, EGASP has indeed demonstrated progress in the performance of newly developed computational gene finding pipelines, with accuracies of about 80% at the coding exon level for both sensitivity and specificity, and of nearly 90% at the coding nucleotide level (Table ). However, the success of these metrics is significantly tempered by the relatively low numbers of coding transcripts that are predicted correctly. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotation. This is not unexpected, and, to some extent, circular, since the manually curated annotation relies on mRNA and protein sequences as well. Notably, however, programs based on sequence comparisons across two or more genomes - which do not use information from known mRNA or protein sequences - also exhibited impressive accuracy at the nucleotide and exon levels (Table ). Dual genome prediction programs, however, were significantly less accurate at finding complete genes than the expressed sequence based methods. Finally, with few exceptions, all of the methods struggled to predict correctly the non-coding exons of transcripts. Indeed, UTRs are often predicted as mere extensions of first and terminal exons, if predicted at all. Thus, while the computational methods are quite reliable in predicting the protein coding components of transcripts, they have difficulties in linking them into transcript structures. Indeed, the most accurate programs were only able to correctly predict about 40% of the annotated transcripts, meaning the correct prediction of all of the exons constituting a transcript (Table ). The results of coding gene predictions were more encouraging. For up to 80% of human genes the exact structure of the coding part, including all the splice junctions and start/stop codons, could be predicted correctly in at least one transcript.
Contributing to the difficulty is the unexpected complexity of the protein coding loci in higher eukaryotic genomes. Indeed, as revealed in the GENCODE annotation, most protein coding loci appear to encode a mixture of coding and non-coding transcripts, sharing part of their sequence. Additional transcriptional activity, including chimeric, overlapping and antisense transcripts, transcripts within introns, and other transcriptional phenomena, appear to be less exceptional than had been previously suspected. Thus, the model of a eukaryotic gene currently implicit in most computational methods is too simple to capture this complexity, leading to relatively poor prediction performance.
The second goal of EGASP was to assess the completeness of the manual/computational/experimental GENCODE annotation. This annotation is based on available evidence, and thus may miss some protein coding genes and exons. Indeed, in EGASP, computational methods predict many exons and transcripts that are not included in the GENCODE annotation (Table ), a trend accentuated in ab initio and comparative gene finders, which do not rely on available evidence from transcript sequences. While we were not able to confirm experimentally the bulk of these predictions and they are likely to be false positives, some might be real.
To assess what fraction of the predicted exons unannotated in GENCODE could correspond to novel genes, we prioritized - based on the reliability of the programs predicting them - a subset of intergenic predicted exon pairs, and attempted to experimentally verify them by RT-PCR in 24 human tissues. Only 3.2% of these pairs tested positive, a result consistent with most of the computational predictions outside of GENCODE being false positives. All verified cases tested positive in only one tissue among the 24 tested, emphasizing the extremely restricted expression patterns of these novel, unannotated exons. Since many more tissues and cell lines exist, it cannot be ruled out that some other predictions could also be positive in other tissues. Support for a larger fraction of predictions corresponding to real exons comes from the observation that 13% of these predictions overlap sites of transcription (or TARs/transfrags) as detected by genome tiling experiments. Interestingly, the success rate of RT-PCR was much higher (at least 40%) for those few tested exon pairs that both overlapped TARs/transfrags and were detected in the same cell line and condition. Thus, consistent TAR/transfrag support is strongly indicative of an underlying transcript, including exons predicted to be connected. In total, about 100 unannotated predicted exons in EGASP are consistently supported by TARs/transfrags, and are, therefore, likely to belong to transcribed RNAs. In summary, a non-negligible fraction of unannotated exons predicted in EGASP have some evidence of transcription (not necessarily associated with protein coding), but only a small fraction of the predicted structures connecting exons could be verified experimentally here.
In this regard, the EGASP experiment seems to indicate that the GENCODE annotation of protein coding genes is quite complete, although it is still unclear what fraction of all the alternative transcript diversity of gene loci is captured by GENCODE. EGASP was also useful in helping to identify the software tools that can contribute to reduce the amount of human intervention required to delineate the GENCODE annotation. Programs accelerating and improving the mapping of cDNA sequences (partial or complete) into the genome sequence could be particularly useful towards that end.
Overall, we believe that the EGASP project has given a fair assessment of the state-of-the-art of gene prediction in human DNA. This will allow biologists to interpret better the annotations presented to them in public genome databases such as GenBank, the UCSC browser, ENSEMBL and others. It has also clearly shown that we are still far from being able to computationally predict human gene structures with total accuracy from the DNA sequence alone. Furthermore, while we believe the experiment has shown that only very few protein-coding human genes seem to missing from the annotations, the exact protein sequences are annotated for roughly over 50% of the sequences. Getting a complete protein sequence correct is also made difficult by the existence of many splice forms, mis-assembled cDNAs and additional contamination in cDNA/EST sequences in the public databases. Each can lead to various spurious protein sequence annotations. Unfortunately, there are very few processes in place to remove erroneous sequences and annotations from the public databases, so it will still take some time to get a better picture of exact gene structures. It has to be noted that the human genome and its annotation for protein coding genes are still works in progress.
Another class of genes, non-protein coding transcripts, which were not generally considered by EGASP, are thought to be especially difficult to predict. These genes, such as those that encode miRNAs and snoRNAs, were not addressed in this experiment; nevertheless, they seem to play a very important role in physiological processes such as development and disease.
One of the most difficult problems in gene prediction accuracy assessment is the definition of a reference set against which to evaluate. Ultimately, this reference set should be 'unknown' to the prediction teams. In EGASP, the delayed publication of the GENCODE annotations partially achieved this goal, although a significant amount of the annotation information was known from previously submitted cDNA and EST sequences to public databases such as ENSEMBL or Genbank. This is slightly different to GASP1 [27
], where novel cDNA sequences had been withheld before the experiment. Additionally, it may be optimal if each group used the same auxiliary data for their predictions. One suggestion would be to 'freeze' databases of auxiliary data and allow only the inclusion in the predictions of these frozen databases, so that progress in these assessment experiments can be measured independently of growing experimental data.
Furthermore, while our assessments have started to evaluate gene annotations on the transcript level, better and additional evaluation methods for evaluating UTRs are needed. One suggestion would be to evaluate the transcript performance at the intron level (similar to the exon evaluation above). This measure would exclude the beginning and end of a gene, two coordinates that are considered the most difficult to obtain experimentally, but would include non-coding introns that are determined by their splice sites.
One of the major benefits of this kind of experiment is that it allows prediction teams to measure their programs and methods against each other, to learn from their failures, and, as a community, to identify the open and difficult questions in this area of research.