RefSeq and MGC cDNA sequences mapped to the ENCODE regions were downloaded from the UCSC Genome Browser and alignments were generated using the Stepping Stone implementation of Pairagon v0.5 as described in Materials and methods. GenBank's coding sequence (CDS) annotations of these cDNA sequences were used to produce 451 aligned transcripts annotated with GenBank ORFs (141 from MGC sequences and 310 from RefSeq sequences). Merging identical gene structures and removing inconsistent structures (for example, gap in the coding region leading to a frame shift in the genome) yielded 413 unique gene structures. N-SCAN_EST predictions were generated as described in Materials and methods. The 94 N-SCAN_EST predictions that did not overlap the 413 Pairagon gene structures were added to the gene set. We obtained seven gene structures by aligning sequences from our RT-PCR experiments. Two of these did not overlap the existing set and were included in our submission to the 'any evidence' category. We do not discuss this set in detail because it is almost identical to the submission to the 'mRNA/EST evidence' category. The accuracy statistics for this set can be found in the EGASP assessment report [14
The official assessment of Pairagon+N-SCAN_EST shows high accuracy
Table compares the coding region prediction accuracy measures of three submissions to the EGASP 'mRNA/EST evidence' category at the gene, transcript, exon and nucleotide levels. Pairagon+N-SCAN_EST (Pairagon+N) is optimized for high accuracy in predicting exact exons and transcripts, so we will focus our analysis on those columns of Table . By both measures, ExoGean is the most sensitive of the three programs and Pairagon+N is the most specific; ENSEMBL is intermediate except in exact exon specificity, where it falls below the other two. None of the programs completely dominates any other, although one might argue that Pairagon+N has a slight edge, since the margin by which its specificity exceeds that of the second best program is substantially larger than the margin by which its sensitivity falls below the others. In absolute numbers, our pipeline identifies almost the same number of correct Gencode transcript structures as ENSEMBL (255 versus 258, respectively), and 21 fewer than ExoGean, but we have many fewer incorrect transcripts (149 versus 205 from ENSEMBL and 237 from ExoGean). Their gene accuracy measures are slightly better than ours because ENSEMBL and ExoGean predict more transcripts per gene locus on average. Predicting more transcripts at a locus increases the chance that at least one of them is correct, yielding a true positive by the gene measure, while no penalty in false positives is incurred for the additional incorrect transcripts. This is arguably a flaw in the gene level measure when applied to systems that can predict more than one transcript per locus.
Prediction accuracy measures of mRNA/EST evidence based gene prediction methods
Pairagon's cDNA alignments are highly accurate
The individual accuracies of Pairagon and N-SCAN_EST gene structures in the submission are given in Table . Pairagon's nucleotide and exon specificities are 98.8% and 96.1%, respectively. Pairagon is also very accurate in identifying splice sites - we estimated that 98.3% of the introns that Pairagon identified have supporting evidence in the Gencode reference genes. When there is high quality mRNA evidence, more than three-fourths of transcript structures predicted by Pairagon are correct.
Individual prediction accuracies of Pairagon alignments and N-SCAN_EST predictions in the submission
Identifying the correct splice boundaries is the crucial step in cDNA-to-genome alignment, and here Pairagon proves to be extremely accurate. Out of the 1,834 introns Pairagon predicted (both within and outside coding regions), only 22 introns from 15 transcript structures were not supported by HAVANA annotation. Three of them (from a single transcript) matched the introns of a Gencode gene labeled 'putative' and eight of them were a result of using incorrect seed exons from BLASTN (discussed in detail below). The remaining 11 were from Refseq cDNAs that have no evidence in HAVANA annotation. Two of the eleven aligned to the reference genome with numerous mismatches.
There are 22 unique GC-AG introns in the protein coding part of the HAVANA annotation. Pairagon correctly identifies 12 of these. The remaining 10 are missed because they did not have supporting Refseq or MGC cDNA sequence. When other systems prefer a GT dinucleotide, especially if it occurs close to the actual GC donor site, Pairagon gets the GC splice boundaries correct. Figure shows one such example where ENSEMBL, Augustus and ExonHunter choose an incorrect GT donor site that is four nucleotides downstream of the correct GC donor, which Pairagon chooses. There are two unique AT-AC splice sites in the annotation and Pairagon correctly identifies both of them. Among the methods that use mRNA/EST evidence, AceView identifies the two introns and ENSEMBL identifies one of them. There are also two AT-AG introns with one supporting Gencode annotation each, and only AceView predicts them. Pairagon's splice boundary model prevents it from identifying these introns.
Figure 3 An annotated GC donor site that ENSEMBL misses. There is a GT dinucleotide four nucleotides downstream of the GC donor site (both dinucleotides are marked brown in the sequence). Pairagon identifies the correct donor site. (Screen shot obtained from UCSC (more ...)
In the Stepping Stone implementation of Pairagon, the accuracy of the final alignment depends on how well the seed exons are mapped in the genome (see Materials and methods and Figure for details). Figure shows an example where the first 112 bases (forming an exon) of the cDNA can be mapped to either of two tandem duplicates that are identical in those 112 bases. Because we chose to use BLASTN parameter topComboN = 1, which does not return alignments of a query segment to more than one location in the genome, BLASTN aligned the exon arbitrarily to the locus farther from the rest of the alignment. As a result, Pairagon placed the exon in the same general region, while the annotation maps it to the nearer locus. One possible way to address this problem would be to follow Zhang and Gish [18
], who report using topComboN = 4 to generate multiple combinations of high-scoring segment pairs (HSPs) as seed alignments for their cDNA-to-genome alignment program, EXALIN. We can then superimpose the search subspaces obtained from the possible HSP combinations. Using this approach, Pairagon would choose the correct alignment for the example in Figure because, all other things being equal, it favors shorter introns over longer ones.
Figure 4 Generating the search subspace given three high-scoring segment pairs (HSPs) in the Stepping Stone algorithm. The three diagonal lines represent the three HSPs. The stars represent alignment pins. The lighter blue areas represent the search subspaces (more ...)
Figure 5 An incorrect alignment from Pairagon. The seed alignment from BLASTN aligned the 112-base exon at a location about 30 kb upstream (arrow in Pairagon gene prediction) instead of the annotated location (arrows in Gencode reference genes). Both alignments (more ...)
Pairagon's accuracy has improved since the official evaluation
Since the EGASP assessment, we have made several improvements to both Pairagon's probability model and its implementation. We have retrained Pairagon using its own alignments of 20,594 MGC cDNA sequences to 21,249 loci on the human genome. Several bug-fixes and optimizations have resulted in a faster and more robust program with lower memory requirements. Table lists the accuracy measures of the current version of Pairagon (v0.95) when aligning the same cDNA sequences used for the assessment. Pairagon v0.95 shows improvement in all accuracy measures. It now identifies 22 more correct Gencode transcripts and 162 more correct exons with a small improvement in specificity as well. Thus, the accuracy of our pipeline using Pairagon v0.95 is substantially better than that of the version submitted for the assessment, which was already as good as, or slightly better than, that of the other entrants. Of course, other systems have likely improved as a result of this exercise, too.
Prediction accuracies of improved Pairagon alignments and Pairagon+N-SCAN_EST gene structures
A lack of biological evidence raises questions about ORF annotation
Identifying the coding region in (even) a full-length mRNA is an extremely difficult problem. NCBI and HAVANA do not always agree in their CDS annotations of mRNA sequences, even if they agree on the exon-intron structures. Because we relied on the CDS annotations from NCBI, a few of our gene predictions are incorrect according to HAVANA, although the underlying alignment is correct. For example, GenBank's annotated translation start sites for cDNA sequences BC001940 and NM_001004759.1 are 798 bases downstream and 81 bases upstream of HAVANA's annotated translation start sites in Gencode genes AC005538.1-001 and AC011711.3-001, respectively. A few more of our ORF predictions obtained from correct alignments are labeled incorrect because HAVANA has not made any CDS annotations on the exon-intron structures yet. For example, the exon-intron structure of our gene NM_181879.1 from aligning a reviewed RefSeq mRNA NM_181879.1 matches that of Gencode reference gene AC008984.1-003, which does not have a CDS annotation. Since the biological evidence supporting the GenBank ORF annotations, if any, is not available for evaluation, we might do better by using a modified version of N-SCAN to predict ORFs on aligned cDNA sequences.
N-SCAN_EST performs well on complete GENCODE test regions
After the release of the HAVANA annotations, we found that N-SCAN_EST predictions used to fill the gaps between Pairagon alignments had a very high proportion of incorrect genes - the gene/transcript specificity of the original N-SCAN_EST predictions was 8.5% in regions that did not overlap Pairagon alignments (gene and transcript specificity are the same for programs that predict only one transcript per locus). However, this is due largely to the fact that there are high quality cDNA sequences covering most of the real genes in the ENCODE regions. When these are not used and N-SCAN_EST's predictions on the complete GENCODE test regions are evaluated, their specificity is 38.7% (Table ).
In the ENCODE regions, the accuracy of N-SCAN_EST is due in large part to the accuracy of N-SCAN itself (this may not hold in less gene-dense regions). Table compares the five submissions to the Dual or Multiple Genome category of EGASP that score the highest on exons, transcripts, and genes. N-SCAN scores the highest in all categories except for nucleotide sensitivity. In terms of exon specificity, N-SCAN is 4.8% better than the next best system (Dogfish) and in transcript specificity 18% better than the next best system (Augustus-dual). For transcript and exon sensitivity, N-SCAN is 4.7% and 4.6% better, respectively, than any other system except TWINSCAN-MARS. N-SCAN outperforms TWINSCAN-MARS by about 1% transcript sensitivity and 2% exon sensitivity. TWINSCAN-MARS has relatively high sensitivity in part because it predicts several transcripts per gene, for which it pays a price in specificity. Even with the hit it takes in specificity, TWINSCAN-MARS is among the top three performers, especially at the transcript level. This may be explained, in part, by the fact that N-SCAN and TWINSCAN-MARS share nearly identical models for DNA sequence [16
], although their conservation models are quite different.
Prediction accuracy measures of multiple-genome based gene prediction methods
N-SCAN's ability to explicitly model untranslated regions (UTRs) [12
] facilitates the distinction between coding and non-coding exons. Figure illustrates this advantage of N-SCAN when compared to other dual- or multiple-genome gene predictors on Gencode reference gene AC009404.6. Only the N-SCAN prediction agrees with the Gencode reference gene; N-SCAN's ability to model 5' UTR content is the key. The 168 base-pair (bp) region upstream of the annotated start codon lies within a 1,012 bp CpG island (annotated on the UCSC Genome Browser CpG-island track). The 67% G+C content of this 168 bp region is very high compared to typical intronic and intergenic regions and even high compared to most exonic regions. However, this is not unusual for a region of this size within a CpG island. Without explicit 5' UTR-content modeling, however, it is more likely to be predicted as a coding region rather than as a 5' UTR, intronic, or intergenic region. For example, Augustus + Mouse Homology and TWINSCAN-MARS annotate this region as coding. N-SCAN's modeling of DNA content and conserved sequence for 5' UTR regions facilitates the correct categorization of this region.
Figure 6 Initial exon of a gene where N-SCAN correctly discriminates coding region from the 5' UTR. Other gene prediction systems predict longer coding regions due to the high G+C content of the region. (Screen shot obtained from UCSC Genome Browser web site .) (more ...)
When the genome sequence and conservation do not provide sufficient information about the coding potential of a gene locus, EST evidence can be very useful in gene prediction. Figure shows a gene where N-SCAN_EST predicts three out of four exons correctly while both ENSEMBL and N-SCAN do not predict any gene in the region. In fact, N-SCAN_EST is one of only two gene predictors that predict any gene in this locus. There are high quality EST alignments supporting this gene, such as BX116511 with a 100% identical alignment of 583 bases, which aid N-SCAN_EST in predicting this gene even though the conservation rate of the coding regions is low. This low conservation may explain why N-SCAN failed to predict a gene; likewise, the extremely low genomic conservation in Exon 3 may explain why even N-SCAN_EST missed this exon.
Figure 7 A gene where N-SCAN_EST predicts three out of the four exons right. All other programs except AceView do not predict anything in that locus. N-SCAN_EST missed an exon even though there is EST evidence for it. We believe that lack of conservation overwhelmed (more ...)