We compared Pairagon's accuracy to that of the 12 other aligners listed in . The focus of our study is alignment, not mapping. Thus, each transcript we tested was aligned to a short region of the genome which contains 10 Kb before and after the coding region for the particular transcript.
Programs and parameters used for evaluation
Generally, the default alignment parameters were chosen. If there was a specific flag for either high-identity alignment or cross-species alignment, those parameters were tested separately, as denoted in the naming scheme with either a + or an X, respectively. Although the aligners could theoretically provide better alignments with more parameter customization, our sense is that even sophisticated users rarely tune alignment parameters. Tests were performed without any prior indexing of the genome sequence as a whole, which some aligners use for speed-up. The aligners were run on a cluster of eight-processor 64 bit Linux boxes with 15.7 GB of memory available.
In testing all the aligners, it became a practical necessity to develop a common format for the alignments. We chose the Verbose Useful Labeled Gapped Alignment Report, or vulgar format, as used by Exonerate (Slater and Birney, 2005
). This allowed us to capture all of the vital features of the alignments, including the locations of the exon-intron boundaries, insertions and deletions. We developed a perl library for converting particular output formats for each aligner to vulgar, and from there to various other commonly used formats, including GTF, GFF and PSL. It can also produce visualizations of the alignments and generate statistics for each alignment.
We gauge the accuracy of the aligners with exon sensitivity (the number of correct exons divided by the number of reference exons) and specificity (the number of correct exons divided by the number of aligned exons). When we refer to accuracy, we are referring to the product of the sensitivity and specificity. We chose this metric to emphasize the importance of discerning exon-intron structure, as opposed to nucleotide sensitivity and specificity, which are typically quite high for all aligners.
The perennial problem with benchmarking aligners is the absence of a gold standard against which to compare. To address this, we aligned artificial cDNA sequences created by splicing the coding regions of the reference genome sequence together, resulting in a cDNA sequence that is 100% identical to the genome (Florea et al.
; Wheelan et al.
). To our surprise, there were considerable differences between the aligner output and the original exon–intron structure even at 100% identity. (Similar tests were performed with full cDNA sequences instead of just the coding sequences, and the results showed no major differences. See Section 5
However, 100% identity is a best-case scenario. To make the tests more realistic, we needed discrepancies between the reference genome sequence and the cDNA sequence. Unless the cDNAs and the genome are obtained from a strain in which all heterozygosity has been bred out, there will always be differences between some cDNAs and the genome. In many cases, the cDNAs and genome sequences come from different individuals. There are also many genome sequences for which the available cDNA sequences are insufficient or non-existent, so cDNAs from related species must be aligned. Whatever the source, the discrepancies reduce the overall identity between the sequences, necessitating that the aligners be tested at a wide range of identity levels. We chose two methods to introduce possible discrepancies. The first was to modify the genome by using a mutation simulator. The second method was to use the cDNA sequences of one species to align to the orthologous genome sequence of another species.
3.1 Experiment 1: simulated fly genomes
We used a detailed genome evolution simulator (developed in-house) that mustates different regions of the genome at different rates (R. Brown and M. Brent, In preparation). This gave us finer control over the cDNA-genome identity than we could obtain by aligning to naturally occurring genome sequences. The adjustable parameter used to control the simulated evolutionary distance between the original and mutated genomes is D, the expected number of substitutions per 4-fold degenerate site in protein coding sequence. Other parameters, which control the relative rates of silent and non-silent mutations, mutations in splice sites and so on, are all estimated from real genome alignments of Drosophila melanogaster to its close relatives and then scaled by D. This model yields a mutated genome that is more biologically realistic than one with more uniformly distributed mutations. Mutating the genome rather than the cDNA makes the alignments more realistic by giving changes in the splice sites a non-zero probability. In this experiment, we mutated the D.melanogaster genome multiple times, varying D between 0 and 0.7. The coding regions of the mutated genome sequences are between 75% and 100% identical to the original sequence. We then aligned 1000 transcripts randomly selected from the reference annotation to the mutated genomes.
The output of the mutation simulator gives us a base by base mapping between the original genome and the mutated genome, allowing us to ascertain the locations of matches, substitutions, insertions and deletions. Using this information, we create an annotation for the mutated genome, which we use to judge the accuracy of the predicted alignments.
We found that aligner accuracy greatly depends on sequence identity, as seen in , which shows the accuracy of the top aligners. (The results of this experiment for all aligners is provided in Section 1
of the Supplementary Material
.) As expected, all aligners decrease in accuracy as the identity level decreases, but the rate at which the accuracy declines varies. The drop off is slower for aligners tuned for cross-species alignment, such as Xat and the aligners with cross-species parameter sets. However, they generally start with lower accuracy than their high-identity counterparts.
Fig. 3. Experiment 1 results—the accuracy of the top aligners, judged by the accuracy in aligning exons (the product of exon sensitivity and specificity), over a range of sequence identities, with the original genome on the left side of the graph. The (more ...)
For alignments involving the unmutated genome, the high-identity parameter set for Pairagon ranks first with 99.6% accuracy, followed closely by PairagonX and Palma, both with 98% accuracy (, inset). Blat, Exalin, ExonerateCG, Sim4+, both versions of Spaln and both versions of Splign all also score above 97% (see Section 1
of the Supplementary Material
). As the identity level drops, Pairagon remains the most accurate aligner. However, the ranking of the other aligners does shift. While Blat, Exalin and Sim4+ were among the top aligners for the unmutated genome, their accuracy falls off quickly, so that Spaln and Splign outperform all of them when the identity drops to 99%. Below 78% identity, the sequences have enough mutations that PairagonX outperforms Pairagon. PairagonX remains the best aligner for the rest of the low-identity trials. The cross-species version of Spaln (SpalnX) is the next most accurate at the lowest identity levels.
3.2 Experiment 2: natural mammalian genomes
For this experiment, we aligned perfect cDNAs (created as in Experiment 1) to the unmutated genomes of other species. We selected 1782 proteins from the Swissprot database that were present in humans, mice and rats. We then found the corresponding gene annotations from the UCSC Genome Browser and extracted cDNA from those annotations as we did in our previous experiment. This allowed us to test the aligners on a set of homologous sequences, which had been subjected to real mutations, and therefore validate the results of the simulator.
First, we ran the 1782 transcripts through our mutation simulator, using the mouse genome as the base. This showed similar trends to the fly data, with Pairagon achieving the highest accuracy levels for high-identity alignments, and PairagonX achieving best for low-identity alignments. Spaln, SpalnX and Exalin were the next most accurate (Section 2
in Supplementary Material
Then we aligned the mouse cDNAs to the human and rat genomes (). For the mouse-to-mouse alignments (100% identity), Pairagon performed best, followed by PairagonX, Spaln and Splign. For the mouse-to-rat alignments, PairagonX performed best, followed by Pairagon, Splign and Spaln. Finally, for the mouse-to-human alignments PairagonX aligns best, followed by SpalnX and SplignX. Pairagon's relatively lower performance in the mouse-to-human alignments results from the evolutionary distance between mouse and human generating more substitutions and indels than Pairagon's same species model expects.
Cross species ranks —this table shows the results of aligning the cDNAs of mouse, rat and humans to the mouse genome