All EST and assembly data were publicly available and downloaded from the Broad Institute [58
], GenBank, J Craig Venter Institute/The Institute for Genomic Research [59
], Joint Genome Institute [60
] or VectorBase [61
] (Tables and ). We used BLAT [62
] version 33 to align the ESTs to genomic sequence using the following parameters: --minIdentity = 95 --minScore = 50 --queryType = rna. We set the --maxIntron parameter to the longest annotated intron in each species.
We filtered the resulting alignments using the following criteria: each alignment must contain at least one canonical splice site (GT:AG, CG:AG, AT:AC); must have no non-canonical splice sites; must have ≥ 95% nucleotide identity; must not have more than nine consecutive insertions; and must not have more than nine consecutive deletions outside of an intron. We also required three or more exact matches at every intron-exon boundary [12
]. If a single EST aligned to more than one location on the genome, we only considered the alignment with the highest score. To guard against redundant input data, if multiple alignments in the same locus had the same sequence after trimming, we disregarded all but one of them.
We discarded all unspliced alignments to prevent labeling pre-spliced transcripts as splice variants. Unspliced ESTs show no evidence of having been processed by the spliceosome, and may represent pre-spliced transcript fragments. It is difficult to distinguish between an unspliced EST that has been processed and an unspliced EST that has not, and thus we cannot report how many processed ESTs we discarded. However, if most unspliced ESTs have already been processed by the spliceosome, we would expect to find more of them in organisms with very few introns and/or very long exons. We found no such correlation in either case (Additional data file 5). We conclude that a substantial fraction of the unspliced ESTs represent pre-spliced ESTs, and, therefore, that unspliced ESTs are not a reliable indicator of splice variation.
The number of RIs changed substantially depending on whether we included or excluded unspliced ESTs, while the numbers of CEs and competing 3' and 5' splice sites changed very little. We believe that previous reports that do not exclude unspliced ESTs overestimate the frequency of intron retention.
After aligning the ESTs and filtering them, we built transcripts and transcript fragments using CallReferenceGenes, an unpublished tool used in the Broad Institute's genome annotation pipeline since 2005. Several previous papers provide an in-depth discussion of the problem [12
]. We briefly sketch our algorithm here. The source code for CallReferenceGenes, as well as the code that labels alternative splice forms, is included in Additional data file 6.
First, we partition the alignments into clusters, so that every alignment has exon-exon overlap with at least one other alignment in the cluster, and no alignment has exon-exon overlap with any alignment outside its cluster. Alignments that overlap no other alignments are ignored.
Next, for each cluster, we compare every alignment to every other alignment that overlaps it. (Here, as opposed to the previous step, we also consider exon-intron overlaps.) Relationships are directional, and are given one of three labels: 'conflicts', 'extended-by', or 'includes'. Two alignments conflict if any base in the region of overlap is exonic in one alignment and intronic in the other. If they do not conflict, one alignment extends another if it ends after the other and does not begin before it. Lastly, one alignment includes another if it does not conflict with it, starts before it, and ends after it.
At this stage the cluster is represented by a graph of nodes (representing alignments) and labeled edges (representing their relationships). To turn this into a more tree-like representation that is easier to traverse, we build an ordered list of alignments. We sort them by 3' coordinate, in ascending order. In the case where we have two alignments with identical 3' coordinates, we sort by 5' coordinate, again in ascending order. In this way, no alignment is extended by any element that sorts before it. An alignment usually, though not always, sorts after the alignments it includes.
To improve performance, we prune redundant relationships from the graph. We apply a series of heuristics to reduce the number of paths from an alignment to its descendants. These 'trimming rules' include: first, if A is extended by B, and C extends both A and B, and there is no element D such that B conflicts with D and C does not, then any path containing A and B will also contain C. The extended-by relationship from A to C is therefore redundant. Second, if A is included by B, and there is no element D such that B conflicts with D and A does not, then any path containing A will also include B. All extended-by relationships terminating in A are redundant. Third, if A is extended by B and both are included by C, and every element D that conflicts with C also conflicts with either A or B, then any path containing both A and B will also contain C. Thus, the extended-by relationship from A to B is redundant.
We traverse the list bottom-up so that every element is pruned before any element it extends. There will be multiple paths from a parent to a given child if a splice variation lies between them. If no splice variation occurs between them, generally, there will be one path, but the above rules are not exhaustive and some duplicate paths may remain after pruning. As we traverse the list and prune it, we track, per alignment, all alignments that can be reached through it (for example, all its extenders and includees, as well as their extenders and includees, and so on). We call these sets of alignments the 'descendants' of each alignment. (Note that pruning never reduces the membership of these sets, only the number of edges between them.)
After ordering and pruning, we can traverse the list of alignments left-to-right, following 'extended-by' and 'includes' links to build paths linking splice-compatible alignments. To do this, we do the following. First, find starts - a start is any element that extends no other element and has at least one descendant that cannot be reached through any previously discovered start. Second, walk the tree - treating each start, in turn, as a root of a subtree, traverse extension and inclusion links as edges in a graph; each unique path represents (a fragment of) a transcript. Third, remove sub-paths - if all the alignments in one path are present in another, we discard the one containing fewer alignments. Fourth, overlapping paths with distinct alignments represent splice variants of the same gene.
Given two overlapping transcripts A and B, all bases in the region of overlap will be in one of four states: I, A and B are exonic; II, A is exonic; III, B is exonic; IV, neither is exonic. We group all adjacent states into a column with a single label, then use a sliding-window approach, across three such columns, to compare overlapping transcripts. CEs, RIs, and competing 5' and 3' splice sites all have a different signature appearance. CEs appear as IV:II:IV and IV:III:IV; RIs appear as I:II:I and I:III:I; competing 5' and 3' splice sites appear as I:II:IV, I:III:IV, IV:II:I, and IV:III:I.
We filter the set of splice variants by requiring that every base in the variant region be exonic in at least one alignment and intronic in at least one alignment. Because EST data are fragmentary we cannot be sure that initial or terminal exons are complete. To ensure accurate labeling as well as reliable length statistics, we require that any exon in the alternatively spliced region not be an initial or terminal exon. Finally, we require that every alignment spanning the region conform to either the major or minor variant. The reported length of each splice variant is simply the width of the center column in the windows described above.
We also generated a list of constitutive introns and exons as a control. These are introns and exons predicted from loci where no ESTs conflict. Furthermore, every base within these constitutive elements must have ten or more ESTs supporting it. Note that there is no way to tell for sure that any exon or intron never exhibits splice variation; this method simply identifies loci where splice variation has never been seen, and if present, is presumably rare.
Testing retained introns for evidence of selection at the codon level
We used a comparative approach to test retained introns for purifying selection at the codon level using two groups of fungi, where genome sequences at suitable evolutionary distances were available. We examined all four sequenced serotypes of C. neoformans
(JEC21, H99, R265 and WM276) in one group, while the second group consisted of C. immitis
and the C735 strain of Coccidioides posadasii
. Orthology of genes was determined using a reciprocal-best-BLAST criterion, while the alignment of orthologs was performed using ClustalW [64
]. Alignments of retained orthologous introns were concatenated to enhance power for detecting selection. Prior to concatenation, splice donor and acceptor sites were removed, and the 5' and 3' ends of each intron alignment were padded with gaps in order to preserve the native reading frames of introns. We used a 100 bp cutoff for retained introns in C. neoformans
and 150 bp in C. immitis
. Since 95% of the introns we identified in each organism were less than these cutoffs, we used a cutoff to eliminate unusually long introns. We analyzed 477 and 389 orthologous intron alignments in C. neoformans
and C. immitis
, respectively. The dN/dS ratio of concatenated intron alignments was calculated using the codeml program (model M0) in the PAML 3.15 software package [65
]. The probability that the resulting dN/dS ratios were significantly less than one, indicating purifying selection at the codon level, was calculated using a bootstrapping approach with a set of control introns. We identified 672 and 662 control introns in C. neoformans
and C. immitis
, respectively. We randomly resampled, with replacement, from the control introns to create 1,000 concatenated control alignments approximately equal in length to the concatenated retained intron alignments in each taxonomic group. Then, we used codeml to calculate the dN/dS ratio exhibited by each resampled control alignment to determine the probability of observing dN/dS ratios as low as or lower than those exhibited by the retained introns in each taxonomic group, under a null hypothesis that they are non-coding.