The idea driving the collection of data was to generate a large, reliable dataset that would allow statistical analyses. For this reason, sequences from RefSeq (35
) were used instead of ESTs and cDNAs and other sources of noisy data. The RefSeq collection contains sequences with different levels of confidence (including categories, such as reviewed, provisional, predicted etc.).
RefSeq sequence data from NCBI (from June 2005) for five reference genomes was collected, including the primate H.sapiens, the rodent M.musculus, the arthropode D.melanogaster, the nematode C.elegans and the land plant A.thaliana. If a splice site occurred more than once in different alternatively spliced isoforms, it was considered only once in the database, as part of one of the variants, in order to avoid counting splice sites multiple times.
The flanking sequences at both exon–intron junctions were extracted, 38 nt long at the 5′ss (8 nt within the exon and 30 nt within the intron), and 43 nt long at the 3′ss (35 nt within the intron and 8 nt within the exon). The splice sites were identified using annotations of the genome from GenBank. C.elegans
3′ss that are trans
-spliced to spliced leader (SL) RNAs (36
) were not included in the analysis. The data, along with lengths of introns and exons, are stored in a relational database.
Position weight matrices (PWMs), information content and phylogeny
Sequences at the 5′ and 3′ junctions for each intron were separately aligned using the intron–exon junction as the anchor. The alignment was used to calculate the frequencies of nucleotides at each position. This results in a PWM that has four rows (one for each of A, C, G and T) and the number of columns is equal to the length of the splice-site motif. We chose a 9 nt long motif for the 5′ss (3 in the exon and 6 in the intron) and a 14 nt long motif for the 3′ss (13 in the intron and 1 in the exon). The consensus sequence for a PWM is constructed by choosing the nucleotide with the highest frequency of occurrence at each position.
A log-odds (LOD) matrix was constructed from the PWM by the usual method of taking a logarithm to base 2 of the ratio of frequencies at each position of the matrix against a background frequency of 0.25 for each nucleotide. Using a different background will not change the results, but will change the scores and thresholds. A position on a sequence was scored by aligning it with the matrix and adding up the entries in the matrix that correspond to the nucleotides at each position on the sequence.
Each position on the splice site has an associated information content that can be derived from the frequencies of the 4 nt, using information theory (37
). For a random variable that can take on four values with four probabilities, pi
, the information content is
. Information content can be used to study the variability (or level of conservation) at each position, varying from 0 (one possibility is certain) to 2 (all possibilities are likely equal). Some papers use
as a measure of conservation, which just inverts the scale (39
A distance, d
, can be derived between two PWMs, p
, of equal length, using a measure called the Kullback–Leibler distance (40
), a symmetrized form of which is defined as
is the index of position on the PWM. This distance was used to derive phylogenetic trees for the 5′ and 3′ss separately, using the program Phylip (41
Scoring a putative splice site
Given a PWM of a given splice-site type, a log-odds scoring matrix was constructed to score the splice sites. A pseudocount of 0.0001 was added to avoid logarithms of zero. The small pseudocount ensures a strong penalty for mismatches at the exon–intron junctions.
The scores obtained by each LOD score matrix were rescaled to fall in the 0–100 range. All positive scores, from 0 to the maximum, were rescaled to lie between 50 and 100, and all negative scores, from the minimum score to 0, were rescaled to lie between 0 and 50. A score close to 50 means the background model is favored, a score close to 100 favors the splice-site motif in question, and a score close to 0 implies that the splice site is different from both the background and the splice-site distribution. Finally, strongly conserved splice sites have a bigger spread of scores, so a single mismatch can drop the score dramatically, whereas less conserved splice sites are more tolerant of mismatches compared to the highest scoring sequences.
Classification of introns
PWMs were first generated for each splice-site type and BPS using human curation. The U12-type BPS consists of an 8 nt pyrimidine-rich sequence (approximately TTTTCCTT), followed by two positions enriched in As and two positions enriched in pyrimidines (C or T). The U12-type BPS requires an A, either at positions 9 or 10 (42
). In order to take this into account, we constructed two U12-type BPS PWMs, one with a highly conserved A at position 9 (U12-BPS-A9), and another with a highly conserved A at position 10 (U12-BPS-A10). A sequence was considered a good BPS if it had a score greater than 65 for either one of the two BPS PWMs and if it was located between 8 and 30 nt upstream of the 3′ss.
The short fragments around the exon–intron boundaries were scored using the LOD matrices, and each intron received a score for 5′ss, 3′ss and BPS for U12-type introns. The best examples of each type, those with high 5′ss scores and consistent dinucleotide boundaries, were selected. For U12-type GT–AG introns, an additional criterion for the presence of a BPS between 8 and 38 nt upstream of the 3′ end was used to eliminate possible contaminants. The best set of introns for each subtype was used to generate new species and splice-site type specific PWMs, which were then used to rescore the splice sites and reclassify them.
Intron classification was unambiguous in almost all cases, either due to their distinctive boundaries or their 5′ss scores, with the exception of a few introns with similar scores for U2- and U12-type GT–AG 5′ss matrices. The introns that remained unclassified were subsequently classified using criteria described below.
Previous classification schemes (6
) assumed that U12-dependent splicing required a good BPS within 10–20 nt from the 3′ end of the intron (22
), but recent data has shown this is not always necessary (7
). Thus, we used the presence of an optimal U12-type BPS only to discriminate between introns that could not be unambiguously classified by their 5′ss scores alone. If the U12-type 5′ss score was greater than the U2-type 5′ss score by more than 25 then it was assigned the U12-type GT–AG. If the U12-type 5′ss score was greater than the U2-type 5′ss score by more than 10 and the intron had a good U12-type BPS, then again it was assigned the U12-type GT–AG. In all other cases it was assigned the U2-type GT–AG.
The thresholds of 25 and 10 are arbitrary, but were optimized by human curators to minimize the error rate, which was estimated at <1% by manual scrutiny. The borderline cases could be classified in either group, and their unambiguous assignment will require experimental verification.
Finally, introns with a minimum 5′ss score less than 50 are kept in the database, and can be accessed from the SpliceRack website, but are not used in our analysis.
Introns with non-canonical boundaries
A few of the introns with dinucleotide boundaries that do not fit into the canonical categories are real, and we delineate the strategy to identify them. In general, current datasets are biased against non-canonical splice sites, but it is possible to gather some information on them from the RefSeq sequences, as we show below.
To identify non-canonical U12-type introns from this unclassified set, we chose those with good matches to the U12-type 5′ss, with a non-canonical dinucleotide at the intron's 3′ end, and a significant number of these introns were found (). Many of these introns have a U12-BPS. Analogous to the scheme outlined above, those introns without a match to the U12-type BPS were only classified as U12-type when the score for the U12-type 5′ss was >25 score units higher than that for the U2-type 5′ss. These introns were included in the U12-type RT–RN category (where R is a purine, and N is any nucleotide).
A synopsis of the collection of splice sites in SpliceRack
Some GC–AG introns revealed a strong match to the PWM for U12-type GT–AG 5′ss, except for the mismatch at position +2. We searched for GC–AG introns with the GCATCC motif from positions +1 to +6 of the 5′ss, and found that most of these introns had a match to the BPS at an optimal distance from the 3′ss. These introns were named U12-type GC–AG.
U2-type AT–AC introns (or AT–AC type II introns) were identified by searching for unclassified AT–AC introns that bear a good match to the U2-type 5′ss, except for the intron's terminal dinucleotides. A PWM for U2-type AT–AC 5′ss was generated from the PWM for U2-type GT–AG 5′ ss, by changing the full conservation from G to A at position +1. Only those introns with a minimal 5′ss score of 75 were reclassified as U2-type AT–AC.
Putative examples of other classes of non-canonical U2-type introns were identified by searching for all exon–intron boundaries with good 5′ GT–AG U2-type splice sites (score greater than 70), with a non-canonical dinucleotide boundary at the 3′ terminus of the intron and a good PPT near the 3′ terminus in the intron.
Introns that could not be classified into any of the seven categories either lack a canonical exon–intron boundary dinucleotide, and/or have all 5′ss scores below 40. These can be accessed through SpliceRack. Most of these unknown splice sites are probably due to sequencing or annotation errors in the genome (44
), as shown in previous examinations of non-canonical splice sites (45
). The unclassified introns in our dataset represent a small fraction (<1.5%) of the total number of introns.