To gain insight into the nature of primordial introns, we analyzed putative protosplice sites of ancient introns that retained their positions throughout the course of the evolution of eukaryotes. The idea is to determine whether the primordial protosplice sites correspond to those of U12 or U2 introns. Protosplice sites [8
] are thought to comprise specific targets for intron insertion into coding sequences of eukaryotic genes. The existence of protosplice sites is indicated by the conservation of nucleotides flanking the splice junctions (). In principle, these consensus nucleotides could be remnants of the original protosplice sites or could have evolved convergently after intron insertion. The existence of protosplice sites has been addressed directly by examining the context of introns inserted within codons encoding amino acids that are conserved in all eukaryotes and that, accordingly, are not subject to selection for splicing efficiency [10
]. Evidence has been presented that introns are either predominantly inserted into specific protosplice sites, which have the consensus sequence (A/C)AG||Gt, or are inserted randomly but preferentially fixed at such sites [10
]. The U12 protosplice sites are distinct from the U2 protosplice sites and have the CT||ATA consensus sequence (). This sequence is conserved in human and Arabidopsis thaliana
, indicating that it has not changed since the divergence of plants and animals from their last common ancestor.
Figure 1 Protosplice sites of U2 and U12 introns. Negative numbers indicate the nucleotide positions in the exon immediately preceding the splice junction and the positive numbers indicate the nucleotide positions in the exon immediately after the splice junction. (more ...)
We analyzed the distributions of amino acids in intron-containing sites in which the amino acid is conserved in the sequences of orthologous proteins from eight eukaryotes and five prokaryotes (Supplementary Table S1 and Supplementary Materials and Methods in the supplementary material online), that is, sites that are subject to extreme evolutionary constraints (hereafter called invariant sites). Such constraints operating at the level of amino acids imply that selection for splicing efficiency had no substantial impact on the intron insertion signal. Thus, this signal, at least to the extent that it covers conserved nucleotides within the respective codon, must have remained intact since the time of intron insertion at an early stage of eukaryotic evolution. Ancient introns, in this case, were defined as those in which positions are conserved in at least two of three major eukaryotic lineages (plants, animals plus fungi, and apicomplexa). All 197 ancient introns (53% of the intron positions that are conserved between animals and plants) found at the invariant sites were of the U2-type.
Putative protosplice sites can be inferred by analyzing amino acid frequency distributions at intron-containing sites [10
]. We compared the amino acid distributions at the putative ancient protosplice sites that are derived from the invariant site analysis with the distributions at the sites containing U2 and U12 introns in human and Arabidopsis
genes. The distributions of amino acids at intron-containing invariant sites were highly non-uniform (). Introns occur in three phases, that is, the location of an intron can occur within or between codons; introns of phase 0, 1 and 2 are located between two codons, after the first position in a codon and after the second position, respectively. Each phase has a distinct set of over-represented conserved amino acids ( and Supplementary Table S2). This effect is especially pronounced for phase 1 in which 71% of primordial introns are located within glycine codons (G|GN) (). This pattern is similar to that seen for U2 introns in phase 1, in which 37% of human introns and 40% of Arabidopsis
introns are located within glycine codons (), in agreement with the inference that at least a substantial fraction of ancient introns was U2-type. The excess of glycine in the case of ancient introns is a straightforward consequence of the over-representation of glycine in invariant positions (Supplementary Figure S1).
Figure 2 Comparison of amino acid frequencies at protosplice site of primordial, U2 and U12 introns. The distributions are shown separately for each intron phase. ‘Phase aa0’ and ‘Phase 0aa’ denote phase 0 introns located immediately (more ...)
Comparison of the distributions of amino acids that harbor human and Arabidopsis U2 and U12 introns revealed an insignificant negative correlation (Supplementary Table S3). This is not unexpected when taking into account the difference between the U2 and U12 inferred protosplice sites (). To compare the protosplice sites of primordial introns with U2 and U12 protosplice sites, we employed multiple regression analyses using frequencies of invariant amino acids containing ancient introns as a dependent variable and frequencies of amino acids containing human and Arabidopsis U2 or U12 introns as independent variables ().
Multiple regression analyses of the protosplice sites of the primordial introns with U2 and U12 intronsa
A strong and statistically significant positive correlation between the putative ancient protosplice sites and U2 protosplice sites from human and Arabidopsis was found both for the raw numbers of amino acids and for normalized values ( and Supplementary Table S4), thereby explaining a substantial part (>0.64) of the sequence variance of the ancient protosplice sites. This result indicates that most, if not all, of the analyzed primordial introns were U2-type at the time of their insertion at an early stage of eukaryotic evolution rather than being the result of U12 to U2 conversion. It should be noted that this finding in itself is not dependent on the excess of U2 introns in extant genes or even in conserved intron positions but, rather, comes from an unbiased analysis of invariant intron-containing sites.
With respect to the possibility of massive losses of U12 in early eukaryotes, we have shown in a separate recent study that positions of U12 introns are even more strongly conserved between humans and Arabidopsis
than positions of U2 introns [11
]. Therefore, it seems unlikely that all primordial U12 introns have been lost, so at least a substantial majority of the primordial introns probably were of the U2-type.
We cannot rule out the (formal) possibility that a minor fraction of U12 introns was present during the early stages of eukaryotic evolution, although there was no correlation between ancient protosplice sites and U12 protosplice sites (). We attempted to estimate the sensitivity of the multiple regression analysis using a sampling procedure. Mixtures of U2 and U12 protosplice sites with different proportions of each type (e.g. 10% U12 protosplice sites and 90% U2 protosplice sites) were generated and used as pseudo-ancestral protosplice sites. The results of this simulation show (Supplementary Figure S2) that even a 10% admixture of U12 introns yielded a correlation coefficient value that was significantly lower than the value observed with the real data (Supplementary Table S5). The number of known U12 introns is too small to enable a more precise estimate but the results strongly indicate that, if U12 introns were present among the primordial introns, their fraction was, at best, similar to that in modern genomes. The conclusions of this analysis should be interpreted with caution considering that the invariant sites that are informative for inferring the features of primordial introns comprise but a small fraction of the conserved intron positions and, also, that the statistics on the discrimination between U2 and U12 protosplice sites is weak. Nevertheless, as shown earlier, we currently have no indication of the existence of primordial U12 introns, whereas the evidence in support of primordial U2 introns is clear.