To assess the relationship between the locations of common genetic variation in human genes and ESEs, we screened the public SNP database (dbSNP; build 112) for SNPs that mapped to human exons. A set of biallelic reference SNPs was used, excluding entries that (1) mapped to multiple regions in the human genome, (2) mapped to repetitive elements, or (3) were derived from transcript sequence data, e.g., through comparison of expressed sequence tags (ESTs) (see Materials and Methods
for details). The remaining SNPs were searched against a large database of human genes containing approximately 121,000 internal exons annotated by aligning available human cDNAs to the assembled genome using the GENOA genome annotation system (see Materials and Methods
). This search identified 9,862 SNPs that were localized to an internal exon (aligned perfectly to the genomic sequence over a 33-base segment centered on the polymorphic position).
ESE Density Is Highest and SNP Density Lowest near Splice Sites
Recording the position of each SNP within the corresponding exon revealed that SNP density is not uniform along exons (). Consistent with previous observations (Majewski and Ott 2002
), SNP density was approximately 20–30% lower near both the 3′ splice site (3′ss) and the 5′ splice site (5′ss) of human exons than in the interior of exons, and reached a plateau at about 25–30 bases from the splice sites. The distribution of RESCUE-predicted ESE hexamers along exons had roughly an inverse relationship to the SNP density, with the highest density of ESEs observed near the 5′ and 3′ splice junctions and a lower density in the interior of exons (). Previously, ESE activity has been observed to vary as a function of the distance between the ESE and the adjacent splice sites, with the highest activity in vitro and in vivo for ESEs positioned closest to splice sites (Nelson and Green 1988
; Lavigueur et al. 1993
; Graveley et al. 1998
). Thus, selective pressure is likely to be higher on ESEs located near splice junctions relative to ESEs in the interior of exons, which could explain the trend in ESE density shown in . As a consequence of the increased density of ESEs near splice sites, mutations that occur in exons near splice sites should have a higher likelihood of disrupting ESEs and therefore be more likely to be eliminated by purifying selection. Thus, selection on ESEs could potentially explain the trend in SNP density seen in . In order to more directly test the hypothesis that ESE disruption mutations are subject to negative selection, we conducted a large-scale analysis of sequence variation in human exons.
Density of Predicted ESEs and SNPs along Human Exons
Estimating the Frequency of ESE Disruption in Human SNPs
The frequency of ESE disruption for simulated (randomly generated, unselected) mutations was compared to the frequency of ESE disruption observed in SNPs, which represent mutations that have survived selection to become reasonably frequent in the human population. SNPs are shaped by the interplay between the mutation process and the process of natural selection. Comparing simulated mutations to natural variations is an effective way to decouple these processes, and this approach has been used by others to study selective pressure in protein-coding genes (Gojobori 1983
; Nei and Gojobori 1986
; Kowalczuk et al. 2001
). Here, we describe the analysis of simulated and natural mutations using a variation of the McDonald–Kreitman test, a widely used statistical test for detecting selection in genes, that we have adapted to measure the strength of selection acting on ESEs using data from several thousand exons (McDonald and Kreitman 1991
; Jenkins et al. 1995
When considering SNPs, in order to distinguish mutations that disrupt predicted ESEs from those that create predicted ESEs, the identity of the ancestral allele must be established. Since the mutations that created most human polymorphisms occurred less than 1 million years ago (Slatkin and Rannala 2000
; Miller and Kwok 2001
), long after the human–chimpanzee divergence of 5 million years ago (Stauffer et al. 2001
), the orthologous chimpanzee exon will almost always represent the sequence of the ancestral allele.
Each of the 9,862 mapped human SNPs described previously was aligned (using a 33-nucleotide sequence window centered around the polymorphic position) to unassembled reads from the genome of the chimpanzee Pan troglodytes,
accessed through the NCBI trace archives (ftp://ftp.ncbi.nih.gov/pub/TraceDB/pan_troglodytes/
). As the trace archives represented several-fold coverage of the chimp genomic sequence, most SNPs matched to several sequence reads. Whenever one allele of a human SNP consistently matched the chimpanzee sequence in all high-quality alignments, that allele was designated as the ancestral allele, and the other allele at that position was designated as a variant allele. The 8,408 SNPs that satisfied this criterion were then annotated for predicted ESEs in both the ancestral and variant sequence, simply by comparing the six overlapping hexanucleotides that differed between the two alleles to the set of RESCUE-ESE hexamers and recording the number of matching hexamers.
It is well known that current SNP databases contain a certain rate of error. The SNP consortium estimated that about five percent of their submissions were false positives attributed to base calling errors (Altshuler et al. 2000
). Incorrect mapping can also result in the misclassification of nearly identical paralogous regions as SNPs (Bailey et al. 2002
; Cheung et al. 2003
). The rate of false positives in dbSNP can be conservatively estimated by resequencing DNA that has been collected from many individuals. SNPs that cannot be validated in such a manner are either rare SNPs that were not present in the sample or are false positives of the SNP discovery method. Recent resequencing studies validate 60–86% of the entries in dbSNP (Carlson et al. 2003
; Reich et al. 2003
). We anticipated a lower rate of false positives in the 8,408 SNPs that were used in this analysis because we removed error-prone categories of SNPs (such as those derived from ESTs or duplicated regions) from our data set. Despite this expected improvement, our initial analysis focused on the subset of 2,561 SNPs that had been validated by resequencing and were thus assumed to be free of errors (see Materials and Methods
). This precaution was taken because the measurement of ESE disruption is particularly sensitive to artifacts (unpublished data).
The annotation of RESCUE-ESE hexamers in a biallelic SNP results in one of four possible outcomes: no ESE hexamers in either allele (ESE neutrality, − −), one or more ESE hexamers only in the ancestral allele (ESE disruption, + −), one or more ESE hexamers only in the variant allele (ESE creation, − +), or one or more ESE hexamers in both alleles (ESE alteration, + +). The latter category is referred to as ESE alteration because the sets of RESCUE-ESE hexamers in the ancestral and variant alleles are, of course, different, and therefore may not necessarily be recognized with the same affinity by the same trans-factor(s). The relative frequencies of these four outcomes are listed in A in the row labeled “Selected (SNP).” Since the 238 RESCUE-ESE hexamers represent only a small fraction (approximately 6%) of the 4,096 possible hexanucleotides, it was not surprising that a large majority of SNPs fell into the ESE-neutral category. To determine whether the rate of ESE disruption in SNPs was higher or lower than what would be expected from unselected mutations, we performed a Monte Carlo (random) simulation of point mutations in human exons.
Analysis of the Effects of SNPs and Unselected Mutations on Predicted ESEs
Reduced Frequency of ESE Disruption in SNPs versus Unselected Changes
Mutations were simulated at random in human exons using nucleotide substitution frequencies that reflect the mutational biases observed in unselected regions of the genome with similar nucleotide composition. In vertebrates, the greatest biases in nucleotide-to-nucleotide substitution frequencies are related to the hypermutability of C residues (Duncan and Miller 1980
) and the higher rate of transitions (C ↔ T, A ↔ G) relative to transversions (purine ↔ pyrimidine). Rates for different base changes have been estimated from analysis of aligned sequences which are assumed to be under no selective pressure. For example, transitions make up 67–70% of all substitutions in human pseudogenes and 65.5% of all substitutions in genomic repeat sequences (Graur and Li 2000
; Hardison et al. 2003
; Zhang and Gerstein 2003
Using the substitution frequencies derived from a recent large-scale study of nucleotide substitution patterns in processed pseudogenes (Zhang and Gerstein 2003
), mutations were simulated in the set of approximately 121,000 GENOA-annotated internal human exons and the effects on predicted ESEs were analyzed as described above for SNPs. The simulation captured the influence of nearest-neighbor bases on the pattern and rate of nucleotide substitution. These nearest-neighbor effects are particularly pronounced for CpG dinucleotides, where an elevated C-to-T mutation rate is observed as an indirect consequence of cytosine hypermethylation. Comparing the results of this simulation to those observed for SNPs (listed as “Selected (SNPs)” in A), the most striking difference was observed for the category of ESE disruption: 13.6% of simulated (unselected) mutations caused ESE disruption compared to only 10.9% of SNP mutations. This difference implies a significant selective disadvantage for mutations that disrupt RESCUE-ESE hexamers.
To assess the degree of selection on ESEs, one standard measure is the relative risk (RR), defined in this instance as the ratio of the frequency of ESE disruption in SNPs to the frequency of ESE disruption for unselected mutations (e.g., RR = 10.9%/13.6% = 0.80, using the pooled data from A). In this instance, we preferred to use the slightly more complex odds ratio (OR) measure to quantify this effect (defined in Materials and Methods
) because of its better statistical properties (Pagano and Gauvreau 2000
As expected, the SNP dataset was greatly enriched for synonymous variation relative to the simulated mutation dataset. There is a 1.3:1 ratio of synonymous:nonsynonymous changes in the SNP dataset compared to a 0.5:1 ratio of synonymous:nonsynonymous changes in the simulated dataset. This difference, which has been observed many times, suggests that more than 60% of mutations that change the amino acid sequence are eliminated by natural selection (Graur and Li 2000
). In order to account for the potentially confounding effect of selection occurring at the protein level, SNPs and simulated mutations were divided into synonymous and nonsynonymous groups and analyzed separately. After controlling for the higher frequency of synonymous mutations in dbSNP, the selective pressure to avoid disrupting ESEs was approximately equal for the synonymous and nonsynonymous classes of mutations (the Mantel–Haenzel [MH] test of homogeneity indicated no significant differences in the magnitude of the effect across all comparisons; χ2
< 0.5). This observation confirms our previous result and alleviates the concern that the analysis might have been confounded by the effects of selection acting at the protein level. The summary OR (the weighted combination of the separate synonymous and nonsynonymous analysis) for ESE disruption was 0.82 ± 0.05 (B), which implies that natural selection has eliminated approximately 18% of arising point mutations that disrupt RESCUE-ESE hexamers (p
Base changes that alter one or more predicted ESE hexamers but result in creation of other RESCUE-ESE hexamers (ESE alteration) are also selected against, but to a somewhat lesser degree (B, “+ +” category). This observation is not surprising given that these changes may alter the specific combination of SR proteins which interact with the exon and consequently alter ESE activity. In our previous study, we found that ESE alteration mutations would often cause an increase or decrease in enhancer activity, as determined in vivo using a splicing reporter construct (Fairbrother et al. 2002
). SR proteins generally have distinct, though sometimes overlapping, RNA binding specificities and vary in their ability to activate splicing (Graveley et al. 1998
; Liu et al. 1998
). Therefore, some ESE alteration mutations that result in one SR protein replacing another SR protein may weaken the ESE and disrupt splicing. In addition, there are situations where the simultaneous binding of multiple activator proteins on a substrate is critical for correct processing of that pre-mRNA (Tian and Maniatis 1993
). In such a case, it is unlikely that one ESE could be exchanged for another without deleterious consequences.
At first glance, the overrepresentation of some categories in this analysis may seem surprising. However, this is simply a consequence of the underrepresentation of disruption and alteration mutations in the SNP pool causing the remaining two categories of variation, ESE neutrality and ESE creation, to appear slightly overrepresented in SNPs relative to unselected mutations (B). As mutations that result in a selective disadvantage are rapidly eliminated from the population, an increasing fraction of the mutations that persist as SNPs will be selectively neutral (Graur and Li 2000
). In other words a neutral variant will, on average, be eliminated less rapidly than a disadvantageous variant and so the set of neutral variations will come to represent an increasing fraction of the total (diminishing) pool of variants. The analysis presented here divides SNPs into four categories based on ESE annotation. As variations from two of these categories (ESE disruption and ESE alteration) are shown to be preferentially eliminated by natural selection, the remaining two categories (ESE neutrality and ESE creation) will represent a larger fraction of a diminished total pool and, therefore, appear enriched (B).
Selective Pressure Is Strongest for ESEs Located near Splice Sites
Experiments measuring the splicing activity of substrates with variations in the distance between a well-characterized ESE and the 3′ss have demonstrated a strong proximity effect, with ESE activity decreasing as the distance from the splice site increases (Lavigueur et al. 1993
; Graveley et al. 1998
). Although the closest ESEs tested in the distance studies were 70 nucleotides away from the splice sites, other studies have demonstrated that ESEs can function at much closer distances to splice sites (Nelson and Green 1988
; Coulter et al. 1997
). In order to test the generality of this distance effect, we quantitated the selective pressure on ESEs in distal and proximal windows at both the 3′ss and the 5′ss (). Validated SNPs that fell within a particular exon region (e.g., the first 20 nucleotides of the 3′ss proximal window is defined as region A in ) were compared to unselected (simulated) mutations that fell in the same region, and summary ORs for ESE disruption were calculated for the four exon regions shown. Consistent with higher ESE activity for ESEs located near splice sites, we observed a pronounced increase in the conservation of RESCUE-ESE hexamers located within 20 bases of either the 5′ss or 3′ss (p
< 0.05) relative to predicted ESEs located further from splice sites ().
Selection against Disruption of Predicted ESEs in Different Exon Regions
SNP Analysis Identifies Conserved ESEs
Our observations that SNPs tend to avoid disrupting RESCUE-ESE hexamers (see ) and that the magnitude of this selection increases near splice sites () indicate that this set of hexamers represents a physiologically important collection of sequences across many human exons and genes. Although all RESCUE-ESE hexamers tested to date have ESE activity in cell culture assays, we have tested only a small fraction of the 238 individual hexamers, and presumably some members of this set may be false positives of the RESCUE-ESE method.
In order to better define functional hexamers in the RESCUE-ESE set, we repeated the selected versus unselected comparisons for each hexamer individually using the larger set of 8,408 exonic SNPs described previously. For each RESCUE-ESE hexamer, we counted cases where a hexamer was interrupted by a mutation that has survived selection (SNP mutations) and compared this frequency to the value we would expect in the absence of selection (simulated mutations). For this analysis we used a simple RR measure, defined as the frequency with which a hexamer overlaps with SNPs relative to the frequency with which the same hexamer overlaps with simulated mutations. This ratio (frequencySNP/frequencysimulated) was calculated for each hexamer and provided a means of assessing the selective pressure on each hexamer. The RR for a hexamer that was under no additional selective pressure would therefore be equal to 1.0, and a hexamer under increased selection would have an RR of less than 1. Consistent with our previous analysis (see ), the majority of RESCUE-ESE hexamers (162 hexamers) had an RR of less than 1 (A), suggesting that many RESCUE-ESE hexamers are subject to purifying selection.
Measuring Selective Pressure on Each RESCUE-ESE Hexamer
While the somewhat limited size of the currently available SNP databases limits our power to detect selection acting on individual hexamers, it was possible to detect a subset of hexamers that displayed a statistically significant level of conservation. A bootstrap sampling strategy identified a total of 57 hexamers with RRs of significantly less than 1 (A, pink bars) compared to only six hexamers with RRs significantly greater than 1 (A, blue bars). By comparison, testing a set of 238 arbitrary hexamers would be expected to yield approximately 12 significant hexamers in each of these categories at a p
value cutoff of 0.05. Included within the set of 57 conserved ESEs are hexamers corresponding to the well-characterized “GAR” and AC-rich ESE classes (Coulter et al. 1997
) and several other types of ESEs (see Supporting Information
A comparison of RESCUE-ESE predictions for four different vertebrates (G. Yeo, S. Hoon, B. Venketesh, and C. B. B., unpublished data) revealed a strong correspondence between within-species conservation (SNP analysis) and cross-species conservation (B). In other words, the ESE hexamers that appeared to be under the greatest selective pressure within the time frame of the SNP analysis (the last hundreds of thousands of years) were more likely to retain the characteristics used by the RESCUE approach to identify ESEs over the time frame of vertebrate speciation (the last tens to hundreds of millions of years). While only a minor fraction (11%) of the hexamers that appear exclusively in the human lineage were significantly conserved in the SNP analysis (RR < 1; p < 0.05), about half of the hexamers predicted to be ESEs in all four vertebrates examined (human, mouse, zebrafish, and pufferfish) were significantly conserved in the SNP analysis. In the future, with a larger SNP database and more genomes available, it should be possible to use these methods to analyze more individual hexamers for evidence of selective pressure.
As mentioned previously, the public SNP database used in this study contains a significant fraction of entries that could not be validated by resequencing. If these unvalidated entries in dbSNP were errors, they would not be expected to specifically avoid ESEs, or particular exon positions. Therefore, SNP artifacts are likely to reduce, rather than increase, the apparent significance of the biases in the nature and distribution of SNPs that we have observed.
Here we have shown, using polymorphism data, that RESCUE-ESE hexamers have been preferentially conserved in the recent evolutionary history of the human lineage and that the strength of this conservation increases with increasing proximity to the splice sites. These results imply that splicing imposes important constraints on the evolution of human exons.
As the size and quality of the SNP database is rapidly increasing, measures of selection deriving from SNP data are likely to become increasingly useful for evaluating the function of short, degenerate sequence elements like ESEs. For example, it should be possible to use the VERIFY method to analyze selection on motifs that are postulated to control transcription, polyadenylation, messenger RNA (mRNA) stability, or translation. The current build of dbSNP contains 360,000 SNPs that are located in a “locus region” within 2 kb upstream of the transcript start site or 500 bp downstream of the polyadenylation site (Sherry et al. 2001
). There are also more than 500,000 SNPs localized to the untranslated regions of mRNAs where elements that modulate translation, polyadenylation, and mRNA stability are thought to be located. The analysis presented here used 8,400 SNPs to study selection on ESE hexamers. Making the simplifying assumptions that ESE hexamer frequencies are both uniform and slightly elevated in exons and are mutated without bias, we can roughly estimate the statistical power of the VERIFY method to detect selection (see Materials and Methods
). For example, the set of ESE hexamers, as a whole, is under selective pressure (OR = 0.82), but this selective pressure will vary according to hexamer, with some hexamers being highly conserved and others less so (). Our ability to detect conservation at the level of the individual hexamers depends upon (1) the degree of hexamer conservation (large differences are easier to detect than small differences) and (2) the number of times a SNP interrupts an ESE hexamer (larger data size increases our confidence in individual measurements and, therefore, increases our ability to measure differences). The VERIFY method could detect about 70% of strongly conserved hexamers (OR > 0.5) in the set of 238 ESE hexamers using 8,400 SNPs. A similar type of analysis performed with 500,000 SNPs would not only detect almost all the strongly conserved motifs but would be able to detect motifs under a weaker degree of selection (OR > 0.77) with a similar power (approximately 70%). As an alternative to using polymorphisms within a species, VERIFY could also utilize variations between closely related species to study selection on gene control elements.
In addition, this work identifies two new categories of mutations that are under selective pressure: mutations that disrupt or alter RESCUE-ESE hexamers. This increased selective pressure presumably reflects the increased likelihood that such variations will alter splicing and, therefore, gene activity. Several cases of polymorphisms which result in allele-specific differences in splicing have been reported (Betticher et al. 1995
; Stallings-Mann et al. 1996
; Stanton et al. 2003
). Here, we identify features such as proximity to splice sites and ESE disruption and alteration that should prove useful in discovering additional polymorphisms that affect splicing. As splicing mutations constitute at least 14% of disease-causing mutations (Stenson et al. 2003
), polymorphisms that affect splicing would be good candidates for association studies intended to identify genetic contributors to quantitative traits or diseases. An additional benefit of using variations that are predicted to result in altered splicing relates to the feasibility of validating an RNA phenotype. While variations predicted to alter protein activity may require gene-specific activity assays or antibodies, RNA phenotypes such as exon skipping can be readily screened by RT-PCR, thus enabling large scale phenotyping of genotyped cell lines.