This study shows that the (CAG)8CAA(CAG)4CAA(CAG)8 allele (a-5) of the CAG repeat in exon 1 of the SCA2 is significantly associated with a haplotype (CH-1) that has been detected to be under recent positive selection in CEU. As a result, a region of nearly one megabase of Chromosome 12 around this locus shows extensive LD. This is a dramatic, recent evolutionary pattern which appears to be restricted to Europeans.
Other alleles tested by the EHH approach served as internal controls to eliminate the possibilities of reduction of either the mutation or recombination rates in this physical region of chromatin or as a function of different alleles being accountable for the long LD. For example, the TCAGGAT allele is only one base different from CH-1 in the core region and only a few bases different across the entire ~1 Mb region (A). However, its EHH decayed very quickly (A), and showed no significance when plotted REHH values against their allele frequencies (Figure S2
). These results suggest that no complication of local recombination rate variations has led to the predominance of the CH-1 haplotype. Additionally, the recombination rates (sex-average = 0.52, female = 0.87, and male = −0.11) estimated for this region using the natural logarithm of the ratio of the map distances cM × 106
/Mb based on deCODE and Marshfield markers do not show unusual statistical distributions, and would not account for the extremity of LD (J. Belmont et al., unpublished data).
gene has an unusually low repeat variance relative to the other disease-associated coding triplet repeats [11
]. The allele distribution is highly skewed towards a-4 and a-5 in CEU. Although the mutation rate has been suggested to be relatively low in this locus due to being stabilized by CAA interruptions [10
], it cannot fully explain this low level of variance, because a comparable number of rare alleles were found in SCA2
as in SCA1
]. In addition, the rare alleles could easily arise from the common alleles, which implies that selective pressure could act to maintain the predominance of only a small number of alleles.
The population-specific allele spectra imply the action of other driving forces. In our study, we found that a-4 and a-5 accounted for 100% of chromosomes in CHB and JPT samples. This suggests an historical population bottleneck that could also have contributed to the formation of the 1 Mb LD observed in CEU. The chromosome-wide distribution pattern demonstrated, however, that the LD is an “outlier”, and thus it is unlikely that such factors can solely account for it. For CEU, we propose that a-4 and a-5 migrated from Africa to Europe with a-5 at much lower frequency than that found in modern Europeans, whereas the recent selective advantage on a-5 enriched it in the population. The adjacent region hitchhiked with this allele and reached high frequency quickly, therefore, the long-range LD was preserved in CEU.
The possible functional mechanisms whereby positive selection acted on the a-5 allele of the SCA2
gene in the recent human population history are unclear. It seems unlikely that the total number of glutamine residues plays a role (a-4 and a-5 each encode 22 Gln residues), however, differences in the number of uninterrupted CAG repeats at the mRNA level could alter normal function through changes in mRNA folding and stability [20
] or association with RNA binding factors. The allele a-5 shows a very low likelihood of expansion to the disease state [21
], but given the late age of onset and low prevalence of the disease, it seems unlikely that disease predisposition could be directly related to the selection at this locus.
gene product's normal function is unknown, although it may play a role in regulated cell death [22
], and changes in this function could clearly be under selective constraint. Recent analysis of a C. elegans
homologue suggests a role in translational control in the germline, another potential function under strong selective constraints [24
]. Recognition of the role of selection at this locus will stimulate further investigation of the mechanism through functional studies.
It remains possible that other linked functional alterations in SCA2 or in nearby genes on the CH-1 haplotype background were necessary for selection on a-5 or even the primary target of natural selection, with the coding triplet instead hitchhiking to high frequency.
Based on our phased results, a-5 also associates with core haplotypes other than CH-1, which do not show significantly high REHH. Other polymorphisms specific to the CH-1 allele are possibly important elements for selection. One model is that both a-5 and other unknown genetic variants on the CH-1 background each contribute modestly to the unidentified biological function, and are necessary to form a specific combination in order to confer a selective advantage. Indeed, two coding SNPs (rs695871 and rs695872) are found within 200 bp of the CAG repeat in exon 1. Our data demonstrated that a-5 is significantly associated with the rs695871 G allele, which codes for Val versus Leu. In addition, as CH-1 spans a large region (~70 kb), including the intergenic sequences between SCA2
, and the 5′-UTR and the intronic sequences of SCA2
, the polymorphisms that might induce alternative splice sites and that might regulate differentiated expression levels of SCA2
or the adjacent BRAP
gene are potential candidates as well and need to be further investigated. Other genes in this ~1.2 Mb interval cannot be completely excluded as being the targets of selection even though they were not detected by the REHH analysis. For example, the ALDH2
gene has been suspected to be selected for its hypothetical functions in resistance to endemic disease in east Asia [25
]. Nevertheless, the SCA2
is in the center of the mapped window and remains the strongest candidate gene for selection.
The uncertainty of the precise biochemical mechanism for the selection illustrates the power of the statistical genetic methods used for the identification of a “biological signal” from this locus. We expect other genomic regions to be identified in this way, and eventually to correlate the results of this kind of study with our growing understanding of biological processes.