|Home | About | Journals | Submit | Contact Us | Français|
Mutations in the human gene ALMS1 result in Alström Syndrome, which presents with early childhood obesity and insulin resistance leading to Type 2 diabetes. Previous genomewide scans for selection in the HapMap data based on linkage disequilibrium and population structure suggest that ALMS1 was subject to recent positive selection. Through a detailed population genomic analysis of existing genomewide data sets and new resequencing data obtained in geographically diverse populations, we find that the signature of selection at ALMS1 is considerably more complex than what would be expected for an idealized model of a selective sweep acting on a newly arisen advantageous mutation. Specifically, we observed three highly divergent and globally dispersed haplogroups, two of which carry a set of seven derived nonsynonymous single nucleotide polymorphisms that are nearly fixed in Asian populations. Our data suggest that the interaction of human demographic history and positive selection on standing variation in Eurasian populations approximately 15 thousand years ago parsimoniously explains the spectrum of extant ALMS1 variation. These results provide new insights into the evolutionary history of ALMS1 in humans and suggest that selective events identified in genomewide scans may be more complex than currently appreciated.
The recent availability of dense catalogs of human genetic variation such as the HapMap (International HapMap Consortium 2005) and Perlegen (Hinds et al. 2005) data sets has facilitated global inferences of positive selection. Numerous genomewide scans have identified putative targets of positive selection with patterns of variation that show significant deviations from neutral expectations (Sabeti et al. 2002; Kelley et al. 2006; Voight et al. 2006; Wang et al. 2006; Zhang et al. 2006; Kimura et al. 2007; Tang et al. 2007). Although these analyses have provided considerable insight into how often and where in the genome positive selection has shaped extant patterns of human genetic variation, a deeper understanding of human evolutionary history will require in-depth follow-up studies of “outlier loci” identified in genomewide scans (Biswas and Akey 2006).
To this end, we have performed a detailed population genomic analysis of ALMS1, which has been identified as a putative target of recent adaptive evolution in several genomewide scans for selection (International HapMap Consortium 2005; Wang et al. 2006; Kimura et al. 2007; Tang et al. 2007). Mutations in ALMS1 can lead to Alström Syndrome, a rare autosomal recessive disorder with a spectrum of phenotypes including early onset obesity, metabolic disorders, and sensory impairment (Collin et al. 2005; Li et al. 2007). Recent in vitro work demonstrates that ALMS1 is widely expressed and localizes to centrosomes and the base of cilia (Hearn et al. 2005; Arsov et al. 2006), and studies in mice confirm that ALMS1 is involved in cilia formation and function (Li et al. 2007). Alström Syndrome belongs to a growing class of human diseases, referred to as ciliopathies, that includes disorders such as nephronophthisis, Bardet–Biedl syndrome (BBS), and Meckel–Gruber syndrome (MKS) (Badano et al. 2006). Interestingly, several phenotypes, such as childhood obesity and insulin resistance, overlap between Alström Syndrome and BBS (Hildebrandt and Otto 2005), and hypomorphic mutations in MKS causing genes are associated with BBS (Leitch et al. 2008). Thus, distinct genetic perturbations to the network of proteins involved in cilia formation and function can result in overlapping and pleiotropic phenotypic anomalies.
To better understand the evolutionary history of ALMS1, we analyzed ALMS1 genotype, sequence, and haplotype data. These analyses show that ALMS1 has been subjected to recent positive selection in Eurasian populations approximately 15 thousand years ago (kya). However, unexpectedly, the signature of selection at ALMS1 is considerably more complex than what would be expected for an idealized model of selection acting on a newly arisen advantageous mutation. Rather, the interaction of human demography and positive selection on standing variation in Eurasians parsimoniously explains the spectrum of extant ALMS1 variation. In addition, by reanalyzing previously published genomewide association data, we provide evidence that ALMS1 genetic variation contributes to interindividual variation in metabolic phenotypes such as insulin and glucose levels. In summary, our results provide new insights into the evolutionary history of ALMS1 in humans, highlight the need for careful follow-up studies of candidate selection genes identified in genomewide analyses, and suggest that selective events in human populations may be more complex than currently appreciated.
We sequenced approximately 6 kb of ALMS1 in DNA samples from 91 individuals representing six populations that were obtained from the Coriell Institute for Medical Research Cell Repositories (Camden, NJ). Coriell repository numbers for these samples are as follows: CEPH (n = 21: NA06990, NA07019, NA07348–9, NA10830–1, NA10842–5, NA10848, NA10850–4, NA10857–8, NA10860–1, and NA17201), Han Chinese of L.A. (n = 21: NA17733–NA17749, NA17752–56), Middle East (n = 10: NA17041–50), Pygmy (n = 10: NA10469–73, NA10492–96), South Africa (n = 9: NA17341–49), South America (n = 10: NA17301–10) and South East Asia (n = 10: NA17081–90). In addition, we sequenced the same regions in four nonhuman primate DNA samples from the Coriell Institute for Medical Research Cell Repositories with the following repository numbers: gorilla (Gorilla gorilla; AG05251), bonobo (Pan paniscus; AG05253), chimpanzee (Pan troglodytes; AG06939), and orangutan (Pongo pygmaeus; AG12256).
Sequencing primers were designed from published human sequence (NM_015120) with primer3 (http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi) for coding and noncoding regions of ALMS1: upstream, intron 2, exon 5, intron 7, exon 8, intron 8, exon 10, and downstream (primer sequences are available upon request). We used standard polymerase chain reaction–based sequencing reactions using Applied Biosystem's Big Dye sequencing protocol on an ABI 3130xl. Sequence data were assembled using Phred/Phrap (Ewing and Green 1998; Ewing et al. 1998), and the alignments were inspected for accuracy with Consed (Gordon et al. 1998, 2001). Polymorphisms were identified with PolyPhred 4.0 (Bhangale et al. 2006). All polymorphic sites were manually verified and confirmed by sequencing the opposite strand. Genotype data from 210 unrelated individuals were obtained from the HapMap project (Release 22 NCBI Build 36) (International HapMap Consortium 2005).
We calculated r2 between all pairwise combinations (Hill 1968) of markers in ALMS1 and approximately 1 Mb of flanking sequences (both 5′ and 3′) using HapMap genotype data. Estimates of r2 were obtained from Haploview (Barrett et al. 2005) for all markers with a minor allele frequency ≥5% and used in subsequent analyses. To evaluate and compare the distribution of LD within and between the HapMap CEU, YRI, and ASN samples, and how LD decays as a function of distance from ALMS1, we calculated a statistic related to ZnS (Kelly 1997). Specifically, we calculated the average r2 between all pairwise comparisons of single nucleotide polymorphisms (SNPs) in bin 1 and bin 2:
where n1 is the number of SNPs in bin 1 and n2 is the number of SNPs in bin 2. Here, n2 represents the number of SNPs in ALMS1 and n1 the number of SNPs in nonoverlapping 50-kb windows up and downstream of ALMS1.
Haplotypes were reconstructed in the HapMap and sequence data with Phase 2.1.1 (Stephens et al. 2001; Stephens and Scheet 2005) using 10 iterations to confirm consistency among runs, and the run with the best average goodness-of-fit was used. We defined Haplogroup A (ancestral) and Haplogroup D (derived) based on the allelic state of seven nonsynonymous SNPs (nsSNPs) (rs3813227, rs6546837, rs6546838, rs6724782, rs6546839, rs2056486, and rs10193972) and Haplogroup D1 and Haplogroup D2 based on the allelic state of two additional SNPs (rs6730785 and rs7598901). The ancestral allele was determined by the chimpanzee sequence.
We used Neighbor from the software package PHYLIP 3.6 (Felsenstein 1989, 2005) to construct unrooted phylogenetic trees on phased sequence (Dnadist was used to calculate the pairwise distance matrix) and HapMap data (average pairwise distances were calculated for the distance matrix). In both cases, we removed recombinant haplotypes occurring among the seven aforementioned nsSNPs (three unique haplotypes/four total haplotypes from the sequence data and two unique haplotypes/three total haplotypes from the HapMap data). We visualized the Neighbor-Joining trees with the APE package in R (http://cran.r-project.org/web/packages/ape/).
We used the method described by Thomson et al. (2000) to estimate the TMRCA on our phased sequenced data as this method does not utilize any particular population model. Analyses were performed both on all haplotypes as well as on only haplotypes with no recombination among the seven nsSNPs, and we found minimal effects on the estimated TMRCA (data not shown). We used the average divergence between chimpanzee and human sequences divided by two times the estimated divergence time of 6 million years, which we calculated to be 36/(2*60,00,000), or 3 × 10−6 for our sequence mutation rate. Briefly, to estimate the TMRCA, we used the simple estimate of T, the time since the MRCA (Thomson et al. 2000; Mekel-Bobrov et al. 2005):
where is the unbiased estimator of T, xi is the number of mutational differences between the ith sequence and the MRCA, n is the total number of sequences in the sample, and μ is the mutation rate. In addition, we used three additional methods to estimate the ALMS1 TMRCA (McPeek and Strahs 1999; Bahlo and Griffiths 2000; Templeton 2002), all of which yielded similarly old dates and were not significantly different from one another (data not shown).
We calculated three standard neutrality tests of the site frequency spectrum: Tajima's D (Tajima 1989), Fu and Li's F test (Fu and Li 1993), and Fay and Wu's H test (Fay and Wu 2000). We used the nonhuman primate sequence to establish the ancestral allele for Fay and Wu's H test. To interpret summary statistics derived from the resequencing data, we performed additional coalescent simulations with the program ms (Hudson 2002) using previously inferred demographic parameters that were found to best fit genomic patterns of variation in the HapMap YRI, CEU, and ASN samples (Schaffner et al. 2005). The exact parameters can be found in table 1 of Schaffner et al. (2005), and involve multiple bottlenecks, population expansions, population splitting, recombination, and gene conversion. The only exception is that we did not include migration following population splitting as Schaffner et al. (2005) found these parameters resulted in only slightly worse fitting models, but the modest increase in levels of population differentiation resulted in more accepted simulation replicates to analyze. The ms command line argument for this model is available upon request. We used a rejection sampling method (Beaumont et al. 2002) to account for the a priori observation of ALMS1 population structure and a total of 1 × 107 simulations were performed. Initially, we attempted to accept data sets if they matched observed levels of differentiation in our resequencing data (five or more SNPs with an FST ≥ 0.80 between African and Han Chinese samples and two or more SNPs with an FST ≥ 0.52 between African and CEPH samples). However, none of the 10 million simulation replicates met these criteria, indicating that such levels of structure are incompatible with a neutral demographic model that is consistent with major features of human genomic variation (Schaffner et al. 2005). Thus, for computational tractability, we relaxed the acceptance criteria to one or more SNPs with a pairwise FST ≥ 0.80 and 0.52 between African and Han Chinese samples and African and CEPH samples, respectively. Using these thresholds, 1,405 data sets of the 10 million simulations were accepted and analyzed further. In particular, we evaluated the probability of observing divergent haplotype lineages, TMRCA, and Tajima's D as or more extreme than that observed for ALMS1. In accepted data sets, we calculated TMRCA as described above, Tajima's D (Tajima 1989), and the average number of nucleotide differences between haplogroups carrying the derived allele at the highly differentiated SNP:
where D1 and D2 denote the set of haplotypes belonging to derived haplogroup lineages 1 and 2, respectively. In the simulated data sets, D1 and D2 were chosen so as to maximize dxy.
We defined haplogroups in the HGDP–CEPH data set (Li et al. 2008) with six SNPs, Haplogroups A and D were defined based on alleles of four genotyped nsSNPs (rs3813227, rs6546838, rs2056486, and rs10193972) and Haplogroups D1 and D2 were further defined by two additional genotyped SNPs (rs2037814 and rs3820700). Recombinant haplotypes among Haplogroups A and D were excluded from the haplotype frequency map, whereas recombinants between Haplogroups D1 and D2 were included and defined by the allelic status of rs10193972. In order to avoid any single population sample falling below a sample size of 10, we combined the Bantu SE and SW individuals into one Bantu South population.
We developed a simple heuristic statistic to determine how unusual the geographic distribution of ALMS1 genetic variation is relative to the rest of the genome using all autosomal HGDP–CEPH data that had less than 10% missing data. Specifically, for the ith SNP, we define the global deviance score, GDi, as follows:
where FSTi12, FSTi13, and FSTi23is the unbiased pairwise FST (Weir 1996) between East Asian and African samples, East Asian and American samples, and African and American samples, respectively, for the ith SNP, and is the average allele frequency across samples weighted by sample size. In words, the global deviance score is large when levels of differentiation between Asian and African and Asian and American samples are greater than the genomewide average and levels of differentiation between African and American samples is less than the genomewide average. We included the Bantu (North and South), Biaka, Mbuti, Mandenka, Yoruba, and San in the African sample; the Colombian, Karitiana, Maya, Pima, and Surui in the American sample; and the Cambodian, Dai, Daur, Han (North and South), Hezhen, Japanese, Lahu, Miaozu, Mongola, Naxi, Oroqen, She, Tu, Tujia, Xibo, Yakut, and Yizu in the East Asian sample.
We used the expression analysis tool (Thomas et al. 2006) to identify enriched PANTHER Pathways, Biological Processes, and Molecular Functions (Thomas et al. 2003) among genes in the top 0.1% of the distribution of GD scores. Pathways and terms with less than five genes were excluded from further analysis, and Bonferroni corrections were used to correct for multiple testing.
We estimated the time since the selective sweep for the derived class of ALMS1 lineages by analyzing the amount of nucleotide diversity that has accumulated on the selected haplotypes as described in Akey et al. (2004) where the time back to the selective sweep, t, can be estimated by S/(nμ), where S is the number of segregating sites, n is the number of haplotypes included and μ is the neutral mutation rate of the locus. For ALMS1 derived haplogroups, n = 120, S = 13, and μ = 1.75 × 10−4. Note that this calculation should be treated as a rough approximation because it assumes a starlike phylogeny, which ALMS1 violates.
We used the following simple deterministic formula to estimate the selection coefficient, s (Gillespie 1998):
where , p is the frequency of the selected allele, q is the frequency of the nonselected allele, and h is the heterozygous effect. We assumed an initial frequency of 10% (a conservatively high estimate based on current frequencies in African samples) and a final frequency of 95% (a conservatively low estimate based on current frequencies in East Asian samples) for the putatively selected allele. The range of s reported in the main text is based on varying the age of the selective event (from 500 to 1,000 generations) and heterozygous effects (h = 0, 0.5, and 1).
ALMS1 was initially identified as a potential target of positive selection based on large allele frequency differences among populations for six nonsynonymous SNPs (nsSNPs) (International HapMap Consortium 2005). In order to better understand how unusual levels of population structure are in the ALMS1 region relative to the rest of the genome and fine-scale map the signature of selection, we first performed a genomewide analysis of allele frequency differences in the HapMap data. Specifically, we calculated the average pairwise FST (Weir 1996) in nonoverlapping 100-kb windows using HapMap Phase II data (autosomal regions only) among the Yoruba (YRI) individuals from Ibadan, Nigeria (n = 60), CEPH (CEU) individuals with ancestry from northern and western Europe (n = 60), Japanese (JPT) individuals from Tokyo, Japan (n = 45), and Han Chinese (CHB) individuals from Beijing, China (n = 45). In all of the analyses, we combined the JPT and CHB individuals into a single Asian sample (ASN). Windows containing ALMS1 were in the extreme 99th percentile of the empirical distribution for both the ASN and YRI and CEU and YRI comparisons, and only 12 of the 28,652 windows were more differentiated than ALMS1 in the ASN and YRI comparisons.
We performed three additional analyses to determine how robust the signature of strong population structure at ALMS1 is to potential confounding variables. First, we repeated the genomewide analysis of FST on the HapMap data with window sizes in units of genetic distance (0.1 cM) estimated from fine-scale recombination rates. Second, we adjusted each window specific estimate of FST for a larger set of potential confounding variables (number of SNPs, recombination rate, GC%, and heterozygosity per 100-kb bin) by multiple regression. Finally, we performed a genomewide analysis of FST on Class A Perlegen SNPs as described above, which were discovered more uniformly and manifest less ascertainment bias relative to the HapMap SNPs (Hinds et al. 2005; Kelley et al. 2006). In all three cases, ALMS1 remained one of the most differentiated regions in the genome (results not shown), indicating that our results are robust to ascertainment bias, recombination rate heterogeneity, and additional confounding variables.
The distribution of pairwise FST among the ASN, CEU, and YRI samples for all SNPs across an approximately 800-kb region centered on ALMS1 is shown in figure 1. The largest values of FST across the region are coincident with the location of ALMS1 (fig. 1), and SNPs located immediately up and downstream of ALMS1 show markedly lower FST values. Extreme levels of population structure are found throughout ALMS1; specifically, 45 and 35 SNPs have FST values greater than the 99th percentile for the ASN and YRI and CEU and YRI samples, respectively. These highly differentiated SNPs include seven nsSNPs (fig. 2), six of which have been previously described (International HapMap Consortium 2005), and are located in exons 5, 8, and 10 of ALMS1. The derived alleles at each of the seven nsSNPs are found at 99%, 80%, and 8–9% frequency in the ASN, CEU, and YRI samples, respectively. Levels of differentiation between the ASN and CEU samples are not unusual compared with the genome at large (fig. 1).
In summary, the patterns of FST in the HapMap and Perlegen data suggest ALMS1 was a target of recent positive selection in East Asian and European populations, and indicate several plausible sites (i.e., one or more of the seven highly differentiated nsSNPs) conferring a fitness advantage. Consistent with this interpretation, previous genomewide scans of selection based on LD have also identified ALMS1 as an outlier in the ASN and CEU samples (Wang et al. 2006; Kimura et al. 2007; Tang et al. 2007). A graphical summary of the distribution of LD across the ALMS1 region in the HapMap samples is shown in figure 2.
Under simple models of genetic hitchhiking (Maynard-Smith and Haigh 1974), we would expect to find a single haplotype carrying an advantageous allele at high frequency. To test this prediction, we reconstructed haplotypes (Stephens et al. 2001; Stephens and Scheet 2005) in the HapMap samples. A visual representation of haplotypes shows a striking departure from predictions of a simple hitchhiking model, where haplotypes carrying the derived alleles at the seven highly differentiated nsSNPs exist on two distinct backgrounds (fig. 3). In addition, haplotypes carrying the ancestral allele at each of the seven highly differentiated nsSNPs exist on a background that is distinct relative to the two derived classes (fig. 3). Similar results were observed in visual representations of genotypes (results not shown), demonstrating these patterns are not an artifact of haplotype inference.
To more quantitatively assess haplotype structure at ALMS1, we constructed a Neighbor-Joining tree based on pairwise distances among haplotypes. The Neighbor-Joining tree shows three distinct haplogroups (fig. 3), formed by an initial split between haplotypes carrying the seven ancestral nsSNPs from those containing the derived nsSNPs (which we will refer to as Haplogroup A and Haplogroup D, respectively). Furthermore, there is an additional deep split among the derived haplotypes forming two distinct haplogroups (which we will refer to as Haplogroup D1 and Haplogroup D2).
The average pairwise difference (based on all 242 HapMap SNPs spanning ALMS1) between Haplogroups A and D and D1 and D2 are 104.2 and 57.6, respectively. Thus, on average, Haplogroups A and D possess alternative alleles at approximately 43% of sites and Haplogroups D1 and D2 differ at approximately 24% of sites. In contrast, the average number of pairwise differences within Haplogroups A, D, D1, and D2 are 24.0, 15.5, 7.3, and 1.5, respectively. Although marked differences exist in Haplogroup D frequency between African and non-African HapMap samples (0.99, 0.80, and 0.08 in the ASN, CEU, and YRI samples, respectively) both derived lineages are found in Africa (Haplogroups D1 and D2 exist in the YRI sample at a frequency of 6% and 2%, respectively; fig. 3). Furthermore, haplotype heterozygosities of the YRI, ASN, and CEU samples are 0.953, 0.837, and 0.700, respectively.
In order to examine a data set not limited by the ascertainment biases inherent in the HapMap data set, we sequenced approximately 6 kb of coding and noncoding ALMS1 in 91 globally dispersed individuals (see Methods). As shown in supplementary figure S1, Supplementary Material online, the Neighbor-Joining tree of ALMS1 sequence variation recapitulates the topology of the three divergent haplogroups consisting of Haplogroups A, D1, and D2, and derived haplogroups are present in low frequencies in African samples (0.13). Consistent with the patterns of divergence among haplogroups in the HapMap data, our estimate of the TMRCA of ALMS1 is 2,158 ± 848 kya (see Methods), which is among the oldest reported autosomal TMRCAs (Kreitman and Di Rienzo 2004).
Thus, both the HapMap and resequencing data demonstrate that the origins of Haplogroups D1 and D2 can be traced back to Africa, and these haplogroups have dramatically increased in frequency in Eurasian populations sometime after the dispersal of humans out of Africa. The large divergence among haplogroups and their global occurrence strongly argue for a model of selection acting upon standing variation, rather than on a newly arisen advantageous mutation.
Summary and standard neutrality test statistics for the resequencing data are shown in supplementary table S1, Supplementary Material online. Typically, patterns of DNA sequence variation are evaluated by determining how unusual observed values are under neutral expectations. However, this canonical approach fails to properly account for the fact that ALMS1 was not chosen at random, but rather was ascertained based on its high level of population structure. Such ascertainment biases need to be taken into account when interpreting patterns of DNA sequence variation in subsequent analyses of outlier loci (Kreitman and Di Rienzo 2004; Thornton and Jensen 2007). To this end, we used a rejection sampling approach (Beaumont et al. 2002) to explicitly control for the a priori observation of strong population structure when evaluating the probability of observing additional aspects of ALMS1 genetic variation under neutrality as described in the Methods section. For simplicity, we will focus on results from the Han Chinese and CEPH, as the calibrated model of Schaffner et al. (2005) is most appropriate for these samples.
Two interesting points emerge from the simulations. First, when ascertainment is taken into account, values of haplogroup divergence, Tajima's D, and TMRCA are either marginally significant or not significant at all in both the Han Chinese and CEPH samples. At least for Tajima's D, this result is unsurprising, as theoretical analyses have shown that tests of the site frequency spectrum have low power to detect deviations from neutrality under models of selection from standing variation (Hermisson and Pennings 2005; Przeworski et al. 2005; Barrett and Schluter 2008).
Second, table 1 illustrates the contrasting patterns of sequence characteristics between the CEPH and the Han Chinese samples. Specifically, the average pairwise difference between derived lineages is marginally significant in the Han Chinese, but not in the CEPH (table 1). In contrast, Tajima's D and the TMRCA are marginally significant in the CEPH, but not the Han Chinese. This result is due to the fact that the ancestral haplogroup is absent in the Han Chinese, but its frequency is 26% in the CEPH, raising the TMRCA of the latter. Furthermore, the presence of three common and divergent haplogroups in the CEPH leads to a modestly positive Tajima's D.
Recently, over 650,000 SNPs were genotyped in the HGDP–CEPH samples (Li et al. 2008), which consist of over 1,000 individuals from 52 populations (see Methods). In the HGDP–CEPH data, four of the ALMS1 nsSNPs described above (rs3813227, rs6546838, rs2056486, and rs10193972) were genotyped, and we used two additional genotyped SNPs (rs3820700 and rs1052161) to distinguish between Haplogroups D1 and D2. The worldwide distribution of ALMS1 haplogroups (fig. 4) reveals a particularly interesting pattern where Haplogroup D is nearly fixed in East Asian samples (98.9%), but is at considerably lower frequency in the American samples (43.0%). Similarly, the frequency of Haplogroup D1 in the American samples is extremely low (0.8%) compared with East Asian samples (24.6%). Conversely, Haplogroup A is common in the Americas (57.03%) but nearly absent in East Asia (0.01%). This geographic distribution is peculiar given that Asia was the likely source population of the Americas (Karafet et al. 1997; Mulligan et al. 2004; Goebel et al. 2008; Volodko et al. 2008). The simplest explanation for these data is that Haplogroups A and D were both present in Asia before the founding of the Americas, but Haplogroup D dramatically increased in frequency in East Asia sometime after the colonization of the Americas 15–20 kya (Karafet et al. 1997; Mulligan et al. 2004; Goebel et al. 2008; Volodko et al. 2008). The caveats to this interpretation are that the HGDP–CEPH samples are not ideally suited to test models for the peopling of the Americas, and the SNPs typed in these samples have difficult to account for ascertainment bias.
To evaluate how unusual the worldwide distribution of ALMS1 allele frequency variation is relative to the rest of the genome, which would provide insight into whether purely neutral processes such as genetic drift and serial founder effects (Edmonds et al. 2004; Klopfstein et al. 2006; Hallatschek and Nelson 2008) can account for patterns of ALMS1 variation, we analyzed 643,884 SNPs (see Methods) genotyped in the HGDP–CEPH panel (Li et al. 2008). Specifically, we defined a simple heuristic statistic, which we refer to as the global deviance score (see Methods), to capture the worldwide frequency distribution of ALMS1. Seven ALMS1 SNPs rank in the top 50 SNPs (99.99th percentile). Interestingly, 27 of the top 50 SNPs are located in regions of the genome that have previously been implicated as targets of adaptive evolution (supplementary table S2, Supplementary Material online; see also Wang et al. 2006; Frazer et al. 2007; Kimura et al. 2007; Tang et al. 2007). In addition, genes in the 99.9th percentile of the empirical distribution of global deviance scores are significantly enriched (Bonforroni corrected P < 0.05) for particular PANTHER Pathways, Biological Processes, and Molecular Functions (supplementary table S3, Supplementary Material online). Of particular interest is the observation that genes involved in carbohydrate metabolism (including ALMS1) are significantly enriched among the top 0.1% of loci (supplementary table S3, Supplementary Material online), consistent with previous genomewide scans for selection (Kelley and Swanson 2008), indicating this class of genes has been particularly important in the recent evolutionary history of East Asian populations.
In short, the geographic distribution of ALMS1 haplogroup frequencies in the HGDP–CEPH samples further supports a model of selection from standing variation. In particular, the presence of Haplogroup A at high frequency in the American samples combined with its extremely low frequency in East Asia, suggests that the ancestral haplogroup was present at an appreciable frequency in Asia prior to the colonization of the Americas, and subsequently driven to near extinction as selection promoted the rapid increase in Haplogroup D frequency.
ALMS1 possesses many anomalous patterns of genetic variation such as extensive population structure, including a cadre of seven nsSNPs, three divergent haplogroup lineages, and a peculiar spatial distribution in geographically diverse populations. We have shown that these characteristics are inconsistent with purely neutral explanations. However, our data are equally inconsistent with simple models of positive selection acting on a newly arisen advantageous mutation (Maynard-Smith and Haigh 1974). Rather, our results support a model of positive selection acting on standing variation in Eurasia populations. In this model, one or more polymorphisms on Haplogroups D1 and D2, which are found at low frequency in the African samples we analyzed, became adaptive following their dispersal out of Africa and rapidly increased in frequency in Eurasians. Furthermore, by considering the geographic distribution of haplogroup frequencies, we are able to narrow down the likely time frame of selection to be either concurrent with or subsequent to the colonization of the Americas 15–20 kya. This interpretation is consistent with our estimate of the time since the selective sweep on the derived lineages of 15.5 kya (see Methods).
A particularly interesting feature of ALMS1 is the old and divergent haplogroup lineages. After taking into account levels of population structure, the estimated TMRCA and average pairwise divergence among haplogroups are not unusual (table 1), suggesting these characteristics occur with appreciable frequency in highly structured regions of the genome (see also Cornejo and Escalante 2006; Garrigan and Hammer 2006). We note, however, that our analyses have primarily focused on elucidating the recent evolutionary history of ALMS1, which shows compelling evidence for recent directional selection acting on preexisting variation in Eurasian populations; additional studies will be necessary to better delimit the contribution of additional models, such as balancing selection to the long-term evolutionary history of ALMS1. Indeed, ALMS1 possesses higher levels of LD in the YRI relative to the genome-at-large (data not shown), a finding that is surprising given its ancient TMRCA, suggesting some form of nonneutral evolution, such as balancing or frequency dependent selection, in Africa.
We estimate the selection coefficient, s, for ALMS1 to be approximately 0.01–0.05, which is commensurate with magnitudes of selection observed for genes underlying lactase persistence (LCT, s = 0.01–0.05; Bersaglieri et al. 2004; Enattah et al. 2007; Tishkoff et al. 2007) and resistance to malaria (G6PD, s = 0.02–0.05; Tishkoff et al. 2001). Thus, the estimated strength of selection for ALMS1 is among the strongest identified in humans, which begs the question as to the historical selective pressure acting on ALMS1 genetic variation. Although it is clear from Alström Syndrome patients that ALMS1 mutations can influence a spectrum of phenotypes, including obesity, type 2 diabetes, and metabolic disorders, the phenotypic consequences of nonsyndromic variation are unknown.
To explore the role of ALMS1 in metabolic phenotypes further, we reanalyzed the results from a number of genomewide association studies for type 2 diabetes (t Hart et al. 2003; Patel et al. 2006; Wellcome Trust Case Control Consortium 2007; Saxena et al. 2007), none of which implicate the ALMS1 region. However, 18 metabolic traits were measured in Saxena et al. (2007) that were not extensively discussed in the original publication. We obtained association data from this study to test the hypothesis that ALMS1 genetic variation is associated with insulin or glucose-related phenotypes, given the observed clinical manifestations of individuals with Alström Syndrome. Interestingly, in nondiabetic controls ALMS1 SNPs show nominal levels of association to five insulin and glucose related phenotypes (supplementary table S3, Supplementary Material online). The strongest association was observed between rs7598660 and 2-h insulin levels (P = 1.38 × 10−4; supplementary fig. S3, Supplementary Material online), which ranked as the 43rd most significant association among the approximately 380,000 genotyped SNPs. Although these results should be interpreted with caution because of the modest statistical evidence supporting them, which do not attain genomewide significance, they suggest that nonsyndromic variation in ALMS1 may contribute to interindividual variation in the same metabolic phenotypes that are perturbed in Alström Syndrome patients. Additional studies will ultimately be necessary to more clearly define the functional and phenotypic consequences of ALMS1 genetic variation, which in turn will inform inferences about the historical selective pressures acting on this genomic region.
A closer inspection of the genomewide association results for ALMS1 also provides insight into how past adaptive evolution may influence present day distribution and susceptibility to disease. Specifically, the strongest association between ALMS1 genetic variation and metabolic phenotypes was observed between rs7598660 and 2-h insulin levels (supplementary fig. S3, Supplementary Material online). The ancestral allele of rs7598660 is associated with higher 2-h insulin levels (i.e., greater insulin resistance), whereas the derived allele is associated with lower 2-h insulin levels (i.e., less insulin resistance; supplementary fig. S3, Supplementary Material online). As the derived allele is only present on a subset of Haplogroup D2 chromosomes (supplementary table S5, Supplementary Material online) it is unlikely that the rs7598660 polymorphism (or linked variation) was the direct target of selection, but rather increased in frequency in non-African populations by hitchhiking. Therefore, geographically varying selective pressures on ALMS1 resulted in large allele frequency differences of a putative polymorphism (rs7598660 or linked variant) that influences insulin resistance, which is tangential to the primary selective force. Interestingly, the frequency of the rs7598660 derived allele among the HapMap YRI, CEU, and ASN samples is 0.008, 0.669, and 0.367, respectively, which is consistent with a higher prevalence of insulin resistance in African-Americans relative to European-Americans (Haffner et al. 1996; Reiner et al. 2007). Thus, models that attempt to place human disease into an evolutionary context (Di Rienzo and Hudson 2005; Biswas and Akey 2006) may also need to account for indirect selective effects, where susceptibility alleles are not causally related to historical selective pressures but merely go along for the ride on a selected haplotype.
In summary, we have shown that the evolutionary history of ALMS1 is considerably more complex than might have been expected based on its identification as an outlier locus in genomewide scans for selection, involving the interaction of demographic history, geographically restricted selection, and selection from standing variation. An emerging question in the evolution of natural populations is to what extent selection acts on new or preexisting mutations (Orr and Betancourt 2001; Hermisson and Pennings 2005; Przeworski et al. 2005; Barrett and Schluter 2008). This issue has important implications for the evolutionary trajectory of populations (Hermisson and Pennings 2005) and more practically on the types of signatures to pursue in the search for selected loci (Przeworski et al. 2005). A number of examples in humans have been described that are consistent with selection from standing variation such as FY (Hamblin et al. 2002), LCT (Tishkoff et al. 2007; Enattah et al. 2008), and NAT2 (Magalon et al. 2008). Thus, we suspect that when additional candidate selection genes are examined with more scrutiny, selection from standing variation will be found to be a common mechanism of adaptation, driven by the rapid dispersal of humans into new environments during the last 60 ky.
We thank members of the Akey Lab and Willie Swanson for helpful discussions and comments on the manuscript. This work was supported by a research grant (1R01GM076036-01A1) from the NIH and a Sloan Fellowship in Computational Biology to J.M.A. and by an NHGRI Interdisciplinary Training in Genomic Sciences grant (HG00035) to L.B.S.