|Home | About | Journals | Submit | Contact Us | Français|
Balancing selection is potentially an important biological force for maintaining advantageous genetic diversity in populations, including variation that is responsible for long-term adaptation to the environment. By serving as a means to maintain genetic variation, it may be particularly relevant to maintaining phenotypic variation in natural populations. Nevertheless, its prevalence and specific targets in the human genome remain largely unknown. We have analyzed the patterns of diversity and divergence of 13,400 genes in two human populations using an unbiased single-nucleotide polymorphism data set, a genome-wide approach, and a method that incorporates demography in neutrality tests. We identified an unbiased catalog of genes with signatures of long-term balancing selection, which includes immunity genes as well as genes encoding keratins and membrane channels; the catalog also shows enrichment in functional categories involved in cellular structure. Patterns are mostly concordant in the two populations, with a small fraction of genes showing population-specific signatures of selection. Power considerations indicate that our findings represent a subset of all targets in the genome, suggesting that although balancing selection may not have an obvious impact on a large proportion of human genes, it is a key force affecting the evolution of a number of genes in humans.
Balancing selection maintains favorable genetic diversity in populations by a variety of mechanisms, including overdominance and fluctuating selection (e.g., frequency-dependent selection). In the case where one locus with two alleles displays overdominance, the higher fitness of heterozygotes maintains both alleles in the population, eventually leading to an equilibrium allele frequency that maximizes the mean fitness of the population. Under frequency-dependent selection, the fitness associated with an allele varies with its frequency, giving rise to an equilibrium with an enhanced number of alleles at intermediate frequencies (when selection favors intermediate alleles) or low frequencies (in cases of rare allele advantage) (see Richman 2000). Classical examples of balancing selection include the β-globin gene in humans (Pasvol et al. 1978), the major histocompatibility complex (MHC) system in mammals (Hughes and Nei 1988; Takahata and Nei 1990), the disease-response genes (R-genes) in plants (Stahl et al. 1999), the self-incompatibility system in plants (Wright 1939), and the complementary sex determination of haplodiploid species (Yokoyama and Nei 1979; Cho et al. 2006).
By maintaining functional genetic variation in populations, balancing selection is medically relevant. Association between balanced polymorphisms and pathology has been proposed for several human diseases, including the ß-globin gene and sickle cell anemia (Pasvol et al. 1978), CFTR and cystic fibrosis (Gabriel et al. 1994; Pier et al. 1998), and PAH and phenylketonuria (Woolf et al.1967). This is not surprising because, at equilibrium frequencies, a substantial portion of the population is homozygous and carries a deleterious genotype. Balanced polymorphisms present the primary candidates for the common disease–common variant hypothesis because the pattern of natural selection results in elevated frequencies of alleles which, in the homozygous state, may reduce fitness and contribute to disease.
The influence of balancing selection in shaping the levels of diversity in natural populations has long been a subject of debate. Once thought to be the primary driver that maintains the substantial genetic variability observed in populations (Lewontin and Hubby 1966), balancing selection came to be considered rare when polymorphism levels could be explained, without the need of selection, by the neutral theory of evolution (Kimura 1968). It has been proposed that balancing selection cannot be common due to the associated genetic load (the population burden that derives from the reduced fitness of less-favorable homozygotes, maintained by selection than favors advantageous heterozygotes); but the relevance of such arguments in predicting the prevalence of selection has been debated (see Gillespie 1991), and today the debate over the role of selection (and balancing selection) in maintaining polymorphism remains open (Gillespie 1991).
Recent genome-wide scans of selection have dramatically improved our understanding of the influence of purifying and directional selection in shaping the evolution of genes, particularly in humans (Clark et al. 2003; Akey et al. 2004; Bustamante et al. 2005; Chimpanzee Genome and Analysis Consortium 2005; Nielsen, Bustamante, et al. 2005; Sabeti et al. 2006; Voight et al. 2006; Williamson et al. 2007; Barreiro et al. 2008). Such advances have not been applied in a systematic genome-wide fashion to balancing selection, and current biological understanding of balancing selection is mostly limited to a few loci localized by candidate gene approaches (e.g., Hughes and Nei 1988, 1989; Bamshad et al. 2002; Baum et al. 2002; Wooding et al. 2004; Cork and Purugganan 2005; Kroymann and Mitchell-Olds 2005; Tan et al. 2005; Cho et al. 2006; reviewed in Bamshad and Wooding 2003; Mitchell-Olds et al. 2007). This is mainly due to the difficulties associated with the detection of this type of selection at a whole-genome level. The genomic signal of recent balancing selection (extended linkage disequilibrium [LD]) is detectable by LD-based methods (Voight et al. 2006; Wang et al. 2006), but it is indistinguishable from incomplete sweeps of positive selection. The signal of long-term balancing selection is specific (excess of polymorphism) but narrow due to the long-term effects of recombination (Hudson and Kaplan 1988; Charlesworth et al. 1997). Therefore, most available data sets (with low and constant single-nucleotide polymorphism [SNP] density) have little power to detect the localized signals of long-term balancing selection. As a consequence, previous efforts have failed to detect convincing targets in the human genome (Asthana et al. 2005; Bubb et al. 2006).
Here, we present the first concerted effort to detect genes undergoing balancing selection across the genome in human populations. We use a data set of unascertained SNPs and apply a method that contrasts patterns of polymorphism in each gene to the rest of the genome as well as to neutral expectations. Because the timing and type of selection affect its genomic signature, we focus on the identification of genes with strong signals of long-term balancing selection maintaining an excess of intermediate-frequency variants. We find a small but strongly supported set of genes with signatures of selection, providing an unbiased catalog of candidate targets of balancing selection in the human genome.
Analyses were performed using polymorphism and divergence data obtained from a complete survey of coding variability in 13,400 human RefSeq genes by direct sequencing of all their well-annotated exons in 39 human subjects (19 African Americans [AA] and 20 European Americans [EA]). The data are described in Bustamante et al. (2005), and a strict bioinformatics pipeline ensured true homology and the use of only well-supported SNPs, as described in Boyko et al. (2008). Substantial effort was taken to avoid biological and technical confounding factors. The original bioinformatics pipeline involved reciprocal Blast searches to avoid misalignments and required a unique high-quality match to the public human chimpanzee sequence PanTro2 (Chimpanzee Sequence and Analysis Consortium 2005) (supplementary Methods, Supplementary Material online). Also, only genes with unique products with in silico polymerase chain reaction (http://genome.ucsc.edu) were used. This process checks for multiple genomic matches of the amplification primers and detects cases of putative nonspecific amplification. Finally, extreme (significant) genes were extensively checked for the presence of close paralogs, including segmental duplications, through BLAT searches of the March 2006 Human Genome Sequence Assembly and test of involvement in segmental duplications (Human Segmental Duplication Database [Cheung et al. 2003]). Fixed differences with respect to chimpanzee and ancestral state of human SNPs were assessed by comparison with PanTro2 chimpanzee reference sequence (Chimpanzee Sequence and Analysis Consortium 2005).
A total of 4,877 genes had at least ten informative sites (polymorphic or fixed relative to chimpanzee) and were further considered. This condition filtered out genes that lacked sufficient information for a valid test without biasing the data set. Also, all genes had to contain at least one polymorphic site for neutrality tests to be performed. These data do not suffer from ascertainment bias, have power to detect the localized signals of long-term balancing selection, and are expected to contain the majority of common variants in these populations. In short, this is a particularly well-suited data set for the detection of balancing selection.
The choice of an adequate null model is crucial for detection of selection because some demographic scenarios can mimic the effects of selection on diversity. To avoid such confounding effects, we applied a method designed to minimize the effects of demography in neutrality tests (Nielsen et al. 2009). In essence, the method uses the complete data set to estimate parameters of the past demographic history that best fit the data and considers such estimates as the null (neutral) demographic model against which neutrality is tested.
All demographic inferences were based on the complete data set (13,400 genes). Briefly, the method infers admixture proportions of individuals using a maximum likelihood (ML) method. The demographic parameters that best fit the data are estimated using an (composite) ML approach through coalescent simulations and considering the estimated admixture proportions. Demography was inferred separately for the X and autosomes due to the possibility of sex-specific differential migration. The best-fit demographic model allows for a bottleneck in Europeans upon emergence from Africa and exponential growth in both populations (fig. 2 legend and supplementary Methods, Supplementary Material online). It provides a very good fit of the data, indicating that the demographic scenario explains most of the patterns observed in the data (Nielsen et al. 2009).
For each gene, neutrality tests are then performed and their statistical significance is assessed by extensive neutral coalescent simulations under the inferred demographic scenario, with the number of segregating sites and missing data of the gene, and a recombination rate of 7.5 × 10−4 per base pair (Nielsen, Williamson, et al. 2005). Further details can be found in supplementary Methods (Supplementary Material online); for a formal description of the method, statistical details, and discussion of the demographic inference, readers are referred to Nielsen et al. (2009). Genes showing the most unusual patterns of variability considering the demographic history of the populations are identified based on the P values from these neutrality tests. Although the demographic model inferred does not necessarily represent the exact demographic history of the populations, its application as the null model in neutrality tests represents a conservative approach: The tests will only identify genes with a sufficiently extreme departure from the overall patterns observed in the genome, according to the demographic history of the sample (assessed by neutral simulations). The influence of the demographic model was assessed by comparing the probability of the tests under the original and two alternative demographic scenarios (described in fig. 2 legend and supplementary Methods, Supplementary Material online).
Balancing selection may vary in timescale, strength, type (e.g., overdominance vs. frequency-dependent selection), and target (e.g., single locus vs. multiple loci). Such parameters influence the expected effect of selection in linked variation and therefore the strategies for their detection. We aim at detecting long-term balancing selection toward intermediate-frequency alleles, either due to overdominance or frequency-dependent selection, and either targeting single sites or combinations of variants in an epistatic way.
Signatures of balancing selection were detected based on two different properties of sequence variation, as the use of different attributes of the data can be more powerful than the consideration of single neutrality tests (Innan 2006). The main effect in genealogies of long-term balancing selection is an increased coalescence time when compared with neutral expectations. This leads to an excess of polymorphism in the genomic region linked to the selected variant(s) (Hudson and Kaplan 1988; Takahata and Nei 1990; Nordborg 1997; Barton and Etheridge 2004; Williamson et al. 2004). A modified HKA test (Hudson et al. 1987) was applied to detect such excess of diversity. Whereas the original HKA test rejects neutrality with both excess of polymorphism and divergence, our “HKAlow” test is a one-sided HKA test that rejects neutrality only with excess polymorphism. Besides affecting the time to coalescence, balancing selection also affects allele frequencies. Both overdominance (with similar fitness of both homozygotes) and frequency-dependent selection (with optimum at frequencies ~0.5) can produce an excess of intermediate-frequency alleles. This yields a local site frequency spectrum (SFS) skewed toward intermediate-frequency alleles with respect to the genome as a whole (global SFS). Such a difference between the local and global SFS was tested with a one-sided Mann–Whitney U (MWU) test on the “folded” SFSs. This test, which we call “MWUhigh,” rejects neutrality only in the presence of excess of intermediate-frequency alleles.
The signature of balancing selection is defined by the intersection of the two tests. Genes with signatures of balancing selection (here referred to as extreme genes) are selected as those with significant departures from the neutral model both for HKAlow and MWUhigh tests (5% significance level). The intersection defines genes with both a significant excess of polymorphism and a significant excess of intermediate-frequency alleles. The two tests are sensitive to additional selective forces, but their combination is expected to specifically detect the effects of long-term balancing selection maintaining intermediate frequencies.
The limited number of variants per gene prevents the separate analysis of synonymous and nonsynonymous sites, as well as sliding window type of approaches. For test of gene categories, all genes (irrespective of the number of informative sites) were divided into biological process and molecular function categories according to Panther (http://www.pantherdb.org/), and the distribution of P values of each category was compared with the rest of the data set with a Mann–Whitney U test.
Because linkage phase of haplotypes in this data is unknown, LD was measured by the composite LD (Weir 1996), which does not require phase information and avoids introducing uncertainty during haplotype inference (Andrés et al. 2007). We used composite_LD, a Bioperl package from Matthew Hahn and Jason Stajich (http://www.bioperl.org). LD was computed for all SNP pairs in the gene, and the percentage of unmatched, frequency-matched or distance-matched SNP pairs showing significant LD was compared between extreme and nonextreme genes through 10,000 permutations. Complete haplotypes were inferred with PHASE 2.0 (Stephens et al. 2001), and haplotype networks constructed using Network 188.8.131.52 (Bandelt and Dress 1992). When necessary for comparison, HapMap SNP frequency and LD information were obtained from the HapMap database (http://www.hapmap.org) for the Yoruba from Nigeria (YRI) and western Europeans (CEU). The potential functional consequences of nonsynonymous SNPs were predicted with PolyPhen (Sunyaev et al. 2001) as described in Lohmueller et al. (2008).
We detect 60 genes with significant signatures of long-term balancing selection (table 1) as shown by their excess of polymorphism (significant HKAlow test) and excess of intermediate-frequency alleles (significant MWUhigh test). We refer to these genes as extreme genes. The average ratio of counts of polymorphic to divergent sites in nonextreme genes is 0.6, whereas the ratio is 1.9 for extreme genes in both populations. This represents a 3-fold increase in the number of polymorphic nucleotides in extreme genes. Allele frequencies also show substantial differences between extreme and nonextreme genes (fig. 1A). The SFS of nonextreme genes has the expected skew toward low-frequency alleles, slight differences between populations due to demographic differences, and a relative enrichment of replacement sites at very low frequencies due to purifying selection against deleterious alleles. As expected by their significant MWUhigh test, extreme genes have a considerable skew toward intermediate-frequency alleles (fig. 1A). The bimodal SFS may reflect a combination of selective forces, with purifying selection keeping deleterious variants at low frequencies and balancing selection maintaining alleles at intermediate frequencies. Note that the contribution of both synonymous and replacement sites is similar at intermediate frequencies, indicating that the excess is not only due to silent (neutral) alleles but also to putatively functional replacement variants. The highly similar SFS of synonymous and replacement sites can be explained simply by linkage.
The effect on fitness of replacement mutations ranges from mild to severe. Although the phenotypic consequences of most mutations are unknown, inferences can be made based on the physicochemical properties of the change and the evolutionary conservation at the site. For example, because most mutations are expected to be deleterious, Polyphen (Sunyaev et al. 2001) classifies mutations in increasing order of expected phenotypic effect as benign, possibly damaging, or probably damaging. In nonextreme genes, the majority of mutations with possible and probable phenotypic effect are likely deleterious and maintained at low frequencies (fig. 1B). In extreme genes, many such variants are also at low frequencies, probably due to purifying selection against deleterious alleles. Nevertheless, a considerable proportion of possibly and probably functional variants are present at intermediate frequencies in extreme genes (fig. 1B), presumably maintained by balancing selection. This proportional enrichment for likely functional variants at very low frequencies and intermediate frequencies in extreme genes (supplementary fig. 2, Supplementary Material online) is not significant. Still, the trend again illustrates the combination of purifying selection (maintaining the functionality of genes) and balancing selection (maintaining functional variants in the population) in extreme genes.
Our multistep process involves the inference of the demographic scenario that best fits the data and the use of this model as the null against which neutrality is tested. Although the demographic inference is not intended to disentangle the exact demographic history of human populations (no genetic inference can), this strategy is a conservative one (supplementary Methods, Supplementary Material online). Other demographic scenarios are compatible with the data, though, and their use could, in theory, affect neutrality tests. We investigated the influence of the underlying demographic scenario by assessing the probability of the two tests under two additional demographic scenarios in extreme genes. The P values show a high correlation between the original and the two alternative demographic models (fig. 2). Only six genes in AA and one gene in EA found to be significant in the original analysis do not reach significance under the alternative scenarios, in all cases with P<0.07 (fig. 2). These results show a modest influence of the demographic scenario and suggest that our results will be largely robust regardless of the demographic model assumed, as long as the model is a realistic one.
An advantage of the gene-centric nature of the data set is that, rather than detecting long genomic regions containing several genes, we identify the specific gene under selection. This makes the interpretation of selective signatures considerably easier and more precise than other genome-scan methods. A total of 28 genes show signals of selection in AA and 45 genes show signals of selection in EA, with 13 showing consistent signatures in both populations (table 1). Although selective differences between the two populations cannot be discarded, the asymmetry is likely due to differences in power between the two populations because of their dissimilar neutral and genomic distributions. Assessing the false discovery rate is not trivial because the criterion to select extreme genes integrates information from two nonindependent tests. If the tests were independent, we would expect 12 extreme genes in each population just by chance (at the 5% significance level for each test). We observe an excess of 16 extreme genes for AA and 33 extreme genes for EA, indicating that there are real signatures of selection in the data.
Most genes with significant signals of selection in only one population show similar patterns in the second population (although not reaching statistical significance, table 1) and cannot be considered population specific. This is consistent with selection predating the relatively recent separation of the two populations, as expected with long-term balancing selection. Some genes, though, show unexpected population-specific patterns. Specifically, four genes show AA-only signatures (with P values >0.2 in EA) and nine genes show EA-only signatures (table 1). Those patterns likely result from recent demographic or selective population-specific factors contributing to the loss of the balanced equilibrium in one of the populations and may represent interesting cases of population-specific loss of an advantageous functional variant.
Whenever possible, results of large scans should be compared with examples of genes known to be undergoing balancing selection, which serve as internal positive controls. In the case of long-term balancing selection, this represents a challenge due to the scarcity of known examples. The best-characterized case in humans is the MHC, with several human leukocyte antigen (HLA) loci showing excess of polymorphism, complex haplotype structures, and trans-specific polymorphism (Hughes and Yeager 1998). Of the five HLA genes analyzed, the only one for which signatures of balancing selection have been previously reported is HLA-B (Hedrick et al. 1991; Sánchez-Mazas 2007), which shows signatures of selection in our data set. We also detect FUT2/Secretor factor (Se), an ABO-secretor gene considered an “honorary blood group” and associated with signatures of balancing selection in humans (Koda et al. 2000; Soejima et al. 2007; Ferrer-Admetlla et al. 2009) (supplementary table 2A, Supplementary Material online). Other historically proposed targets of balancing selection are either cases of recent selection (β-globin, CFTR, G6PD)—not targeted or detected by our method—or genes (like ABO) that show incomplete signatures of selection according to our strict criterion (see supplementary table 2B, Supplementary Material online). Overall, the comparison of extreme genes with previously reported targets confirms the detection of strong signatures of selection (like those in HLA-B and FUT2) and the specificity of the method to detect only genes under strong, long-term selection maintaining intermediate-frequency alleles.
An excess of heterozygotes (one of the signatures of present-day overdominance) has been reported for olfactory receptors (Alonso et al. 2008), but we find no evidence of increased selection in this functional category. The molecular function categories showing the strongest excess of low P values for the two tests and in the two populations are extracellular matrix, extracellular matrix structural protein, structural protein, intermediate filament, and serine protease inhibitor. The extracellular matrix comprise a large variety of proteins, including diverse structural molecules; high genetic variability in these proteins might contribute to the diversity and complexity of the matrix. We observe fewer signals in biological process categories, with no category showing consistent excess of low P values in both populations (supplementary table 3, Supplementary Material online).
The genome-wide scale of the project allowed us to analyze the specific characteristics of the selection targets, rather than focusing on the particulars of one or two candidate genes. For example, balancing selection can alter the haplotype and LD structure of a gene, either by reducing the association between sites due to increased coalescent time and recombination (Charlesworth et al. 1997) or some types of epistasis (Navarro and Barton 2002), or by raising it due to positive epistasis between selected sites. Extreme genes show significantly higher LD (MWU P [AA] = 4.16 × 10−6, P [EA] = 6.01 × 10−10). They also show an excess of SNP pairs in significant LD (in both populations, 1-tailed permutation test P < 10−4). This is true even after correcting for the intrinsically higher SNP density (which may increase LD) or higher average allele frequency (which may increase the power to detect significance) of extreme genes (in all cases, one-tailed permutation test P < 10−4). Unfortunately, simultaneously controlling for these two factors is not feasible because the combination of high SNP density and allele frequency is an intrinsic trait of extreme genes. The elevated LD within genes cannot be explained by reduced recombination rate in extreme genes (data not shown), and the increased LD does not extend over the limits of the genes: The average LD (r2) in HapMap SNPs (CEU and YRI) for regions of 20 kb centered on every gene is not unusual in extreme genes (t-test P [AA] = 0.1497, P [EA] = 0.1604). Similar results were obtained for regions of 50 kb (t-test P [AA] = 0.9942, P [EA] = 0.3751). This confirms that the signal is specific to extreme genes and not to the genomic regions in which they reside, and that balancing selection may favor mostly specific haplotypes, rather than individual SNPs, in this set of genes. In any case, the pattern is by no means universal, with LD and haplotype structure varying substantially among genes, from genes with two distant and rarely recombining haplogroups to genes with pervasive signals of recombination and/or gene conversion (fig. 3).
Nonhomologous gene conversion has a recognized influence in the high levels of variability of the MHC complex by introducing new variants from paralogous sequences. In the absence of selection, such variants, at low frequency, mimic the patterns of purifying selection rather than those of balancing selection. Still, nonhomologous gene conversion could not be completely discounted for 13 extreme genes (including HLA-B and FUT2), where more than one SNP could be mapped to a paralogous sequence on the same chromosome (table 1). Most of these SNPs are transitions, and therefore, independent mutations in the two copies can account for some of the cases. Nevertheless, our results suggest that gene conversion may be a mechanism for introducing variants to genes evolving under balancing selection outside the HLA complex.
Here, we report the results of a systematic genome-wide scan of balancing selection that, in contrast to previous studies, reveals a number of candidate target of balancing selection in the human genome. Asthana et al. (2005) reported that transspecific polymorphism between humans and chimpanzees is rare, suggesting a limited role of long-term balancing selection in the two species. The power to detect events of transspecific polymorphism is small (Clark 1997; Wiuf et al. 2004) and still limited by the data, but those results suggest that transspecific patterns like those found in the MHC locus are most likely an exception. Focusing on variants recovered from genomic sequence reads, Bubb et al. (2006) also failed to detect convincing targets of long-term balancing selection in humans. They focused on large genomic regions with high SNP density and high LD. Although some targets of balancing selection fulfill those requirements (HLA is the most prominent example), neither high LD nor extension of the highly polymorphic region are necessary predictions of the action of balancing selection. In nonselfing species and in the absence of epistasis, the signal of balancing selection is expected to be broken by recombination and affect only narrow genomic regions (Charlesworth et al. 1997). Bubb et al. (2006) clearly established that the complex patterns seen in the MHC locus are unusual, possibly resulting from a combination of directional and balancing selection, recombination/gene conversion, and epistasis between distant sites (Hughes and Yeager 1998). But until now, it had remained unclear whether other, more typical cases of balancing selection exist in the human genome.
New genomic data sets allow us to tackle this question now. We have identified genes experiencing effects of balancing selection, showing that these cases do exist. The double signature of excess of polymorphism and intermediate-frequency alleles is difficult to reconcile with forces other than balancing selection, including purifying selection and positive selection, from new or standing variation (supplementary Discussion, Supplementary Material online). Because weak overdominance does not increase polymorphism (Williamson et al. 2004), selection must be strong to lead to the patterns that we observe here. Likewise, the genetic signals observed do not agree with the expected patterns produced by other possible causes of deep coalescence, like ancestral admixture, ancient population structure, or putative hybridization between ancestral humans and chimpanzees (supplementary Discussion, Supplementary Material online).
Other demographic factors are also an unlikely cause for the patterns observed. First, populations were analyzed separately, and population history (including admixture) was intrinsically accounted for in the statistical test. Second, demographic effects are expected to affect the whole genome, not a small number of genes, although some demographic models may increase the variance among genes. Third, we have shown that our results are largely robust to the demographic model used. Fourth, admixture is not expected to increase the coalescence time of the gene, and in humans, with little population stratification, population structure increases the proportion of low-frequency alleles, and not intermediate-frequency alleles, with increasing numbers of populations (Ptak and Przeworski 2002). This is the opposite pattern to balancing selection. Finally, to ensure that the use of potentially admixed American populations does not affect our results, we compared the frequency in our samples (AA and EA) and potentially nonadmixed HapMap populations (Yoruba and CEU) for SNPs present in both data sets. Only three genes (LRRN6A [also known as LINGO], LINS1, and TARBP1) had two or more SNPs with more intermediate frequency in this data set than in HapMap samples (allele frequency difference ≥0.2). This is a very small proportion of extreme genes, and some variance in allele frequencies between data sets is expected. So, even if the influence of potential admixture cannot be completely discarded for these three genes, it should not be a concern to the overall results.
An additional element that requires attention is gene duplication because inadvertent confounding of the sequences of two distinct copies of a gene would alter the patterns of variation. Nevertheless, this is an unlikely source of error in our analysis. Only old events (with fixed differences among copies) would produce false high-frequency variants; such events are likely well annotated in genome assemblies and would be detected by our strict bioinformatics pipeline, which was designed to remove such regions from the analysis (see Materials and Methods). To further discard the influence of copy number variation (CNV), we calculated the fraction of extreme genes overlapping CNVs according to Redon et al. (2006). Extreme genes do not have a greater likelihood to overlap CNVs than other genes in our data set (P value=0.209), and only two genes (PCDHB16 and ZNF512b) are present in CNVs reported in more than one study. Identifying CNVs is an error-prone task and their annotation at the genome scale might still be incomplete, but this analysis suggests that duplications do not significantly impact our results.
The signature of balancing selection affects extremely narrow genomic regions, as predicted by theory (Hudson and Kaplan 1988; Charlesworth et al. 1997). Note, for example, that the signal in one gene does not extend to neighboring genes (supplementary fig. 3, Supplementary Material online). One unique large cluster of genes is evident, on chromosome 19 (a gene-rich chromosome), with extreme genes separated by many nonextreme genes, indicating that their signals are most likely independent. Only the signals of VARSL and CDSN should be interpreted with caution due to their close proximity to the HLA loci, and the double signature in KRT84 and KRT6B (adjacent in the genome) in AA could be caused by strong selection in one of the two genes or an intermediate region. The remarkably tight localization of signals confirms that only data sets with a high density of SNPs (i.e., resequencing data) will have the power to detect balancing selection.
The 60 candidate targets of balancing selection were identified after careful efforts to remove or account for possible confounding factors and using very stringent criteria. Still, confirmation of selective signatures detected by genome scans, such as this one, will best be performed on a gene-by-gene basis. Nevertheless, an inspection of genes in table 1 already reveals interesting patterns. For example, a disproportionate number of extreme genes are involved in immunology and response to pathogens. In addition to HLA-B, LRAP and LILRB4 are directly involved in MHC function; BTN1A1 is a member of the immunoglobulin superfamily; LRRN6A (LINGO1) contains an immunoglobulin domain; C20orf186 codes for the antimicrobial peptide RY2G5; CD200R1 is an important immune regulator; TARBP1 and TRIM22 are involved in HIV infection; and FUT2 determines blood group and its variants modulate susceptibility to Norwalk virus and HIV-1 infection (supplementary table 4, Supplementary Material online). This is expected if, as predicted based on the MHC loci, the maintenance of genetic diversity is selectively advantageous in response to pathogens (Hughes 2002). The signatures of balancing selection in a variety of immune genes illustrate the beneficial role of genetic variability in diverse steps of the immunological process (e.g., Ferrer-Admetlla et al. 2008).
A number of keratin genes (KRT14, KRT6B, KRT6C [KRT6E], KRT84) show signals of balancing selection, as does CDSN, a protein expressed specifically in corneocytes (keratinocyte-derived cells). Interestingly, other identified genes include those encoding a glucose solute carrier (SLC2A9) and three ion channels (CLCNKB, GRIN3A, and TRPV6). The pattern of variation in the gene encoding a prominent chloride channel, CFTR, has been proposed to reflect recent balancing selection in humans (Quinton 1994), consistent with reports suggesting improved survival of heterozygotes to certain infections (Gabriel et al. 1994). Like CFTR, the genes encoding other membrane channels may work as a gateway for entrance of pathogens into cells or may be important in controlling the response to infection, in addition to their primary role in cellular transport.
Many extreme genes are disease-causing or have association with disease (supplementary table 5, Supplementary Material online). Still, we find no overrepresentation of targets of balancing selection in OMIM. If SNPs in genes under balancing selection are associated with disease (and the disease is not the direct selective force), their common variants will most likely have a role in common, complex diseases, where identification of causal mutations is a challenge (Reich and Lander 2001). For example, immune-related genes, as well as those involved in inflammation (ADAMST7, NALP13, and PPP1R15A), are candidates of this class. The possibility remains that balancing selection has a modest influence on human disease, but we believe that these genes should be considered candidates for the genetic basis of common, complex diseases of unclear etiology, due to their excess of putatively functional common variants.
To our knowledge, this study provides the first unbiased set of candidate targets of balancing selection in humans. Our method was designed for the detection of a specific type of selection and, consequently, has little or no power to detect other classes of selection. We will not detect selection if the bouts are particularly recent or if the form of balancing selection yields no excess of intermediate-allele frequency at equilibrium. Likewise, we have reduced power to identify very short genes and no coverage in nongenic genomic regions. Because, in addition, our data set contains about one-third of the estimated number of genes in the genome, it is likely that we have identified only a portion of the genes that may have evolved under balancing selection in the human genome. The identification of such elements is important not only for understanding their evolutionary history but also for finding functional variants of potential phenotypic and medical relevance. In this respect, this study represents a step forward in the evolutionary annotation of the human genome.
The authors would like to thank Scott Williamson for his constant support and for always-insightful comments, discussions, and suggestions. We thank Sergio Castellano for helpful discussions and Kirk Lohmueller for useful suggestions and help with PolyPhen analysis. We thank John Sninsky (Celera Diagnostics) for his active role in the Applera Genome Initiative project, which generated the data for this study, and for stimulating discussion and support. This work was supported by the National Institutes of Health (grants HL072904 and GM065509 to A.M.A. and A.G.C.) This work was supported in part by the Intramural Program of the National Human Genome Research Institute, National Institutes of Health.