When comparing rare variant density between populations, investigators want to know whether the differences are due to drift or selection. For instance, values of Tajima’s D that are less than −2 may indicate positive selection, whereas values greater than 2 may provide evidence of balancing selection. But, under neutral theory, selection can be inferred from Tajima’s D only if a balance between the mutation rate and the rate of allelic fixation is assumed. This balance does not exist if the population is going through a period of expansion or contraction—demographic phenomena that can also result in negative and positive values of Tajima’s D, respectively—instead of selection. One way to distinguish between demography and selection is that demography affects the genome as a whole, whereas selection has a differential effect on functional regions within the genome.
Rare variant to total variant regression across the 3,205 genes distributed throughout the genome averages out the differential effects of selection. Slope differences between populations are a better reflection of the differences in demographic phenomena that affect all genes for each population. Populations group by continent for this parameter, with the African, European, and Asian populations showing slopes with more similar values within continents. The Japanese population is significantly different from both the European and the Chinese populations. Mean values of Tajima’s D
for each population for both synonymous and nonsynonymous SNPs group the populations in the same manner, with the African populations showing the greatest density of rare variants. Note that this does not simply indicate higher and lower levels of overall variation but rather higher and lower levels of rare variants given the amount of total variation. Although the African populations have been known to harbor greater overall variation, they may also present greater levels of rare variants, even after taking their overall level of variation into account. In a study of one of the most sequenced genes in humans, BRCA1
, individuals with African ancestry were shown to have a significantly higher proportion of deleterious mutations compared to European Americans [7
]. It would be of interest to see whether this is a result that is specific to this particular gene or whether it is product of the demography of the African population, as suggested by this study, and thereby reflected genome-wide. With the advent of more affordable sequencing technologies, the answer to this question can be pursued and may aid in the study of the genetic epidemiology of cancer and health disparities.
The outlier genes highlighted in Figure are likely to represent instances of selection because their rare variant loads are distinct from that of other genes and is unlikely to be due to demographic parameters. AHNAK
, which lies above the regression line, presents an excess of rare variants compared to other genes and is likely under positive selection; CDC27
, and HLA-A
are under the regression line, which means that they present an excess of common variants and are likely under balancing selection. The AHNAK
gene encodes an unusually large protein that is typically repressed in cell lines derived from human neuroblastomas and in several other types of tumors. AHNAK
is known to be composed of highly conserved repeated elements [8
]. Regions in the genome that are highly conserved are generally so because of the selective pressure acting on them. At the other end of the spectrum lies the human leukocyte antigen system (HLA
), the major histocompatibility complex in humans. HLA
loci are known to be exceptionally diverse. As Apanius et al. [9
] describe, only natural selection for heterozygosity can account for such a level of diversity.
Although the number of rare variants in relation to the number of total variants or segregating sites is not a measure that is generally used for case-control rare variant association analysis, Bansal et al. [1
] pointed out that summary group-level statistics such as this can potentially be used along with individual-based measures. As they pointed out, the use and power of such statistics have not yet been assessed [1
]. One possible drawback of using group diversity measures in an association test is that population substructure within the group can inflate the statistic, as can be seen by the higher slope obtained for the pooled populations (see Figure ). Controlling for this level of substructure would have to be considered in any method that relies on such measures.
The group-level rare to total variant ratio, which is the rare variant density measure investigated here for each population, can translate into differences in number of individuals who carry a rare variant and/or number of rare variants per individual, measures more commonly used in association analyses on rare variants [2
]. Two assumptions would have to hold for this translation: (1) The rare variant distribution across individuals does not differ between populations, and (2) the allele frequency spectrum for the rare SNPs does not differ between populations. Without testing these assumptions, we do see that the difference between the Yoruba and CEPH populations is also present for these measures (see Figure ). Ancestry can therefore be an important confounder for rare variant association analysis in the same way it has proven to be for common SNP associations in genome-wide association studies. Simulations that study the best way to control for ancestry for rare variant association analysis still have to be performed. Admixed populations in particular may require a local level of ancestry control in order to account for differential rare variant densities throughout the genome.
This set of data does not afford much power for detecting significant values of Tajima’s D for individual genes, especially after accounting for multiple comparisons. Despite this, carrying out separate analyses on synonymous and nonsynonymous SNPs provides a way of seeing a result that can best be explained by selection. If random noise explained the rare variant distribution across the genome seen in this data set, then synonymous SNPs would necessarily present the same allele frequency spectrum as the nonsynonymous SNPs. Values of Tajima’s D for both sets of SNPs would not systematically differ. Here we see that they do; the nonsynonymous SNPs result in more negative values of Tajima’s D for all the populations (see Table ). This may indicate an increase in power for detecting instances of selection across the genome when using the nonsynonymous SNPs, which are more likely to have a functional effect relative to the synonymous SNPs. If overall there are more instances of positive selection (negative Tajima’s D) than of balancing selection (positive Tajima’s D), then the increase in power would result in the more negative mean value of Tajima’s D for the population. This result is therefore a good indication that the nonsynonymous SNP analysis of Tajima’s D is picking up real signals of selection that are present in these data.
The contrast between the synonymous and nonsynonymous SNPs rare to total variant regressions (see Figure ) also serves to distinguish whether selection or drift is accounting for the results. The observed population slope differences are likely caused by phenomena that affect the genome as a whole, such as genetic drift. It is to be expected, therefore, that because synonymous SNPs are less sensitive to functional variation across the genome, they can provide greater power for detecting these genome-wide differences caused by drift.
In this study we focus on the results for positive selection (excess of rare variants, negative values of Tajima’s D
) rather than for balancing selection (excess of common variants, positive values of Tajima’s D
) because of the ascertainment bias that is present in the data. The SNP discovery process presents a bias toward common SNPs that has not been corrected for [10
], making any inference on positive values of Tajima’s D
extremely difficult. Also, positive selection is likely to play a larger role across the genome than balancing selection would. Despite this, the rare variant to total variant regressions present an alternative to biased values of Tajima’s D
for inferring balancing selection. The outliers under the regression line clearly show an excess of common variants that cannot be explained by ascertainment bias.
Finally, values of Tajima’s D provided by the nonsynonymous SNP analysis should provide relatively low variation across populations if this analysis is indeed improving the detection of functionally important genes. It is a biology that is likely to translate across different populations, and therefore the same genes should present the more extreme values from population to population. For the synonymous SNPs, on the other hand, genes are more likely to present extreme values by chance alone, and more variation across populations should be expected. This is in fact what we see here, and it serves as another indication that the nonsynonymous SNP analysis is picking up on real selection signal rather than random noise.
If this is the case, we can use our analysis to compare selection signals and what may be functionality across populations. Table shows that most genes that may be presenting a selection signal have consistently highly negative values of Tajima’s D
across all populations. One exception to this rule is RANBP2
, like AHNAK
, is a highly conserved gene. Its insufficiency has been linked to autosomal dominant necrotizing encephalopathy, among other things [11
]. Not only is there evidence for selection for the gene in itself, but the gene has also been the object of extensive duplication in the human lineage, with the resulting region occupying approximately 10% of chromosome 2 [12
]. There are eight partial copies of RANBP2
within this region. This gene family has been named RGP
(RanBP2-like, GRIP domain containing proteins) [12
]. Evidence suggests that all eight copies are evolutionarily active and are expressed [11
]. This includes RGPD4
, the other gene in this study that showed the same pattern as RANBP2
Several possibilities can account for the striking difference between the rare variant load for the Luhya population compared to the rest of the populations for these RGPD4 and RANBP2. Environmental differences can certainly account for differences in selection pressure between populations even when the biology and functionality of the gene remains unchanged from population to population. On the other hand, the gene’s functionality itself may be different. Genomic context, which plays a role in the functionality of individual genes, can be greatly divergent between populations because of drift and other historical events, such as deletions and duplications. This result points out a specific instance in which it may be important to take into account the ancestral composition of the study population when attempting to narrow down genomic searches according to functionality. It also underlines the potential for disease heterogeneity across different populations.