Current estimates of the number of deleterious mutations per diploid human genome vary by several orders of magnitude. Using a correlation in inbreeding rates within consanguineous marriages and mortality, Morton, Crow, and Muller5
estimated each of us carries 3–5 lethal equivalents (i.e
., an allele or combination of alleles that if made homozygous would be lethal) whereas Kondrashov6
has predicted that the number may be as high as 100 lethal equivalents. Comparative genomic methods suggest that approximately 38% of amino-acid changing polymorphisms are deleterious, with 1.6 new deleterious mutations arising per individual per generation7
while studies based on segregating polymorphisms estimate that each person carries between 500 and 1,200 deleterious mutations3,8
. It is very difficult to reconcile these estimates since each study used different methods and data. Furthermore, studies that used DNA sequences only included data from several hundred genes. Thus, there is a critical need for an unbiased genome-wide estimate of the number of damaging mutations carried by individuals in different populations.
We quantify the number of damaging mutations per diploid human genome by combining the Applera genome-wide survey of SNPs found by resequencing of 20 European Americans (EAs) and 15 African Americans (AAs)9
with comparative genomic data including the PanTro2 build of the chimpanzee genome and protein structure prediction data. After applying strict quality control criteria, the data set we analyzed contains 39,440 autosomal SNPs free of ascertainment bias comprising 10,150 unique transcripts in the human genome (see Methods
). Of these SNPs, 20,893 were synonymous (nucleotide changes that do not change the amino acid) and 18,547 were nonsynonymous (nucleotide changes that change the amino acid).
At each SNP, an individual can be homozygous for the ancestral allele (carry zero copies of the mutant allele), heterozygous (carry one copy of the mutant allele), or homozygous for the derived allele (carry two copies of the mutant allele). We find that an individual is heterozygous, on average, for 1,962.4 nonsynonymous SNPs (SD: 275.1; ; Supplementary Table 1
). These numbers are an underestimate since only SNPs with good quality sequence and a matching chimp base are considered. Perhaps for these reasons, our estimate is slightly smaller than that by Cargill et al
, even after decreasing their estimate to account for the current estimated number of genes in the genome. For both synonymous and nonsynonymous SNPs, AA individuals are heterozygous at a greater number of SNPs than are EA individuals (; P
< 6.2 ×10−10
, Mann-Whitney U-test (MWU) for synonymous SNPs; P
<6.2 × 10−10
, MWU for nonsynonymous SNPs), consistent with previous studies finding higher levels of genetic variability in Africa4
. Interestingly, for both types of SNPs, we find that EA individuals are homozygous for the derived allele at a greater number of SNPs than AA individuals (; P
< 6.2 ×10−10
, MWU). These patterns are largely due to an elevated number of SNPs fixed for the derived allele in the EA sample while segregating for two alleles in the AA sample. Excluding SNPs that are not segregating in the particular subpopulation, we observe that AAs have more homozygous derived genotypes per individual at synonymous SNPs and EAs slightly more homozygous derived genotypes per individual at nonsynonymous SNPs.
Distribution of the number of heterozygous and homozygous genotypes per individual
To estimate the number of damaging alleles carried by each individual in our sample, we used the PolyPhen algorithm8,11
to predict which nonsynonymous SNPs might disrupt protein function. PolyPhen predicts whether a SNP is “benign”, “possibly damaging”, or “probably damaging” based on evolutionary conservation and structural data. In order to assess whether “damaging” SNPs were more likely to be deleterious, we compared the allele frequency distribution of SNPs predicted to be “benign”, “possibly damaging”, and “probably damaging” for each population. We find that the three distributions are significantly different from each other, with more low frequency SNPs in the “probably damaging” category (, P
<5.9 × 10−81
<2.3 × 10−101
EA, Kruskal-Wallis test), suggesting that the majority of SNPs classified as damaging are also evolutionarily deleterious.
Distribution of Applera SNPs by population and functional class
shows the distribution of the number of SNPs per individual where individuals were heterozygous () and homozygous for the damaging allele () for SNPs predicted to be “possibly damaging” and “probably damaging”. We find that an individual typically carries 426.1 damaging (here defined as possibly or probably damaging) SNPs in the heterozygous state (SD: 65.4, range: 340–534) and 91.7 in the homozygous state (SD: 8.6, range: 77–113). Since we surveyed just over 10,000 genes, the actual number of damaging mutations in a person’s genome may be as much as twice that given here. Every individual in our sample is heterozygous at fewer “probably damaging” SNPs than synonymous SNPs, consistent with purifying selection eliminating damaging SNPs from the population. AAs have significantly more heterozygous genotypes than do EAs for all three PolyPhen categories (, P < 6.2 × 10−10, for “possibly damaging” SNPs; P < 3.7 × 10−8, for “probably damaging” SNPs). The two populations differ significantly in the distribution of homozygous genotypes for the damaging allele at “probably damaging SNPs” (; P < 2.7 × 10−6), with EAs having approximately 26% more homozygous damaging genotypes than AAs. The lack of a statistical difference at “possibly damaging” SNPs (P=0.17) is likely due to a lack of power since, overall, all other categories of SNPs (synonymous, non-synonymous, “benign”, and “probably damaging”) follow the same pattern of excess homozygosity for the derived/damaging allele in EAs relative to AAs.
Classical analyses of human inbreeding suggest that each individual carries 1.44–5 lethal equivalents5,12
. However, inbreeding studies cannot determine whether a single lethal equivalent is due to one lethal allele, two alleles each with a 50% chance of lethality, 10 alleles each with a 10% chance of lethality, or other combinations. Since we find that individuals carry hundreds of damaging alleles, it is likely that each lethal equivalent consists of many weakly deleterious alleles. Our finding that each person carries several hundred potentially damaging SNPs suggests that large-scale medical re-sequencing will be useful to find common and rare SNPs of medical consequence2
We next examined the distribution of synonymous and nonsynonymous SNPs between AA and EA population samples (). As expected4
, there are more of both types of SNPs in the AA sample than in the EA sample. However, when classifying synonymous and nonsynonymous SNPs as being shared, private to AA, or private to EA, we strongly reject homogeneity (, P
< 3.0 × 10−88
). We find the proportion of private SNPs that are nonsynonymous (49.9%) is higher than the proportion of shared SNPs that are nonsynonymous (41.7%; P
< 4.3 × 10−54
), which is not surprising since nonsynonymous SNPs are more likely to be at lower frequency and thus be population specific. However, considering only the private SNPs, we find that the EA sample has a higher proportion of nonsynonymous SNPs (55.4%) than the AA sample (47.0%; P
< 2.3 ×10−37
). We observed a similar significant proportional excess of private nonsynonymous SNPs in an independent data set collected by the SeattleSNPs project (Supplementary Table 3
; Supplementary Note 1
). The SeattleSNPs data, additional quality control analyses (Supplementary Note 2
and Supplementary Table 4
), and a similar finding reported for the ANGPTL4
indicate that this pattern is not an artefact of the Applera data. Our further analyses using Yoruba individuals from Nigeria collected by the International HapMap Consortium14
, support this result indicating that it is robust to admixture (Supplementary Note 3
Results of G-tests of homogeneity for .
We hypothesized that the proportional excess of nonsynonymous polymorphism in the EA sample could be due to varying efficacy of purifying selection due to differences in demographic histories between the two populations. Our hypothesis has two testable predictions: 1) if this proportional excess of nonsynonymous polymorphisms in EAs is due to an excess of damaging alleles, we would also expect to find a proportional increase of “probably damaging” SNPs as predicted by PolyPhen in the EA sample, and 2) we should be able to recapitulate this pattern using simulations with reasonable demographic parameters. When dividing nonsynonymous SNPs into the three PolyPhen categories, we find a significant excess of “probably damaging” SNPs in private SNPs compared to shared SNPs ( and ). When considering only the private SNPs, we find a significantly higher proportion of “probably damaging” SNPs in the EA sample relative to the AA sample (P<3.3 × 10−11, and ), supporting our hypothesis that the excess proportion of nonsynonymous SNPs in the EA sample is due to a higher proportion of damaging SNPs.
In order to assess whether these observations are consistent with plausible demographic histories of the two populations, we developed a large-scale forward simulation program that includes non-stationary demography and a negative log-normal distribution of selective effects for deleterious mutations. Our program used demographic parameters estimated from the data and the literature15
for each population (Supplementary Table 2
). For example, for the simulations in , we used a population expansion model for the AAs and a bottleneck model for the EAs (Supplementary Fig. 1
). We sampled from these simulated populations and found that the proportion of nonsynonymous SNPs is greater in the bottlenecked population than in a population that has expanded (; Supplementary Table 2
; Supplementary Fig. 2a
). Furthermore, as shown in , the simulated proportions agree with the observed proportions for the Applera dataset (here the proportion includes all SNPs, not just private ones). For all demographic models considered, we observed a higher proportion of nonsynonymous SNPs in the population that underwent a bottleneck as compared to a population of constant size, or that has expanded; the degree to which these other models fit the observed data is variable, however (Supplementary Table 2
; Supplementary Fig. 2a
). For all models tested, we find that a higher proportion of SNPs in the simulated EA sample are weakly or strongly deleterious (−0.001< s
< −0.5) than in the simulated AA sample (; Supplementary Table 2
, Supplementary Fig. 2b
), which supports our hypothesis that a higher proportion of deleterious alleles have accumulated in the bottlenecked population. Our analysis illustrates that plausible models of human demography and purifying selection are sufficient to account for the observed increase in the proportion of nonsynonymous SNPs in the EA sample relative to the AA sample.
Demography and selection can cause a proportional excess of nonsynonymous SNPs in Europeans
To determine how the bottleneck contributed to the increased proportion of nonsynonymous SNPs in the EA sample, we recorded the number of SNPs at different time points throughout our forward simulations (see Supplementary Methods
). show how the number of synonymous SNPs, nonsynonymous SNPs, and the proportion of nonsynonymous SNPs change over time for the EA and AA models described above as well as for a second bottleneck model, having a shorter, but more severe reduction in population size. At the start of the bottleneck, the proportion of nonsynonymous SNPs drops below the pre-bottleneck value (due to the preferential loss of low frequency nonsynonymous SNPs). Then, the proportion increases during the bottleneck due to the accumulation of slightly deleterious SNPs that almost behave neutrally in the small population but are eliminated efficiently from larger populations16
. Once the population expands, the proportion of nonsynonymous SNPs increases dramatically since the increase in population size results in many more mutations (most of which are nonsynonymous, due to the genetic code) entering the population (). Since growth was recent, purifying selection has not had sufficient time to decrease the proportion of nonsynonymous SNPs to the equilibrium value for the larger population. A related effect has been noted in spatial expansion models, where deleterious mutations can “surf” to high frequency on the edge of the expansion17
. Our simulations for African demography suggest that once the African population expanded, the proportion of nonsynonymous SNPs also increased initially. But, since the African expansion occurred further back in time than the most recent European expansion, the proportion of nonsynonymous SNPs has had more time to decrease closer to the equilibrium value in the AA sample. At the present time, the absolute numbers of SNPs are higher in the non-bottleneck model (AA 2) than in the bottleneck models (EA 1 and EA 6). The bottleneck dynamics were robust to the distribution of selective effects used in our simulations (Supplementary Fig. 3
Thus, both the PolyPhen analysis and the forward simulations suggest that given the lower levels of genetic diversity compared to Africans, EAs have a higher proportion of deleterious alleles which can be explained by the Out-of-Africa bottleneck and subsequent expansion that outbred European populations endured. This result is important for two reasons. First, while previous work has highlighted examples of European-specific positive selection14,18–21
, the importance of adaptations for the evolution of European populations needs to be tempered by our finding that negative selection is less effective at removing slightly deleterious alleles from European populations. Second, the idea that bottlenecks and founder effects could lead to an increase of damaging alleles in human populations was historically reserved for isolated populations that experienced severe founder effects (e.g.
). Our work suggests that the interaction of demographic processes and purifying selection can have an important impact on the distribution of deleterious variation, even in populations that did not undergo a severe founder effect.