Both anatomically modern humans and the gastric pathogen Helicobacter pylori originated in Africa, and both species have been associated for at least 100,000 years. Seven geographically distinct H. pylori populations exist, three of which are indigenous to Africa: hpAfrica1, hpAfrica2, and hpNEAfrica. The oldest and most divergent population, hpAfrica2, evolved within San hunter-gatherers, who represent one of the deepest branches of the human population tree. Anticipating the presence of ancient H. pylori lineages within all hunter-gatherer populations, we investigated the prevalence and population structure of H. pylori within Baka Pygmies in Cameroon. Gastric biopsies were obtained by esophagogastroduodenoscopy from 77 Baka from two geographically separated populations, and from 101 non-Baka individuals from neighboring agriculturalist populations, and subsequently cultured for H. pylori. Unexpectedly, Baka Pygmies showed a significantly lower H. pylori infection rate (20.8%) than non-Baka (80.2%). We generated multilocus haplotypes for each H. pylori isolate by DNA sequencing, but were not able to identify Baka-specific lineages, and most isolates in our sample were assigned to hpNEAfrica or hpAfrica1. The population hpNEAfrica, a marker for the expansion of the Nilo-Saharan language family, was divided into East African and Central West African subpopulations. Similarly, a new hpAfrica1 subpopulation, identified mainly among Cameroonians, supports eastern and western expansions of Bantu languages. An age-structured transmission model shows that the low H. pylori prevalence among Baka Pygmies is achievable within the timeframe of a few hundred years and suggests that demographic factors such as small population size and unusually low life expectancy can lead to the eradication of H. pylori from individual human populations. The Baka were thus either H. pylori-free or lost their ancient lineages during past demographic fluctuations. Using coalescent simulations and phylogenetic inference, we show that Baka almost certainly acquired their extant H. pylori through secondary contact with their agriculturalist neighbors.
Genetic analyses of Helicobacter pylori have illuminated human migrations and the history of human infection by these bacteria. Both humans and H. pylori originated in Africa, and have been intimately associated for at least 100,000 years. We hypothesized that communities who still live in relative isolation might provide further details about the evolutionary history of H. pylori in Africa. We therefore investigated H. pylori within Baka Pygmies of southeast Cameroon, who live as hunter-gatherers in the tropical rainforest, and compared those bacteria to H. pylori from neighboring farming populations of non-Baka ethnicities. Unexpectedly, Baka Pygmies were much less commonly infected (20.8%) than the non-Baka (80.2%). H. pylori from hunter-gatherers and agriculturalists were genetically very similar and ancient H. pylori lineages were not identified in Baka. We used an epidemiological model to show that demographic factors including small population size and low life expectancy can account for the low infection rate among Baka Pygmies, and that this low rate could have been attained within a few hundred years of secondary contact with their neighbors. We also suggest that the ancestors of the Baka Pygmies were initially H. pylori-free or that their ancestral bacteria have been lost through past demographic fluctuations.
The rates of escape and reversion in response to selection pressure arising from the host immune system, notably the cytotoxic T-lymphocyte (CTL) response, are key factors determining the evolution of HIV. Existing methods for estimating these parameters from cross-sectional population data using ordinary differential equations (ODEs) ignore information about the genealogy of sampled HIV sequences, which has the potential to cause systematic bias and overestimate certainty. Here, we describe an integrated approach, validated through extensive simulations, which combines genealogical inference and epidemiological modelling, to estimate rates of CTL escape and reversion in HIV epitopes. We show that there is substantial uncertainty about rates of viral escape and reversion from cross-sectional data, which arises from the inherent stochasticity in the evolutionary process. By application to empirical data, we find that point estimates of rates from a previously published ODE model and the integrated approach presented here are often similar, but can also differ several-fold depending on the structure of the genealogy. The model-based approach we apply provides a framework for the statistical analysis and hypothesis testing of escape and reversion in population data and highlights the need for longitudinal and denser cross-sectional sampling to enable accurate estimate of these key parameters.
phylodynamics; HIV; escape; genealogy; peeling; cytotoxic T-lymphocyte
Adaptor protein-2 (AP2), a central component of clathrin-coated vesicles (CCVs), is pivotal in clathrin-mediated endocytosis which internalises plasma membrane constituents such as G protein-coupled receptors (GPCRs)1-3 . AP2, a heterotetramer of alpha, beta, mu and sigma subunits, links clathrin to vesicle membranes and binds to tyrosine-based and dileucine-based motifs of membrane-associated cargo proteins1,4. Here, we show that AP2 sigma subunit (AP2S1) missense mutations, which all involved the Arg15 residue (Arg15Cys, Arg15His and Arg15Leu) that forms key contacts with dileucine-based motifs of CCV cargo proteins4, result in familial hypocalciuric hypercalcemia type 3 (FHH3), an extracellular-calcium homeostasis disorder affecting parathyroids, kidneys and bone5-7 These AP2S1 mutations occurred in >20% of FHH patients without calcium-sensing GPCR (CaSR) mutations which cause FHH18-12. AP2S1 mutations decreased the sensitivity of CaSR-expressing cells to extracellular-calcium and reduced CaSR endocytosis, likely through a loss of interaction with a C-terminus CaSR dileucine-based motif whose disruption also decreased intracellular signalling. Thus, our results reveal a new role for AP2 in extracellular-calcium homeostasis.
Efforts to identify the genetic basis of human adaptations from polymorphism data have sought footprints of “classic selective sweeps”. Yet it remains unknown whether this form of natural selection was common in our evolution. We examined the evidence for classic sweeps in resequencing data from 179 human genomes. As expected under a recurrent sweep model, diversity levels decrease near exons and conserved non-coding regions. In contrast to expectation, however, the trough in diversity around human-specific amino acid substitutions is no more pronounced than around synonymous substitutions. Moreover, relative to the genome background, amino acid and putative regulatory sites are not significantly enriched for alleles that are highly differentiated between populations. These findings indicate that classic sweeps were not a dominant mode of adaptation over the past ~250,000 years.
The var genes of the human malaria parasite Plasmodium falciparum are highly polymorphic loci coding for the erythrocyte membrane proteins 1 (PfEMP1), which are responsible for the cytoaherence of P. falciparum infected red blood cells to the human vasculature. Cytoadhesion, coupled with differential expression of var genes, contributes to virulence and allows the parasite to establish chronic infections by evading detection from the host’s immune system. Although studying genetic diversity is a major focus of recent work on the var genes, little is known about the gene family's origin and evolutionary history.
Using a novel hidden Markov model-based approach and var sequences assembled from additional isolates and species, we are able to reveal elements of both the early evolution of the var genes as well as recent diversifying events. We compare sequences of the var gene DBLα domains from divergent isolates of P. falciparum (3D7 and HB3), and a closely-related species, Plasmodium reichenowi. We find that the gene family is equally large in P. reichenowi and P. falciparum -- with a minimum of 51 var genes in the P. reichenowi genome (compared to 61 in 3D7 and a minimum of 48 in HB3). In addition, we are able to define large, continuous blocks of homologous sequence among P. falciparum and P. reichenowi var gene DBLα domains. These results reveal that the contemporary structure of the var gene family was present before the divergence of P. falciparum and P. reichenowi, estimated to be between 2.5 to 6 million years ago. We also reveal that recombination has played an important and traceable role in both the establishment, and the maintenance, of diversity in the sequences.
Despite the remarkable diversity and rapid evolution found in these loci within and among P. falciparum populations, the basic structure of these domains and the gene family is surprisingly old and stable. Revealing a common structure as well as conserved sequence among two species also has implications for developing new primate-parasite models for studying the pathology and immunology of falciparum malaria, and for studying the population genetics of var genes and associated virulence phenotypes.
Non-allelic homologous recombination; Hidden Markov-model; var genes; Malaria; PfEMP1; Gene family evolution; Balancing selection
Inferring the nature and magnitude of selection is an important problem in many biological contexts. Typically when estimating a selection coefficient for an allele, it is assumed that samples are drawn from a panmictic population and that selection acts uniformly across the population. However, these assumptions are rarely satisfied. Natural populations are almost always structured, and selective pressures are likely to act differentially. Inference about selection ought therefore to take account of structure. We do this by considering evolution in a simple lattice model of spatial population structure. We develop a hidden Markov model based maximum-likelihood approach for estimating the selection coefficient in a single population from time series data of allele frequencies. We then develop an approximate extension of this to the structured case to provide a joint estimate of migration rate and spatially varying selection coefficients. We illustrate our method using classical data sets of moth pigmentation morph frequencies, but it has wide applications in settings ranging from ecology to human evolution.
selection; spatial structure; population structure; allele frequencies; time series
Statistical imputation of classical HLA alleles in case-control studies has become established as a valuable tool for identifying and fine-mapping signals of disease association in the MHC. Imputation into diverse populations has, however, remained challenging, mainly because of the additional haplotypic heterogeneity introduced by combining reference panels of different sources. We present an HLA type imputation model, HLA*IMP:02, designed to operate on a multi-population reference panel. HLA*IMP:02 is based on a graphical representation of haplotype structure. We present a probabilistic algorithm to build such models for the HLA region, accommodating genotyping error, haplotypic heterogeneity and the need for maximum accuracy at the HLA loci, generalizing the work of Browning and Browning (2007) and Ron et al. (1998). HLA*IMP:02 achieves an average 4-digit imputation accuracy on diverse European panels of 97% (call rate 97%). On non-European samples, 2-digit performance is over 90% for most loci and ethnicities where data available. HLA*IMP:02 supports imputation of HLA-DPB1 and HLA-DRB3-5, is highly tolerant of missing data in the imputation panel and works on standard genotype data from popular genotyping chips. It is publicly available in source code and as a user-friendly web service framework.
The human leukocyte antigen (HLA) proteins influence how pathogens and components of body cells are presented to immune cells. It has long been known that they are highly variable and that this variation is associated with differential risk for autoimmune and infectious diseases. Variant frequencies differ substantially between and even within continents. Determining HLA genotypes is thus an important part of many studies to understand the genetic basis of disease risk. However, conventional methods for HLA typing (e.g. targeted sequencing, hybridisation, amplification) are typically laborious and expensive. We have developed a method for inferring an individual's HLA genotype based on evaluating genetic information from nearby variable sites that are more easily assayed, which aims to integrate heterogeneous data. We introduce two key innovations: we allow for single HLA types to appear on heterogeneous backgrounds of genetic information and we take into account the possibility of genotyping error, which is common within the HLA region. We show that the method is well-suited to deal with multi-population datasets: it enables integrated HLA type inference for individuals of differing ancestry and ethnicity. It will therefore prove useful particularly in international collaborations to better understand disease risks, where samples are drawn from multiple countries.
To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high throughput sequence data from ten Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine-scales, chimpanzee recombination is dominated by hotspots, which show no overlap with humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.
Estimating fine-scale recombination maps of Drosophila from population genomic data is a challenging problem, in particular because of the high background recombination rate. In this paper, a new computational method is developed to address this challenge. Through an extensive simulation study, it is demonstrated that the method allows more accurate inference, and exhibits greater robustness to the effects of natural selection and noise, compared to a well-used previous method developed for studying fine-scale recombination rate variation in the human genome. As an application, a genome-wide analysis of genetic variation data is performed for two Drosophila melanogaster populations, one from North America (Raleigh, USA) and the other from Africa (Gikongoro, Rwanda). It is shown that fine-scale recombination rate variation is widespread throughout the D. melanogaster genome, across all chromosomes and in both populations. At the fine-scale, a conservative, systematic search for evidence of recombination hotspots suggests the existence of a handful of putative hotspots each with at least a tenfold increase in intensity over the background rate. A wavelet analysis is carried out to compare the estimated recombination maps in the two populations and to quantify the extent to which recombination rates are conserved. In general, similarity is observed at very broad scales, but substantial differences are seen at fine scales. The average recombination rate of the X chromosome appears to be higher than that of the autosomes in both populations, and this pattern is much more pronounced in the African population than the North American population. The correlation between various genomic features—including recombination rates, diversity, divergence, GC content, gene content, and sequence quality—is examined using the wavelet analysis, and it is shown that the most notable difference between D. melanogaster and humans is in the correlation between recombination and diversity.
Recombination is a process by which chromosomes exchange genetic material during meiosis. It is important in evolution because it provides offspring with new combinations of genes, and so estimating the rate of recombination is of fundamental importance in various population genomic inference problems. In this paper, we develop a new statistical method to enable robust estimation of fine-scale recombination maps of Drosophila, a genus of common fruit flies, in which the background recombination rate is high and natural selection has been prevalent. We apply our method to produce fine-scale recombination maps for a North American population and an African population of D. melanogaster. For both populations, we find extensive fine-scale variation in recombination rate throughout the genome. We provide a quantitative characterization of the similarities and differences between the recombination maps of the two populations; our study reveals high correlation at broad scales and low correlation at fine scales, as has been documented among human populations. We also examine the correlation between various genomic features. Furthermore, using a conservative approach, we find a handful of putative recombination “hotspot” regions with solid statistical support for a local elevation of at least 10 times the background recombination rate.
Well-powered genome-wide association studies, now possible through advances in technology and large-scale collaborative projects, promise to reveal the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it is unknown whether methods developed to control stratification are equally effective for rare variants. Here we demonstrate that rare variants can show a systematically different and typically stronger stratification than common variants, and that this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. We show that populations can display spatial structure in rare variants even when FST is low, but that allele-frequency dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits.
Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex; the first de novo assembler capable of assembling multiple eukaryote genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variation in a high coverage human genome. Second, we identify over 3Mb of novel sequence in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from 10 chimpanzees enables accurate variant calls without a reference sequence. Finally, we estimate classical HLA genotypes at HLA-B, the most variable gene in the human genome.
In large populations, many beneficial mutations may be simultaneously available and may compete with one another, slowing adaptation. By finding the probability of fixation of a favorable allele in a simple model of a haploid sexual population, we find limits to the rate of adaptive substitution, , that depend on simple parameter combinations. When variance in fitness is low and linkage is loose, the baseline rate of substitution is , where is the population size, is the rate of beneficial mutations per genome, and is their mean selective advantage. Heritable variance in log fitness due to unlinked loci reduces by under polygamy and under monogamy. With a linear genetic map of length Morgans, interference is yet stronger. We use a scaling argument to show that the density of adaptive substitutions depends on , , , and only through the baseline density: . Under the approximation that the interference due to different sweeps adds up, we show that , implying that interference prevents the rate of adaptive substitution from exceeding one per centimorgan per 200 generations. Simulations and numerical calculations confirm the scaling argument and confirm the additive approximation for ; for higher , the rate of adaptation grows above , but only very slowly. We also consider the effect of sweeps on neutral diversity and show that, while even occasional sweeps can greatly reduce neutral diversity, this effect saturates as sweeps become more common—diversity can be maintained even in populations experiencing very strong interference. Our results indicate that for some organisms the rate of adaptive substitution may be primarily recombination-limited, depending only weakly on the mutation supply and the strength of selection.
In small populations, adaptation may be limited by a lack of beneficial alleles on which selection can act; in such populations, increasing the supply of mutations (by increasing the population size or the rate of beneficial mutation per individual) proportionally increases the rate of adaptation. However, when multiple beneficial mutations arise simultaneously, they will typically occur in different individuals and will compete against each other, slowing adaptation. Recombination (sex) alleviates this interference among mutations by bringing them together in the same individuals. By analyzing and simulating a simple model of an adapting sexual population, we find that interference prevents the rate of adaptive substitutions from greatly exceeding one substitution per centimorgan in every 200 generations. Populations with infrequent outcrossing, such as many microbes and plants, may approach this limit. In these populations, the rate of adaptive substitutions is hardly affected by increasing the mutation supply or the strength of selection, but grows proportionally (up to very high rates) as recombination increases.
Motivation: Genetic variation at classical HLA alleles influences many phenotypes, including susceptibility to autoimmune disease, resistance to pathogens and the risk of adverse drug reactions. However, classical HLA typing methods are often prohibitively expensive for large-scale studies. We previously described a method for imputing classical alleles from linked SNP genotype data. Here, we present a modification of the original algorithm implemented in a freely available software suite that combines local data preparation and QC with probabilistic imputation through a remote server.
Results: We introduce two modifications to the original algorithm. First, we present a novel SNP selection function that leads to pronounced increases (up by 40% in some scenarios) in call rate. Second, we develop a parallelized model building algorithm that allows us to process a reference set of over 2500 individuals. In a validation experiment, we show that our framework produces highly accurate HLA type imputations at class I and class II loci for independent datasets: at call rates of 95–99%, imputation accuracy is between 92% and 98% at the four-digit level and over 97% at the two-digit level. We demonstrate utility of the method through analysis of a genome-wide association study for psoriasis where there is a known classical HLA risk allele (HLA-C*06:02). We show that the imputed allele shows stronger association with disease than any single SNP within the region. The imputation framework, HLA*IMP, provides a powerful tool for dissecting the architecture of genetic risk within the HLA.
Availability: HLA*IMP, implemented in C++ and Perl, is available from http://oxfordhla.well.ox.ac.uk and is free for academic use.
Supplementary information: Supplementary data are available at Bioinformatics online.
An important paradigm in evolutionary genetics is that of a delicate balance between genetic variants that favorably boost host control of infection but which may unfavorably increase susceptibility to autoimmune disease. Here, we investigated whether patients with psoriasis, a common immune-mediated disease of the skin, are enriched for genetic variants that limit the ability of HIV-1 virus to replicate after infection. We analyzed the HLA class I and class II alleles of 1,727 Caucasian psoriasis cases and 3,581 controls and found that psoriasis patients are significantly more likely than controls to have gene variants that are protective against HIV-1 disease. This includes several HLA class I alleles associated with HIV-1 control; amino acid residues at HLA-B positions 67, 70, and 97 that mediate HIV-1 peptide binding; and the deletion polymorphism rs67384697 associated with high surface expression of HLA-C. We also found that the compound genotype KIR3DS1 plus HLA-B Bw4-80I, which respectively encode a natural killer cell activating receptor and its putative ligand, significantly increased psoriasis susceptibility. This compound genotype has also been associated with delay of progression to AIDS. Together, our results suggest that genetic variants that contribute to anti-viral immunity may predispose to the development of psoriasis.
Individuals with autoimmune disease generally demonstrate excessive immune system activation, leading to inflammation and damage of specific target organs. However, in some cases the detrimental effects of an overactive immune system might be counterbalanced by a beneficial effect in protecting against certain infections. In this study, we investigated whether patients with psoriasis, a common autoimmune disease of the skin, harbor genetic variants that are associated with an enhanced ability to limit replication of the HIV-1 virus. We profiled the HLA (human leukocyte antigen) immune genes located on chromosome 6 in 1,727 Caucasian psoriasis cases and 3,581 healthy controls and found that psoriasis patients are significantly more likely than controls to have gene variants that are protective against HIV-1 disease. We found that this enrichment for HIV-1 protective variants was unique to psoriasis and largely absent in patients with other autoimmune or inflammatory diseases such as rheumatoid arthritis, Crohn's disease, type 1 diabetes, type 2 diabetes, and coronary artery disease. Our results suggest the possibility that the excessive skin inflammation in psoriasis may be associated with activation of anti-viral immune pathways that were important to human ancestors who encountered viruses similar to HIV-1.
In humans, chromosome-number abnormalities have been associated with altered recombination and increased maternal age. Therefore, age-related effects on recombination are of major importance, especially in relation to the mechanisms involved in human trisomies. Here, we examine the relationship between maternal age and recombination rate in humans. We localized crossovers at high resolution by using over 600,000 markers genotyped in a panel of 69 French-Canadian pedigrees, revealing recombination events in 195 maternal meioses. Overall, we observed the general patterns of variation in fine-scale recombination rates previously reported in humans. However, we make the first observation of a significant decrease in recombination rates with advancing maternal age in humans, likely driven by chromosome-specific effects. The effect appears to be localized in the middle section of chromosomal arms and near subtelomeric regions. We postulate that, for some chromosomes, protection against non-disjunction provided by recombination becomes less efficient with advancing maternal age, which can be partly responsible for the higher rates of aneuploidy in older women. We propose a model that reconciles our findings with reported associations between maternal age and recombination in cases of trisomies.
Aging is a genetically and environmentally modulated process. One particular manifestation of aging in humans is the age-related changes that affect the female reproductive system. It is well established that chromosome-number abnormalities in offspring occur more frequently as maternal age advances, but the meiotic mechanisms involved remain unclear. Meiotic recombination has been associated with maternal age in different species but contrasting effects of maternal age on recombination rates have been reported among mammals. In this study, we found a decrease of recombination rates with increasing maternal age in a French-Canadian cohort, with the most pronounced decline possibly occurring before 32 years of age. We observed chromosome-specific age effects, and in older women recombination frequencies are notably reduced in the middle portion of chromosomal arms and near subtelomeric regions. No paternal age effect on recombination was found, highlighting differences in patterns of variation among sexes. Many studies have shown significant inter-individual variation in genome-wide recombination rates, and our results points to an additional, intra-individual source of variation in recombination rates among transmissions from the same mother.
Salmonella enterica is a bacterial pathogen that causes enteric fever and gastroenteritis in humans and animals. Although its population structure was long described as clonal, based on high linkage disequilibrium between loci typed by enzyme electrophoresis, recent examination of gene sequences has revealed that recombination plays an important evolutionary role. We sequenced around 10% of the core genome of 114 isolates of enterica using a resequencing microarray. Application of two different analysis methods (Structure and ClonalFrame) to our genomic data allowed us to define five clear lineages within S. enterica subspecies enterica, one of which is five times older than the other four and two thirds of the age of the whole subspecies. We show that some of these lineages display more evidence of recombination than others. We also demonstrate that some level of sexual isolation exists between the lineages, so that recombination has occurred predominantly between members of the same lineage. This pattern of recombination is compatible with expectations from the previously described ecological structuring of the enterica population as well as mechanistic barriers to recombination observed in laboratory experiments. In spite of their relatively low level of genetic differentiation, these lineages might therefore represent incipient species.
Salmonella enterica is a species of bacteria that causes severe diseases in humans and animals. We sequenced about a tenth of the genome from a broadly sampled collection of S. enterica. By comparing these genetic sequences, we were able to partially reconstruct the ancestry of this sample. We identified five lineages within S. enterica, one of which is almost as old as the common ancestor of our sample. We also found evidence for frequent homologous recombination in the ancestry of S. enterica, where fragments of genes from one individual bacterium are acquired by a distinct individual. These recombination events make the ancestry harder to reconstruct in its entirety, but also contain interesting information. We found in particular that recombination had happened more often between strains belonging to the same lineage than across lineage boundaries. This observation is compatible with the lineages of S. enterica becoming progressively isolated from each other, which could lead to their gradual splintering into new species.
Recombination between homologous, but non-allelic, stretches of DNA such as gene families, segmental duplications and repeat elements is an important source of mutation. In humans, recent studies have identified short DNA motifs that both determine the location of 40 per cent of meiotic cross-over hotspots and are significantly enriched at the breakpoints of recurrent non-allelic homologous recombination (NAHR) syndromes. Unexpectedly, the most highly penetrant form of the motif occurs on the background of an inactive repeat element family (THE1 elements) and the motif also has strong recombinogenic activity on currently active element families including Alu and LINE2 elements. Analysis of genetic variation among members of these repeat families indicates an important role for NAHR in their evolution. Given the potential for double-strand breaks within repeat DNA to cause pathological rearrangement, the association between repeats and hotspots is surprising. Here we consider possible explanations for why selection acting against NAHR has not eliminated hotspots from repeat DNA including mechanistic constraints, possible benefits to repeat DNA from recruiting hotspots and rapid evolution of the recombination machinery. I suggest that rapid evolution of hotspot motifs may, surprisingly, tend to favour sequences present in repeat DNA and outline the data required to differentiate between hypotheses.
recombination hotspot; mutation; repeat element
Previous genetic studies have suggested a history of sub-Saharan African gene flow into some West Eurasian populations after the initial dispersal out of Africa that occurred at least 45,000 years ago. However, there has been no accurate characterization of the proportion of mixture, or of its date. We analyze genome-wide polymorphism data from about 40 West Eurasian groups to show that almost all Southern Europeans have inherited 1%–3% African ancestry with an average mixture date of around 55 generations ago, consistent with North African gene flow at the end of the Roman Empire and subsequent Arab migrations. Levantine groups harbor 4%–15% African ancestry with an average mixture date of about 32 generations ago, consistent with close political, economic, and cultural links with Egypt in the late middle ages. We also detect 3%–5% sub-Saharan African ancestry in all eight of the diverse Jewish populations that we analyzed. For the Jewish admixture, we obtain an average estimated date of about 72 generations. This may reflect descent of these groups from a common ancestral population that already had some African ancestry prior to the Jewish Diasporas.
Southern Europeans and Middle Eastern populations are known to have inherited a small percentage of their genetic material from recent sub-Saharan African migrations, but there has been no estimate of the exact proportion of this gene flow, or of its date. Here, we apply genomic methods to show that the proportion of African ancestry in many Southern European groups is 1%–3%, in Middle Eastern groups is 4%–15%, and in Jewish groups is 3%–5%. To estimate the dates when the mixture occurred, we develop a novel method that estimates the size of chromosomal segments of distinct ancestry in individuals of mixed ancestry. We verify using computer simulations that the method produces useful estimates of population mixture dates up to 300 generations in the past. By applying the method to West Eurasians, we show that the dates in Southern Europeans are consistent with events during the Roman Empire and subsequent Arab migrations. The dates in the Jewish groups are older, consistent with events in classical or biblical times that may have occurred in the shared history of Jewish populations.
Here we investigate the correlations between coding sequence substitutions as a function of their separation along the protein sequence. We consider both substitutions between the reference genomes of several Drosophilids as well as polymorphisms in a population sample of Zimbabwean Drosophila melanogaster. We find that amino acid substitutions are “clustered” along the protein sequence, that is, the frequency of additional substitutions is strongly enhanced within ≈10 residues of a first such substitution. No such clustering is observed for synonymous substitutions, supporting a “correlation length” associated with selection on proteins as the causative mechanism. Clustering is stronger between substitutions that arose in the same lineage than it is between substitutions that arose in different lineages. We consider several possible origins of clustering, concluding that epistasis (interactions between amino acids within a protein that affect function) and positional heterogeneity in the strength of purifying selection are primarily responsible. The role of epistasis is directly supported by the tendency of nearby substitutions that arose on the same lineage to preserve the total charge of the residues within the correlation length and by the preferential cosegregation of neighboring derived alleles in our population sample. We interpret the observed length scale of clustering as a statistical reflection of the functional locality (or modularity) of proteins: amino acids that are near each other on the protein backbone are more likely to contribute to, and collaborate toward, a common subfunction.
Genes are templates for proteins, yet evolutionary studies of genes and proteins often bear little resemblance. Analyses of gene evolution typically treat each codon independently, quantifying gene evolution by summing over the constituent codons. In contrast, studies of protein evolution generally incorporate protein structure and interactions between amino acids explicitly. We investigate correlations in the evolution of codons as a function of their distance from each other along the protein coding sequence. This approach is motivated by the expectation that codons near each other in sequence often encode amino acids belonging to the same functional unit. Consequently, these amino acids are more likely to interact and/or experience similar selective regimes, introducing correlation between the evolution of the underlying codons. We find codon evolution in Drosophilids to be correlated over a characteristic length scale of ≈10 codons. Specifically, the presence of a non-synonymous substitution substantially increases the probability of further such substitutions nearby, particularly within that lineage. Further analysis suggests both functional interactions between amino acids and correlation in the strength of selection contribute to this effect. These findings are relevant for understanding the relative importance of different modes of selection, and particularly the role of epistasis, in gene and protein evolution.
In Drosophila, multiple lines of evidence converge in suggesting that beneficial substitutions to the genome may be common. All suffer from confounding factors, however, such that the interpretation of the evidence—in particular, conclusions about the rate and strength of beneficial substitutions—remains tentative. Here, we use genome-wide polymorphism data in D. simulans and sequenced genomes of its close relatives to construct a readily interpretable characterization of the effects of positive selection: the shape of average neutral diversity around amino acid substitutions. As expected under recurrent selective sweeps, we find a trough in diversity levels around amino acid but not around synonymous substitutions, a distinctive pattern that is not expected under alternative models. This characterization is richer than previous approaches, which relied on limited summaries of the data (e.g., the slope of a scatter plot), and relates to underlying selection parameters in a straightforward way, allowing us to make more reliable inferences about the prevalence and strength of adaptation. Specifically, we develop a coalescent-based model for the shape of the entire curve and use it to infer adaptive parameters by maximum likelihood. Our inference suggests that ∼13% of amino acid substitutions cause selective sweeps. Interestingly, it reveals two classes of beneficial fixations: a minority (approximately 3%) that appears to have had large selective effects and accounts for most of the reduction in diversity, and the remaining 10%, which seem to have had very weak selective effects. These estimates therefore help to reconcile the apparent conflict among previously published estimates of the strength of selection. More generally, our findings provide unequivocal evidence for strongly beneficial substitutions in Drosophila and illustrate how the rapidly accumulating genome-wide data can be leveraged to address enduring questions about the genetic basis of adaptation.
Characterizing the nature of beneficial changes to the genome is essential to our understanding of adaptation. To do so, researchers identify and analyze footprints that beneficial changes leave in patterns of genetic variation within and between species. In order to teach us about adaptive evolution, these footprints need to be specific to positive selection as well as rich enough to allow for reliable inferences. Here, we identify such a footprint: a pronounced trough in the average levels of genetic diversity surrounding amino acid substitutions throughout the D. simulans genome. Based on this pattern, we infer that approximately 13% of amino acid substitutions were beneficial, a minority of which (3%) conferred a large selective advantage of nearly 0.5% and the majority of which (10%) conferred a much smaller advantage of about 0.01%. These findings offer insights into the distribution of selection effects driving beneficial changes to the D. simulans genome and suggest how the widely varying estimates obtained in previous studies of Drosophila may be reconciled. Moreover, the approach that we introduce is readily applicable to other taxa and thus should help to gain important insights into how the rate and strength of adaptive evolution vary depending on life-history, population size, and ecology.
The last decade has witnessed important advances in our understanding of the genetics of pigmentation in European populations, but very little is known about the genes involved in skin pigmentation variation in East Asian populations. Here, we present the results of a study evaluating the association of 10 Single Nucleotide Polymorphisms (SNPs) located within 5 pigmentation candidate genes (OCA2, DCT, ADAM17, ADAMTS20, and TYRP1) with skin pigmentation measured quantitatively in a sample of individuals of East Asian ancestry living in Canada. We show that the non-synonymous polymorphism rs1800414 (His615Arg) located within the OCA2 gene is significantly associated with skin pigmentation in this sample. We replicated this result in an independent sample of Chinese individuals of Han ancestry. This polymorphism is characterized by a derived allele that is present at a high frequency in East Asian populations, but is absent in other population groups. In both samples, individuals with the derived G allele, which codes for the amino acid arginine, show lower melanin levels than those with the ancestral A allele, which codes for the amino acid histidine. An analysis of this non-synonymous polymorphism using several programs to predict potential functional effects provides additional support for the role of this SNP in skin pigmentation variation in East Asian populations. Our results are consistent with previous research indicating that evolution to lightly-pigmented skin occurred, at least in part, independently in Europe and East Asia.
Our knowledge of the genetic basis of normal pigmentation variation in human populations is quite incomplete. Recent studies have identified some of the genes responsible for the reduction in melanin content in European populations, but this is not the case for other population groups, such as East Asians. Here, we report that a genetic variant located within the gene OCA2 (rs1800414) is associated with skin pigmentation in two samples of East Asian ancestry. The allele associated with lower melanin levels is found at high frequencies in East Asian populations, but is absent or at very low frequencies in other population groups. This is one of the first reports of association of genetic markers with quantitative measures of pigmentation in East Asian populations and it confirms previous evidence indicating that evolution towards light skin occurred, at least in part, independently in Europe and East Asia. The OCA2 gene has been under positive selection in Europe and East Asia, but different alleles have been selected in each region.
Hotspots of meiotic recombination can change rapidly over time. This instability and the reported high level of inter-individual variation in meiotic recombination puts in question the accuracy of the calculated hotspot map, which is based on the summation of past genetic crossovers. To estimate the accuracy of the computed recombination rate map, we have mapped genetic crossovers to a median resolution of 70 Kb in 10 CEPH pedigrees. We then compared the positions of crossovers with the hotspots computed from HapMap data and performed extensive computer simulations to compare the observed distributions of crossovers with the distributions expected from the calculated recombination rate maps. Here we show that a population-averaged hotspot map computed from linkage disequilibrium data predicts well present-day genetic crossovers. We find that computed hotspot maps accurately estimate both the strength and the position of meiotic hotspots. An in-depth examination of not-predicted crossovers shows that they are preferentially located in regions where hotspots are found in other populations. In summary, we find that by combining several computed population-specific maps we can capture the variation in individual hotspots to generate a hotspot map that can predict almost all present-day genetic crossovers.
In eukaryotes genetic crossovers are responsible for generating genetic diversity and ensuring the proper segregation of chromosomes. Genetic crossovers are tightly clustered in hotspots. Although the existence of hotspots in humans is clearly proven, mechanisms of their formation and the regulation of meiotic recombination in general remain poorly understood. An additional complication in studies of meiotic recombination is the fact that the direct experimental mapping of human hotspots on a genome-wide scale is not feasible with current methods. The best available indirect methods compute the position of hotspots from patterns of historic associations between genetic markers in population samples. In this study we determined the positions of genetic crossovers in ten pedigrees of European origin and then compared the positions of crossovers with the hotspots computed from HapMap data. Importantly, we find that the population-averaged computed map is in close agreement with the observed distribution of genetic crossovers. We also find that cryptic hotspots that are not easily detected in the computed European map can be more effectively identified if other populations are included in the analysis. Our analysis shows that high-resolution recombination profiles are highly similar between distantly related populations and that by including computed hotspots from several populations we can predict nearly all crossovers.
Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40–270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17–43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3–26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).
The demographic history of our species is reflected in patterns of genetic variation within and among populations. We developed an efficient method for calculating the expected distribution of genetic variation, given a demographic model including such events as population size changes, population splits and joins, and migration. We applied our approach to publicly available human sequencing data, searching for models that best reproduce the observed patterns. Our joint analysis of data from African, European, and Asian populations yielded new dates for when these populations diverged. In particular, we found that African and Eurasian populations diverged around 100,000 years ago. This is earlier than other genetic studies suggest, because our model includes the effects of migration, which we found to be important for reproducing observed patterns of variation in the data. We also analyzed data from European, Asian, and Mexican populations to model the peopling of the Americas. Here, we find no evidence for recurrent migration after East Asian and Native American populations diverged. Our methods are not limited to studying humans, and we hope that future sequencing projects will offer more insights into the history of both our own species and others.
Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's fst and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.
Genetic variation in natural populations typically demonstrates structure arising from diverse processes including geographical isolation, founder events, migration, and admixture. One technique commonly used to uncover such structure is principal components analysis, which identifies the primary axes of variation in data and projects the samples onto these axes in a graphically appealing and intuitive manner. However, as the method is non-parametric, it can be hard to relate PCA to underlying process. Here, I show that the underlying genealogical history of the samples can be related directly to the PC projection. The result is useful because it is straightforward to predict the effects of different demographic processes on the sample genealogy. However, the result also reveals the limitations of PCA, in that multiple processes can give the same projections, it is strongly influenced by uneven sampling, and it discards important information in the spatial structure of genetic variation along chromosomes.
Infectious diseases have been paramount among the threats to health and survival throughout human evolutionary history. Natural selection is therefore expected to act strongly on host defense genes, particularly on innate immunity genes whose products mediate the direct interaction between the host and the microbial environment. In insects and mammals, the Toll-like receptors (TLRs) appear to play a major role in initiating innate immune responses against microbes. In humans, however, it has been speculated that the set of TLRs could be redundant for protective immunity. We investigated how natural selection has acted upon human TLRs, as an approach to assess their level of biological redundancy. We sequenced the ten human TLRs in a panel of 158 individuals from various populations worldwide and found that the intracellular TLRs—activated by nucleic acids and particularly specialized in viral recognition—have evolved under strong purifying selection, indicating their essential non-redundant role in host survival. Conversely, the selective constraints on the TLRs expressed on the cell surface—activated by compounds other than nucleic acids—have been much more relaxed, with higher rates of damaging nonsynonymous and stop mutations tolerated, suggesting their higher redundancy. Finally, we tested whether TLRs have experienced spatially-varying selection in human populations and found that the region encompassing TLR10-TLR1-TLR6 has been the target of recent positive selection among non-Africans. Our findings indicate that the different TLRs differ in their immunological redundancy, reflecting their distinct contributions to host defense. The insights gained in this study foster new hypotheses to be tested in clinical and epidemiological genetics of infectious disease.
The detrimental effects of microbial infections have led to the evolution of a variety of host defense mechanisms. A vast array of host innate immunity receptors, critical sensors of viruses, bacteria, and fungi, exist to achieve permanent surveillance of intruding pathogens. The best characterized class of microbial sensors is the Toll-like receptor (TLR) family, which elicits inflammatory and antimicrobial responses after activation by microbial products. Here we investigated how microbes have exerted selective pressure on the human TLR family to gain insights on the extent to which they are functionally important in the immune system. By resequencing the ten TLRs in different worldwide populations, we show that intracellular TLRs—principally specialized in viral recognition—evolve under strong purifying selection, indicating their essential role in host survival, while the remaining TLRs display higher levels of immunological redundancy. However, for this latter group of genes, we also show that mutations altering immune responses have been in some cases beneficial for host survival, as attested by the signature of positive selection favoring a reduced TLR1-mediated response in Europeans. Our findings taken together indicate that the different human TLRs differ in their biological relevance and provide clues to be experimentally tested in clinical, immunological, and epidemiological studies.