In the last few years, two paradigms underlying human evolution have crumbled. Modern humans have not totally replaced previous hominins without any admixture, and the expected signatures of adaptations to new environments are surprisingly lacking at the genomic level. Here we review current evidence about archaic admixture and lack of strong selective sweeps in humans. We underline the need to properly model differential admixture in various populations to correctly reconstruct past demography. We also stress the importance of taking into account the spatial dimension of human evolution, which proceeded by a series of range expansions that could have promoted both the introgression of archaic genes and background selection.
The evolution of sociality in spiders involves a transition from an outcrossing to a highly inbreeding mating system, a shift to a female biased sex ratio, and an increase in the reproductive skew among individuals. Taken together, these features are expected to result in a strong reduction in the effective population size. Such a decline in effective population size is expected to affect population genetic and molecular evolutionary processes, resulting in reduced genetic diversity and relaxed selective constraint across the genome. In the genus Stegodyphus, permanent sociality and regular inbreeding has evolved independently three times from periodic-social (outcrossing) ancestors. This genus is therefore an ideal model for comparative studies of the molecular evolutionary and population genetic consequences of the transition to a regularly inbreeding mating system. However, no genetic resources are available for this genus.
We present the analysis of high throughput transcriptome sequencing of three Stegodyphus species. Two of these are periodic-social (Stegodyphus lineatus and S.tentoriicola) and one is permanently social (S. mimosarum). From non-normalized cDNA libraries, we obtained on average 7,000 putative uni-genes for each species. Three-way orthology, as predicted from reciprocal BLAST, identified 1,792 genes that could be used for cross-species comparison. Open reading frames (ORFs) could be deduced from 1,345 of the three-way alignments. Preliminary molecular analyses suggest a five- to ten-fold reduction in heterozygosity in the social S. mimosarum compared with the periodic-social species. Furthermore, an increased ratio of non-synonymous to synonymous polymorphisms in the social species indicated relaxed efficiency of selection. However, there was no sign of relaxed selection on the phylogenetic branch leading to S. mimosarum.
The 1,792 three-way ortholog genes identified in this study provide a unique resource for comparative studies of the eco-genomics, population genetics and molecular evolution of repeated evolution of inbreeding sociality within the Stegodyphus genus. Preliminary analyses support theoretical expectations of depleted heterozygosity and relaxed selection in the social inbreeding species. Relaxed selection could not be detected in the S. mimosarum lineage, suggesting that there has been a recent transition to sociality in this species.
Recent results from Drosophila suggest that positive selection has a substantial impact on genomic patterns of polymorphism and divergence. However, species with smaller population sizes and/or stronger population structure may not be expected to exhibit Drosophila-like patterns of sequence variation. We test this prediction and identify determinants of levels of polymorphism and rates of protein evolution using genomic data from Arabidopsis thaliana and the recently sequenced Arabidopsis lyrata genome. We find that, in contrast to Drosophila, there is no negative relationship between nonsynonymous divergence and silent polymorphism at any spatial scale examined. Instead, synonymous divergence is a major predictor of silent polymorphism, which suggests variation in mutation rate as the main determinant of silent variation. Variation in rates of protein divergence is mainly correlated with gene expression level and breadth, consistent with results for a broad range of taxa, and map-based estimates of recombination rate are only weakly correlated with nonsynonymous divergence. Variation in mutation rates and the strength of purifying selection seem to be major drivers of patterns of polymorphism and divergence in Arabidopsis. Nevertheless, a model allowing for varying negative and positive selection by functional gene category explains the data better than a homogeneous model, implying the action of positive selection on a subset of genes. Genes involved in disease resistance and abiotic stress display high proportions of adaptive substitution. Our results are important for a general understanding of the determinants of rates of protein evolution and the impact of selection on patterns of polymorphism and divergence.
dN/dS; neutral theory; purifying selection; translational selection; recurrent hitchhiking
Due to genetic variation in the ancestor of two populations or two species, the divergence time for DNA sequences from two populations is variable along the genome. Within genomic segments all bases will share the same divergence—because they share a most recent common ancestor—when no recombination event has occurred to split them apart. The size of these segments of constant divergence depends on the recombination rate, but also on the speciation time, the effective population size of the ancestral population, as well as demographic effects and selection. Thus, inference of these parameters may be possible if we can decode the divergence times along a genomic alignment. Here, we present a new hidden Markov model that infers the changing divergence (coalescence) times along the genome alignment using a coalescent framework, in order to estimate the speciation time, the recombination rate, and the ancestral effective population size. The model is efficient enough to allow inference on whole-genome data sets. We first investigate the power and consistency of the model with coalescent simulations and then apply it to the whole-genome sequences of the two orangutan sub-species, Bornean (P. p. pygmaeus) and Sumatran (P. p. abelii) orangutans from the Orangutan Genome Project. We estimate the speciation time between the two sub-species to be thousand years ago and the effective population size of the ancestral orangutan species to be , consistent with recent results based on smaller data sets. We also report a negative correlation between chromosome size and ancestral effective population size, which we interpret as a signature of recombination increasing the efficacy of selection.
We present a hidden Markov model that uses variation in coalescence times between two distantly related populations, or closely related species, to infer population genetics parameters in ancestral population or species. The model infers the divergence times in segments along the alignment. Using coalescent simulations, we show that the model accurately estimates the divergence time between the two populations and the effective population size of the ancestral population. We apply the model to the recently sequenced orangutan sub-species and estimate their divergence time and the effective population size of their ancestor population.
The fungus Mycosphaerella graminicola has been a pathogen of wheat since host domestication 10,000–12,000 years ago in the Fertile Crescent. The wheat-infecting lineage emerged from closely related Mycosphaerella pathogens infecting wild grasses. We use a comparative genomics approach to assess how the process of host specialization affected the genome structure of M. graminicola since divergence from the closest known progenitor species named M. graminicola S1. The genome of S1 was obtained by Illumina sequencing resulting in a 35 Mb draft genome sequence of 32X. Assembled contigs were aligned to the previously sequenced M. graminicola genome. The alignment covered >90% of the non-repetitive portion of the M. graminicola genome with an average divergence of 7%. The sequenced M. graminicola strain is known to harbor thirteen essential chromosomes plus eight dispensable chromosomes. We found evidence that structural rearrangements significantly affected the dispensable chromosomes while the essential chromosomes were syntenic. At the nucleotide level, the essential and dispensable chromosomes have evolved differently. The average synonymous substitution rate in dispensable chromosomes is considerably lower than in essential chromosomes, whereas the average non-synonymous substitution rate is three times higher. Differences in molecular evolution can be related to different transmission and recombination patterns, as well as to differences in effective population sizes of essential and dispensable chromosomes. In order to identify genes potentially involved in host specialization or speciation, we calculated ratios of synonymous and non-synonymous substitution rates in the >9,500 aligned protein coding genes. The genes are generally under strong purifying selection. We identified 43 candidate genes showing evidence of positive selection, one encoding a potential pathogen effector protein. We conclude that divergence of these pathogens was accompanied by structural rearrangements in the small dispensable chromosomes, while footprints of positive selection were present in only a small number of protein coding genes.
The fungal wheat pathogen Mycosphaerella graminicola emerged in the Middle East 11,000 years ago, coinciding with host domestication. We sequenced the genome of the closest known endemic relative of M. graminicola infecting wild grass hosts. A comparative genome analysis allowed us to infer how speciation and host specialization processes have influenced pathogen evolution. The wild grass-adapted pathogen can infect wheat, but M. graminicola shows a significantly higher degree of host specificity and virulence in a detached leaf assay. The genomes of the pathogens are 7% divergent with a high degree of synteny in the 13 essential core chromosomes. However, structural rearrangements have strongly affected eight small dispensable chromosomes. These chromosomes also show altered rates of non-synonymous and synonymous substitutions. We found 43 genes showing evidence of positive selection. As the divergence of species occurred very recently, these genes are likely involved in host specialization or speciation. None of the genes have a known function, although one encodes a signal peptide and is a potential pathogen effector. We conclude that the genomic basis of the rapid emergence of the wheat-specialized pathogen M. graminicola has involved structural changes in the eight dispensable chromosomes and positive selection in a small number of genes.
Although insertions and deletions (indels) account for a sizable portion of genetic changes within and among species, they have received little attention because they are difficult to type, are alignment dependent and their underlying mutational process is poorly understood. A fundamental question in this respect is whether insertions and deletions are governed by similar or different processes and, if so, what these differences are.
We use published resequencing data from Seattle SNPs and NIEHS human polymorphism databases to construct a genomewide data set of short polymorphic insertions and deletions in the human genome (n = 6228). We contrast these patterns of polymorphism with insertions and deletions fixed in the same regions since the divergence of human and chimpanzee (n = 10546). The macaque genome is used to resolve all indels into insertions and deletions. We find that the ratio of deletions to insertions is greater within humans than between human and chimpanzee. Deletions segregate at lower frequency in humans, providing evidence for deletions being under stronger purifying selection than insertions. The insertion and deletion rates correlate with several genomic features and we find evidence that both insertions and deletions are associated with point mutations. Finally, we find no evidence for a direct effect of the local recombination rate on the insertion and deletion rate.
Our data strongly suggest that deletions are more deleterious than insertions but that insertions and deletions are otherwise generally governed by the same genomic factors.
A small region of about 70 kb on human chromosome 19q13.3 encompasses 4 genes of which 3, ERCC1, ERCC2, and PPP1R13L (aka RAI) are related to DNA repair and cell survival, and one, CD3EAP, aka ASE1, may be related to cell proliferation. The whole region seems related to the cellular response to external damaging agents and markers in it are associated with risk of several cancers.
We downloaded the genotypes of all markers typed in the 19q13.3 region in the HapMap populations of European, Asian and African descent and inferred haplotypes. We combined the European HapMap individuals with a Danish breast cancer case-control data set and inferred the association between HapMap haplotypes and disease risk.
We found that the susceptibility haplotype in our European sample had increased from 2 to 50 percent very recently in the European population, and to almost the same extent in the Asian population. The cause of this increase is unknown. The maximal proportion of overall genetic variation due to differences between groups for Europeans versus Africans and Europeans versus Asians (the Fst value) closely matched the putative location of the susceptibility variant as judged from haplotype-based association mapping.
The combined observation that a common haplotype causing an increased risk of cancer in Europeans and a high differentiation between human populations is highly unusual and suggests a causal relationship with a recent increase in Europeans caused either by genetic drift overruling selection against the susceptibility variant or a positive selection for the same haplotype. The data does not allow us to distinguish between these two scenarios. The analysis suggests that the region is not involved in cancer risk in Africans and that the susceptibility variants may be more finely mapped in Asian populations.
Using the genomic sequences of Drosophila melanogaster subgroup, the pattern of gene duplications was investigated with special attention to interlocus gene conversion. Our fine-scale analysis with careful visual inspections enabled accurate identification of a number of duplicated blocks (genomic regions). The orthologous parts of those duplicated blocks were also identified in the D. simulans and D. sechellia genomes, by which we were able to clearly classify the duplicated blocks into post- and pre-speciation blocks. We found 31 post-speciation duplicated genes, from which the rate of gene duplication (from one copy to two copies) is estimated to be 1.0×10−9 per single-copy gene per year. The role of interlocus gene conversion was observed in several respects in our data: (1) synonymous divergence between a duplicated pair is overall very low. Consequently, the gene duplication rate would be seriously overestimated by counting duplicated genes with low divergence; (2) the sizes of young duplicated blocks are generally large. We postulate that the degeneration of gene conversion around the edges could explain the shrinkage of “identifiable” duplicated regions; and (3) elevated paralogous divergence is observed around the edges in many duplicated blocks, supporting our gene conversion–degeneration model. Our analysis demonstrated that gene conversion between duplicated regions is a common and genome-wide phenomenon in the Drosophila genomes, and that its role should be especially significant in the early stages of duplicated genes. Based on a population genetic prediction, we applied a new genome-scan method to test for signatures of selection for neofunctionalization and found a strong signature in a pair of transporter genes.
Eukaryote genomes have a number of duplicated genes, which could potentially coevolve by exchanging DNA sequences by interlocus gene conversion. However, the extent of gene conversion on a genomic scale is not well understood, except that an extensive role of gene conversion was reported in yeast. Here, we show a second evaluation of the role of gene conversion by analyzing multiple genomes in the D. melanogaster subgroup. We found that most of young duplicated genes have experienced gene conversion, although not as extensively as yeast. We further performed fine-scale analysis of duplicated DNA sequences and estimated the gene duplication rate. Our estimate turned out to be much smaller than that of a commonly used method, which usually causes an overestimation when gene conversion is active. The role of positive selection for neofunctionalization was inferred by applying a novel test. Our results suggest that interlocus gene conversion could be a crucial mutational mechanism in the evolution of duplicated genes in eukaryote genomes and that the effect of gene conversion should be taken into account when analyzing molecular evolution of duplicated genes.
Recently diverged species typically have incomplete reproductive barriers, allowing introgression of genetic material from one species into the genomic background of the other. The role of natural selection in preventing or promoting introgression remains contentious. Because of genomic co-adaptation, some chromosomal fragments are expected to be selected against in the new background and resist introgression. In contrast, natural selection should favor introgression for alleles at genes evolving under multi-allelic balancing selection, such as the MHC in vertebrates, disease resistance, or self-incompatibility genes in plants. Here, we test the prediction that negative, frequency-dependent selection on alleles at the multi-allelic gene controlling pistil self-incompatibility specificity in two closely related species, Arabidopsis halleri and A. lyrata, caused introgression at this locus at a higher rate than the genomic background. Polymorphism at this gene is largely shared, and we have identified 18 pairs of S-alleles that are only slightly divergent between the two species. For these pairs of S-alleles, divergence at four-fold degenerate sites (K = 0.0193) is about four times lower than the genomic background (K = 0.0743). We demonstrate that this difference cannot be explained by differences in effective population size between the two types of loci. Rather, our data are most consistent with a five-fold increase of introgression rates for S-alleles as compared to the genomic background, making this study the first documented example of adaptive introgression facilitated by balancing selection. We suggest that this process plays an important role in the maintenance of high allelic diversity and divergence at the S-locus in flowering plant families. Because genes under balancing selection are expected to be among the last to stop introgressing, their comparison in closely related species provides a lower-bound estimate of the time since the species stopped forming fertile hybrids, thereby complementing the average portrait of divergence between species provided by genomic data.
The role of natural selection in promoting or preventing genomic divergence between nascent species remains highly debated. As long as reproductive barriers remain incomplete, genetic material from one species is indeed exposed to natural selection into the genomic background of the other species. In some cases, genomic co-adaptations developing independently in each species are believed to select against such transfers. Yet, theory predicts that the transfer of some chromosomal fragments may be favored by natural selection. In particular, this should occur for alleles at genes evolving under a particular form of natural selection, i.e., multi-allelic balancing selection. We test this prediction using two closely related Arabidopsis species, and find a four-fold lower divergence at alleles at the gene controlling pistil self-incompatibility specificity than at the genomic background. We conclude that alleles at this gene have been transferred more readily between the two species than the genomic background. We suggest that natural selection may efficiently allow the maintenance of high allelic diversity and divergence across many species at S-loci as well as at all other loci under multi-allelic balancing selection, such as the MHC in vertebrates or disease resistance genes in plants.
A newly arisen Y-chromosome can become established in one part of a species range by genetic drift or through the effects of selection on sexually antagonistic alleles. However, it is difficult to explain why it should then spread throughout the species range after this initial episode. As it spreads into new populations, it will actually enter females. It would then be expected to perform poorly since it will have been shaped by the selective regime of the male-only environment from which it came. We address this problem using computer models of hybrid zone dynamics where a neo-XY chromosomal race meets the ancestral karyotype. Our models consider that the neo-Y was established by the fusion of an autosome with the ancestral X-chromosome (thereby creating the Y and the ‘fused X’). Our principal finding is that sexually antagonistic effects of the Y induce indirect selection in favour of the fused X-chromosomes, causing their spread. The Y-chromosome can then spread, protected behind the advancing shield of the fused X distribution. This mode of spread provides a robust explanation of how newly arisen Y-chromosomes can spread. A Y-chromosome would be expected to accumulate mutations that would cause it to be selected against when it is a rare newly arrived migrant. The Y can spread, nevertheless, because of the indirect selection induced by gene flow (which can only be observed in models comprising multiple populations). These results suggest a fundamental re-evaluation of sex-chromosome hybrid zones. The well-understood evolutionary events that initiate the Y-chromosome's degeneration will actually fuel its range expansion.
Comparisons between related species have shown that, over evolutionary time scales, Y-chromosomes tend to degenerate and can be completely lost. How then can we explain the persistence of Y-chromosomes to the present? One possibility is that losses are counter-balanced by the origin of new Y chromosomes, which then spread throughout the species in which they have arisen. The first of these two processes, the generation of new Y chromsomes, is more readily understood: it can occur if an autosome (a non sex chromosome) fuses with an X chromosome. This form might become established in one locality. However, its subsequent geographic spread has been more challenging to explain. Problems arise if gene flow carries them to another part of the species range. Crosses can then occur which introduce the new Y chromosome into females, who are expected to suffer reduced fitness. The new sex chromosomes are therefore selected against when they are in the minority. We use simulations to show that they can nevertheless spread, if they meet the ancestral forms at a front so the chromosomes intermingle in a hybrid zone. Paradoxically, the degeneration of the Y will actually intensify selection, thereby speeding its spread.
The genealogical relationship of human, chimpanzee, and gorilla varies along the genome. We develop a hidden Markov model (HMM) that incorporates this variation and relate the model parameters to population genetics quantities such as speciation times and ancestral population sizes. Our HMM is an analytically tractable approximation to the coalescent process with recombination, and in simulations we see no apparent bias in the HMM estimates. We apply the HMM to four autosomal contiguous human–chimp–gorilla–orangutan alignments comprising a total of 1.9 million base pairs. We find a very recent speciation time of human–chimp (4.1 ± 0.4 million years), and fairly large ancestral effective population sizes (65,000 ± 30,000 for the human–chimp ancestor and 45,000 ± 10,000 for the human–chimp–gorilla ancestor). Furthermore, around 50% of the human genome coalesces with chimpanzee after speciation with gorilla. We also consider 250,000 base pairs of X-chromosome alignments and find an effective population size much smaller than 75% of the autosomal effective population sizes. Finally, we find that the rate of transitions between different genealogies correlates well with the region-wide present-day human recombination rate, but does not correlate with the fine-scale recombination rates and recombination hot spots, suggesting that the latter are evolutionarily transient.
Primate evolution is a central topic in biology and much information can be obtained from DNA sequence data. A key parameter is the time “when we became human,” i.e., the time in the past when descendents of the human–chimp ancestor split into human and chimpanzee. Other important parameters are the time in the past when descendents of the human–chimp–gorilla ancestor split into descendents of the human–chimp ancestor and the gorilla ancestor, and population sizes of the human–chimp and human–chimp–gorilla ancestors. To estimate speciation times and ancestral population sizes we have developed a new methodology that explicitly utilizes the spatial information in contiguous genome alignments. Furthermore, we have applied this methodology to four long autosomal human–chimp–gorilla–orangutan alignments and estimated a very recent speciation time of human and chimp (around 4 million years) and ancestral population sizes much larger than the present-day human effective population size. We also analyzed X-chromosome sequence data and found that the X chromosome has experienced a different history from that of autosomes, possibly because of selection.
With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed.
We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene.
Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
Coalescent simulations are playing a large role in interpreting large scale intra-specific sequence or polymorphism surveys and for planning and evaluating association studies. Coalescent simulations of data sets under different models can be compared to the actual data to test the importance of different evolutionary factors and thus get insight into these.
We have created the CoaSim application as a flexible environment for Monte Carlo simulation of various types of genetic data under equilibrium and non-equilibrium coalescent processes for a variety of applications. Interaction with the tool is through the Guile version of the Scheme scripting language. Scheme scripts for many standard and advanced applications are provided and these can easily be modified by the user for a much wider range of applications. A graphical user interface with less functionality and flexibility is also included. It is primarily intended as an exploratory and educational tool
CoaSim is a powerful tool because of its flexibility and ease of use. This is illustrated through very varied uses of the application, e.g. evaluation of association mapping methods, parametric bootstrapping, and design and choice of markers for specific questions
The advent of live-attenuated vaccines against measles virus during the 1960'ies changed the circulation dynamics of the virus. Earlier the virus was indigenous to countries worldwide, but now it is mediated by a limited number of evolutionary lineages causing sporadic outbreaks/epidemics of measles or circulating in geographically restricted endemic areas of Africa, Asia and Europe. We expect that the evolutionary dynamics of measles virus has changed from a situation where a variety of genomic variants co-circulates in an epidemic with relatively high probabilities of co-infection of the individual to a situation where a co-infection with strains from evolutionary different lineages is unlikely.
We performed an analysis of the partial sequences of the hemagglutinin gene of 18 measles virus strains collected in Denmark between 1965 and 1983 where vaccination was first initiated in 1987. The results were compared with those obtained with strains collected from other parts of the world after the initiation of vaccination in the given place. Intergenomic recombination among pre-/early-vaccination strains is suggested by 1) estimations of linkage disequilibrium between informative sites, 2) the decay of linkage disequilibrium with distance between informative sites and 3) a comparison of the expected number of homoplasies to the number of apparent homoplasies in the most parsimonious tree. No significant evidence of recombination could be demonstrated among strains circulating at present.
We provide evidence that recombination can occur in measles virus and that it has had a detectable impact on sequence evolution of pre-vaccination samples. We were not able to detect recombination from present-day sequence surveys. We believe that the decreased rate of visible recombination may be explained by changed dynamics, since divergent strains do not meet very often in current epidemics that are often spawned by a single sequence type. Signs of pre-vaccination recombination events in the present-day sequences are not strong enough to be detectable.
Comparative whole genome analysis of Mammalia can benefit from the addition of more species. The pig is an obvious choice due to its economic and medical importance as well as its evolutionary position in the artiodactyls.
We have generated ~3.84 million shotgun sequences (0.66X coverage) from the pig genome. The data are hereby released (NCBI Trace repository with center name "SDJVP", and project name "Sino-Danish Pig Genome Project") together with an initial evolutionary analysis.
The non-repetitive fraction of the sequences was aligned to the UCSC human-mouse alignment and the resulting three-species alignments were annotated using the human genome annotation. Ultra-conserved elements and miRNAs were identified. The results show that for each of these types of orthologous data, pig is much closer to human than mouse is. Purifying selection has been more efficient in pig compared to human, but not as efficient as in mouse, and pig seems to have an isochore structure most similar to the structure in human.
The addition of the pig to the set of species sequenced at low coverage adds to the understanding of selective pressures that have acted on the human genome by bisecting the evolutionary branch between human and mouse with the mouse branch being approximately 3 times as long as the human branch. Additionally, the joint alignment of the shot-gun sequences to the human-mouse alignment offers the investigator a rapid way to defining specific regions for analysis and resequencing.
Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours1–4, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.
We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.
Next-generation sequencing technology has enabled the generation of whole-genome data for many closely related species. For population genetic inference we have sequenced many loci, but only in a few individuals. We present a new method that allows inference of the divergence process based on two closely related genomes, modelled as gradual isolation in an isolation with migration model. This allows estimation of the initial time of restricted gene flow, the cessation of gene flow, as well as the population sizes, migration rates, and recombination rates. We show by simulations that the parameter estimation is accurate with genome-wide data and use the model to disentangle the divergence processes among three sets of closely related great ape species: bonobo/chimpanzee, eastern/western gorillas, and Sumatran/Bornean orang-utans. We find allopatric speciation for bonobo and chimpanzee and non-allopatric speciation for the gorillas and orang-utans. We also consider the split between humans and chimpanzees/bonobos and find evidence for non-allopatric speciation, similar to that within gorillas and orang-utans.
An outstanding question in human genetics has been the degree to which adaptation occurs from standing genetic variation or from de novo mutations. Here, we combine several common statistics used to detect selection in an Approximate Bayesian Computation (ABC) framework, with the goal of discriminating between models of selection and providing estimates of the age of selected alleles and the selection coefficients acting on them. We use simulations to assess the power and accuracy of our method and apply it to seven of the strongest sweeps currently known in humans. We identify two genes, ASPM and PSCA, that are most likely affected by selection on standing variation; and we find three genes, ADH1B, LCT, and EDAR, in which the adaptive alleles seem to have swept from a new mutation. We also confirm evidence of selection for one further gene, TRPV6. In one gene, G6PD, neither neutral models nor models of selective sweeps fit the data, presumably because this locus has been subject to balancing selection.
Considerable effort has been devoted to detecting genes that are under natural selection, and hundreds of such genes have been identified in previous studies. Here, we present a method for extending these studies by inferring parameters, such as selection coefficients and the time when a selected variant arose. Of particular interest is the question whether the selective pressure was already present when the selected variant was first introduced into a population. In this case, the variant would be selected right after it originated in the population, a process we call selection from a de novo mutation. We contrast this with selection from standing variation, where the selected variant predates the selective pressure. We present a method to distinguish these two scenarios, test its accuracy, and apply it to seven human genes. We find three genes, ADH1B, EDAR, and LCT, that were presumably selected from a de novo mutation and two other genes, ASPM and PSCA, which we infer to be under selection from standing variation.
Non-human primates have emerged as an important resource for the study of human disease and evolution. The characterization of genomic variation between and within non-human primate species could advance the development of genetically defined non-human primate disease models. However, non-human primate specific reagents that would expedite such research, such as exon-capture tools, are lacking. We evaluated the efficiency of using a human exome capture design for the selective enrichment of exonic regions of non-human primates. We compared the exon sequence recovery in nine chimpanzees, two crab-eating macaques and eight Japanese macaques. Over 91% of the target regions were captured in the non-human primate samples, although the specificity of the capture decreased as evolutionary divergence from humans increased. Both intra-specific and inter-specific DNA variants were identified; Sanger-based resequencing validated 85.4% of 41 randomly selected SNPs. Among the short indels identified, a majority (54.6%–77.3%) of the variants resulted in a change of 3 base pairs, consistent with expectations for a selection against frame shift mutations. Taken together, these findings indicate that use of a human design exon-capture array can provide efficient enrichment of non-human primate gene regions. Accordingly, use of the human exon-capture methods provides an attractive, cost-effective approach for the comparative analysis of non-human primate genomes, including gene-based DNA variant discovery.
Gene conversion is the unidirectional transfer of genetic information between orthologous (allelic) or paralogous (nonallelic) genomic segments. Though a number of studies have examined nucleotide replacements, little is known about length difference mutations produced by gene conversion. Here, we investigate insertions and deletions produced by nonallelic gene conversion in 338 Drosophila and 10,149 primate paralogs. Using a direct phylogenetic approach, we identify 179 insertions and 614 deletions in Drosophila paralogs, and 132 insertions and 455 deletions in primate paralogs. Thus, nonallelic gene conversion is strongly deletion-biased in both lineages, with almost 3.5 times as many conversion-induced deletions as insertions. In primates, the deletion bias is considerably stronger for long indels and, in both lineages, the per-site rate of gene conversion is orders of magnitudes higher than that of ordinary mutation. Due to this high rate, deletion-biased nonallelic gene conversion plays a key role in genome size evolution, leading to the cooperative shrinkage and eventual disappearance of selectively neutral paralogs.
Gene conversion is a process whereby a DNA sequence is copied from one segment of the genome (donor) to another (recipient), resulting in the replacement, insertion, or deletion of a DNA sequence in the recipient. This exchange is facilitated by the high sequence similarity of the two segments, which is due to their evolutionary relationship. Here, we study insertions and deletions produced by gene conversion between paralogs, segments related by DNA duplication events. By comparing paralog sequences in multiple species of fruit flies and primates, we find that deletions occur more than three times as frequently as insertions. We also discover that the rate of gene conversion between paralogs is quite high. The deletion bias and high rate of this process causes paralogs to shrink cooperatively and eventually be eliminated from the genome. Because of the abundance of paralogs in animal genomes, this phenomenon can lead to a significant reduction in genome size. Therefore, our finding enhances our understanding of the forces that lead to changes in genome size during evolution.
North African populations are distinct from sub-Saharan Africans based on cultural, linguistic, and phenotypic attributes; however, the time and the extent of genetic divergence between populations north and south of the Sahara remain poorly understood. Here, we interrogate the multilayered history of North Africa by characterizing the effect of hypothesized migrations from the Near East, Europe, and sub-Saharan Africa on current genetic diversity. We present dense, genome-wide SNP genotyping array data (730,000 sites) from seven North African populations, spanning from Egypt to Morocco, and one Spanish population. We identify a gradient of likely autochthonous Maghrebi ancestry that increases from east to west across northern Africa; this ancestry is likely derived from “back-to-Africa” gene flow more than 12,000 years ago (ya), prior to the Holocene. The indigenous North African ancestry is more frequent in populations with historical Berber ethnicity. In most North African populations we also see substantial shared ancestry with the Near East, and to a lesser extent sub-Saharan Africa and Europe. To estimate the time of migration from sub-Saharan populations into North Africa, we implement a maximum likelihood dating method based on the distribution of migrant tracts. In order to first identify migrant tracts, we assign local ancestry to haplotypes using a novel, principal component-based analysis of three ancestral populations. We estimate that a migration of western African origin into Morocco began about 40 generations ago (approximately 1,200 ya); a migration of individuals with Nilotic ancestry into Egypt occurred about 25 generations ago (approximately 750 ya). Our genomic data reveal an extraordinarily complex history of migrations, involving at least five ancestral populations, into North Africa.
Proposed migrations between North Africa and neighboring regions have included Paleolithic gene flow from the Near East, an Arabic migration across the whole of North Africa 1,400 years ago (ya), and trans-Saharan transport of slaves from sub-Saharan Africa. Historical records, archaeology, and mitochondrial and Y-chromosome DNA have been marshaled in support of one theory or another, but there is little consensus regarding the overall genetic background of North African populations or their origin and expansion. We characterize the patterns of genetic variation in North Africa using ∼730,000 single nucleotide polymorphisms from across the genome for seven populations. We observe two distinct, opposite gradients of ancestry: an east-to-west increase in likely autochthonous North African ancestry and an east-to-west decrease in likely Near Eastern Arabic ancestry. The indigenous North African ancestry may have been more common in Berber populations and appears most closely related to populations outside of Africa, but divergence between Maghrebi peoples and Near Eastern/Europeans likely precedes the Holocene (>12,000 ya). We also find significant signatures of sub-Saharan African ancestry that vary substantially among populations. These sub-Saharan ancestries appear to be a recent introduction into North African populations, dating to about 1,200 years ago in southern Morocco and about 750 years ago into Egypt, possibly reflecting the patterns of the trans-Saharan slave trade that occurred during this period.
In human cells, DNA double-strand breaks are repaired primarily by the non-homologous end joining (NHEJ) pathway. Given their critical nature, we expected NHEJ proteins to be evolutionarily conserved, with relatively little sequence change over time. Here, we report that while critical domains of these proteins are conserved as expected, the sequence of NHEJ proteins has also been shaped by recurrent positive selection, leading to rapid sequence evolution in other protein domains. In order to characterize the molecular evolution of the human NHEJ pathway, we generated large simian primate sequence datasets for NHEJ genes. Codon-based models of gene evolution yielded statistical support for the recurrent positive selection of five NHEJ genes during primate evolution: XRCC4, NBS1, Artemis, POLλ, and CtIP. Analysis of human polymorphism data using the composite of multiple signals (CMS) test revealed that XRCC4 has also been subjected to positive selection in modern humans. Crystal structures are available for XRCC4, Nbs1, and Polλ; and residues under positive selection fall exclusively on the surfaces of these proteins. Despite the positive selection of such residues, biochemical experiments with variants of one positively selected site in Nbs1 confirm that functions necessary for DNA repair and checkpoint signaling have been conserved. However, many viruses interact with the proteins of the NHEJ pathway as part of their infectious lifecycle. We propose that an ongoing evolutionary arms race between viruses and NHEJ genes may be driving the surprisingly rapid evolution of these critical genes.
Because all cells experience DNA damage, they must also have mechanisms for repairing DNA. When the proteins that repair DNA malfunction, mutation and disease often result. Based on their fundamental importance, DNA repair proteins would be expected to be well preserved over evolutionary time in order to ensure optimal DNA repair function. However, a previous genome-wide study of molecular evolution in Saccharomyces yeast identified the non-homologous end joining (NHEJ) DNA repair pathway as one of the two most rapidly evolving pathways in the yeast genome. In order to analyze the evolution of this pathway in humans, we have generated large evolutionary sequence sets of NHEJ genes from our primate relatives. Similar to the scenario in yeast, several genes in this pathway are evolving rapidly in primate genomes and in modern human populations. Thus, complex and seemingly opposite selective forces are shaping the evolution of these important DNA repair genes. The finding that NHEJ genes are rapidly evolving in species groups as diverse as yeasts and primates indicates a systematic perturbation of the NHEJ pathway, one that is potentially important to human health.
Haploinsufficiency, wherein a single functional copy of a gene is insufficient to maintain normal function, is a major cause of dominant disease. Human disease studies have identified several hundred haploinsufficient (HI) genes. We have compiled a map of 1,079 haplosufficient (HS) genes by systematic identification of genes unambiguously and repeatedly compromised by copy number variation among 8,458 apparently healthy individuals and contrasted the genomic, evolutionary, functional, and network properties between these HS genes and known HI genes. We found that HI genes are typically longer and have more conserved coding sequences and promoters than HS genes. HI genes exhibit higher levels of expression during early development and greater tissue specificity. Moreover, within a probabilistic human functional interaction network HI genes have more interaction partners and greater network proximity to other known HI genes. We built a predictive model on the basis of these differences and annotated 12,443 genes with their predicted probability of being haploinsufficient. We validated these predictions of haploinsufficiency by demonstrating that genes with a high predicted probability of exhibiting haploinsufficiency are enriched among genes implicated in human dominant diseases and among genes causing abnormal phenotypes in heterozygous knockout mice. We have transformed these gene-based haploinsufficiency predictions into haploinsufficiency scores for genic deletions, which we demonstrate to better discriminate between pathogenic and benign deletions than consideration of the deletion size or numbers of genes deleted. These robust predictions of haploinsufficiency support clinical interpretation of novel loss-of-function variants and prioritization of variants and genes for follow-up studies.
Humans, like most complex organisms, have two copies of most genes in their genome, one from the mother and one from the father. This redundancy provides a back-up copy for most genes, should one copy be lost through mutation. For a minority of genes, one functional copy is not enough to sustain normal human function, and mutations causing the loss of function of one of the copies of such genes are a major cause of childhood developmental diseases. Over the past 20 years medical geneticists have identified over 300 such genes, but it is not known how many of the 22,000 genes in our genome may also be sensitive to gene loss. By comparing these ∼300 genes known to be sensitive to gene loss with over 1,000 genes where loss of a single copy does not result in disease, we have identified some key evolutionary and functional similarities between genes sensitive to loss of a single copy. We have used these similarities to predict for most genes in the genome, whether loss of a single copy is likely to result in disease. These predictions will help in the interpretation of mutations seen in patients.
Population genetic theory predicts discordance in the true phylogeny of different genomic regions when studying recently diverged species. Despite this expectation, genome-wide discordance in young species groups has rarely been statistically quantified. The house mouse subspecies group provides a model system for examining phylogenetic discordance. House mouse subspecies are recently derived, suggesting that even if there has been a simple tree-like population history, gene trees could disagree with the population history due to incomplete lineage sorting. Subspecies of house mice also hybridize in nature, raising the possibility that recent introgression might lead to additional phylogenetic discordance. Single-locus approaches have revealed support for conflicting topologies, resulting in a subspecies tree often summarized as a polytomy. To analyze phylogenetic histories on a genomic scale, we applied a recently developed method, Bayesian concordance analysis, to dense SNP data from three closely related subspecies of house mice: Mus musculus musculus, M. m. castaneus, and M. m. domesticus. We documented substantial variation in phylogenetic history across the genome. Although each of the three possible topologies was strongly supported by a large number of loci, there was statistical evidence for a primary phylogenetic history in which M. m. musculus and M. m. castaneus are sister subspecies. These results underscore the importance of measuring phylogenetic discordance in other recently diverged groups using methods such as Bayesian concordance analysis, which are designed for this purpose.
The phylogenetic history of individual genes can differ strongly from the species history if taxa are recently derived, making inferences of a species history from only a handful of genes especially difficult in these cases. Genome-scale data sets now allow phylogenetic histories to be reconstructed from a large number of genes. Although data sets of this size are becoming more common, few studies have characterized variation in phylogenetic history across whole genomes. We summarize fine scale variation in phylogenetic history across the genome of house mice, a recently derived group of subspecies, using a method that combines phylogenetic uncertainty among gene trees. We document substantial variation in phylogenetic history among 14,081 loci and describe a primary history in the face of this variation. These results support the use of genome-scale datasets and methods that accommodate phylogenetic discordance in attempts to reconstruct the history of closely related groups.
Reproductive proteins are among the fastest evolving in the proteome, often due to the consequences of positive selection, and their rapid evolution is frequently attributed to a coevolutionary process between interacting female and male proteins. Such a process could leave characteristic signatures at coevolving genes. One signature of coevolution, predicted by sexual selection theory, is an association of alleles between the two genes. Another predicted signature is a correlation of evolutionary rates during divergence due to compensatory evolution. We studied female–male coevolution in the abalone by resequencing sperm lysin and its interacting egg coat protein, VERL, in populations of two species. As predicted, we found intergenic linkage disequilibrium between lysin and VERL, despite our demonstration that they are not physically linked. This finding supports a central prediction of sexual selection using actual genotypes, that of an association between a male trait and its female preference locus. We also created a novel likelihood method to show that lysin and VERL have experienced correlated rates of evolution. These two signatures of coevolution can provide statistical rigor to hypotheses of coevolution and could be exploited for identifying coevolving proteins a priori. We also present polymorphism-based evidence for positive selection and implicate recent selective events at the specific structural regions of lysin and VERL responsible for their species-specific interaction. Finally, we observed deep subdivision between VERL alleles in one species, which matches a theoretical prediction of sexual conflict. Thus, abalone fertilization proteins illustrate how coevolution can lead to reproductive barriers and potentially drive speciation.
When a sperm meets an egg, it must display the correct recognition proteins to achieve fertilization. Given the importance of fertilization one would think these proteins are perfected and do not change over time; however, recent studies show that they do change and quite rapidly. Thus, the sperm and egg must change together in harmony, through a process called coevolution, so the species can successfully reproduce. We followed the sperm–egg coevolutionary process at the level of genes: one that makes the protective egg coat and a sperm gene which opens that coat for fertilization. By examining their DNA sequences in two abalone species, we revealed two coevolutionary signatures. In one case, we discovered an association of variants between the egg and sperm genes, the origin of which could be strong preference for compatible variants. In the second case, we demonstrated that both genes changed at correlated rates over millions of years of evolution. Whenever one gene had accelerated in one species, the other showed a parallel acceleration in that same species. These unique signatures help us to understand coevolution by revealing its strength within natural populations and by showing that it has acted consistently over long time periods.