Population genetics theory supplies powerful predictions about how natural selection interacts with genetic linkage to sculpt the genomic landscape of nucleotide polymorphism. Both the spread of beneficial mutations and removal of deleterious mutations act to depress polymorphism levels, especially in low-recombination regions. However, empiricists have documented extreme disparities among species. Here we characterize the dominant features that could drive variation in linked selection among species, including roles for selective sweeps being ‘hard’ or ‘soft’, and concealing by demography and genomic confounds. We advocate targeted studies of close relatives to unify our understanding of how selection and linkage interact to shape genome evolution.
The genome-wide scan for selection is an important method for identifying loci involved in adaptive evolution. However, theory that underlies standard scans for selection assumes a simple mutation model. In particular, recurrent mutation of the selective target is not considered. Although this assumption is reasonable for single-nucleotide variants (SNVs), a microsatellite targeted by selection will reliably violate this assumption due to high mutation rate. Moreover, the mutation rate of microsatellites is generally high enough to ensure that recurrent mutation is pervasive rather than occasional. It is therefore unclear if positive selection targeting microsatellites can be detected using standard scanning statistics. Examples of functional variation at microsatellites underscore the significance of understanding the genomic effects of microsatellite selection. Here, we investigate the joint effects of selection and complex mutation on linked sequence diversity, comparing simulations of microsatellite selection and SNV-based selective sweeps. We find that selection on microsatellites is generally difficult to detect using popular summaries of the site frequency spectrum, and, under certain conditions, using popular methods such as the integrated haplotype statistic and SweepFinder. However, comparisons of the number of haplotypes (K) and segregating sites (S) often provide considerable power to detect selection on microsatellites. We apply this knowledge to a scan of autosomes in the human CEU population (CEPH population sampled from Utah). In addition to the most commonly reported targets of selection in European populations, we identify numerous novel genomic regions that bear highly anomalous haplotype configurations. Using one of these regions—intron 1 of MAGI2—as an example, we show that the anomalous configuration is coincident with a perfect CA repeat of length 22. We conclude that standard genome-wide scans will commonly fail to detect mutationally complex targets of selection but that comparisons of K and S will, in many cases, facilitate their identification.
microsatellites; natural selection; genomic scans for selection; mutation; short tandem repeats
We use genotype data from the Marshfield Clinical Research Foundation Personalized Medicine Research Project to investigate genetic similarity and divergence between Europeans and the sampled population of European Americans in Central Wisconsin, USA. To infer recent genetic ancestry of the sampled Wisconsinites, we train support vector machines (SVMs) on the positions of Europeans along top principal components (PCs). Our SVM models partition continent-wide European genetic variance into eight regional classes, which is an improvement over the geographically broader categories of recent ancestry reported by personal genomics companies. After correcting for misclassification error associated with the SVMs (<10%, in all cases), we observe a >14% discrepancy between insular ancestries reported by Wisconsinites and those inferred by SVM. Values of FST as well as Mantel tests for correlation between genetic and European geographic distances indicate minimal divergence between Europe and the local Wisconsin population. However, we find that individuals from the Wisconsin sample show greater dispersion along higher-order PCs than individuals from Europe. Hypothesizing that this pattern is characteristic of nascent divergence, we run computer simulations that mimic the recent peopling of Wisconsin. Simulations corroborate the pattern in higher-order PCs, demonstrate its transient nature, and show that admixture accelerates the rate of divergence between the admixed population and its parental sources relative to drift alone. Together, empirical and simulation results suggest that genetic divergence between European source populations and European Americans in Central Wisconsin is subtle but already under way.
population structure; genetic ancestry; admixture; support vector machine; principal component analysis
Hybrid dysfunction, a common feature of reproductive barriers between species, is often caused by negative epistasis between loci (“Dobzhansky-Muller incompatibilities”). The nature and complexity of hybrid incompatibilities remain poorly understood because identifying interacting loci that affect complex phenotypes is difficult. With subspecies in the early stages of speciation, an array of genetic tools, and detailed knowledge of reproductive biology, house mice (Mus musculus) provide a model system for dissecting hybrid incompatibilities. Male hybrids between M. musculus subspecies often show reduced fertility. Previous studies identified loci and several X chromosome-autosome interactions that contribute to sterility. To characterize the genetic basis of hybrid sterility in detail, we used a systems genetics approach, integrating mapping of gene expression traits with sterility phenotypes and QTL. We measured genome-wide testis expression in 305 male F2s from a cross between wild-derived inbred strains of M. musculus musculus and M. m. domesticus. We identified several thousand cis- and trans-acting QTL contributing to expression variation (eQTL). Many trans eQTL cluster into eleven ‘hotspots,’ seven of which co-localize with QTL for sterility phenotypes identified in the cross. The number and clustering of trans eQTL—but not cis eQTL—were substantially lower when mapping was restricted to a ‘fertile’ subset of mice, providing evidence that trans eQTL hotspots are related to sterility. Functional annotation of transcripts with eQTL provides insights into the biological processes disrupted by sterility loci and guides prioritization of candidate genes. Using a conditional mapping approach, we identified eQTL dependent on interactions between loci, revealing a complex system of epistasis. Our results illuminate established patterns, including the role of the X chromosome in hybrid sterility. The integrated mapping approach we employed is applicable in a broad range of organisms and we advocate for widespread adoption of a network-centered approach in speciation genetics.
New species are created when barriers to reproduction form between groups of organisms that formerly interbred freely. Reduced fertility or viability of hybrid offspring is a common form of reproductive isolation. Hybrid defects are caused by negative interactions between genes that have undergone evolutionary change within each subgroup. Identifying genetic interactions causing disease or trait variation is very difficult, consequently there are few known hybrid incompatibility genes and even fewer cases where both interacting genes are known. Here, we combined mapping of gene expression levels in testis with previous results mapping male sterility traits in hybrid house mice. This new approach to finding genetic causes of reproductive barriers enabled us to identify a large number of hybrid incompatibilities, involving genomic regions with known roles in hybrid sterility and previously unknown regions. Understanding the number and type of genetic interactions is important for developing accurate models used to reconstruct speciation events. The genetics of hybrid sterility in mice may also contribute to understanding basic processes involved in male reproduction and causes of human infertility.
The ability to survey polymorphism on a genomic scale has enabled genome-wide scans for the targets of natural selection. Theory that connects patterns of genetic variation to evidence of natural selection most often assumes a diallelic locus and no recurrent mutation. Although these assumptions are suitable to selection that targets single nucleotide variants, fundamentally different types of mutation generate abundant polymorphism in genomes. Moreover, recent empirical results suggest that mutationally complex, multiallelic loci including microsatellites and copy number variants are sometimes targeted by natural selection. Given their abundance, the lack of inference methods tailored to the mutational peculiarities of these types of loci represents a notable gap in our ability to interrogate genomes for signatures of natural selection. Previous theoretical investigations of mutation-selection balance at multiallelic loci include assumptions that limit their application to inference from empirical data. Focusing on microsatellites, we assess the dynamics and population-level consequences of selection targeting mutationally complex variants. We develop general models of a multiallelic fitness surface, a realistic model of microsatellite mutation, and an efficient simulation algorithm. Using these tools, we explore mutation-selection-drift equilibrium at microsatellites and investigate the mutational history and selective regime of the microsatellite that causes Friedreich’s ataxia. We characterize microsatellite selective events by their duration and cost, note similarities to sweeps from standing point variation, and conclude that it is premature to label microsatellites as ubiquitous agents of efficient adaptive change. Together, our models and simulation algorithm provide a powerful framework for statistical inference, which can be used to test the neutrality of microsatellites and other multiallelic variants.
microsatellites; fitness landscape; natural selection; population genetic inference; Friedreich’s ataxia; tandem repeats
The Dobzhansky-Muller model of speciation posits that defects in hybrids between species are the result of negative epistatic interactions between alleles that arose in independent genetic backgrounds. Tests of one important prediction from this model, that incompatibilities “snowball”, have relied on comparisons of the number of incompatibilities between closely related pairs of species separated by different divergence times. How incompatibilities accumulate along phylogenies, however, remains poorly understood. We extend the Dobzhansky-Muller model to multi-species clades to describe the mathematical relationship between tree topology and the number of shared incompatibilities among related pairs of species. We use these results to develop a statistical test that distinguishes between the snowball and alternative incompatibility accumulation models, including non-epistatic and multi-locus incompatibility models, in a phylogenetic context. We further demonstrate that patterns of incompatibility sharing across species pairs can be used to estimate the relative frequencies of different types of incompatibilities, including derived-derived vs. derived-ancestral incompatibilities. Our results and statistical methods should motivate comparative genetic mapping of hybrid incompatibilities to evaluate competing models of speciation.
Dobzhansky-Muller incompatibilities; phylogenetic comparison; reproductive isolation; speciation
The pseudoautosomal region (PAR) is essential for the accurate pairing and segregation of the X and Y chromosomes during meiosis. Despite its functional significance, the PAR shows substantial evolutionary divergence in structure and sequence between mammalian species. An instructive example of PAR evolution is the house mouse Mus musculus domesticus (represented by the C57BL/6J strain), which has the smallest PAR among those that have been mapped. In C57BL/6J, the PAR boundary is located just ~700 kb from the distal end of the X chromosome, whereas the boundary is found at a more proximal position in Mus spretus, a species that diverged from house mice 2–4 million years ago. Here, we use a combination of genetic and physical mapping to document a pronounced shift in the PAR boundary in a second house mouse subspecies, Mus musculus castaneus (represented by the CAST/EiJ strain), ~430 kb proximal of the M. m. domesticus boundary. We demonstrate molecular evolutionary consequences of this shift, including a marked lineage-specific increase in sequence divergence within Mid1, a gene that resides entirely within the M. m. castaneus PAR but straddles the boundary in other subspecies. Our results extend observations of structural divergence in the PAR to closely related subspecies, pointing to major evolutionary changes in this functionally important genomic region over a short time period.
pseudoautosomal region; house mouse; Mid1
Despite advances in genetic mapping of quantitative traits and in phylogenetic comparative approaches, these two perspectives are rarely combined. The joint consideration of multiple crosses among related taxa (whether species or strains) not only allows more precise mapping of the genetic loci (called quantitative trait loci, QTL) that contribute to important quantitative traits, but also offers the opportunity to identify the origin of a QTL allele on the phylogenetic tree that relates the taxa. We describe a formal method for combining multiple crosses to infer the location of a QTL on a tree. We further discuss experimental design issues for such endeavors, such as how many crosses are required and which sets of crosses are best. Finally, we explore the method’s performance in computer simulations, and we illustrate its use through application to a set of four mouse intercrosses among five inbred strains, with data on HDL cholesterol.
quantitative trait loci (QTL); phylogenetic tree; evolution; multiple crosses; combining crosses
We report genome sequences of 17 inbred strains of laboratory mice and identify almost ten times more variants than previously known. We use these genomes to explore the phylogenetic history of the laboratory mouse and to examine the functional consequences of allele-specific variation on transcript abundance, revealing that at least 12% of transcripts show a significant tissue-specific expression bias. By identifying candidate functional variants at 718 quantitative trait loci we show that the molecular nature of functional variants and their position relative to genes vary according to the effect size of the locus. These sequences provide a starting point for a new era in the functional analysis of a key model organism.
Recently diverged taxa may continue to exchange genes. A number of models of speciation with gene flow propose that the frequency of gene exchange will be lower in genomic regions of low recombination and that these regions will therefore be more differentiated. However, several population-genetic models that focus on selection at linked sites also predict greater differentiation in regions of low recombination simply as a result of faster sorting of ancestral alleles even in the absence of gene flow. Moreover, identifying the actual amount of gene flow from patterns of genetic variation is tricky, because both ancestral polymorphism and migration lead to shared variation between recently diverged taxa. New analytic methods have been developed to help distinguish ancestral polymorphism from migration. Along with a growing number of datasets of multi-locus DNA sequence variation, these methods have spawned a renewed interest in speciation models with gene flow. Here, we review both speciation and population-genetic models that make explicit predictions about how the rate of recombination influences patterns of genetic variation within and between species. We then compare those predictions with empirical data of DNA sequence variation in rabbits and mice. We find strong support for the prediction that genomic regions experiencing low levels of recombination are more differentiated. In most cases, reduced gene flow appears to contribute to the pattern, although disentangling the relative contribution of reduced gene flow and selection at linked sites remains a challenge. We suggest fruitful areas of research that might help distinguish between different models.
genetic hitchhiking; background selection; gene flow; Mus musculus; Oryctolgaus cuniculus
Rapid advances in DNA sequencing and genotyping technologies are beginning to reveal the scope and pattern of human genomic variation. Although single nucleotide polymorphisms (SNPs) have been intensively studied, the extent and form of variation at other types of molecular variants remain poorly understood. Polymorphism at the most variable loci in the human genome, microsatellites, has rarely been examined on a genomic scale without the ascertainment biases that attend typical genotyping studies. We conducted a genomic survey of variation at microsatellites with at least three perfect repeats by comparing two complete genome sequences, the Human Genome Reference sequence and the sequence of J. Craig Venter. The genomic proportion of polymorphic loci was 2.7%, much higher than the rate of SNP variation, with marked heterogeneity among classes of loci. The proportion of variable loci increased substantially with repeat number. Repeat lengths differed in levels of variation, with longer repeat lengths generally showing higher polymorphism at the same repeat number. Microsatellite variation was weakly correlated with regional SNP number, indicating modest effects of shared genealogical history. Reductions in variation were detected at microsatellites located in introns, in untranslated regions, in coding exons, and just upstream of transcription start sites, suggesting the presence of selective constraints. Our results provide new insights into microsatellite mutational processes and yield a preview of patterns of variation that will be obtained in genomic surveys of larger numbers of individuals.
microsatellites; tandem repeats; population genomics; mutation; human genome
Theoretical work focused on microsatellite variation has produced a number of important
results, including the expected distribution of repeat sizes and the expected squared
difference in repeat size between two randomly selected samples. However, closed-form
expressions for the sampling distribution and frequency spectrum of microsatellite
variation have not been identified. Here, we use coalescent simulations of the stepwise
mutation model to develop gamma and exponential approximations of the microsatellite
allele frequency spectrum, a distribution central to the description of microsatellite
variation across the genome. For both approximations, the parameter of biological
relevance is the number of alleles at a locus, which we express as a function of
θ, the population-scaled mutation rate, based on simulated data.
Discovered relationships between θ, the number of alleles, and the
frequency spectrum support the development of three new estimators of microsatellite
θ. The three estimators exhibit roughly similar mean squared
errors (MSEs) and all are biased. However, across a broad range of sample sizes and
θ values, the MSEs of these estimators are frequently lower than
all other estimators tested. The new estimators are also reasonably robust to mutation
that includes step sizes greater than one. Finally, our approximation to the
microsatellite allele frequency spectrum provides a null distribution of microsatellite
variation. In this context, a preliminary analysis of the effects of demographic change on
the frequency spectrum is performed. We suggest that simulations of the microsatellite
frequency spectrum under evolutionary scenarios of interest may guide investigators to the
use of relevant and sometimes novel summary statistics.
microsatellite; allele frequency spectrum; θ (theta); stepwise mutation model
Although growing numbers of single nucleotide polymorphisms (SNPs) and microsatellites (short tandem repeat polymorphisms or STRPs) are used to infer population structure, their relative properties in this context remain poorly understood. SNPs and STRPs mutate differently, suggesting multi-locus genotypes at these loci might differ in ability to detect population structure. Here, we use coalescent simulations to measure the power of sets of SNPs and STRPs to identify population structure. To maximize applicability of our results to empirical studies, we focus on the popular STRUCTURE analysis and evaluate the role of several biological and practical factors in the detection of population structure. We find that: (1) fewer unlinked STRPs than SNPs are needed to detect structure at divergence times <0.3 Ne generations; (2) accurate estimation of the number of populations requires many fewer STRPs than SNPs; (3) for both marker types, declines in power due to modest gene flow (Nem=1.0) are largely negated by increasing marker number; (4) variation in the STRP mutational model affects power modestly; (5) SNP haplotypes (θ=1, no recombination) provide power comparable to STRP loci (θ=10); (6) ascertainment schemes that select highly variable STRP or SNP loci increase power to detect structure, though ascertained data may not be suitable to other inference; and (7) when samples are drawn from an admixed population and one parent population, the reduction in power to detect two populations is greater for STRPs than SNPs. These results should assist the design of multi-locus studies to detect population structure in nature.
population structure; microsatellite; single nucleotide polymorphism; ascertainment bias; statistical power; single tandem repeat
The rate of meiotic recombination varies markedly between species and among individuals. Classical genetic experiments demonstrated a heritable component to population variation in recombination rate, and specific sequence variants that contribute to recombination rate differences between individuals have recently been identified. Despite these advances, the genetic basis of species divergence in recombination rate remains unexplored. Using a cytological assay that allows direct in situ imaging of recombination events in spermatocytes, we report a large (∼30%) difference in global recombination rate between males of two closely related house mouse subspecies (Mus musculus musculus and M. m. castaneus). To characterize the genetic basis of this recombination rate divergence, we generated an F2 panel of inter-subspecific hybrid males (n = 276) from an intercross between wild-derived inbred strains CAST/EiJ (M. m. castaneus) and PWD/PhJ (M. m. musculus). We uncover considerable heritable variation for recombination rate among males from this mapping population. Much of the F2 variance for recombination rate and a substantial portion of the difference in recombination rate between the parental strains is explained by eight moderate- to large-effect quantitative trait loci, including two transgressive loci on the X chromosome. In contrast to the rapid evolution observed in males, female CAST/EiJ and PWD/PhJ animals show minimal divergence in recombination rate (∼5%). The existence of loci on the X chromosome suggests a genetic mechanism to explain this male-biased evolution. Our results provide an initial map of the genetic changes underlying subspecies differences in genome-scale recombination rate and underscore the power of the house mouse system for understanding the evolution of this trait.
Homologous recombination is an indispensable feature of the mammalian meiotic program and an important mechanism for creating genetic diversity. Despite its central significance, recombination rates vary markedly between species and among individuals. Although recent studies have begun to unravel the genetic basis of recombination rate variation within populations, the genetic mechanisms of species divergence in recombination rate remain poorly characterized. In this study, we show that two closely related house mouse subspecies differ in their genomic recombination rates by ∼30%, providing an excellent model system for studying evolutionary divergence in this trait. Using quantitative genetic methods, we identify eight genomic regions that contribute to divergence in global recombination rate between these subspecies, including large effect loci and multiple loci on the X-chromosome. Our study uncovers novel genomic loci contributing to species divergence in global recombination rate and offers simple genetic explanations for rapid phenotypic divergence in this trait.
Patterns of population structure provide insights into evolutionary processes and help identify groups of individuals for genotype–phenotype association studies. With increasing availability of polymorphic molecular markers across genomes, the examination of population structure using large numbers of unlinked loci has become a common practice in evolutionary biology and human genetics. The two classes of molecular variation most widely used for this purpose, short tandem repeat polymorphisms (STRPs) and single-nucleotide polymorphisms (SNPs), differ in mutational properties expected to affect population structure. To measure the relative ability of these loci to describe population structure, we compared diversity at neighboring STRPs and SNPs from 720 genomic regions in the four populations that comprise the Human HapMap. Comparing loci from the same genomic regions allowed us to focus on the contribution of mutational differences (rather than variation in genealogical history) to disparities in population structure between STRPs and SNPs. Relative to average values for SNPs from the same regions, STRPs had lower Fst, but higher Gst′ and In values. STRP–SNP correlations in population structure across genomic regions were statistically significant but weak in magnitude. Separate analyses by repeat type showed that these correlations were driven primarily by tetranucleotide and trinucleotide STRPs; measures of population structure at dinucleotides and SNPs were not significantly correlated. Pairwise comparisons among populations revealed effects of divergence time on differences in population structure between STRPs and SNPs. Collectively, these results confirm that individual STRPs can provide more information about population structure than individual SNPs, but suggest that the difference in structure at STRPs and SNPs depends on local genealogical history. Our study motivates theoretical comparisons of population structure at loci with different mutational properties.
SNP; microsatellite; recurrent mutation; population structure; marker informativeness; human genome
The high genomic density of the single-nucleotide polymorphism (SNP) sets that are typically surveyed in genome-wide association studies (GWAS) now allows the application of haplotype-based methods. Although the choice of haplotype-based vs. individual-SNP approaches is expected to affect the results of association studies, few empirical comparisons of method performance have been reported on the genome-wide scale in the same set of individuals. To measure the relative ability of the two strategies to detect associations, we used a large dataset from the North American Rheumatoid Arthritis Consortium to: 1) partition the genome into haplotype blocks, 2) associate haplotypes with disease, and 3) compare the results with individual-SNP association mapping. Although some associations were shared across methods, each approach uniquely identified several strong candidate regions. Our results suggest that the application of both haplotype-based and individual-SNP testing to GWAS should be adopted as a routine procedure.
Population genetic theory predicts discordance in the true phylogeny of different genomic regions when studying recently diverged species. Despite this expectation, genome-wide discordance in young species groups has rarely been statistically quantified. The house mouse subspecies group provides a model system for examining phylogenetic discordance. House mouse subspecies are recently derived, suggesting that even if there has been a simple tree-like population history, gene trees could disagree with the population history due to incomplete lineage sorting. Subspecies of house mice also hybridize in nature, raising the possibility that recent introgression might lead to additional phylogenetic discordance. Single-locus approaches have revealed support for conflicting topologies, resulting in a subspecies tree often summarized as a polytomy. To analyze phylogenetic histories on a genomic scale, we applied a recently developed method, Bayesian concordance analysis, to dense SNP data from three closely related subspecies of house mice: Mus musculus musculus, M. m. castaneus, and M. m. domesticus. We documented substantial variation in phylogenetic history across the genome. Although each of the three possible topologies was strongly supported by a large number of loci, there was statistical evidence for a primary phylogenetic history in which M. m. musculus and M. m. castaneus are sister subspecies. These results underscore the importance of measuring phylogenetic discordance in other recently diverged groups using methods such as Bayesian concordance analysis, which are designed for this purpose.
The phylogenetic history of individual genes can differ strongly from the species history if taxa are recently derived, making inferences of a species history from only a handful of genes especially difficult in these cases. Genome-scale data sets now allow phylogenetic histories to be reconstructed from a large number of genes. Although data sets of this size are becoming more common, few studies have characterized variation in phylogenetic history across whole genomes. We summarize fine scale variation in phylogenetic history across the genome of house mice, a recently derived group of subspecies, using a method that combines phylogenetic uncertainty among gene trees. We document substantial variation in phylogenetic history among 14,081 loci and describe a primary history in the face of this variation. These results support the use of genome-scale datasets and methods that accommodate phylogenetic discordance in attempts to reconstruct the history of closely related groups.
Identifying genomic locations that have experienced selective sweeps is an important first step toward understanding the molecular basis of adaptive evolution. Using statistical methods that account for the confounding effects of population demography, recombination rate variation, and single-nucleotide polymorphism ascertainment, while also providing fine-scale estimates of the position of the selected site, we analyzed a genomic dataset of 1.2 million human single-nucleotide polymorphisms genotyped in African-American, European-American, and Chinese samples. We identify 101 regions of the human genome with very strong evidence (p < 10−5) of a recent selective sweep and where our estimate of the position of the selective sweep falls within 100 kb of a known gene. Within these regions, genes of biological interest include genes in pigmentation pathways, components of the dystrophin protein complex, clusters of olfactory receptors, genes involved in nervous system development and function, immune system genes, and heat shock genes. We also observe consistent evidence of selective sweeps in centromeric regions. In general, we find that recent adaptation is strikingly pervasive in the human genome, with as much as 10% of the genome affected by linkage to a selective sweep.
A selective sweep is a single realization of adaptive evolution at the molecular level. When a selective sweep occurs, it leaves a characteristic signal in patterns of variation in genomic regions linked to the selected site; therefore, recently released population genomic datasets can be used to search for instances of molecular adaptation. Here, we present a comprehensive scan for complete selective sweeps in the human genome. Our analysis is complementary to several recent analyses that focused on partial selective sweeps, in which the adaptive mutation still segregates at intermediate frequency in the population. Consequently, our analysis identifies many genomic regions that were not previously known to have experienced natural selection, including consistent evidence of selection in centromeric regions, which is possibly the result of meiotic drive. Genes within selected regions include pigmentation candidate genes, genes of the dystrophin protein complex, and olfactory receptors. Extensive testing demonstrates that the method we use to detect selective sweeps is strikingly robust to both alternative demographic scenarios and recombination rate variation. Furthermore, the method we use provides precise estimates of the genomic position of the selected site, which greatly facilitates the fine-scale mapping of functionally significant variation in human populations.