The identification of the H3K4 trimethylase, PRDM9, as the gene responsible for recombination hotspot localization has provided considerable insight into the mechanisms by which recombination is initiated in mammals. However, uniquely amongst mammals, canids appear to lack a functional version of PRDM9 and may therefore provide a model for understanding recombination that occurs in the absence of PRDM9, and thus how PRDM9 functions to shape the recombination landscape. We have constructed a fine-scale genetic map from patterns of linkage disequilibrium assessed using high-throughput sequence data from 51 free-ranging dogs, Canis lupus familiaris. While broad-scale properties of recombination appear similar to other mammalian species, our fine-scale estimates indicate that canine highly elevated recombination rates are observed in the vicinity of CpG rich regions including gene promoter regions, but show little association with H3K4 trimethylation marks identified in spermatocytes. By comparison to genomic data from the Andean fox, Lycalopex culpaeus, we show that biased gene conversion is a plausible mechanism by which the high CpG content of the dog genome could have occurred.
Recombination in mammalian genomes tends to occur within highly localized regions known as recombination hotspots. These hotspots appear to be a ubiquitous feature of mammalian genomes, but tend to not be shared between closely related species despite high levels of DNA sequence similarity. This disparity has been largely explained by the discovery of PRDM9 as the gene responsible for localizing recombination hotspots via recognition and binding to specific DNA motifs. Variation within PRDM9 can lead to changes to the recognized motif, and hence changes to the location of recombination hotspots thought the genome. Multiple studies have shown that PRDM9 is under strong selective pressure, apparently leading to a rapid turnover of hotspot locations between species. However, uniquely amongst mammals, PRDM9 appears to be dysfunctional in dogs and other canids. In this paper, we investigate how the loss of PRDM9 has affected the fine-scale recombination landscape in dogs and contrast this with patterns seen in other species.
How organisms adapt to the range of environments they encounter is a fundamental question in biology. Elucidating the genetic basis of adaptation is a difficult task, especially when the targets of selection are not known. Emerging sequencing technologies and assembly algorithms facilitate the genomic dissection of adaptation and population differentiation in a vast array of organisms. Here we describe the attributes of Kryptolebias marmoratus, one of two known self-fertilizing hermaphroditic vertebrates that make this fish an attractive genetic system and a model for understanding the genomics of adaptation. Long periods of selfing have resulted in populations composed of many distinct naturally homozygous strains with a variety of identifiable, and apparently heritable, phenotypes. There also is strong population genetic structure across a diverse range of mangrove habitats, making this a tractable system in which to study differentiation both within and among populations. The ability to rear K. marmoratus in the laboratory contributes further to its value as a model for understanding the genetic drivers for adaptation. To date, microsatellite markers distinguish wild isogenic strains but the naturally high homozygosity improves the quality of de novo assembly of the genome and facilitates the identification of genetic variants associated with phenotypes. Gene annotation can be accomplished with RNA-sequencing data in combination with de novo genome assembly. By combining genomic information with extensive laboratory-based phenotyping, it becomes possible to map genetic variants underlying differences in behavioral, life-history, and other potentially adaptive traits. Emerging genomic technologies provide the required resources for establishing K. marmoratus as a new model organism for behavioral genetics and evolutionary genetics research.
To gain insights into evolutionary forces that have shaped the history of Bornean and Sumatran populations of orang-utans, we compare patterns of variation across more than 11 million single nucleotide polymorphisms found by previous mitochondrial and autosomal genome sequencing of 10 wild-caught orang-utans. Our analysis of the mitochondrial data yields a far more ancient split time between the two populations (∼3.4 million years ago) than estimates based on autosomal data (0.4 million years ago), suggesting a complex speciation process with moderate levels of primarily male migration. We find that the distribution of selection coefficients consistent with the observed frequency spectrum of autosomal non-synonymous polymorphisms in orang-utans is similar to the distribution in humans. Our analysis indicates that 35% of genes have evolved under detectable negative selection. Overall, our findings suggest that purifying natural selection, genetic drift, and a complex demographic history are the dominant drivers of genome evolution for the two orang-utan populations.
Identifying ancestry along each chromosome in admixed individuals provides a wealth of information for understanding the population genetic history of admixture events and is valuable for admixture mapping and identifying recent targets of selection. We present PCAdmix (available at https://sites.google.com/site/pcadmix/home), a Principal Components-based algorithm for determining ancestry along each chromosome from a high-density, genome-wide set of phased single-nucleotide polymorphism (SNP) genotypes of admixed individuals. We compare our method to HAPMIX on simulated data from two ancestral populations, and we find high concordance between the methods. Our method also has better accuracy than LAMP when applied to three-population admixture, a situation as yet unaddressed by HAPMIX. Finally, we apply our method to a data set of four Latino populations with European, African, and Native American ancestry. We find evidence of assortative mating in each of the four populations, and we identify regions of shared ancestry that may be recent targets of selection and could serve as candidate regions for admixture-based association mapping.
Admixture; Principal Components Analysis (Pca); Local Ancestry Deconvolution; Haplotype-Based; Forward-Backward Algorithm
Evolutionary theory assumes that mutations occur randomly in the genome; however, studies performed in a variety of organisms indicate the existence of context-dependent mutation biases. Sources of mutagenesis variation across large genomic contexts (e.g. hundreds of bases) have not been identified. Here, we use high-coverage whole genome sequencing of a conditional mismatch repair mutant line of diploid yeast to identify mutations that accumulated after 160 generations of growth. The vast majority of the mutations accumulated as insertion/deletions (in-dels) in homopolymeric (poly(dA:dT)) and repetitive DNA tracts. Surprisingly, the likelihood of an in-del mutation in a given poly(dA:dT) tract is increased by the presence of nearby poly(dA:dT) tracts in up to a 1000 bp region centered on the given tract. Our work suggests that specific mutation hotspots can contribute disproportionately to the genetic variation that is introduced into populations, and provides the first long-range genomic sequence context that contributes to mutagenesis.
DNA Mismatch Repair; homopolymeric tracts; mutation hotspot
Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.
The environment humans inhabit has changed many times in the last 100,000 years. Migration and dynamic local environments can lead to genetic adaptations favoring beneficial traits. Many genes responsible for these adaptations can alter disease susceptibility. Genes can also affect disease susceptibility by varying randomly across different populations. We have studied genetic variants that are known to modify disease susceptibility in the context of worldwide migration. We found that variants associated with 11 diseases have been affected to an extent that is not explained by random variation. We also found that the genetic risk of type 2 diabetes has steadily decreased along the worldwide human migration trajectory from Africa to America.
We present an approach for identifying genes under natural selection using polymorphism and divergence data from synonymous and non-synonymous sites within genes. A generalized linear mixed model is used to model the genome-wide variability among categories of mutations and estimate its functional consequence. We demonstrate how the model's estimated fixed and random effects can be used to identify genes under selection. The parameter estimates from our generalized linear model can be transformed to yield population genetic parameter estimates for quantities including the average selection coefficient for new mutations at a locus, the synonymous and non-synynomous mutation rates, and species divergence times. Furthermore, our approach incorporates stochastic variation due to the evolutionary process and can be fit using standard statistical software. The model is fit in both the empirical Bayes and Bayesian settings using the lme4 package in R, and Markov chain Monte Carlo methods in WinBUGS. Using simulated data we compare our method to existing approaches for detecting genes under selection: the McDonald-Kreitman test, and two versions of the Poisson random field based method MKprf. Overall, we find our method universally outperforms existing methods for detecting genes subject to selection using polymorphism and divergence data.
We present a new methodology, SnIPRE, for identifying genes under natural selection. SnIPRE is a “McDonald-Kreitman” type of analysis, in that it is based on MK table data and has an advantage over other types of statistics because it is robust to demography. Similar to the MKprf method, SnIPRE makes use of genome-wide information to increase power, but is non-parametric in the sense that it makes no assumptions (and does not require estimation) of parameters such as mutation rate and species divergence time in order to identify genes under selection. In simulations SnIPRE outperforms both the MK statistic and the two versions of MKprf considered. We then apply our method to Drosophila and human-chimp data.
Blond hair is a rare human phenotype found almost exclusively in Europe and Oceania. Here, we identify a cystine-to-arginine change at a highly conserved residue in tyrosinase-related protein 1 (TYRP1) as the single source of blond hair in Solomon Islanders. This missense mutation is predicted to impact catalytic activity of the protein and causes blond hair through a recessive mode of inheritance. The novel mutation is at a frequency of 26% in the Solomon Islands but is absent outside of Oceania and represents the largest genetic effect on a visible human phenotype reported to date. Our findings demonstrate that alleles of large effect reach appreciable frequencies in geographically isolated populations and underscore the importance of extending medical genomics to humans worldwide.
One of the main findings derived from the analysis of the Neandertal genome was the evidence for admixture between Neandertals and non-African modern humans. An alternative scenario is that the ancestral population of non-Africans was closer to Neandertals than to Africans because of ancient population substructure. Thus, the study of North African populations is crucial for testing both hypotheses. We analyzed a total of 780,000 SNPs in 125 individuals representing seven different North African locations and searched for their ancestral/derived state in comparison to different human populations and Neandertals. We found that North African populations have a significant excess of derived alleles shared with Neandertals, when compared to sub-Saharan Africans. This excess is similar to that found in non-African humans, a fact that can be interpreted as a sign of Neandertal admixture. Furthermore, the Neandertal's genetic signal is higher in populations with a local, pre-Neolithic North African ancestry. Therefore, the detected ancient admixture is not due to recent Near Eastern or European migrations. Sub-Saharan populations are the only ones not affected by the admixture event with Neandertals.
Asian rice, Oryza sativa is a cultivated, inbreeding species that feeds over half of the world's population. Understanding the genetic basis of diverse physiological, developmental, and morphological traits provides the basis for improving yield, quality and sustainability of rice. Here we show the results of a genome-wide association study based on genotyping 44,100 SNP variants across 413 diverse accessions of O. sativa collected from 82 countries that were systematically phenotyped for 34 traits. Using cross-population-based mapping strategies, we identified dozens of common variants influencing numerous complex traits. Significant heterogeneity was observed in the genetic architecture associated with subpopulation structure and response to environment. This work establishes an open-source translational research platform for genome-wide association studies in rice that directly links molecular variation in genes and metabolic pathways with the germplasm resources needed to accelerate varietal development and crop improvement.
Understanding the genetics and physiology of domesticated species is important for crop improvement. By studying natural variation and the phenotypic traits of 413 diverse accessions of rice, Zhao et al. identify many common genetic variants that influence quantitative traits such as seed size and flowering time.
Fine-scale population structure characterizes most continents and is especially pronounced in non-cosmopolitan populations. Roughly half of the world's population remains non-cosmopolitan and even populations within cities often assort along ethnic and linguistic categories. Barriers to random mating can be ecologically extreme, such as the Sahara Desert, or cultural, such as the Indian caste system. In either case, subpopulations accumulate genetic differences if the barrier is maintained over multiple generations. Genome-wide polymorphism data, initially with only a few hundred autosomal microsatellites, have clearly established differences in allele frequency not only among continental regions, but also within continents and within countries. We review recent evidence from the analysis of genome-wide polymorphism data for genetic boundaries delineating human population structure and the main demographic and genomic processes shaping variation, and discuss the implications of population structure for the distribution and discovery of disease-causing genetic variants, in the light of the imminent availability of sequencing data for a multitude of diverse human genomes.
Inferring population structure using Bayesian clustering programs often requires a priori specification of the number of subpopulations, , from which the sample has been drawn. Here, we explore the utility of a common Bayesian model selection criterion, the Deviance Information Criterion (DIC), for estimating . We evaluate the accuracy of DIC, as well as other popular approaches, on datasets generated by coalescent simulations under various demographic scenarios. We find that DIC outperforms competing methods in many genetic contexts, validating its application in assessing population structure.
Oryza sativa or Asian cultivated rice is one of the major cereal grass species domesticated for human food use during the Neolithic. Domestication of this species from the wild grass Oryza rufipogon was accompanied by changes in several traits, including seed shattering, percent seed set, tillering, grain weight, and flowering time. Quantitative trait locus (QTL) mapping has identified three genomic regions in chromosome 3 that appear to be associated with these traits. We would like to study whether these regions show signatures of selection and whether the same genetic basis underlies the domestication of different rice varieties. Fragments of 88 genes spanning these three genomic regions were sequenced from multiple accessions of two major varietal groups in O. sativa—indica and tropical japonica—as well as the ancestral wild rice species O. rufipogon. In tropical japonica, the levels of nucleotide variation in these three QTL regions are significantly lower compared to genome-wide levels, and coalescent simulations based on a complex demographic model of rice domestication indicate that these patterns are consistent with selection. In contrast, there is no significant reduction in nucleotide diversity in the homologous regions in indica rice. These results suggest that there are differences in the genetic and selective basis for domestication between these two Asian rice varietal groups.
Deep resequencing of functional regions in human genomes is key to identifying potentially causal rare variants for complex disorders. Here, we present the results from a large-sample resequencing (n = 285 patients) study of candidate genes coupled with population genetics and statistical methods to identify rare variants associated with Autism Spectrum Disorder and Schizophrenia. Three genes, MAP1A, GRIN2B, and CACNA1F, were consistently identified by different methods as having significant excess of rare missense mutations in either one or both disease cohorts. In a broader context, we also found that the overall site frequency spectrum of variation in these cases is best explained by population models of both selection and complex demography rather than neutral models or models accounting for complex demography alone. Mutations in the three disease-associated genes explained much of the difference in the overall site frequency spectrum among the cases versus controls. This study demonstrates that genes associated with complex disorders can be mapped using resequencing and analytical methods with sample sizes far smaller than those required by genome-wide association studies. Additionally, our findings support the hypothesis that rare mutations account for a proportion of the phenotypic variance of these complex disorders.
It is widely accepted that genetic factors play important roles in the etiology of neurological diseases. However, the nature of the underlying genetic variation remains unclear. Critical questions in the field of human genetics relate to the frequency and size effects of genetic variants associated with disease. For instance, the common disease–common variant model is based on the idea that sets of common variants explain a significant fraction of the variance found in common disease phenotypes. On the other hand, rare variants may have strong effects and therefore largely contribute to disease phenotypes. Due to their high penetrance and reduced fitness, such variants are maintained in the population at low frequencies, thus limiting their detection in genome-wide association studies. Here, we use a resequencing approach on a cohort of 285 Autism Spectrum Disorder and Schizophrenia patients and preformed several analyses, enhanced with population genetic approaches, to identify variants associated with both diseases. Our results demonstrate an excess of rare variants in these disease cohorts and identify genes with negative (deleterious) selection coefficients, suggesting an accumulation of variants of detrimental effects. Our results present further evidence for rare variants explaining a component of the genetic etiology of autism and schizophrenia.
Approximately 100 mouse genes undergo genomic imprinting, whereby one of the two parental alleles is epigenetically silenced. Imprinted genes influence processes including development, X chromosome inactivation, obesity, schizophrenia, and diabetes, motivating the identification of all imprinted loci. Local sequence features have been used to predict candidate imprinted genes, but rigorous testing using reciprocal crosses validated only three, one of which resided in previously identified imprinting clusters. Here we show that specific epigenetic features in mouse cells correlate with imprinting status in mice, and we identify hundreds of additional genes predicted to be imprinted in the mouse. We used a multitiered approach to validate imprinted expression, including use of a custom single nucleotide polymorphism array and traditional molecular methods. Of 65 candidates subjected to molecular assays for allele-specific expression, we found 10 novel imprinted genes that were maternally expressed in the placenta.
Balancing selection is potentially an important biological force for maintaining advantageous genetic diversity in populations, including variation that is responsible for long-term adaptation to the environment. By serving as a means to maintain genetic variation, it may be particularly relevant to maintaining phenotypic variation in natural populations. Nevertheless, its prevalence and specific targets in the human genome remain largely unknown. We have analyzed the patterns of diversity and divergence of 13,400 genes in two human populations using an unbiased single-nucleotide polymorphism data set, a genome-wide approach, and a method that incorporates demography in neutrality tests. We identified an unbiased catalog of genes with signatures of long-term balancing selection, which includes immunity genes as well as genes encoding keratins and membrane channels; the catalog also shows enrichment in functional categories involved in cellular structure. Patterns are mostly concordant in the two populations, with a small fraction of genes showing population-specific signatures of selection. Power considerations indicate that our findings represent a subset of all targets in the genome, suggesting that although balancing selection may not have an obvious impact on a large proportion of human genes, it is a key force affecting the evolution of a number of genes in humans.
overdominance; frequency-dependent selection; heterosis; human evolution; population genetics; human diversity
Motivation: The development of new high-throughput genotyping products requires a significant investment in testing and training samples to evaluate and optimize the product before it can be used reliably on new samples. One reason for this is current methods for automated calling of genotypes are based on clustering approaches which require a large number of samples to be analyzed simultaneously, or an extensive training dataset to seed clusters. In systems where inbred samples are of primary interest, current clustering approaches perform poorly due to the inability to clearly identify a heterozygote cluster.
Results: As part of the development of two custom single nucleotide polymorphism genotyping products for Oryza sativa (domestic rice), we have developed a new genotype calling algorithm called ‘ALCHEMY’ based on statistical modeling of the raw intensity data rather than modelless clustering. A novel feature of the model is the ability to estimate and incorporate inbreeding information on a per sample basis allowing accurate genotyping of both inbred and heterozygous samples even when analyzed simultaneously. Since clustering is not used explicitly, ALCHEMY performs well on small sample sizes with accuracy exceeding 99% with as few as 18 samples.
Availability: ALCHEMY is available for both commercial and academic use free of charge and distributed under the GNU General Public License at http://alchemy.sourceforge.net/
Supplementary information: Supplementary data are available at Bioinformatics online.
Pedigree studies of complex heritable diseases often feature nominal or ordinal phenotypic measurements and missing genetic marker or phenotype data.
We have developed a Bayesian method for Linkage analysis of Ordinal and Categorical traits (LOCate) that can analyze complex genealogical structure for family groups and incorporate missing data. LOCate uses a Gibbs sampling approach to assess linkage, incorporating a simulated tempering algorithm for fast mixing. While our treatment is Bayesian, we develop a LOD (log of odds) score estimator for assessing linkage from Gibbs sampling that is highly accurate for simulated data. LOCate is applicable to linkage analysis for ordinal or nominal traits, a versatility which we demonstrate by analyzing simulated data with a nominal trait, on which LOCate outperforms LOT, an existing method which is designed for ordinal traits. We additionally demonstrate our method's versatility by analyzing a candidate locus (D2S1788) for panic disorder in humans, in a dataset with a large amount of missing data, which LOT was unable to handle.
LOCate's accuracy and applicability to both ordinal and nominal traits will prove useful to researchers interested in mapping loci for categorical traits.
Quantifying the number of deleterious mutations per diploid human genome is of critical concern to both evolutionary and medical geneticists1–3. Here, we combine genome-wide polymorphism data from PCR-based exon re-sequencing, comparative genomic data across mammalian species, and protein structure predictions to estimate the number of functionally consequential mutations carried by each of 15 African American (AA) and 20 European American (EA) individuals. We find that AAs show significantly higher levels of nucleotide heterozygosity than do EAs for all categories of functional mutations considered including synonymous, nonsynonymous, predicted “benign”, predicted “possibly damaging” and predicted “probably damaging” mutations. This result is wholly consistent with previous work showing higher overall levels of nucleotide variation in African populations as compared to Europeans4. EA individuals, on the other hand, have significantly more genotypes homozygous for the derived allele at synonymous and nonsynonymous SNPs and for the damaging allele at “probably damaging” SNPs than AAs do. Surprisingly, for SNPs segregating only in one population or the other, the proportion of nonsynonymous SNPs is significantly higher in the EA sample (55.4%) than in the AA sample (47.0%; P<2.3 ×10−37). We observe a similar proportional excess of SNPs that are inferred to be “probably damaging” (15.9% EA; 12.1% AA; P<3.3 ×10−11). Using extensive simulations, we show that this excess proportion of segregating damaging alleles in Europeans is likely a consequence of a bottleneck that Europeans experienced around the time of the migration out of Africa.
Morphological diversity within closely related species is an essential aspect of evolution and adaptation. Mutations in the Melanocortin 1 receptor (Mc1r) gene contribute to pigmentary diversity in natural populations of fish, birds, and many mammals. However, melanism in the gray wolf, Canis lupus, is caused by a different melanocortin pathway component, the K locus, that encodes a beta-defensin protein that acts as an alternative ligand for Mc1r. We show that the melanistic K locus mutation in North American wolves derives from past hybridization with domestic dogs, has risen to high frequency in forested habitats, and exhibits a molecular signature of positive selection. The same mutation also causes melanism in the coyote, Canis latrans, and in Italian gray wolves, and hence our results demonstrate how traits selected in domesticated species can influence the morphological diversity of their wild relatives.
Coat color and type are essential characteristics of domestic dog breeds. Although the genetic basis of coat color has been well characterized, relatively little is known about the genes influencing coat growth pattern, length, and curl. We performed genome-wide association studies of more than 1000 dogs from 80 domestic breeds to identify genes associated with canine fur phenotypes. Taking advantage of both inter- and intrabreed variability, we identified distinct mutations in three genes, RSPO2, FGF5, and KRT71 (encoding R-spondin–2, fibroblast growth factor–5, and keratin-71, respectively), that together account for most coat phenotypes in purebred dogs in the United States. Thus, an array of varied and seemingly complex phenotypes can be reduced to the combinatorial effects of only a few genes.
The domestication of Asian rice (Oryza sativa) was a complex process punctuated by episodes of introgressive hybridization among and between subpopulations. Deep genetic divergence between the two main varietal groups (Indica and Japonica) suggests domestication from at least two distinct wild populations. However, genetic uniformity surrounding key domestication genes across divergent subpopulations suggests cultural exchange of genetic material among ancient farmers.
In this study, we utilize a novel 1,536 SNP panel genotyped across 395 diverse accessions of O. sativa to study genome-wide patterns of polymorphism, to characterize population structure, and to infer the introgression history of domesticated Asian rice. Our population structure analyses support the existence of five major subpopulations (indica, aus, tropical japonica, temperate japonica and GroupV) consistent with previous analyses. Our introgression analysis shows that most accessions exhibit some degree of admixture, with many individuals within a population sharing the same introgressed segment due to artificial selection. Admixture mapping and association analysis of amylose content and grain length illustrate the potential for dissecting the genetic basis of complex traits in domesticated plant populations.
Genes in these regions control a myriad of traits including plant stature, blast resistance, and amylose content. These analyses highlight the power of population genomics in agricultural systems to identify functionally important regions of the genome and to decipher the role of human-directed breeding in refashioning the genomes of a domesticated species.