The Y chromosome and the mitochondrial genome (mtDNA) have been used to estimate when the common patrilineal and matrilineal ancestors of humans lived. We sequenced the genomes of 69 males from nine populations, including two in which we find basal branches of the Y chromosome tree. We identify ancient phylogenetic structure within African haplogroups and resolve a long-standing ambiguity deep within the tree. Applying equivalent methodologies to the Y and mtDNA, we estimate the time to the most recent common ancestor (TMRCA) of the Y chromosome to be 120–156 thousand years and the mtDNA TMRCA to be 99–148 ky. Our findings suggest that, contrary to prior claims, male lineages do not coalesce significantly more recently than female lineages.
Streptococcus mutans is widely recognized as one of the key etiological agents of human dental caries. Despite its role in this important disease, our present knowledge of gene content variability across the species and its relationship to adaptation is minimal. Estimates of its demographic history are not available. In this study, we generated genome sequences of 57 S. mutans isolates, as well as representative strains of the most closely related species to S. mutans (S. ratti, S. macaccae, and S. criceti), to identify the overall structure and potential adaptive features of the dispensable and core components of the genome. We also performed population genetic analyses on the core genome of the species aimed at understanding the demographic history, and impact of selection shaping its genetic variation. The maximum gene content divergence among strains was approximately 23%, with the majority of strains diverging by 5–15%. The core genome consisted of 1,490 genes and the pan-genome approximately 3,296. Maximum likelihood analysis of the synonymous site frequency spectrum (SFS) suggested that the S. mutans population started expanding exponentially approximately 10,000 years ago (95% confidence interval [CI]: 3,268–14,344 years ago), coincidental with the onset of human agriculture. Analysis of the replacement SFS indicated that a majority of these substitutions are under strong negative selection, and the remainder evolved neutrally. A set of 14 genes was identified as being under positive selection, most of which were involved in either sugar metabolism or acid tolerance. Analysis of the core genome suggested that among 73 genes present in all isolates of S. mutans but absent in other species of the mutans taxonomic group, the majority can be associated with metabolic processes that could have contributed to the successful adaptation of S. mutans to its new niche, the human mouth, and with the dietary changes that accompanied the origin of agriculture.
Streptococcus mutans; demographic inference; cavities; bacterial evolution; pan and core genome; infectious disease
To identify the causative mutations in two early-onset canine retinal degenerations, crd1 and crd2, segregating in the American Staffordshire terrier and the Pit Bull Terrier breeds, respectively.
Retinal morphology of crd1- and crd2-affected dogs was evaluated by light microscopy. DNA was extracted from affected and related unaffected controls. Association analysis was undertaken using the Illumina Canine SNP array and PLINK (crd1 study), or the Affymetrix Version 2 Canine array, the “MAGIC” genotype algorithm, and Fisher's Exact test for association (crd2 study). Positional candidate genes were evaluated for each disease.
Structural photoreceptor abnormalities were observed in crd1-affected dogs as young as 11-weeks old. Rod and cone inner segment (IS) and outer segments (OS) were abnormal in size, shape, and number. In crd2-affected dogs, rod and cone IS and OS were abnormal as early as 3 weeks of age, progressing with age to severe loss of the OS, and thinning of the outer nuclear layer (ONL) by 12 weeks of age. Genome-wide association study (GWAS) identified association at the telomeric end of CFA3 in crd1-affected dogs and on CFA33 in crd2-affected dogs. Candidate gene evaluation identified a three bases deletion in exon 21 of PDE6B in crd1-affected dogs, and a cytosine insertion in exon 10 of IQCB1 in crd2-affected dogs.
Identification of the mutations responsible for these two early-onset retinal degenerations provides new large animal models for comparative disease studies and evaluation of potential therapeutic approaches for the homologous human diseases.
We describe two genome-wide association studies in two closely related dog breeds affected with retinal degeneration, the pathology of the diseases and the discovery of a novel deletion mutation in PDE6B and an insertion mutation in IQCB1 as the causality for these diseases.
retina; mutation; GWAS
The identification of the H3K4 trimethylase, PRDM9, as the gene responsible for recombination hotspot localization has provided considerable insight into the mechanisms by which recombination is initiated in mammals. However, uniquely amongst mammals, canids appear to lack a functional version of PRDM9 and may therefore provide a model for understanding recombination that occurs in the absence of PRDM9, and thus how PRDM9 functions to shape the recombination landscape. We have constructed a fine-scale genetic map from patterns of linkage disequilibrium assessed using high-throughput sequence data from 51 free-ranging dogs, Canis lupus familiaris. While broad-scale properties of recombination appear similar to other mammalian species, our fine-scale estimates indicate that canine highly elevated recombination rates are observed in the vicinity of CpG rich regions including gene promoter regions, but show little association with H3K4 trimethylation marks identified in spermatocytes. By comparison to genomic data from the Andean fox, Lycalopex culpaeus, we show that biased gene conversion is a plausible mechanism by which the high CpG content of the dog genome could have occurred.
Recombination in mammalian genomes tends to occur within highly localized regions known as recombination hotspots. These hotspots appear to be a ubiquitous feature of mammalian genomes, but tend to not be shared between closely related species despite high levels of DNA sequence similarity. This disparity has been largely explained by the discovery of PRDM9 as the gene responsible for localizing recombination hotspots via recognition and binding to specific DNA motifs. Variation within PRDM9 can lead to changes to the recognized motif, and hence changes to the location of recombination hotspots thought the genome. Multiple studies have shown that PRDM9 is under strong selective pressure, apparently leading to a rapid turnover of hotspot locations between species. However, uniquely amongst mammals, PRDM9 appears to be dysfunctional in dogs and other canids. In this paper, we investigate how the loss of PRDM9 has affected the fine-scale recombination landscape in dogs and contrast this with patterns seen in other species.
Evolutionary theory assumes that mutations occur randomly in the genome; however, studies performed in a variety of organisms indicate the existence of context-dependent mutation biases. Sources of mutagenesis variation across large genomic contexts (e.g. hundreds of bases) have not been identified. Here, we use high-coverage whole genome sequencing of a conditional mismatch repair mutant line of diploid yeast to identify mutations that accumulated after 160 generations of growth. The vast majority of the mutations accumulated as insertion/deletions (in-dels) in homopolymeric (poly(dA:dT)) and repetitive DNA tracts. Surprisingly, the likelihood of an in-del mutation in a given poly(dA:dT) tract is increased by the presence of nearby poly(dA:dT) tracts in up to a 1000 bp region centered on the given tract. Our work suggests that specific mutation hotspots can contribute disproportionately to the genetic variation that is introduced into populations, and provides the first long-range genomic sequence context that contributes to mutagenesis.
DNA Mismatch Repair; homopolymeric tracts; mutation hotspot
How organisms adapt to the range of environments they encounter is a fundamental question in biology. Elucidating the genetic basis of adaptation is a difficult task, especially when the targets of selection are not known. Emerging sequencing technologies and assembly algorithms facilitate the genomic dissection of adaptation and population differentiation in a vast array of organisms. Here we describe the attributes of Kryptolebias marmoratus, one of two known self-fertilizing hermaphroditic vertebrates that make this fish an attractive genetic system and a model for understanding the genomics of adaptation. Long periods of selfing have resulted in populations composed of many distinct naturally homozygous strains with a variety of identifiable, and apparently heritable, phenotypes. There also is strong population genetic structure across a diverse range of mangrove habitats, making this a tractable system in which to study differentiation both within and among populations. The ability to rear K. marmoratus in the laboratory contributes further to its value as a model for understanding the genetic drivers for adaptation. To date, microsatellite markers distinguish wild isogenic strains but the naturally high homozygosity improves the quality of de novo assembly of the genome and facilitates the identification of genetic variants associated with phenotypes. Gene annotation can be accomplished with RNA-sequencing data in combination with de novo genome assembly. By combining genomic information with extensive laboratory-based phenotyping, it becomes possible to map genetic variants underlying differences in behavioral, life-history, and other potentially adaptive traits. Emerging genomic technologies provide the required resources for establishing K. marmoratus as a new model organism for behavioral genetics and evolutionary genetics research.
To gain insights into evolutionary forces that have shaped the history of Bornean and Sumatran populations of orang-utans, we compare patterns of variation across more than 11 million single nucleotide polymorphisms found by previous mitochondrial and autosomal genome sequencing of 10 wild-caught orang-utans. Our analysis of the mitochondrial data yields a far more ancient split time between the two populations (∼3.4 million years ago) than estimates based on autosomal data (0.4 million years ago), suggesting a complex speciation process with moderate levels of primarily male migration. We find that the distribution of selection coefficients consistent with the observed frequency spectrum of autosomal non-synonymous polymorphisms in orang-utans is similar to the distribution in humans. Our analysis indicates that 35% of genes have evolved under detectable negative selection. Overall, our findings suggest that purifying natural selection, genetic drift, and a complex demographic history are the dominant drivers of genome evolution for the two orang-utan populations.
Identifying ancestry along each chromosome in admixed individuals provides a wealth of information for understanding the population genetic history of admixture events and is valuable for admixture mapping and identifying recent targets of selection. We present PCAdmix (available at https://sites.google.com/site/pcadmix/home), a Principal Components-based algorithm for determining ancestry along each chromosome from a high-density, genome-wide set of phased single-nucleotide polymorphism (SNP) genotypes of admixed individuals. We compare our method to HAPMIX on simulated data from two ancestral populations, and we find high concordance between the methods. Our method also has better accuracy than LAMP when applied to three-population admixture, a situation as yet unaddressed by HAPMIX. Finally, we apply our method to a data set of four Latino populations with European, African, and Native American ancestry. We find evidence of assortative mating in each of the four populations, and we identify regions of shared ancestry that may be recent targets of selection and could serve as candidate regions for admixture-based association mapping.
Admixture; Principal Components Analysis (Pca); Local Ancestry Deconvolution; Haplotype-Based; Forward-Backward Algorithm
Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.
The environment humans inhabit has changed many times in the last 100,000 years. Migration and dynamic local environments can lead to genetic adaptations favoring beneficial traits. Many genes responsible for these adaptations can alter disease susceptibility. Genes can also affect disease susceptibility by varying randomly across different populations. We have studied genetic variants that are known to modify disease susceptibility in the context of worldwide migration. We found that variants associated with 11 diseases have been affected to an extent that is not explained by random variation. We also found that the genetic risk of type 2 diabetes has steadily decreased along the worldwide human migration trajectory from Africa to America.
We present an approach for identifying genes under natural selection using polymorphism and divergence data from synonymous and non-synonymous sites within genes. A generalized linear mixed model is used to model the genome-wide variability among categories of mutations and estimate its functional consequence. We demonstrate how the model's estimated fixed and random effects can be used to identify genes under selection. The parameter estimates from our generalized linear model can be transformed to yield population genetic parameter estimates for quantities including the average selection coefficient for new mutations at a locus, the synonymous and non-synynomous mutation rates, and species divergence times. Furthermore, our approach incorporates stochastic variation due to the evolutionary process and can be fit using standard statistical software. The model is fit in both the empirical Bayes and Bayesian settings using the lme4 package in R, and Markov chain Monte Carlo methods in WinBUGS. Using simulated data we compare our method to existing approaches for detecting genes under selection: the McDonald-Kreitman test, and two versions of the Poisson random field based method MKprf. Overall, we find our method universally outperforms existing methods for detecting genes subject to selection using polymorphism and divergence data.
We present a new methodology, SnIPRE, for identifying genes under natural selection. SnIPRE is a “McDonald-Kreitman” type of analysis, in that it is based on MK table data and has an advantage over other types of statistics because it is robust to demography. Similar to the MKprf method, SnIPRE makes use of genome-wide information to increase power, but is non-parametric in the sense that it makes no assumptions (and does not require estimation) of parameters such as mutation rate and species divergence time in order to identify genes under selection. In simulations SnIPRE outperforms both the MK statistic and the two versions of MKprf considered. We then apply our method to Drosophila and human-chimp data.
Blond hair is a rare human phenotype found almost exclusively in Europe and Oceania. Here, we identify a cystine-to-arginine change at a highly conserved residue in tyrosinase-related protein 1 (TYRP1) as the single source of blond hair in Solomon Islanders. This missense mutation is predicted to impact catalytic activity of the protein and causes blond hair through a recessive mode of inheritance. The novel mutation is at a frequency of 26% in the Solomon Islands but is absent outside of Oceania and represents the largest genetic effect on a visible human phenotype reported to date. Our findings demonstrate that alleles of large effect reach appreciable frequencies in geographically isolated populations and underscore the importance of extending medical genomics to humans worldwide.
One of the main findings derived from the analysis of the Neandertal genome was the evidence for admixture between Neandertals and non-African modern humans. An alternative scenario is that the ancestral population of non-Africans was closer to Neandertals than to Africans because of ancient population substructure. Thus, the study of North African populations is crucial for testing both hypotheses. We analyzed a total of 780,000 SNPs in 125 individuals representing seven different North African locations and searched for their ancestral/derived state in comparison to different human populations and Neandertals. We found that North African populations have a significant excess of derived alleles shared with Neandertals, when compared to sub-Saharan Africans. This excess is similar to that found in non-African humans, a fact that can be interpreted as a sign of Neandertal admixture. Furthermore, the Neandertal's genetic signal is higher in populations with a local, pre-Neolithic North African ancestry. Therefore, the detected ancient admixture is not due to recent Near Eastern or European migrations. Sub-Saharan populations are the only ones not affected by the admixture event with Neandertals.
Asian rice, Oryza sativa is a cultivated, inbreeding species that feeds over half of the world's population. Understanding the genetic basis of diverse physiological, developmental, and morphological traits provides the basis for improving yield, quality and sustainability of rice. Here we show the results of a genome-wide association study based on genotyping 44,100 SNP variants across 413 diverse accessions of O. sativa collected from 82 countries that were systematically phenotyped for 34 traits. Using cross-population-based mapping strategies, we identified dozens of common variants influencing numerous complex traits. Significant heterogeneity was observed in the genetic architecture associated with subpopulation structure and response to environment. This work establishes an open-source translational research platform for genome-wide association studies in rice that directly links molecular variation in genes and metabolic pathways with the germplasm resources needed to accelerate varietal development and crop improvement.
Understanding the genetics and physiology of domesticated species is important for crop improvement. By studying natural variation and the phenotypic traits of 413 diverse accessions of rice, Zhao et al. identify many common genetic variants that influence quantitative traits such as seed size and flowering time.
Fine-scale population structure characterizes most continents and is especially pronounced in non-cosmopolitan populations. Roughly half of the world's population remains non-cosmopolitan and even populations within cities often assort along ethnic and linguistic categories. Barriers to random mating can be ecologically extreme, such as the Sahara Desert, or cultural, such as the Indian caste system. In either case, subpopulations accumulate genetic differences if the barrier is maintained over multiple generations. Genome-wide polymorphism data, initially with only a few hundred autosomal microsatellites, have clearly established differences in allele frequency not only among continental regions, but also within continents and within countries. We review recent evidence from the analysis of genome-wide polymorphism data for genetic boundaries delineating human population structure and the main demographic and genomic processes shaping variation, and discuss the implications of population structure for the distribution and discovery of disease-causing genetic variants, in the light of the imminent availability of sequencing data for a multitude of diverse human genomes.
Inferring population structure using Bayesian clustering programs often requires a priori specification of the number of subpopulations, , from which the sample has been drawn. Here, we explore the utility of a common Bayesian model selection criterion, the Deviance Information Criterion (DIC), for estimating . We evaluate the accuracy of DIC, as well as other popular approaches, on datasets generated by coalescent simulations under various demographic scenarios. We find that DIC outperforms competing methods in many genetic contexts, validating its application in assessing population structure.
Oryza sativa or Asian cultivated rice is one of the major cereal grass species domesticated for human food use during the Neolithic. Domestication of this species from the wild grass Oryza rufipogon was accompanied by changes in several traits, including seed shattering, percent seed set, tillering, grain weight, and flowering time. Quantitative trait locus (QTL) mapping has identified three genomic regions in chromosome 3 that appear to be associated with these traits. We would like to study whether these regions show signatures of selection and whether the same genetic basis underlies the domestication of different rice varieties. Fragments of 88 genes spanning these three genomic regions were sequenced from multiple accessions of two major varietal groups in O. sativa—indica and tropical japonica—as well as the ancestral wild rice species O. rufipogon. In tropical japonica, the levels of nucleotide variation in these three QTL regions are significantly lower compared to genome-wide levels, and coalescent simulations based on a complex demographic model of rice domestication indicate that these patterns are consistent with selection. In contrast, there is no significant reduction in nucleotide diversity in the homologous regions in indica rice. These results suggest that there are differences in the genetic and selective basis for domestication between these two Asian rice varietal groups.
Deep resequencing of functional regions in human genomes is key to identifying potentially causal rare variants for complex disorders. Here, we present the results from a large-sample resequencing (n = 285 patients) study of candidate genes coupled with population genetics and statistical methods to identify rare variants associated with Autism Spectrum Disorder and Schizophrenia. Three genes, MAP1A, GRIN2B, and CACNA1F, were consistently identified by different methods as having significant excess of rare missense mutations in either one or both disease cohorts. In a broader context, we also found that the overall site frequency spectrum of variation in these cases is best explained by population models of both selection and complex demography rather than neutral models or models accounting for complex demography alone. Mutations in the three disease-associated genes explained much of the difference in the overall site frequency spectrum among the cases versus controls. This study demonstrates that genes associated with complex disorders can be mapped using resequencing and analytical methods with sample sizes far smaller than those required by genome-wide association studies. Additionally, our findings support the hypothesis that rare mutations account for a proportion of the phenotypic variance of these complex disorders.
It is widely accepted that genetic factors play important roles in the etiology of neurological diseases. However, the nature of the underlying genetic variation remains unclear. Critical questions in the field of human genetics relate to the frequency and size effects of genetic variants associated with disease. For instance, the common disease–common variant model is based on the idea that sets of common variants explain a significant fraction of the variance found in common disease phenotypes. On the other hand, rare variants may have strong effects and therefore largely contribute to disease phenotypes. Due to their high penetrance and reduced fitness, such variants are maintained in the population at low frequencies, thus limiting their detection in genome-wide association studies. Here, we use a resequencing approach on a cohort of 285 Autism Spectrum Disorder and Schizophrenia patients and preformed several analyses, enhanced with population genetic approaches, to identify variants associated with both diseases. Our results demonstrate an excess of rare variants in these disease cohorts and identify genes with negative (deleterious) selection coefficients, suggesting an accumulation of variants of detrimental effects. Our results present further evidence for rare variants explaining a component of the genetic etiology of autism and schizophrenia.
Approximately 100 mouse genes undergo genomic imprinting, whereby one of the two parental alleles is epigenetically silenced. Imprinted genes influence processes including development, X chromosome inactivation, obesity, schizophrenia, and diabetes, motivating the identification of all imprinted loci. Local sequence features have been used to predict candidate imprinted genes, but rigorous testing using reciprocal crosses validated only three, one of which resided in previously identified imprinting clusters. Here we show that specific epigenetic features in mouse cells correlate with imprinting status in mice, and we identify hundreds of additional genes predicted to be imprinted in the mouse. We used a multitiered approach to validate imprinted expression, including use of a custom single nucleotide polymorphism array and traditional molecular methods. Of 65 candidates subjected to molecular assays for allele-specific expression, we found 10 novel imprinted genes that were maternally expressed in the placenta.
Balancing selection is potentially an important biological force for maintaining advantageous genetic diversity in populations, including variation that is responsible for long-term adaptation to the environment. By serving as a means to maintain genetic variation, it may be particularly relevant to maintaining phenotypic variation in natural populations. Nevertheless, its prevalence and specific targets in the human genome remain largely unknown. We have analyzed the patterns of diversity and divergence of 13,400 genes in two human populations using an unbiased single-nucleotide polymorphism data set, a genome-wide approach, and a method that incorporates demography in neutrality tests. We identified an unbiased catalog of genes with signatures of long-term balancing selection, which includes immunity genes as well as genes encoding keratins and membrane channels; the catalog also shows enrichment in functional categories involved in cellular structure. Patterns are mostly concordant in the two populations, with a small fraction of genes showing population-specific signatures of selection. Power considerations indicate that our findings represent a subset of all targets in the genome, suggesting that although balancing selection may not have an obvious impact on a large proportion of human genes, it is a key force affecting the evolution of a number of genes in humans.
overdominance; frequency-dependent selection; heterosis; human evolution; population genetics; human diversity
Motivation: The development of new high-throughput genotyping products requires a significant investment in testing and training samples to evaluate and optimize the product before it can be used reliably on new samples. One reason for this is current methods for automated calling of genotypes are based on clustering approaches which require a large number of samples to be analyzed simultaneously, or an extensive training dataset to seed clusters. In systems where inbred samples are of primary interest, current clustering approaches perform poorly due to the inability to clearly identify a heterozygote cluster.
Results: As part of the development of two custom single nucleotide polymorphism genotyping products for Oryza sativa (domestic rice), we have developed a new genotype calling algorithm called ‘ALCHEMY’ based on statistical modeling of the raw intensity data rather than modelless clustering. A novel feature of the model is the ability to estimate and incorporate inbreeding information on a per sample basis allowing accurate genotyping of both inbred and heterozygous samples even when analyzed simultaneously. Since clustering is not used explicitly, ALCHEMY performs well on small sample sizes with accuracy exceeding 99% with as few as 18 samples.
Availability: ALCHEMY is available for both commercial and academic use free of charge and distributed under the GNU General Public License at http://alchemy.sourceforge.net/
Supplementary information: Supplementary data are available at Bioinformatics online.
Pedigree studies of complex heritable diseases often feature nominal or ordinal phenotypic measurements and missing genetic marker or phenotype data.
We have developed a Bayesian method for Linkage analysis of Ordinal and Categorical traits (LOCate) that can analyze complex genealogical structure for family groups and incorporate missing data. LOCate uses a Gibbs sampling approach to assess linkage, incorporating a simulated tempering algorithm for fast mixing. While our treatment is Bayesian, we develop a LOD (log of odds) score estimator for assessing linkage from Gibbs sampling that is highly accurate for simulated data. LOCate is applicable to linkage analysis for ordinal or nominal traits, a versatility which we demonstrate by analyzing simulated data with a nominal trait, on which LOCate outperforms LOT, an existing method which is designed for ordinal traits. We additionally demonstrate our method's versatility by analyzing a candidate locus (D2S1788) for panic disorder in humans, in a dataset with a large amount of missing data, which LOT was unable to handle.
LOCate's accuracy and applicability to both ordinal and nominal traits will prove useful to researchers interested in mapping loci for categorical traits.
Quantifying the number of deleterious mutations per diploid human genome is of critical concern to both evolutionary and medical geneticists1–3. Here, we combine genome-wide polymorphism data from PCR-based exon re-sequencing, comparative genomic data across mammalian species, and protein structure predictions to estimate the number of functionally consequential mutations carried by each of 15 African American (AA) and 20 European American (EA) individuals. We find that AAs show significantly higher levels of nucleotide heterozygosity than do EAs for all categories of functional mutations considered including synonymous, nonsynonymous, predicted “benign”, predicted “possibly damaging” and predicted “probably damaging” mutations. This result is wholly consistent with previous work showing higher overall levels of nucleotide variation in African populations as compared to Europeans4. EA individuals, on the other hand, have significantly more genotypes homozygous for the derived allele at synonymous and nonsynonymous SNPs and for the damaging allele at “probably damaging” SNPs than AAs do. Surprisingly, for SNPs segregating only in one population or the other, the proportion of nonsynonymous SNPs is significantly higher in the EA sample (55.4%) than in the AA sample (47.0%; P<2.3 ×10−37). We observe a similar proportional excess of SNPs that are inferred to be “probably damaging” (15.9% EA; 12.1% AA; P<3.3 ×10−11). Using extensive simulations, we show that this excess proportion of segregating damaging alleles in Europeans is likely a consequence of a bottleneck that Europeans experienced around the time of the migration out of Africa.