The midge, Belgica antarctica, is the only insect endemic to Antarctica, and thus it offers a powerful model for probing responses to extreme temperatures, freeze tolerance, dehydration, osmotic stress, ultraviolet radiation, and other forms of environmental stress. Here we present the first genome assembly of an extremophile, the first dipteran in the family Chironomidae, and the first Antarctic eukaryote to be sequenced. At 99 megabases, B. antarctica has the smallest insect genome sequenced thus far. Though it has a similar number of genes as other Diptera, the midge genome has very low repeat density and a reduction in intron length. Environmental extremes appear to constrain genome architecture, not gene content. The few transposable elements present are mainly ancient, inactive retroelements. An abundance of genes associated with development, regulation of metabolism, and responses to external stimuli may reflect adaptations for surviving in this harsh environment.
Evolutionary theory assumes that mutations occur randomly in the genome; however, studies performed in a variety of organisms indicate the existence of context-dependent mutation biases. Sources of mutagenesis variation across large genomic contexts (e.g. hundreds of bases) have not been identified. Here, we use high-coverage whole genome sequencing of a conditional mismatch repair mutant line of diploid yeast to identify mutations that accumulated after 160 generations of growth. The vast majority of the mutations accumulated as insertion/deletions (in-dels) in homopolymeric (poly(dA:dT)) and repetitive DNA tracts. Surprisingly, the likelihood of an in-del mutation in a given poly(dA:dT) tract is increased by the presence of nearby poly(dA:dT) tracts in up to a 1000 bp region centered on the given tract. Our work suggests that specific mutation hotspots can contribute disproportionately to the genetic variation that is introduced into populations, and provides the first long-range genomic sequence context that contributes to mutagenesis.
DNA Mismatch Repair; homopolymeric tracts; mutation hotspot
A striking finding from recent large-scale sequencing efforts is that the vast majority of variants in the human genome are rare and found within single populations or lineages. These observations hold important implications for the design of the next round of disease variant discovery efforts—if genetic variants that influence disease risk follow the same trend, then we expect to see population-specific disease associations that require large samples sizes for detection. To address this challenge, and due to the still prohibitive cost of sequencing large cohorts, researchers have developed a new generation of low-cost genotyping arrays that assay rare variation previously identified from large exome sequencing studies. Genotyping approaches rely not only on directly observing variants, but also on phasing and imputation methods that use publicly available reference panels to infer unobserved variants in a study cohort. Rare variant exome arrays are intentionally enriched for variants likely to be disease causing, and here we assay the ability of the first commercially available rare exome variant array (the Illumina Infinium HumanExome BeadChip) to also tag other potentially damaging variants not molecularly assayed. Using full sequence data from chromosome 22 from the phase I 1000 Genomes Project, we evaluate three methods for imputation (BEAGLE, MaCH-Admix, and SHAPEIT2/IMPUTE2) with the rare exome variant array under varied study panel sizes, reference panel sizes, and LD structures via population differences. We find that imputation is more accurate across both the genome and exome for common variant arrays than the next generation array for all allele frequencies, including rare alleles. We also find that imputation is the least accurate in African populations, and accuracy is substantially improved for rare variants when the same population is included in the reference panel. Depending on the goals of GWAS researchers, our results will aid budget decisions by helping determine whether money is best spent sequencing the genomes of smaller sample sizes, genotyping larger sample sizes with rare and/or common variant arrays and imputing SNPs, or some combination of the two.
The American College of Medical Genetics and Genomics (ACMG) recently released guidelines regarding the reporting of incidental findings in sequencing data. Given the availability of Direct to Consumer (DTC) genetic testing and the falling cost of whole exome and genome sequencing, individuals will increasingly have the opportunity to analyze their own genomic data. We have developed a web-based tool, PATH-SCAN, which annotates individual genomes and exomes for ClinVar designated pathogenic variants found within the genes from the ACMG guidelines. Because mutations in these genes predispose individuals to conditions with actionable outcomes, our tool will allow individuals or researchers to identify potential risk variants in order to consult physicians or genetic counselors for further evaluation. Moreover, our tool allows individuals to anonymously submit their pathogenic burden, so that we can crowd source the collection of quantitative information regarding the frequency of these variants. We tested our tool on 1092 publicly available genomes from the 1000 Genomes project, 163 genomes from the Personal Genome Project, and 15 genomes from a clinical genome sequencing research project. Excluding the most commonly seen variant in 1000 Genomes, about 20% of all genomes analyzed had a ClinVar designated pathogenic variant that required further evaluation.
Mexico harbors great cultural and ethnic diversity, yet fine-scale patterns of human genome-wide variation from this region remain largely uncharacterized. We studied genomic variation within Mexico from over 1,000 individuals representing 20 indigenous and 11 mestizo populations. We found striking genetic stratification among indigenous populations within Mexico at varying degrees of geographic isolation. Some groups were as differentiated as Europeans are from East Asians. Pre-Columbian genetic substructure is recapitulated in the indigenous ancestry of admixed mestizo individuals across the country. Furthermore, two independently phenotyped cohorts of Mexicans and Mexican Americans showed a significant association between sub-continental ancestry and lung function. Thus, accounting for fine-scale ancestry patterns is critical for medical and population genetic studies within Mexico, in Mexican-descent populations, and likely in many other populations worldwide.
Atopy varies by ethnicity even within Latino groups. This variation may be due to environmental, socio-cultural or genetic factors.
To examine risk factors for atopy within a nationwide study of U.S. Latino children with and without asthma.
Aeroallergen skin test repsonse was analyzed in 1830 US latino subjects. Key determinants of atopy included: country / region of origin, generation in the U.S., acculturation, genetic ancestry and site to which individuals migrated. Serial multivariate zero inflated negative binomial regressions, stratified by asthma status, examined the association of each key determinant variable with the number of positive skin tests. In addition, the independent effect of each key variable was determined by including all key variables in the final models.
In baseline analyses, African ancestry was associated with 3 times as many positive skin tests in participants with asthma (95% CI:1.62–5.57) and 3.26 times as many positive skin tests in control participants (95% CI: 1.02–10.39). Generation and recruitment site were also associated with atopy in crude models. In final models adjusted for key variables, Puerto Rican [exp(β) (95%CI): 1.31(1.02–1.69)] and mixed ethnicity [exp(β) (95%CI):1.27(1.03–1.56)] asthmatics had a greater probability of positive skin tests compared to Mexican asthmatics. Ancestry associations were abrogated by recruitment site, but not region of origin.
Puerto Rican ethnicity and mixed origin were associated with degree of atopy within U.S. Latino children with asthma. African ancestry was not associated with degree of atopy after adjusting for recruitment site. Local environment variation, represented by site, was associated with degree of sensitization.
Latino; atopy; region of origin; genetic ancestry; immigration; skin test; aeroallergen
Large-scale sequencing efforts have documented extensive genetic variation within the human genome. However, our understanding of the origins, global distribution, and functional consequences of this variation is far from complete. While regulatory variation influencing gene expression has been studied within a handful of populations, the breadth of transcriptome differences across diverse human populations has not been systematically analyzed. To better understand the spectrum of gene expression variation, alternative splicing, and the population genetics of regulatory variation in humans, we have sequenced the genomes, exomes, and transcriptomes of EBV transformed lymphoblastoid cell lines derived from 45 individuals in the Human Genome Diversity Panel (HGDP). The populations sampled span the geographic breadth of human migration history and include Namibian San, Mbuti Pygmies of the Democratic Republic of Congo, Algerian Mozabites, Pathan of Pakistan, Cambodians of East Asia, Yakut of Siberia, and Mayans of Mexico. We discover that approximately 25.0% of the variation in gene expression found amongst individuals can be attributed to population differences. However, we find few genes that are systematically differentially expressed among populations. Of this population-specific variation, 75.5% is due to expression rather than splicing variability, and we find few genes with strong evidence for differential splicing across populations. Allelic expression analyses indicate that previously mapped common regulatory variants identified in eight populations from the International Haplotype Map Phase 3 project have similar effects in our seven sampled HGDP populations, suggesting that the cellular effects of common variants are shared across diverse populations. Together, these results provide a resource for studies analyzing functional differences across populations by estimating the degree of shared gene expression, alternative splicing, and regulatory genetics across populations from the broadest points of human migration history yet sampled.
Previous gene expression studies have identified factors influencing population-level variation in gene regulation. However, these efforts have been limited to a small set of well-studied populations. By leveraging the high resolution of RNA sequencing and broad population sampling, we survey the landscape of transcriptome variation across a globally distributed set of seven populations that span a breadth of human genetic variation and major dispersal events. We assess differences in gene expression, transcript structure, and regulatory variation. We find only 44 transcripts that show significant differences in expression, likely as a result of the small sample size, but we find that 25% of the variance in gene expression is due to population differences. This is a larger fraction than previously observed, and it is likely due to the greater breadth of human diversity assayed in this study. We also find that population-specific variance is mostly due to transcription variability rather than the configuration of expressed gene products. Additionally, known common regulatory variants have similar effects across populations including those we study here. These data and results serve as a resource cataloging the wide array of gene expression regulation affecting population variation among diverse groups, improving our understanding of transcriptional diversity.
The Y chromosome and the mitochondrial genome (mtDNA) have been used to estimate when the common patrilineal and matrilineal ancestors of humans lived. We sequenced the genomes of 69 males from nine populations, including two in which we find basal branches of the Y chromosome tree. We identify ancient phylogenetic structure within African haplogroups and resolve a long-standing ambiguity deep within the tree. Applying equivalent methodologies to the Y and mtDNA, we estimate the time to the most recent common ancestor (TMRCA) of the Y chromosome to be 120–156 thousand years and the mtDNA TMRCA to be 99–148 ky. Our findings suggest that, contrary to prior claims, male lineages do not coalesce significantly more recently than female lineages.
Genome sequencing of the 5,300-year-old mummy of the Tyrolean Iceman, found in 1991 on a glacier near the border of Italy and Austria, has yielded new insights into his origin and relationship to modern European populations. A key finding of that study was an apparent recent common ancestry with individuals from Sardinia, based largely on the Y chromosome haplogroup and common autosomal SNP variation. Here, we compiled and analyzed genomic datasets from both modern and ancient Europeans, including genome sequence data from over 400 Sardinians and two ancient Thracians from Bulgaria, to investigate this result in greater detail and determine its implications for the genetic structure of Neolithic Europe. Using whole-genome sequencing data, we confirm that the Iceman is, indeed, most closely related to Sardinians. Furthermore, we show that this relationship extends to other individuals from cultural contexts associated with the spread of agriculture during the Neolithic transition, in contrast to individuals from a hunter-gatherer context. We hypothesize that this genetic affinity of ancient samples from different parts of Europe with Sardinians represents a common genetic component that was geographically widespread across Europe during the Neolithic, likely related to migrations and population expansions associated with the spread of agriculture.
The analysis of the genome of the Tyrolean Iceman, a 5,300 year old mummy from Central Europe, revealed a surprising recent common ancestry with modern Sardinians for this ancient genome. However, this study was limited both by the availability of data from Sardinians and by a lack of genomic data from other ancient European samples. Here, we use genomic data from modern Sardinians and from ancient European individuals from different geographic regions and cultural contexts, to demonstrate that this ancestry component is shared among individuals associated with the onset of agriculture in Europe. Our results thus suggest that the Iceman's Sardinian ancestry actually reflects a more widespread genetic component related to the migration of people during the Neolithic transition in Central Europe.
We present an Aboriginal Australian genomic sequence obtained from a 100-year-old lock of hair donated by an Aboriginal man from southern Western Australia in the early 20th century. We detect no evidence of European admixture and estimate contamination levels to be below 0.5%. We show that Aboriginal Australians are descendants of an early human dispersal into eastern Asia, possibly 62,000 to 75,000 years ago. This dispersal is separate from the one that gave rise to modern Asians 25,000 to 38,000 years ago. We also find evidence of gene flow between populations of the two dispersal waves prior to the divergence of Native Americans from modern Asian ancestors. Our findings support the hypothesis that present-day Aboriginal Australians descend from the earliest humans to occupy Australia, likely representing one of the oldest continuous populations outside Africa.
Targeted capture of genomic regions reduces sequencing cost while generating
higher coverage by allowing biomedical researchers to focus on specific loci
of interest, such as exons. Targeted capture also has the potential to
facilitate the generation of genomic data from DNA collected via saliva or
buccal cells. DNA samples derived from these cell types tend to have a lower
human DNA yield, may be degraded from age and/or have contamination from
bacteria or other ambient oral microbiota. However, thousands of samples
have been previously collected from these cell types, and saliva collection
has the advantage that it is a non-invasive and appropriate for a wide
variety of research.
We demonstrate successful enrichment and sequencing of 15 South African
KhoeSan exomes and 2 full genomes with samples initially derived from
saliva. The expanded exome dataset enables us to characterize genetic
diversity free from ascertainment bias for multiple KhoeSan populations,
including new exome data from six HGDP Namibian San, revealing substantial
population structure across the Kalahari Desert region. Additionally, we
discover and independently verify thirty-one previously unknown KIR
alleles using methods we developed to accurately map and call the highly
polymorphic HLA and KIR loci from exome capture data.
Finally, we show that exome capture of saliva-derived DNA yields sufficient
non-human sequences to characterize oral microbial communities, including
detection of bacteria linked to oral disease (e.g. Prevotella
melaninogenica). For comparison, two samples were sequenced using
standard full genome library preparation without exome capture and we found
no systematic bias of metagenomic information between exome-captured and
DNA from human saliva samples, collected and extracted using standard
procedures, can be used to successfully sequence high quality human exomes,
and metagenomic data can be derived from non-human reads. We find that
individuals from the Kalahari carry a higher oral pathogenic microbial load
than samples surveyed in the Human Microbiome Project. Additionally, rare
variants present in the exomes suggest strong population structure across
different KhoeSan populations.
Exomes; KhoeSan; Genetic diversity; Metagenomics; Microbiome
Streptococcus mutans is widely recognized as one of the key etiological agents of human dental caries. Despite its role in this important disease, our present knowledge of gene content variability across the species and its relationship to adaptation is minimal. Estimates of its demographic history are not available. In this study, we generated genome sequences of 57 S. mutans isolates, as well as representative strains of the most closely related species to S. mutans (S. ratti, S. macaccae, and S. criceti), to identify the overall structure and potential adaptive features of the dispensable and core components of the genome. We also performed population genetic analyses on the core genome of the species aimed at understanding the demographic history, and impact of selection shaping its genetic variation. The maximum gene content divergence among strains was approximately 23%, with the majority of strains diverging by 5–15%. The core genome consisted of 1,490 genes and the pan-genome approximately 3,296. Maximum likelihood analysis of the synonymous site frequency spectrum (SFS) suggested that the S. mutans population started expanding exponentially approximately 10,000 years ago (95% confidence interval [CI]: 3,268–14,344 years ago), coincidental with the onset of human agriculture. Analysis of the replacement SFS indicated that a majority of these substitutions are under strong negative selection, and the remainder evolved neutrally. A set of 14 genes was identified as being under positive selection, most of which were involved in either sugar metabolism or acid tolerance. Analysis of the core genome suggested that among 73 genes present in all isolates of S. mutans but absent in other species of the mutans taxonomic group, the majority can be associated with metabolic processes that could have contributed to the successful adaptation of S. mutans to its new niche, the human mouth, and with the dietary changes that accompanied the origin of agriculture.
Streptococcus mutans; demographic inference; cavities; bacterial evolution; pan and core genome; infectious disease
To identify the causative mutations in two early-onset canine retinal degenerations, crd1 and crd2, segregating in the American Staffordshire terrier and the Pit Bull Terrier breeds, respectively.
Retinal morphology of crd1- and crd2-affected dogs was evaluated by light microscopy. DNA was extracted from affected and related unaffected controls. Association analysis was undertaken using the Illumina Canine SNP array and PLINK (crd1 study), or the Affymetrix Version 2 Canine array, the “MAGIC” genotype algorithm, and Fisher's Exact test for association (crd2 study). Positional candidate genes were evaluated for each disease.
Structural photoreceptor abnormalities were observed in crd1-affected dogs as young as 11-weeks old. Rod and cone inner segment (IS) and outer segments (OS) were abnormal in size, shape, and number. In crd2-affected dogs, rod and cone IS and OS were abnormal as early as 3 weeks of age, progressing with age to severe loss of the OS, and thinning of the outer nuclear layer (ONL) by 12 weeks of age. Genome-wide association study (GWAS) identified association at the telomeric end of CFA3 in crd1-affected dogs and on CFA33 in crd2-affected dogs. Candidate gene evaluation identified a three bases deletion in exon 21 of PDE6B in crd1-affected dogs, and a cytosine insertion in exon 10 of IQCB1 in crd2-affected dogs.
Identification of the mutations responsible for these two early-onset retinal degenerations provides new large animal models for comparative disease studies and evaluation of potential therapeutic approaches for the homologous human diseases.
We describe two genome-wide association studies in two closely related dog breeds affected with retinal degeneration, the pathology of the diseases and the discovery of a novel deletion mutation in PDE6B and an insertion mutation in IQCB1 as the causality for these diseases.
retina; mutation; GWAS
To identify genetic changes underlying dog domestication and reconstruct their early evolutionary history, we generated high-quality genome sequences from three gray wolves, one from each of the three putative centers of dog domestication, two basal dog lineages (Basenji and Dingo) and a golden jackal as an outgroup. Analysis of these sequences supports a demographic model in which dogs and wolves diverged through a dynamic process involving population bottlenecks in both lineages and post-divergence gene flow. In dogs, the domestication bottleneck involved at least a 16-fold reduction in population size, a much more severe bottleneck than estimated previously. A sharp bottleneck in wolves occurred soon after their divergence from dogs, implying that the pool of diversity from which dogs arose was substantially larger than represented by modern wolf populations. We narrow the plausible range for the date of initial dog domestication to an interval spanning 11–16 thousand years ago, predating the rise of agriculture. In light of this finding, we expand upon previous work regarding the increase in copy number of the amylase gene (AMY2B) in dogs, which is believed to have aided digestion of starch in agricultural refuse. We find standing variation for amylase copy number variation in wolves and little or no copy number increase in the Dingo and Husky lineages. In conjunction with the estimated timing of dog origins, these results provide additional support to archaeological finds, suggesting the earliest dogs arose alongside hunter-gathers rather than agriculturists. Regarding the geographic origin of dogs, we find that, surprisingly, none of the extant wolf lineages from putative domestication centers is more closely related to dogs, and, instead, the sampled wolves form a sister monophyletic clade. This result, in combination with dog-wolf admixture during the process of domestication, suggests that a re-evaluation of past hypotheses regarding dog origins is necessary.
The process of dog domestication is still poorly understood, largely because no studies thus far have leveraged deeply sequenced whole genomes from wolves and dogs to simultaneously evaluate support for the proposed source regions: East Asia, the Middle East, and Europe. To investigate dog origins, we sequence three wolf genomes from the putative centers of origin, two basal dog breeds (Basenji and Dingo), and a golden jackal as an outgroup. We find that none of the wolf lineages from the hypothesized domestication centers is supported as the source lineage for dogs, and that dogs and wolves diverged 11,000–16,000 years ago in a process involving extensive admixture and that was followed by a bottleneck in wolves. In addition, we investigate the amylase (AMY2B) gene family expansion in dogs, which has recently been suggested as being critical to domestication in response to increased dietary starch. We find standing variation in AMY2B copy number in wolves and show that some breeds, such as Dingo and Husky, lack the AMY2B expansion. This suggests that, at the beginning of the domestication process, dogs may have been characterized by a more carnivorous diet than their modern day counterparts, a diet held in common with early hunter-gatherers.
There is great scientific and popular interest in understanding the genetic history of populations in the Americas. We wish to understand when different regions of the continent were inhabited, where settlers came from, and how current inhabitants relate genetically to earlier populations. Recent studies unraveled parts of the genetic history of the continent using genotyping arrays and uniparental markers. The 1000 Genomes Project provides a unique opportunity for improving our understanding of population genetic history by providing over a hundred sequenced low coverage genomes and exomes from Colombian (CLM), Mexican-American (MXL), and Puerto Rican (PUR) populations. Here, we explore the genomic contributions of African, European, and especially Native American ancestry to these populations. Estimated Native American ancestry is in MXL, in CLM, and in PUR. Native American ancestry in PUR is most closely related to populations surrounding the Orinoco River basin, confirming the Southern America ancestry of the Taíno people of the Caribbean. We present new methods to estimate the allele frequencies in the Native American fraction of the populations, and model their distribution using a demographic model for three ancestral Native American populations. These ancestral populations likely split in close succession: the most likely scenario, based on a peopling of the Americas thousand years ago (kya), supports that the MXL Ancestors split kya, with a subsequent split of the ancestors to CLM and PUR kya. The model also features effective populations of in Mexico, in Colombia, and in Puerto Rico. Modeling Identity-by-descent (IBD) and ancestry tract length, we show that post-contact populations also differ markedly in their effective sizes and migration patterns, with Puerto Rico showing the smallest effective size and the earlier migration from Europe. Finally, we compare IBD and ancestry assignments to find evidence for relatedness among European founders to the three populations.
Populations of the Americas have a rich and heterogeneous genetic and cultural heritage that draws from a diversity of pre-Columbian Native American, European, and African populations. Characterizing this diversity facilitates the development of medical genetics research in diverse populations and the transfer of medical knowledge across populations. It also represents an opportunity to better understand the peopling of the Americas, from the crossing of Beringia to the post-Columbian era. Here, we take advantage sequencing of individuals of Colombian (CLM), Mexican (MXL), and Puerto Rican (PUR) origin by the 1000 Genomes project to improve our demographic models for the peopling of the Americas. The divergence among African, European, and Native American ancestors to these populations enables us to infer the continent of origin at each locus in the sampled genomes. The resulting patterns of ancestry suggest complex post-Columbian migration histories, starting later in CLM than in MXL and PUR. Whereas European ancestral segments show evidence of relatedness, a demographic model of synonymous variation suggests that the Native American Ancestors to MXL, PUR, and CLM panels split within a few hundred years over 12 thousand years ago. Together with early archeological sites in South America, these results support rapid divergence during the initial peopling of the Americas.
The identification of the H3K4 trimethylase, PRDM9, as the gene responsible for recombination hotspot localization has provided considerable insight into the mechanisms by which recombination is initiated in mammals. However, uniquely amongst mammals, canids appear to lack a functional version of PRDM9 and may therefore provide a model for understanding recombination that occurs in the absence of PRDM9, and thus how PRDM9 functions to shape the recombination landscape. We have constructed a fine-scale genetic map from patterns of linkage disequilibrium assessed using high-throughput sequence data from 51 free-ranging dogs, Canis lupus familiaris. While broad-scale properties of recombination appear similar to other mammalian species, our fine-scale estimates indicate that canine highly elevated recombination rates are observed in the vicinity of CpG rich regions including gene promoter regions, but show little association with H3K4 trimethylation marks identified in spermatocytes. By comparison to genomic data from the Andean fox, Lycalopex culpaeus, we show that biased gene conversion is a plausible mechanism by which the high CpG content of the dog genome could have occurred.
Recombination in mammalian genomes tends to occur within highly localized regions known as recombination hotspots. These hotspots appear to be a ubiquitous feature of mammalian genomes, but tend to not be shared between closely related species despite high levels of DNA sequence similarity. This disparity has been largely explained by the discovery of PRDM9 as the gene responsible for localizing recombination hotspots via recognition and binding to specific DNA motifs. Variation within PRDM9 can lead to changes to the recognized motif, and hence changes to the location of recombination hotspots thought the genome. Multiple studies have shown that PRDM9 is under strong selective pressure, apparently leading to a rapid turnover of hotspot locations between species. However, uniquely amongst mammals, PRDM9 appears to be dysfunctional in dogs and other canids. In this paper, we investigate how the loss of PRDM9 has affected the fine-scale recombination landscape in dogs and contrast this with patterns seen in other species.
How organisms adapt to the range of environments they encounter is a fundamental question in biology. Elucidating the genetic basis of adaptation is a difficult task, especially when the targets of selection are not known. Emerging sequencing technologies and assembly algorithms facilitate the genomic dissection of adaptation and population differentiation in a vast array of organisms. Here we describe the attributes of Kryptolebias marmoratus, one of two known self-fertilizing hermaphroditic vertebrates that make this fish an attractive genetic system and a model for understanding the genomics of adaptation. Long periods of selfing have resulted in populations composed of many distinct naturally homozygous strains with a variety of identifiable, and apparently heritable, phenotypes. There also is strong population genetic structure across a diverse range of mangrove habitats, making this a tractable system in which to study differentiation both within and among populations. The ability to rear K. marmoratus in the laboratory contributes further to its value as a model for understanding the genetic drivers for adaptation. To date, microsatellite markers distinguish wild isogenic strains but the naturally high homozygosity improves the quality of de novo assembly of the genome and facilitates the identification of genetic variants associated with phenotypes. Gene annotation can be accomplished with RNA-sequencing data in combination with de novo genome assembly. By combining genomic information with extensive laboratory-based phenotyping, it becomes possible to map genetic variants underlying differences in behavioral, life-history, and other potentially adaptive traits. Emerging genomic technologies provide the required resources for establishing K. marmoratus as a new model organism for behavioral genetics and evolutionary genetics research.
The Caribbean basin is home to some of the most complex interactions in recent history among previously diverged human populations. Here, we investigate the population genetic history of this region by characterizing patterns of genome-wide variation among 330 individuals from three of the Greater Antilles (Cuba, Puerto Rico, Hispaniola), two mainland (Honduras, Colombia), and three Native South American (Yukpa, Bari, and Warao) populations. We combine these data with a unique database of genomic variation in over 3,000 individuals from diverse European, African, and Native American populations. We use local ancestry inference and tract length distributions to test different demographic scenarios for the pre- and post-colonial history of the region. We develop a novel ancestry-specific PCA (ASPCA) method to reconstruct the sub-continental origin of Native American, European, and African haplotypes from admixed genomes. We find that the most likely source of the indigenous ancestry in Caribbean islanders is a Native South American component shared among inland Amazonian tribes, Central America, and the Yucatan peninsula, suggesting extensive gene flow across the Caribbean in pre-Columbian times. We find evidence of two pulses of African migration. The first pulse—which today is reflected by shorter, older ancestry tracts—consists of a genetic component more similar to coastal West African regions involved in early stages of the trans-Atlantic slave trade. The second pulse—reflected by longer, younger tracts—is more similar to present-day West-Central African populations, supporting historical records of later transatlantic deportation. Surprisingly, we also identify a Latino-specific European component that has significantly diverged from its parental Iberian source populations, presumably as a result of small European founder population size. We demonstrate that the ancestral components in admixed genomes can be traced back to distinct sub-continental source populations with far greater resolution than previously thought, even when limited pre-Columbian Caribbean haplotypes have survived.
Latinos are often regarded as a single heterogeneous group, whose complex variation is not fully appreciated in several social, demographic, and biomedical contexts. By making use of genomic data, we characterize ancestral components of Caribbean populations on a sub-continental level and unveil fine-scale patterns of population structure distinguishing insular from mainland Caribbean populations as well as from other Hispanic/Latino groups. We provide genetic evidence for an inland South American origin of the Native American component in island populations and for extensive pre-Columbian gene flow across the Caribbean basin. The Caribbean-derived European component shows significant differentiation from parental Iberian populations, presumably as a result of founder effects during the colonization of the New World. Based on demographic models, we reconstruct the complex population history of the Caribbean since the onset of continental admixture. We find that insular populations are best modeled as mixtures absorbing two pulses of African migrants, coinciding with the early and maximum activity stages of the transatlantic slave trade. These two pulses appear to have originated in different regions within West Africa, imprinting two distinguishable signatures on present-day Afro-Caribbean genomes and shedding light on the genetic impact of the slave trade in the Caribbean.
To gain insights into evolutionary forces that have shaped the history of Bornean and Sumatran populations of orang-utans, we compare patterns of variation across more than 11 million single nucleotide polymorphisms found by previous mitochondrial and autosomal genome sequencing of 10 wild-caught orang-utans. Our analysis of the mitochondrial data yields a far more ancient split time between the two populations (∼3.4 million years ago) than estimates based on autosomal data (0.4 million years ago), suggesting a complex speciation process with moderate levels of primarily male migration. We find that the distribution of selection coefficients consistent with the observed frequency spectrum of autosomal non-synonymous polymorphisms in orang-utans is similar to the distribution in humans. Our analysis indicates that 35% of genes have evolved under detectable negative selection. Overall, our findings suggest that purifying natural selection, genetic drift, and a complex demographic history are the dominant drivers of genome evolution for the two orang-utan populations.
Identifying ancestry along each chromosome in admixed individuals provides a wealth of information for understanding the population genetic history of admixture events and is valuable for admixture mapping and identifying recent targets of selection. We present PCAdmix (available at https://sites.google.com/site/pcadmix/home), a Principal Components-based algorithm for determining ancestry along each chromosome from a high-density, genome-wide set of phased single-nucleotide polymorphism (SNP) genotypes of admixed individuals. We compare our method to HAPMIX on simulated data from two ancestral populations, and we find high concordance between the methods. Our method also has better accuracy than LAMP when applied to three-population admixture, a situation as yet unaddressed by HAPMIX. Finally, we apply our method to a data set of four Latino populations with European, African, and Native American ancestry. We find evidence of assortative mating in each of the four populations, and we identify regions of shared ancestry that may be recent targets of selection and could serve as candidate regions for admixture-based association mapping.
Admixture; Principal Components Analysis (Pca); Local Ancestry Deconvolution; Haplotype-Based; Forward-Backward Algorithm
As a first step toward understanding how rare variants contribute to risk for complex diseases, we sequenced 15,585 human protein-coding genes to an average median depth of 111× in 2440 individuals of European (n = 1351) and African (n = 1088) ancestry. We identified over 500,000 single-nucleotide variants (SNVs), the majority of which were rare (86% with a minor allele frequency less than 0.5%), previously unknown (82%), and population-specific (82%). On average, 2.3% of the 13,595 SNVs each person carried were predicted to affect protein function of ∼313 genes per genome, and ∼95.7% of SNVs predicted to be functionally important were rare. This excess of rare functional variants is due to the combined effects of explosive, recent accelerated population growth and weak purifying selection. Furthermore, we show that large sample sizes will be required to associate rare variants with complex traits.
Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders.
We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina
1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation.
We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits.
Theobroma cacao L.; genome; Matina 1-6; haplotype phasing; genetic mapping; pod color; MYB113
Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.
The environment humans inhabit has changed many times in the last 100,000 years. Migration and dynamic local environments can lead to genetic adaptations favoring beneficial traits. Many genes responsible for these adaptations can alter disease susceptibility. Genes can also affect disease susceptibility by varying randomly across different populations. We have studied genetic variants that are known to modify disease susceptibility in the context of worldwide migration. We found that variants associated with 11 diseases have been affected to an extent that is not explained by random variation. We also found that the genetic risk of type 2 diabetes has steadily decreased along the worldwide human migration trajectory from Africa to America.