We report the sequences of 1,244 human Y chromosomes randomly ascertained from 26 worldwide populations by the 1000 Genomes Project. We discovered more than 65,000 variants, including SNVs, MNVs, indels, STRs, and CNVs. Of these, CNVs contribute the greatest predicted functional impact. We constructed a calibrated phylogenetic tree based on binary SNVs and projected the more complex variants onto it, estimating the numbers of mutations for each class. Our phylogeny reveals bursts of extreme expansions in male numbers that have occurred independently among each of the five continental superpopulations examined, at times of known migrations and technological innovations.
Rapidly Mutating Y-STRs (RM Y-STRs) were recently introduced in forensics in order to increase the differentiation of Y-chromosomal profiles even in case of close relatives. We estimate RM Y-STRs mutation rates and their power to discriminate between related individuals by using samples extracted from a wide set of paternal pedigrees and by comparing RM Y-STRs results with those obtained from the Y-filer set. In addition, we tested the ability of RM Y-STRs to discriminate between unrelated individuals carrying the same Y-filer haplotype, using the haplogroup R-M269 (reportedly characterised by a strong resemblance in Y-STR profiles) as a case study. Our results, despite confirming the high mutability of RM Y-STRs, show significantly lower mutation rates than reference germline ones. Consequently, their power to discriminate between related individuals, despite being higher than the one of Y-filer, does not seem to improve significantly the performance of the latter. On the contrary, when considering R-M269 unrelated individuals, RM Y-STRs reveal significant discriminatory power and retain some phylogenetic signal, allowing the correct classification of individuals for some R-M269-derived sub-lineages. These results have important implications not only for forensics, but also for molecular anthropology, suggesting that RM Y-STRs are useful tools for exploring subtle genetic variability within Y-chromosomal haplogroups.
Australia was one of the earliest regions outside Africa to be colonized by fully modern humans, with archaeological evidence for human presence by 47,000 years ago (47 kya) widely accepted [1, 2]. However, the extent of subsequent human entry before the European colonial age is less clear. The dingo reached Australia about 4 kya, indirectly implying human contact, which some have linked to changes in language and stone tool technology to suggest substantial cultural changes at the same time . Genetic data of two kinds have been proposed to support gene flow from the Indian subcontinent to Australia at this time, as well: first, signs of South Asian admixture in Aboriginal Australian genomes have been reported on the basis of genome-wide SNP data ; and second, a Y chromosome lineage designated haplogroup C∗, present in both India and Australia, was estimated to have a most recent common ancestor around 5 kya and to have entered Australia from India . Here, we sequence 13 Aboriginal Australian Y chromosomes to re-investigate their divergence times from Y chromosomes in other continents, including a comparison of Aboriginal Australian and South Asian haplogroup C chromosomes. We find divergence times dating back to ∼50 kya, thus excluding the Y chromosome as providing evidence for recent gene flow from India into Australia.
•We have sequenced 13 Aboriginal Australian Y chromosomes•These diverged from Y chromosomes in other continents around 50,000 years ago•They diverged from Papua New Guinean Y chromosomes soon after this•We find no evidence for Holocene male gene flow to Australia from South Asia
Bergström et al. show that Aboriginal Australian Y chromosomes diverged from Eurasian, including South Asian, Y chromosomes ∼50,000 years ago. This is around the time that Australia was first populated and thus disproves the previous hypothesis of prehistoric Y chromosome gene flow from India ∼5,000 years ago.
High-altitude adaptation in Tibetans is influenced by introgression of a 32.7-kb haplotype from the Denisovans, an extinct branch of archaic humans, lying within the endothelial PAS domain protein 1 (EPAS1), and has also been reported in Sherpa. We genotyped 19 variants in this genomic region in 1507 Eurasian individuals, including 1188 from Bhutan and Nepal residing at altitudes between 86 and 4550 m above sea level. Derived alleles for five SNPs characterizing the core Denisovan haplotype (AGGAA) were present at high frequency not only in Tibetans and Sherpa, but also among many populations from the Himalayas, showing a significant correlation with altitude (Spearman’s correlation coefficient = 0.75, p value 3.9 × 10−11). Seven East- and South-Asian 1000 Genomes Project individuals shared the Denisovan haplotype extending beyond the 32-kb region, enabling us to refine the haplotype structure and identify a candidate regulatory variant (rs370299814) that might be interacting in an additive manner with the derived G allele of rs150877473, the variant previously associated with high-altitude adaptation in Tibetans. Denisovan-derived alleles were also observed at frequencies of 3–14 % in the 1000 Genomes Project African samples. The closest African haplotype is, however, separated from the Asian high-altitude haplotype by 22 mutations whereas only three mutations, including rs150877473, separate the Asians from the Denisovan, consistent with distant shared ancestry for African and Asian haplotypes and Denisovan adaptive introgression.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-016-1641-2) contains supplementary material, which is available to authorized users.
Vitamin D and folate are activated and degraded by sunlight, respectively, and the physiological processes they control are likely to have been targets of selection as humans expanded from Africa into Eurasia. We investigated signals of positive selection in gene sets involved in the metabolism, regulation and action of these two vitamins in worldwide populations sequenced by Phase I of the 1000 Genomes Project. Comparing allele frequency-spectrum-based summary statistics between these gene sets and matched control genes, we observed a selection signal specific to East Asians for a gene set associated with vitamin D action in bones. The selection signal was mainly driven by three genes CXXC finger protein 1 (CXXC1), low density lipoprotein receptor-related protein 5 (LRP5) and runt-related transcription factor 2 (RUNX2). Examination of population differentiation and haplotypes allowed us to identify several candidate causal regulatory variants in each gene. Four of these candidate variants (one each in CXXC1 and RUNX2 and two in LRP5) had a >70% derived allele frequency in East Asians, but were present at lower (20–60%) frequency in Europeans as well, suggesting that the adaptation might have been part of a common response to climatic and dietary changes as humans expanded out of Africa, with implications for their role in vitamin D-dependent bone mineralization and osteoporosis insurgence. We also observed haplotype sharing between East Asians, Finns and an extinct archaic human (Denisovan) sample at the CXXC1 locus, which is best explained by incomplete lineage sorting.
Mountain gorillas are an endangered great ape subspecies and a prominent focus for conservation, yet we know little about their genomic diversity and evolutionary past. We sequenced whole genomes from multiple wild individuals and compared the genomes of all four Gorilla subspecies. We found that the two eastern subspecies have experienced a prolonged population decline over the past 100,000 years, resulting in very low genetic diversity and an increased overall burden of deleterious variation. A further recent decline in the mountain gorilla population has led to extensive inbreeding, such that individuals are typically homozygous at 34% of their sequence, leading to the purging of severely deleterious recessive mutations from the population. We discuss the causes of their decline and the consequences for their future survival.
To investigate the information about Y-structural variants (SVs) in the general population that could be obtained by low-coverage whole-genome sequencing.
We investigated SVs on the male-specific portion of the Y chromosome in the 70 individuals from Africa, Europe, or East Asia sequenced as part of the 1000 Genomes Pilot project, using data from this project and from additional studies on the same samples. We applied a combination of read-depth and read-pair methods to discover candidate Y-SVs, followed by validation using information from the literature, independent sequence and single nucleotide polymorphism-chip data sets, and polymerase chain reaction experiments.
We validated 19 Y-SVs, 2 of which were novel. Non-reference allele counts ranged from 1 to 64. The regions richest in variation were the heterochromatic segments near the centromere or the DYZ19 locus, followed by the ampliconic regions, but some Y-SVs were also present in the X-transposed and X-degenerate regions. In all, 5 of the 27 protein-coding gene families on the Y chromosome varied in copy number.
We confirmed that Y-SVs were readily detected from low-coverage sequence data and were abundant on the chromosome. We also reported both common and rare Y-SVs that are novel.
We have assessed copy number variation (CNV) in the male-specific part of the human Y chromosome discovered by array comparative genomic hybridization (array-CGH) in 411 apparently healthy UK males, and validated the findings using SNP genotype intensity data available for 149 of them. After manual curation taking account of the complex duplicated structure of Y-chromosomal sequences, we discovered 22 curated CNV events considered validated or likely, mean 0.93 (range 0–4) per individual. 16 of these were novel. Curated CNV events ranged in size from <1 kb to >3 Mb, and in frequency from 1/411 to 107/411. Of the 24 protein-coding genes or gene families tested, nine showed CNV. These included a large duplication encompassing the AMELY and TBL1Y genes that probably has no phenotypic effect, partial deletions of the TSPY cluster and AZFc region that may influence spermatogenesis, and other variants with unknown functional implications, including abundant variation in the number of RBMY genes and/or pseudogenes, and a novel complex duplication of two segments overlapping the AZFa region and including the 3′ end of the UTY gene.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-015-1562-5) contains supplementary material, which is available to authorized users.
In a worldwide collaborative effort, 19,630 Y-chromosomes were sampled from 129 different populations in 51 countries. These chromosomes were typed for 23 short-tandem repeat (STR) loci (DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385ab, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635, GATAH4, DYS481, DYS533, DYS549, DYS570, DYS576, and DYS643) and using the PowerPlex Y23 System (PPY23, Promega Corporation, Madison, WI). Locus-specific allelic spectra of these markers were determined and a consistently high level of allelic diversity was observed. A considerable number of null, duplicate and off-ladder alleles were revealed. Standard single-locus and haplotype-based parameters were calculated and compared between subsets of Y-STR markers established for forensic casework. The PPY23 marker set provides substantially stronger discriminatory power than other available kits but at the same time reveals the same general patterns of population structure as other marker sets. A strong correlation was observed between the number of Y-STRs included in a marker set and some of the forensic parameters under study. Interestingly a weak but consistent trend toward smaller genetic distances resulting from larger numbers of markers became apparent.
Gene diversity; Discriminatory power; AMOVA; Population structure; Database
Population differentiation has proved to be effective for identifying loci under geographically localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes.
We demonstrate that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively.
We identify known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.
A report on the 'Genomic Disorders 2013: from 60 years of DNA to human genomes in the clinic' meeting, held at Homerton College, Cambridge, UK, April 10-12, 2013.
We have compared phylogenies and time estimates for Y-chromosomal lineages based on resequencing ∼9 Mb of DNA and applying the program GENETREE to similar analyses based on the more standard approach of genotyping 26 Y-SNPs plus 21 Y-STRs and applying the programs NETWORK and BATWING. We find that deep phylogenetic structure is not adequately reconstructed after Y-SNP plus Y-STR genotyping, and that times estimated using observed Y-STR mutation rates are several-fold too recent. In contrast, an evolutionary mutation rate gives times that are more similar to the resequencing data. In principle, systematic comparisons of this kind can in future studies be used to identify the combinations of Y-SNP and Y-STR markers, and time estimation methodologies, that correspond best to resequencing data.
Human Y chromosome; Male history; Time estimation; Networks; BATWING
All non-human great apes are endangered in the wild, and it is therefore important to gain an understanding of their demography and genetic diversity. Whole genome assembly projects have provided an invaluable foundation for understanding genetics in all four genera, but to date genetic studies of multiple individuals within great ape species have largely been confined to mitochondrial DNA and a small number of other loci. Here, we present a genome-wide survey of genetic variation in gorillas using a reduced representation sequencing approach, focusing on the two lowland subspecies. We identify 3,006,670 polymorphic sites in 14 individuals: 12 western lowland gorillas (Gorilla gorilla gorilla) and 2 eastern lowland gorillas (Gorilla beringei graueri). We find that the two species are genetically distinct, based on levels of heterozygosity and patterns of allele sharing. Focusing on the western lowland population, we observe evidence for population substructure, and a deficit of rare genetic variants suggesting a recent episode of population contraction. In western lowland gorillas, there is an elevation of variation towards telomeres and centromeres on the chromosomal scale. On a finer scale, we find substantial variation in genetic diversity, including a marked reduction close to the major histocompatibility locus, perhaps indicative of recent strong selection there. These findings suggest that despite their maintaining an overall level of genetic diversity equal to or greater than that of humans, population decline, perhaps associated with disease, has been a significant factor in recent and long-term pressures on wild gorilla populations.
Gorillas are humans’ closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago (Mya). In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
Genome sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2,951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in non-essential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes, and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
Numerous genome-wide scans conducted by genotyping previously ascertained single-nucleotide polymorphisms (SNPs) have provided candidate signatures for positive selection in various regions of the human genome, including in genes involved in pigmentation traits. However, it is unclear how well the signatures discovered by such haplotype-based test statistics can be reproduced in tests based on full resequencing data. Four genes (oculocutaneous albinism II (OCA2), tyrosinase-related protein 1 (TYRP1), dopachrome tautomerase (DCT), and KIT ligand (KITLG)) implicated in human skin-color variation, have shown evidence for positive selection in Europeans and East Asians in previous SNP-scan data. In the current study, we resequenced 4.7 to 6.7 kb of DNA from each of these genes in Africans, Europeans, East Asians, and South Asians.
Applying all commonly used neutrality-test statistics for allele frequency distribution to the newly generated sequence data provided conflicting results regarding evidence for positive selection. Previous haplotype-based findings could not be clearly confirmed. Although some tests were marginally significant for some populations and genes, none of them were significant after multiple-testing correction. Combined P values for each gene-population pair did not improve these results. Application of Approximate Bayesian Computation Markov chain Monte Carlo based to these sequence data using a simple forward simulator revealed broad posterior distributions of the selective parameters for all four genes, providing no support for positive selection. However, when we applied this approach to published sequence data on SLC45A2, another human pigmentation candidate gene, we could readily confirm evidence for positive selection, as previously detected with sequence-based and some haplotype-based tests.
Overall, our data indicate that even genes that are strong biological candidates for positive selection and show reproducible signatures of positive selection in SNP scans do not always show the same replicability of selection signals in other tests, which should be considered in future studies on detecting positive selection in genetic data.
Autism is a common, severe and highly heritable neurodevelopmental disorder in children, affecting up to 100 children per 10,000. The MET gene has been regarded as a promising candidate gene for this disorder because it is located within a replicated linkage interval, is involved in pathways affecting the development of the cerebral cortex and cerebellum in ways relevant to autism patients, and has shown significant association signals in previous studies.
Here, we present new ASD patient and control samples from Heilongjiang, China and use them in a case-control and family-based replication study of two MET variants. One SNP, rs38845, was successfully replicated in a case-control association study, but failed to replicate in a family-based study, possibly due to small sample size. The other SNP, rs1858830, failed to replicate in both case-control and family-based studies.
This is the first attempt to replicate associations in Chinese autism samples, and our result provides evidence that MET variants may be relevant to autism susceptibility in the Chinese Han population.
We have investigated whether regions of the genome showing signs of positive selection in scans based on haplotype structure also show evidence of positive selection when sequence-based tests are applied, whether the target of selection can be localized more precisely, and whether such extra evidence can lead to increased biological insights. We used two tools: simulations under neutrality or selection, and experimental investigation of two regions identified by the HapMap2 project as putatively selected in human populations. Simulations suggested that neutral and selected regions should be readily distinguished and that it should be possible to localize the selected variant to within 40 kb at least half of the time. Re-sequencing of two ~300 kb regions (chr4:158Mb and chr10:22Mb) lacking known targets of selection in HapMap CHB individuals provided strong evidence for positive selection within each and suggested the micro-RNA gene hsa-miR-548c as the best candidate target in one region, and changes in regulation of the sperm protein gene SPAG6 in the other.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1111-9) contains supplementary material, which is available to authorized users.
We have surveyed 15 high-altitude adaptation candidate genes for signals of positive selection in North Caucasian highlanders using targeted re-sequencing. A total of 49 unrelated Daghestani from three ethnic groups (Avars, Kubachians, and Laks) living in ancient villages located at around 2,000 m above sea level were chosen as the study population. Caucasian (Adygei living at sea level, N = 20) and CEU (CEPH Utah residents with ancestry from northern and western Europe; N = 20) were used as controls. Candidate genes were compared with 20 putatively neutral control regions resequenced in the same individuals. The regions of interest were amplified by long-PCR, pooled according to individual, indexed by adding an eight-nucleotide tag, and sequenced using the Illumina GAII platform. 1,066 SNPs were called using false discovery and false negative thresholds of ~6%. The neutral regions provided an empirical null distribution to compare with the candidate genes for signals of selection. Two genes stood out. In Laks, a non-synonymous variant within HIF1A already known to be associated with improvement in oxygen metabolism was rediscovered, and in Kubachians a cluster of 13 SNPs located in a conserved intronic region within EGLN1 showing high population differentiation was found. These variants illustrate both the common pathways of adaptation to high altitude in different populations and features specific to the Daghestani populations, showing how even a mildly hypoxic environment can lead to genetic adaptation.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1084-8) contains supplementary material, which is available to authorized users.
Human Y-chromosome haplogroup structure is largely circumscribed by continental boundaries. One notable exception to this general pattern is the young haplogroup R1a that exhibits post-Glacial coalescent times and relates the paternal ancestry of more than 10% of men in a wide geographic area extending from South Asia to Central East Europe and South Siberia. Its origin and dispersal patterns are poorly understood as no marker has yet been described that would distinguish European R1a chromosomes from Asian. Here we present frequency and haplotype diversity estimates for more than 2000 R1a chromosomes assessed for several newly discovered SNP markers that introduce the onset of informative R1a subdivisions by geography. Marker M434 has a low frequency and a late origin in West Asia bearing witness to recent gene flow over the Arabian Sea. Conversely, marker M458 has a significant frequency in Europe, exceeding 30% in its core area in Eastern Europe and comprising up to 70% of all M17 chromosomes present there. The diversity and frequency profiles of M458 suggest its origin during the early Holocene and a subsequent expansion likely related to a number of prehistoric cultural developments in the region. Its primary frequency and diversity distribution correlates well with some of the major Central and East European river basins where settled farming was established before its spread further eastward. Importantly, the virtual absence of M458 chromosomes outside Europe speaks against substantial patrilineal gene flow from East Europe to Asia, including to India, at least since the mid-Holocene.
Y chromosome; haplogroup R1a; human evolution; population genetics
We have investigated human male demographic history using 590 males from 51 populations in the Human Genome Diversity Project - Centre d’Étude du Polymorphisme Humain worldwide panel, typed with 37 Y-chromosomal Single Nucleotide Polymorphisms and 65 Y-chromosomal Short Tandem Repeats and analyzed with the program Bayesian Analysis of Trees With Internal Node Generation. The general patterns we observe show a gradient from the oldest population time to the most recent common ancestors (TMRCAs) and expansion times together with the largest effective population sizes in Africa, to the youngest times and smallest effective population sizes in the Americas. These parameters are significantly negatively correlated with distance from East Africa, and the patterns are consistent with most other studies of human variation and history. In contrast, growth rate showed a weaker correlation in the opposite direction. Y-lineage diversity and TMRCA also decrease with distance from East Africa, supporting a model of expansion with serial founder events starting from this source. A number of individual populations diverge from these general patterns, including previously documented examples such as recent expansions of the Yoruba in Africa, Basques in Europe, and Yakut in Northern Asia. However, some unexpected demographic histories were also found, including low growth rates in the Hazara and Kalash from Pakistan and recent expansion of the Mozabites in North Africa.
Y-STR; Y-SNP; HGDP–CEPH; male demographic history; BATWING; serial founder model
Heart failure is a leading cause of mortality in South Asians. However, its genetic etiology remains largely unknown1. Cardiomyopathies due to sarcomeric mutations are a major monogenic cause for heart failure (MIM600958). Here, we describe a deletion of 25 bp in the gene encoding cardiac myosin binding protein C (MYBPC3) that is associated with heritable cardiomyopathies and an increased risk of heart failure in Indian populations (initial study OR = 5.3 (95% CI = 2.3–13), P = 2 × 10−6; replication study OR = 8.59 (3.19–25.05), P = 3 × 10−8; combined OR = 6.99 (3.68–13.57), P = 4 × 10−11) and that disrupts cardiomyocyte structure in vitro. Its prevalence was found to be high (~4%) in populations of Indian subcontinental ancestry. The finding of a common risk factor implicated in South Asian subjects with cardiomyopathy will help in identifying and counseling individuals predisposed to cardiac diseases in this region.