Isolated populations are emerging as a powerful study design in the search for low-frequency and rare variant associations with complex phenotypes. Here we genotype 2,296 samples from two isolated Greek populations, the Pomak villages (HELIC-Pomak) in the North of Greece and the Mylopotamos villages (HELIC-MANOLIS) in Crete. We compare their genomic characteristics to the general Greek population and establish them as genetic isolates. In the MANOLIS cohort, we observe an enrichment of missense variants among the variants that have drifted up in frequency by more than fivefold. In the Pomak cohort, we find novel associations at variants on chr11p15.4 showing large allele frequency increases (from 0.2% in the general Greek population to 4.6% in the isolate) with haematological traits, for example, with mean corpuscular volume (rs7116019, P=2.3 × 10−26). We replicate this association in a second set of Pomak samples (combined P=2.0 × 10−36). We demonstrate significant power gains in detecting medical trait associations.
Isolated populations can increase power to detect low frequency and rare risk variants associated with complex phenotypes. Here, the authors identify variants associated with haematological traits in two isolated Greek populations that would be difficult to detect in the general population, due to their low frequency.
The somatic mutations in a cancer genome are the aggregate outcome of one or more mutational processes operative through the life of the cancer patient1-3. Each mutational process leaves a characteristic mutational signature determined by the mechanisms of DNA damage and repair that constitute it. A role was recently proposed for the APOBEC family of cytidine deaminases in generating particular genome-wide mutational signatures1,4 and a signature of localized hypermutation called kataegis1,4. A germline copy number polymorphism involving APOBEC3A and APOBEC3B, which effectively deletes APOBEC3B5, has been associated with a modest increased risk of breast cancer6-8. Here, we show that breast cancers in carriers of the deletion show more mutations of the putative APOBEC-dependent genome-wide signatures than cancers in non-carriers. The results suggest that the APOBEC3A/3B germline deletion allele confers cancer susceptibility through increased activity of APOBEC-dependent mutational processes, although the mechanism by which this occurs remains unknown.
The search for a method that utilizes biological information to predict humans’ place of origin has occupied scientists for millennia. Over the past four decades, scientists have employed genetic data in an effort to achieve this goal but with limited success. While biogeographical algorithms using next-generation sequencing data have achieved an accuracy of 700km in Europe, they were inaccurate elsewhere. Here we describe the Geographic Population Structure (GPS) algorithm and demonstrate its accuracy with three datasets using 40,000-130,000 SNPs. GPS placed 83% of worldwide-individuals in their country of origin. Applied to over 200 Sardinians villagers, GPS placed a quarter of them in their villages and most of the rest within 50km of their villages. GPS’s accuracy and power to infer the biogeography of worldwide-individuals down to their country or, in some cases, village, of origin, underscores the promise of admixture-based methods for biogeography and has ramifications for genetic ancestry testing.
In a worldwide collaborative effort, 19,630 Y-chromosomes were sampled from 129 different populations in 51 countries. These chromosomes were typed for 23 short-tandem repeat (STR) loci (DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385ab, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635, GATAH4, DYS481, DYS533, DYS549, DYS570, DYS576, and DYS643) and using the PowerPlex Y23 System (PPY23, Promega Corporation, Madison, WI). Locus-specific allelic spectra of these markers were determined and a consistently high level of allelic diversity was observed. A considerable number of null, duplicate and off-ladder alleles were revealed. Standard single-locus and haplotype-based parameters were calculated and compared between subsets of Y-STR markers established for forensic casework. The PPY23 marker set provides substantially stronger discriminatory power than other available kits but at the same time reveals the same general patterns of population structure as other marker sets. A strong correlation was observed between the number of Y-STRs included in a marker set and some of the forensic parameters under study. Interestingly a weak but consistent trend toward smaller genetic distances resulting from larger numbers of markers became apparent.
Gene diversity; Discriminatory power; AMOVA; Population structure; Database
Population differentiation has proved to be effective for identifying loci under geographically localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes.
We demonstrate that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively.
We identify known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.
A report on the 'Genomic Disorders 2013: from 60 years of DNA to human genomes in the clinic' meeting, held at Homerton College, Cambridge, UK, April 10-12, 2013.
The greater Himalayan region demarcates two of the most prominent linguistic phyla in Asia: Tibeto-Burman and Indo-European. Previous genetic surveys, mainly using Y-chromosome polymorphisms and/or mitochondrial DNA polymorphisms suggested a substantially reduced geneflow between populations belonging to these two phyla. These studies, however, have mainly focussed on populations residing far to the north and/or south of this mountain range, and have not been able to study geneflow patterns within the greater Himalayan region itself. We now report a detailed, linguistically informed, genetic survey of Tibeto-Burman and Indo-European speakers from the Himalayan countries Nepal and Bhutan based on autosomal microsatellite markers and compare these populations with surrounding regions. The genetic differentiation between populations within the Himalayas seems to be much higher than between populations in the neighbouring countries. We also observe a remarkable genetic differentiation between the Tibeto-Burman speaking populations on the one hand and Indo-European speaking populations on the other, suggesting that language and geography have played an equally large role in defining the genetic composition of present-day populations within the Himalayas.
Interpreting variants, especially noncoding ones, in the increasing
number of personal genomes is challenging. We used patterns of polymorphisms in
functionally annotated regions in 1092 humans to identify deleterious variants;
then we experimentally validated candidates. We analyzed both coding and
noncoding regions, with the former corroborating the latter. We found regions
particularly sensitive to mutations (“ultrasensitive”) and
variants that are disruptive because of mechanistic effects on
transcription-factor binding (that is, “motif-breakers”). We also
found variants in regions with higher network centrality tend to be deleterious.
Insertions and deletions followed a similar pattern to single-nucleotide
variants, with some notable exceptions (e.g., certain deletions and enhancers).
On the basis of these patterns, we developed a computational tool (FunSeq),
whose application to ~90 cancer genomes reveals nearly a hundred
candidate noncoding drivers.
We have compared phylogenies and time estimates for Y-chromosomal lineages based on resequencing ∼9 Mb of DNA and applying the program GENETREE to similar analyses based on the more standard approach of genotyping 26 Y-SNPs plus 21 Y-STRs and applying the programs NETWORK and BATWING. We find that deep phylogenetic structure is not adequately reconstructed after Y-SNP plus Y-STR genotyping, and that times estimated using observed Y-STR mutation rates are several-fold too recent. In contrast, an evolutionary mutation rate gives times that are more similar to the resequencing data. In principle, systematic comparisons of this kind can in future studies be used to identify the combinations of Y-SNP and Y-STR markers, and time estimation methodologies, that correspond best to resequencing data.
Human Y chromosome; Male history; Time estimation; Networks; BATWING
Patterns of genetic variation in a population carry information about the prehistory of the population, and for the human Y chromosome an especially informative phylogenetic tree has previously been constructed from fully-sequenced chromosomes. This revealed contrasting bifurcating and starlike phylogenies for the major lineages associated with the Neolithic expansions in sub-Saharan Africa and Western Europe, respectively.
We used coalescent simulations to investigate the range of demographic models most likely to produce the phylogenetic structures observed in Africa and Europe, assessing the starting and ending genetic effective population sizes, duration of the expansion, and time when expansion ended. The best-fitting models in Africa and Europe are very different. In Africa, the expansion took about 12 thousand years, ending very recently; it started from approximately 40 men and numbers expanded approximately 50-fold. In Europe, the expansion was much more rapid, taking only a few generations and occurring as soon as the major R1b lineage entered Europe; it started from just one to three men, whose numbers expanded more than a thousandfold.
Although highly simplified, the demographic model we have used captures key elements of the differences between the male Neolithic expansions in Africa and Europe, and is consistent with archaeological findings.
Human Y chromosome; Neolithic transition; Population expansion; Demographic modeling; Coalescent simulations; Haplogroup; R1b; E1b1a
All non-human great apes are endangered in the wild, and it is therefore important to gain an understanding of their demography and genetic diversity. Whole genome assembly projects have provided an invaluable foundation for understanding genetics in all four genera, but to date genetic studies of multiple individuals within great ape species have largely been confined to mitochondrial DNA and a small number of other loci. Here, we present a genome-wide survey of genetic variation in gorillas using a reduced representation sequencing approach, focusing on the two lowland subspecies. We identify 3,006,670 polymorphic sites in 14 individuals: 12 western lowland gorillas (Gorilla gorilla gorilla) and 2 eastern lowland gorillas (Gorilla beringei graueri). We find that the two species are genetically distinct, based on levels of heterozygosity and patterns of allele sharing. Focusing on the western lowland population, we observe evidence for population substructure, and a deficit of rare genetic variants suggesting a recent episode of population contraction. In western lowland gorillas, there is an elevation of variation towards telomeres and centromeres on the chromosomal scale. On a finer scale, we find substantial variation in genetic diversity, including a marked reduction close to the major histocompatibility locus, perhaps indicative of recent strong selection there. These findings suggest that despite their maintaining an overall level of genetic diversity equal to or greater than that of humans, population decline, perhaps associated with disease, has been a significant factor in recent and long-term pressures on wild gorilla populations.
The Genographic Project is an international effort aimed at charting human migratory history. The project is nonprofit and nonmedical, and, through its Legacy Fund, supports locally led efforts to preserve indigenous and traditional cultures. Although the first phase of the project was focused on uniparentally inherited markers on the Y-chromosome and mitochondrial DNA (mtDNA), the current phase focuses on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide single-nucleotide polymorphism (SNP) genotyping, they were designed for medical genetic studies and contain medically related markers that are inappropriate for global population genetic studies. GenoChip, the Genographic Project’s new genotyping array, was designed to resolve these issues and enable higher resolution research into outstanding questions in genetic anthropology. The GenoChip includes ancestry informative markers obtained for over 450 human populations, an ancient human (Saqqaq), and two archaic hominins (Neanderthal and Denisovan) and was designed to identify all known Y-chromosome and mtDNA haplogroups. The chip was carefully vetted to avoid inclusion of medically relevant markers. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays. Although all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The chip performances are illustrated in a principal component analysis for 14 worldwide populations. In summary, the GenoChip is a dedicated genotyping platform for genetic anthropology. With an unprecedented number of approximately 12,000 Y-chromosomal and approximately 3,300 mtDNA SNPs and over 130,000 autosomal and X-chromosomal SNPs without any known health, medical, or phenotypic relevance, the GenoChip is a useful tool for genetic anthropology and population genetics.
genetic anthropology; GenoChip; Genographic Project; population genetics; AimsFinder; haplogroups
The evolutionary history of variation in the human Rh blood group system, determined by variants in the RHD and RHCE genes, has long been an unresolved puzzle in human genetics. Prior to medical treatments and interventions developed in the last century, the D-positive children of D-negative women were at risk for hemolytic disease of the newborn, if the mother produced anti-D antibodies following sensitization to the blood of a previous D-positive child. Given the deleterious fitness consequences of this disease, the appreciable frequencies in European populations of the responsible RHD gene deletion variant (for example, 0.43 in our study) seem surprising. In this study, we used new molecular and genomic data generated from four HapMap population samples to test the idea that positive selection for an as-of-yet unknown fitness benefit of the RHD deletion may have offset the otherwise negative fitness effects of hemolytic disease of the newborn. We found no evidence that positive natural selection affected the frequency of the RHD deletion. Thus, the initial rise to intermediate frequency of the RHD deletion in European populations may simply be explained by genetic drift/ founder effect, or by an older or more complex sweep that we are insufficiently powered to detect. However, our simulations recapitulate previous findings that selection on the RHD deletion is frequency dependent, and weak or absent near 0.5. Therefore, once such a frequency was achieved, it could have been maintained by a relatively small amount of genetic drift. We unexpectedly observed evidence for positive selection on the C allele of RHCE in non-African populations (on chromosomes with intact copies of the RHD gene) in the form of an unusually high FST value and the high frequency of a single haplotype carrying the C allele. RhCE function is not well understood, but the C/c antigenic variant is clinically relevant and can result in hemolytic disease of the newborn, albeit much less commonly and severely than that related to the D-negative blood type. Therefore, the potential fitness benefits of the RHCE C allele are currently unknown but merit further exploration.
Blood group polymorphism; copy number variation; human evolution; balancing selection
Genome-wide genotypes and sequences are enriching our understanding of the past 50,000 years of human history and providing insights into earlier periods largely inaccessible to mitochondrial DNA and Y-chromosomal studies.
To see a world in a grain of sand ...
William Blake, Auguries of Innocence
Geneticists have long sought to identify the genetic changes that made us human, but pinpointing the functional-relevant changes has been challenging. Two papers in this issue suggest that partial duplication of SRGAP2, producing an incomplete protein that antagonizes the original, contributed to human brain evolution.
Gorillas are humans’ closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago (Mya). In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
Genome sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2,951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in non-essential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes, and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
The male-to-female sex ratio at birth is constant across world populations with an average of 1.06 (106 male to 100 female live births) for populations of European descent. The sex ratio is considered to be affected by numerous biological and environmental factors and to have a heritable component. The aim of this study was to investigate the presence of common allele modest effects at autosomal and chromosome X variants that could explain the observed sex ratio at birth. We conducted a large-scale genome-wide association scan (GWAS) meta-analysis across 51 studies, comprising overall 114 863 individuals (61 094 women and 53 769 men) of European ancestry and 2 623 828 common (minor allele frequency >0.05) single-nucleotide polymorphisms (SNPs). Allele frequencies were compared between men and women for directly-typed and imputed variants within each study. Forward-time simulations for unlinked, neutral, autosomal, common loci were performed under the demographic model for European populations with a fixed sex ratio and a random mating scheme to assess the probability of detecting significant allele frequency differences. We do not detect any genome-wide significant (P < 5 × 10−8) common SNP differences between men and women in this well-powered meta-analysis. The simulated data provided results entirely consistent with these findings. This large-scale investigation across ∼115 000 individuals shows no detectable contribution from common genetic variants to the observed skew in the sex ratio. The absence of sex-specific differences is useful in guiding genetic association study design, for example when using mixed controls for sex-biased traits.
The geographic origin and time of dispersal of Austroasiatic (AA) speakers, presently settled in south and southeast Asia, remains disputed. Two rival hypotheses, both assuming a demic component to the language dispersal, have been proposed. The first of these places the origin of Austroasiatic speakers in southeast Asia with a later dispersal to south Asia during the Neolithic, whereas the second hypothesis advocates pre-Neolithic origins and dispersal of this language family from south Asia. To test the two alternative models, this study combines the analysis of uniparentally inherited markers with 610,000 common single nucleotide polymorphism loci from the nuclear genome. Indian AA speakers have high frequencies of Y chromosome haplogroup O2a; our results show that this haplogroup has significantly higher diversity and coalescent time (17–28 thousand years ago) in southeast Asia, strongly supporting the first of the two hypotheses. Nevertheless, the results of principal component and “structure-like” analyses on autosomal loci also show that the population history of AA speakers in India is more complex, being characterized by two ancestral components—one represented in the pattern of Y chromosomal and EDAR results and the other by mitochondrial DNA diversity and genomic structure. We propose that AA speakers in India today are derived from dispersal from southeast Asia, followed by extensive sex-specific admixture with local Indian populations.
Austroasiatic; mtDNA; Y chromosome; autosomes; admixture
TSPY1 is a tandemly-repeated gene on the human Y chromosome forming an array of approximately 21–35 copies. The testicular expression pattern and the inferred function of the TSPY1 protein suggest possible involvement in spermatogenesis. However, data are scarce on TSPY1 copy number variation in different Y lineages and its role in spermatogenesis.
We sought to define: 1) the extent of TSPY1 copy number variation within and among Y chromosome haplogroups; and 2) the role of TSPY1 dosage in spermatogenic efficiency.
Materials and Methods
A total of 154 idiopathic infertile men and 130 normozoospermic controls from Central Italy were analyzed. We used a quantitative PCR assay to measure TSPY1 copy number and also defined Y haplogroups in all subjects.
We provide evidence that TSPY1 copy number shows substantial variation among Y haplogroups and thus that population stratification does represent a potential bias in case-control association studies. We also found: 1) a significant positive correlation between TSPY1 copy number and sperm count (P < 0.001); 2) a significant difference in mean TSPY1 copy number between patients and controls (28.4 ± 8.3 vs. 33.9 ± 10.7; P < 0.001); and 3) a 1.5-fold increased risk of abnormal sperm parameters in men with less than 33 copies (P < 0.001).
TSPY copy number variation significantly influences spermatogenic efficiency. Low TSPY1 copy number is a new risk factor for male infertility with potential clinical consequences.
We analysed 67 short tandem repeat polymorphisms from the non-recombining part of the Y-chromosome (Y-STRs), including 49 rarely-studied simple single-copy (ss)Y-STRs and 18 widely-used Y-STRs, in 590 males from 51 populations belonging to 8 worldwide regions (HGDP-CEPH panel). Although autosomal DNA profiling provided no evidence for close relationship, we found 18 Y-STR haplotypes (defined by 67 Y-STRs) that were shared by two to five men in 13 worldwide populations, revealing high and widespread levels of cryptic male relatedness. Maximal (95.9%) haplotype resolution was achieved with the best 25 out of 67 Y-STRs in the global dataset, and with the best 3-16 markers in regional datasets (89.6-100% resolution). From the 49 rarely-studied ssY-STRs, the 25 most informative markers were sufficient to reach the highest possible male lineage differentiation in the global (92.2% resolution), and 3-15 markers in the regional datasets (85.4-100%). Considerably lower haplotype resolutions were obtained with the three commonly-used Y-STR sets (Minimal Haplotype, PowerPlex Y®, and AmpFlSTR® Yfiler®). Six ssY-STRs (DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643) were most informative to supplement the existing Y-STR kits for increasing haplotype resolution, or – together with additional ssY-STRs - as a new set for maximizing male lineage differentiation. Mutation rates of the 49 ssY-STRs were estimated from 403 meiotic transfers in deep-rooted pedigrees, and ranged from ~4.8×10−4 for 31 ssY-STRs with no mutations observed to 1.3×10−2 and 1.5×10−2 for DYS570 and DYS576, respectively, the latter representing the highest mutation rates reported for human Y-STRs so far. Our findings thus demonstrate that ssY-STRs are useful for maximizing global and regional resolution of male lineages, either as a new set, or when added to commonly-used Y-STR sets, and support their application to forensic, genealogical and anthropological studies.
Y-STR; microsatellites; Y-chromosome; haplotype resolution; lineage differentiation; HGDP-CEPH, mutation rates
A recently-published study has used next-gen sequencing technology to resequence two Y chromosomes separated by 13 generations and discovered four single-base differences in ~10 Mb DNA, suggesting that the Y chromosome euchromatin accumulates around one mutation per generation. Y-SNPs therefore now offer the best resolution of Y haplotypes and promise to distinguish almost every Y chromosome. This work illustrates the promise of current sequencing technology for forensically-relevant applications.
Next-gen sequencing; Y-SNP; Y-STR; Haplotype resolution; forensic applications