We have whole-exome sequenced 176 individuals from the isolated population of the island of Vis in Croatia in order to describe exonic variation architecture. We found 290 577 single nucleotide variants (SNVs), 65% of which are singletons, low frequency or rare variants. A total of 25 430 (9%) SNVs are novel, previously not catalogued in NHLBI GO Exome Sequencing Project, UK10K-Generation Scotland, 1000Genomes Project, ExAC or NCBI Reference Assembly dbSNP. The majority of these variants (76%) are singletons. Comparable to data obtained from UK10K-Generation Scotland that were sequenced and analysed using the same protocols, we detected an enrichment of potentially damaging variants (non-synonymous and loss-of-function) in the low frequency and common variant categories. On average 115 (range 93–140) genotypes with loss-of-function variants, 23 (15–34) of which were homozygous, were identified per person. The landscape of loss-of-function variants across an exome revealed that variants mainly accumulated in genes on the xenobiotic-related pathways, of which majority coded for enzymes. The frequency of loss-of-function variants was additionally increased in Vis runs of homozygosity regions where variants mainly affected signalling pathways. This work confirms the isolate status of Vis population by means of whole-exome sequence and reveals the pattern of loss-of-function mutations, which resembles the trails of adaptive evolution that were found in other species. By cataloguing the exomic variants and describing the allelic structure of the Vis population, this study will serve as a valuable resource for future genetic studies of human diseases, population genetics and evolution in this population.
We have whole exome sequenced 176 individuals from the isolated population of the island of Vis in Croatia in order to describe exonic variation architecture. We found 290 577 single nucleotide variants (SNVs), 65% of which are singletons, low-frequency, or rare variants. A total of 25 430 (9%) SNVs are novel, previously not catalogued in NHLBI GO Exome Sequencing Project, UK10K-Generation Scotland, 1000 Genomes Project, ExAC, or NCBI Reference Assembly dbSNP. The majority of these variants (76%) are singletons. Comparable to data obtained from UK10K-GS that were sequenced and analysed using the same protocols, we detected an enrichment of potentially damaging variants (non-synonymous and loss-of-function) in the low frequency and common variant categories. On average 115 (range 93-140) genotypes with loss-of-function variants, 23 (15-34) of which were homozygous, were identified per person. The landscape of loss-of-function variants across an exome revealed that variants mainly accumulated in genes on xenobiotic-related pathways, of which majority coded for enzymes. The frequency of loss-of-function variants was additionally increased in Vis ROH regions where variants mainly affected signalling pathways. This work confirms the isolate status of Vis population by means of whole exome sequence and reveals the pattern of loss-of-function mutations which resembles the trails of adaptive evolution that were found in other species. By cataloguing the exomic variants and describing the allelic structure of the Vis population, this study will serve as a valuable resource for future genetic studies of human diseases, population genetics, and evolution in this population.
single nucleotide variants; exome sequencing; loss of function; isolates; xenobiotics
Examining complete gene knockouts within a viable organism can inform on gene function. We sequenced the exomes of 3,222 British Pakistani-heritage adults with high parental relatedness, discovering 1,111 rare-variant homozygous genotypes with predicted loss of gene function (knockouts) in 781 genes. We observed 13.7% fewer than expected homozygous knockout genotypes, implying an average load of 1.6 recessive-lethal-equivalent LOF variants per adult. Linking genetic data to lifelong health records, knockouts were not associated with clinical consultation or prescription rate. In this dataset we identified a healthy PRDM9 knockout mother, and performed phased genome sequencing on her, her child and controls, which showed meiotic recombination sites localised away from PRDM9-dependent hotspots. Thus, natural LOF variants inform upon essential genetic loci, and demonstrate PRDM9 redundancy in humans.
Epitranscriptome modifications are required for structure and function of RNA and defects in these pathways have been associated with human disease. Here we identify the RNA target for the previously uncharacterized 5-methylcytosine (m5C) methyltransferase NSun3 and link m5C RNA modifications with energy metabolism. Using whole-exome sequencing, we identified loss-of-function mutations in NSUN3 in a patient presenting with combined mitochondrial respiratory chain complex deficiency. Patient-derived fibroblasts exhibit severe defects in mitochondrial translation that can be rescued by exogenous expression of NSun3. We show that NSun3 is required for deposition of m5C at the anticodon loop in the mitochondrially encoded transfer RNA methionine (mt-tRNAMet). Further, we demonstrate that m5C deficiency in mt-tRNAMet results in the lack of 5-formylcytosine (f5C) at the same tRNA position. Our findings demonstrate that NSUN3 is necessary for efficient mitochondrial translation and reveal that f5C in human mitochondrial RNA is generated by oxidative processing of m5C.
The post-transcriptional 5-methylcytosine (m5C) modification occurs in a wide range of nuclear-encoded RNAs. Here the authors identify the mitochondrial tRNA-Met as a target for the m5C methyltransferase NSun3—found mutated in a mitochondrial disease patient—and link mitochondrial tRNA modifications with energy metabolism.
The genomic causes and effects of divergent ecological selection during speciation are still poorly understood. Here, we report the discovery and detailed characterization of early-stage adaptive divergence of two cichlid fish ecomorphs in a small (700m diameter) isolated crater lake in Tanzania. The ecomorphs differ in depth preference, male breeding color, body shape, diet and trophic morphology. With whole genome sequences of 146 fish, we identify 98 clearly demarcated genomic ‘islands’ of high differentiation and demonstrate association of genotypes across these islands to divergent mate preferences. The islands contain candidate adaptive genes enriched for functions in sensory perception (including rhodopsin and other twilight vision associated genes), hormone signaling and morphogenesis. Our study suggests mechanisms and genomic regions that may play a role in the closely related mega-radiation of Lake Malawi.
Genomic screening for chromosomal abnormalities is an important part of quality control when establishing and maintaining stem cell lines. We present a new method for sensitive detection of copy number alterations, aneuploidy, and contamination in cell lines using genome-wide SNP genotyping data. In contrast to other methods designed for identifying copy number variations in a single sample or in a sample composed of a mixture of normal and tumor cells, this new method is tailored for determining differences between cell lines and the starting material from which they were derived, which allows us to distinguish between normal and novel copy number variation. We implemented the method in the freely available BCFtools package and present results based on induced pluripotent stem cell lines obtained in the HipSci project.
The extent to which low-frequency (minor allele frequency [MAF] between 1–5%) and rare (MAF ≤ 1%) variants contribute to complex traits and disease in the general population is largely unknown. Bone mineral density (BMD) is highly heritable, is a major predictor of osteoporotic fractures and has been previously associated with common genetic variants1–8, and rare, population-specific, coding variants9. Here we identify novel non-coding genetic variants with large effects on BMD (ntotal = 53,236) and fracture (ntotal = 508,253) in individuals of European ancestry from the general population. Associations for BMD were derived from whole-genome sequencing (n=2,882 from UK10K), whole-exome sequencing (n= 3,549), deep imputation of genotyped samples using a combined UK10K/1000Genomes reference panel (n=26,534), and de-novo replication genotyping (n= 20,271). We identified a low-frequency non-coding variant near a novel locus, EN1, with an effect size 4-fold larger than the mean of previously reported common variants for lumbar spine BMD8 (rs11692564[T], MAF = 1.7%, replication effect size = +0.20 standard deviations [SD], Pmeta = 2×10−14), which was also associated with a decreased risk of fracture (OR = 0.85; P = 2×10−11; ncases = 98,742 and ncontrols = 409,511). Using an En1Cre/flox mouse model, we observed that conditional loss of En1 results in low bone mass, likely as a consequence of high bone turn-over. We also identified a novel low-frequency non-coding variant with large effects on BMD near WNT16 (rs148771817[T], MAF = 1.1%, replication effect size = +0.39 SD, Pmeta = 1×10−11). In general, there was an excess of association signals arising from deleterious coding and conserved non-coding variants. These findings provide evidence that low-frequency non-coding variants have large effects on BMD and fracture, thereby providing rationale for whole-genome sequencing and improved imputation reference panels to study the genetic architecture of complex traits and disease in the general population.
PMID: 26367794 CAMSID: cams5439
•iPSCs show inter/intra-line/donor-variability hampering characterisation.•HipSci generates, banks and provides iPSCs from hundreds of individual donors.•iPSCs respond to different human plasma fibronectin concentrations on 96-well assays.•Phenotypic features: cell number, proliferation, morphology and intercellular adhesion.•The methodologies described can be tailored for disease-modelling and other cell types.
Induced pluripotent stem cells (iPSCs) provide invaluable opportunities for future cell therapies as well as for studying human development, modelling diseases and discovering therapeutics. In order to realise the potential of iPSCs, it is crucial to comprehensively characterise cells generated from large cohorts of healthy and diseased individuals. The human iPSC initiative (HipSci) is assessing a large panel of cell lines to define cell phenotypes, dissect inter- and intra-line and donor variability and identify its key determinant components. Here we report the establishment of a high-content platform for phenotypic analysis of human iPSC lines. In the described assay, cells are dissociated and seeded as single cells onto 96-well plates coated with fibronectin at three different concentrations. This method allows assessment of cell number, proliferation, morphology and intercellular adhesion. Altogether, our strategy delivers robust quantification of phenotypic diversity within complex cell populations facilitating future identification of the genetic, biological and technical determinants of variance. Approaches such as the one described can be used to benchmark iPSCs from multiple donors and create novel platforms that can readily be tailored for disease modelling and drug discovery.
Cell based assays; High content; Phenotype screening; iPSCs; Induced pluripotent stem cells; Human pluripotent stem cells
How and when the Americas were populated remains contentious. Using ancient and modern genome-wide data, we find that the ancestors of all present-day Native Americans, including Athabascans and Amerindians, entered the Americas as a single migration wave from Siberia no earlier than 23 thousand years ago (KYA), and after no more than 8,000-year isolation period in Beringia. Following their arrival to the Americas, ancestral Native Americans diversified into two basal genetic branches around 13 KYA, one that is now dispersed across North and South America and the other is restricted to North America. Subsequent gene flow resulted in some Native Americans sharing ancestry with present-day East Asians (including Siberians) and, more distantly, Australo-Melanesians. Putative ‘Paleoamerican’ relict populations, including the historical Mexican Pericúes and South American Fuego-Patagonians, are not directly related to modern Australo-Melanesians as suggested by the Paleoamerican Model.
Summary: Runs of homozygosity (RoHs) are genomic stretches of a diploid genome that show identical alleles on both chromosomes. Longer RoHs are unlikely to have arisen by chance but are likely to denote autozygosity, whereby both copies of the genome descend from the same recent ancestor. Early tools to detect RoH used genotype array data, but substantially more information is available from sequencing data. Here, we present and evaluate BCFtools/RoH, an extension to the BCFtools software package, that detects regions of autozygosity in sequencing data, in particular exome data, using a hidden Markov model. By applying it to simulated data and real data from the 1000 Genomes Project we estimate its accuracy and show that it has higher sensitivity and specificity than existing methods under a range of sequencing error rates and levels of autozygosity.
Availability and implementation: BCFtools/RoH and its associated binary/source files are freely available from https://github.com/samtools/BCFtools.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
British population history has been shaped by a series of immigrations, including the early Anglo-Saxon migrations after 400 CE. It remains an open question how these events affected the genetic composition of the current British population. Here, we present whole-genome sequences from 10 individuals excavated close to Cambridge in the East of England, ranging from the late Iron Age to the middle Anglo-Saxon period. By analysing shared rare variants with hundreds of modern samples from Britain and Europe, we estimate that on average the contemporary East English population derives 38% of its ancestry from Anglo-Saxon migrations. We gain further insight with a new method, rarecoal, which infers population history and identifies fine-scale genetic ancestry from rare variants. Using rarecoal we find that the Anglo-Saxon samples are closely related to modern Dutch and Danish populations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain.
This study examines ancient genomes of individuals from the late Iron Age to the middle Anglo-Saxon period in the East of England. Using a newly devised analytic algorithm, the author also estimate the relative ancestry of East English genome derived from Anglo-Saxon migrations and to the rest of Europe.
Natural variation within species reveals aspects of genome evolution and function. The fission yeast Schizosaccharomyces pombe is an important model for eukaryotic biology, but researchers typically use one standard laboratory strain. To extend the utility of this model, we surveyed the genomic and phenotypic variation in 161 natural isolates. We sequenced the genomes of all strains, revealing moderate genetic diversity (π = 3 ×10−3) and weak global population structure. We estimate that dispersal of S. pombe began within human antiquity (~340 BCE), and ancestors of these strains reached the Americas at ~1623 CE. We quantified 74 traits, revealing substantial heritable phenotypic diversity. We conducted 223 genome-wide association studies, with 89 traits showing at least one association. The most significant variant for each trait explained 22% of variance on average, with indels having higher effects than SNPs. This analysis presents a rich resource to examine genotype-phenotype relationships in a tractable model.
Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants.
Imputation uses genotype information from SNP arrays to infer the genotypes of missing markers. Here, the authors show that an imputation reference panel derived from whole-genome sequencing of 3,781 samples from the UK10K project improves the imputation accuracy and coverage of low frequency variants compared to existing methods.
Homozygous loss of function (HLOF) variants provide a valuable window on gene function in humans, as well as an inventory of the human genes that are not essential for survival and reproduction. All humans carry at least a few HLOF variants, but the exact number of inactivated genes that can be tolerated is currently unknown—as are the phenotypic effects of losing function for most human genes. Here, we make use of 1432 whole exome sequences from five European populations to expand the catalogue of known human HLOF mutations; after stringent filtering of variants in our dataset, we identify a total of 173 HLOF mutations, 76 (44%) of which have not been observed previously. We find that population isolates are particularly well suited to surveys of novel HLOF genes because individuals in such populations carry extensive runs of homozygosity, which we show are enriched for novel, rare HLOF variants. Further, we make use of extensive phenotypic data to show that most HLOFs, ascertained in population-based samples, appear to have little detectable effect on the phenotype. On the contrary, we document several genes directly implicated in disease that seem to tolerate HLOF variants. Overall HLOF genes are enriched for olfactory receptor function and are expressed in testes more often than expected, consistent with reduced purifying selection and incipient pseudogenisation.
Normal thyroid function is essential for health, but its genetic architecture remains poorly understood. Here, for the heritable thyroid traits thyrotropin (TSH) and free thyroxine (FT4), we analyse whole-genome sequence data from the UK10K project (N=2,287). Using additional whole-genome sequence and deeply imputed data sets, we report meta-analysis results for common variants (MAF≥1%) associated with TSH and FT4 (N=16,335). For TSH, we identify a novel variant in SYN2 (MAF=23.5%, P=6.15 × 10−9) and a new independent variant in PDE8B (MAF=10.4%, P=5.94 × 10−14). For FT4, we report a low-frequency variant near B4GALT6/SLC25A52 (MAF=3.2%, P=1.27 × 10−9) tagging a rare TTR variant (MAF=0.4%, P=2.14 × 10−11). All common variants explain ≥20% of the variance in TSH and FT4. Analysis of rare variants (MAF<1%) using sequence kernel association testing reveals a novel association with FT4 in NRG1. Our results demonstrate that increased coverage in whole-genome sequence association studies identifies novel variants associated with thyroid function.
Levels of circulating thyrotropin and free thyroxine reflect thyroid function, however, their genetic underpinnings remain poorly understood. Taylor et al. take advantage of whole-genome sequence data from cohorts within the UK10K project to identify novel variants associated with these traits.
The analysis of rich catalogues of genetic variation from population-based sequencing provides an opportunity to screen for functional effects. Here we report a rare variant in APOC3 (rs138326449-A, minor allele frequency ~0.25% (UK)) associated with plasma triglyceride (TG) levels (−1.43 standard deviations (standard error (s.e.=0.27) per minor allele (p-value=8.0×10−8)) discovered in 3202 individuals with low read-depth, whole genome sequence. We replicate this in 12831 participants from five additional samples of Northern and Southern European origin (−1.0 standard deviation (s.e.=0.173), p-value=7.32×10−9). This is consistent with an effect between 0.5 and 1.5mmol/L dependent on population. We show that a single predicted splice donor variant is responsible for association signals and is independent of known common variants. Analyses suggest an independent relationship between rs138326449 and high-density lipoprotein (HDL) levels. This represents one of the first examples of a rare, large effect variant identified from whole-genome sequencing at a population scale.
Whole genome sequence; triglycerides; APOC3
Short-chain enoyl-CoA hydratase (ECHS1) is a multifunctional mitochondrial matrix enzyme that is involved in the oxidation of fatty acids and essential amino acids such as valine. Here, we describe the broad phenotypic spectrum and pathobiochemistry of individuals with autosomal-recessive ECHS1 deficiency.
Using exome sequencing, we identified ten unrelated individuals carrying compound heterozygous or homozygous mutations in ECHS1. Functional investigations in patient-derived fibroblast cell lines included immunoblotting, enzyme activity measurement, and a palmitate loading assay.
Patients showed a heterogeneous phenotype with disease onset in the first year of life and course ranging from neonatal death to survival into adulthood. The most prominent clinical features were encephalopathy (10/10), deafness (9/9), epilepsy (6/9), optic atrophy (6/10), and cardiomyopathy (4/10). Serum lactate was elevated and brain magnetic resonance imaging showed white matter changes or a Leigh-like pattern resembling disorders of mitochondrial energy metabolism. Analysis of patients’ fibroblast cell lines (6/10) provided further evidence for the pathogenicity of the respective mutations by showing reduced ECHS1 protein levels and reduced 2-enoyl-CoA hydratase activity. While serum acylcarnitine profiles were largely normal, in vitro palmitate loading of patient fibroblasts revealed increased butyrylcarnitine, unmasking the functional defect in mitochondrial β-oxidation of short-chain fatty acids. Urinary excretion of 2-methyl-2,3-dihydroxybutyrate – a potential derivative of acryloyl-CoA in the valine catabolic pathway – was significantly increased, indicating impaired valine oxidation.
In conclusion, we define the phenotypic spectrum of a new syndrome caused by ECHS1 deficiency. We speculate that both the β-oxidation defect and the block in l-valine metabolism, with accumulation of toxic methacrylyl-CoA and acryloyl-CoA, contribute to the disorder that may be amenable to metabolic treatment approaches.
Statistical factor analysis methods have previously been used to remove noise components from high-dimensional data prior to genetic association mapping and, in a guided fashion, to summarize biologically relevant sources of variation. Here, we show how the derived factors summarizing pathway expression can be used to analyze the relationships between expression, heritability, and aging. We used skin gene expression data from 647 twins from the MuTHER Consortium and applied factor analysis to concisely summarize patterns of gene expression to remove broad confounding influences and to produce concise pathway-level phenotypes. We derived 930 “pathway phenotypes” that summarized patterns of variation across 186 KEGG pathways (five phenotypes per pathway). We identified 69 significant associations of age with phenotype from 57 distinct KEGG pathways at a stringent Bonferroni threshold (P<5.38×10−5). These phenotypes are more heritable (h2=0.32) than gene expression levels. On average, expression levels of 16% of genes within these pathways are associated with age. Several significant pathways relate to metabolizing sugars and fatty acids; others relate to insulin signaling. We have demonstrated that factor analysis methods combined with biological knowledge can produce more reliable phenotypes with less stochastic noise than the individual gene expression levels, which increases our power to discover biologically relevant associations. These phenotypes could also be applied to discover associations with other environmental factors.
aging; factor analysis; gene expression; heritability; linear mixed models
The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.
The human genome reference assembly is crucial for aligning and analyzing sequence data, and for genome annotation, among other roles. However, the models and analysis assumptions that underlie the current assembly need revising to fully represent human sequence diversity. Improved analysis tools and updated data reporting formats are also required.
Associating genetic variation with quantitative measures of gene regulation offers a way to bridge the gap between genotype and complex phenotypes. In order to identify quantitative trait loci (QTLs) that influence the binding of a transcription factor in humans, we measured binding of the multifunctional transcription and chromatin factor CTCF in 51 HapMap cell lines. We identified thousands of QTLs in which genotype differences were associated with differences in CTCF binding strength, hundreds of them confirmed by directly observable allele-specific binding bias. The majority of QTLs were either within 1 kb of the CTCF binding motif, or in linkage disequilibrium with a variant within 1 kb of the motif. On the X chromosome we observed three classes of binding sites: a minority class bound only to the active copy of the X chromosome, the majority class bound to both the active and inactive X, and a small set of female-specific CTCF sites associated with two non-coding RNA genes. In sum, our data reveal extensive genetic effects on CTCF binding, both direct and indirect, and identify a diversity of patterns of CTCF binding on the X chromosome.
We have systematically measured the effect of normal genetic variation present in a human population on the binding of a specific chromatin protein (CTCF) to DNA by measuring its binding in 51 human cell lines. We observed a large number of changes in protein binding that we can confidently attribute to genetic effects. The corresponding genetic changes are often clustered around the binding motif for CTCF, but only a minority are actually within the motif. Unexpectedly, we also find that at most binding sites on the X chromosome, CTCF binding occurs equally on both the X chromosomes in females at the same level as on the single X chromosome in males. This finding suggests that in general, CTCF binding is not subject to global dosage compensation, the process which equalizes gene expression levels from the two female X chromosomes and the single male X.