DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally employed long (400–800 bp) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intra-species genetic variation. We report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterise four million SNPs and four hundred thousand structural variants, many of which are previously unknown. Our approach is effective for accurate, rapid and economical whole genome re-sequencing and many other biomedical applications.
StellaBase, the Nematostella vectensis Genomics Database, is a web-based resource that will facilitate desktop and bench-top studies of the starlet sea anemone. Nematostella is an emerging model organism that has already proven useful for addressing fundamental questions in developmental evolution and evolutionary genomics. StellaBase allows users to query the assembled Nematostella genome, a confirmed gene library, and a predicted genome using both keyword and homology based search functions. Data provided by these searches will elucidate gene family evolution in early animals. Unique research tools, including a Nematostella genetic stock library, a primer library, a literature repository and a gene expression library will provide support to the burgeoning Nematostella research community. The development of StellaBase accompanies significant upgrades to CnidBase, the Cnidarian Evolutionary Genomics Database. With the completion of the first sequenced cnidarian genome, genome comparison tools have been added to CnidBase. In addition, StellaBase provides a framework for the integration of additional species-specific databases into CnidBase. StellaBase is available at .
This report describes the NIH Undiagnosed Diseases Program (UDP), details the Program's application of genomic technology to establish diagnoses, and details the Program's success rate over its first two years.
Each accepted study participant was extensively phenotyped. A subset of participants and selected family members (29 patients and 78 unaffected family members) was subjected to an integrated set of genomic analyses including high-density SNP arrays and whole exome or genome analysis.
Of 1191 medical records reviewed, 326 patients were accepted and 160 were admitted directly to the NIH Clinical Center on the UDP service. Of those, 47% were children, 55% were females, and 53% had neurological disorders. Diagnoses were reached on 39 participants (24%) on clinical, biochemical, pathological, or molecular grounds; 21 diagnoses involved rare or ultra-rare diseases. Three disorders were diagnosed based upon SNP array analysis and three others using WES and filtering of variants. Two new disorders were discovered. Analysis of the SNP-array study cohort revealed that large stretches of homozygosity were more common in affected participants relative to controls.
The NIH UDP addresses an unmet need, i.e., the diagnosis of patients with complex, multisystem disorders. It may serve as a model for the clinical application of emerging genomic technologies, and is providing insights into the characteristics of diseases that remain undiagnosed after extensive clinical workup.
rare disease; undiagnosed disease; SNP arrays; whole exome sequencing; neurological disorders
We present a high-quality genome sequence of a Neandertal woman from Siberia. We show that her parents were related at the level of half siblings and that mating among close relatives was common among her recent ancestors. We also sequenced the genome of a Neandertal from the Caucasus to low coverage. An analysis of the relationships and population history of available archaic genomes and 25 present-day human genomes shows that several gene flow events occurred among Neandertals, Denisovans and early modern humans, possibly including gene flow into Denisovans from an unknown archaic group. Thus, interbreeding, albeit of low magnitude, occurred among many hominin groups in the Late Pleistocene. In addition, the high quality Neandertal genome allows us to establish a definitive list of substitutions that became fixed in modern humans after their separation from the ancestors of Neandertals and Denisovans.
Motivation: Extensive DNA sequencing of tumor and matched normal samples using exome and whole-genome sequencing technologies has enabled the discovery of recurrent genetic alterations in cancer cells, but variability in stromal contamination and subclonal heterogeneity still present a severe challenge to available detection algorithms.
Results: Here, we describe publicly available software, Shimmer, which accurately detects somatic single-nucleotide variants using statistical hypothesis testing with multiple testing correction. This program produces somatic single-nucleotide variant predictions with significantly higher sensitivity and accuracy than other available software when run on highly contaminated or heterogeneous samples, and it gives comparable sensitivity and accuracy when run on samples of high purity.
Supplementary data are available at Bioinformatics online.
An understanding of ctenophore biology is critical for reconstructing events that occurred early in animal evolution. Towards this goal, we have sequenced, assembled, and annotated the genome of the ctenophore Mnemiopsis leidyi. Our phylogenomic analyses of both amino acid positions and gene content suggests that ctenophores rather than sponges are the sister lineage to all other animals. Mnemiopsis lacks many of the genes found in bilaterian mesodermal cell types, suggesting that these cell types evolved independently. The set of neural genes in Mnemiopsis is similar to that of sponges, indicating that sponges may have lost a nervous system. These results present a new view of early animal evolution that accounts for major losses and/or gains of sophisticated cell types, including nerve and muscle cells.
Early-onset myopathy, areflexia, respiratory distress and dysphagia (EMARDD) is a myopathic disorder associated with mutations in MEGF10. By novel analysis of SNP array hybridization and exome sequence coverage, we diagnosed a 10-year old girl with EMARDD following identification of a novel homozygous deletion of exon 7 in MEGF10. In contrast to previously reported EMARDD patients, her weakness was more prominent proximally than distally, and involved her legs more than her arms. MRI of her pelvis and thighs showed muscle atrophy and fatty replacement. Ultrasound of several muscle groups revealed dense homogenous increases in echogenicity. Cloning and sequencing of the deletion breakpoint identified features suggesting the mutation arose by fork stalling and template switching. These findings constitute the first genomic deletion causing EMARDD, expand the clinical phenotype, and provide new insight into the pattern and histology of its muscular pathology.
EMARDD; MEGF10; SNP array; exome sequencing; deletion analysis; myopathy
Antibodies of the VRC01 class neutralize HIV-1, arise in diverse HIV-1-infected donors, and are potential templates for an effective HIV-1 vaccine. However, the stochastic processes that generate repertoires in each individual of >1012 antibodies make elicitation of specific antibodies uncertain. Here we determine the ontogeny of the VRC01 class by crystallography and next-generation sequencing. Despite antibody-sequence differences exceeding 50%, antibody-gp120 cocrystal structures reveal VRC01-class recognition to be remarkably similar. B cell transcripts indicate that VRC01-class antibodies require few specific genetic elements, suggesting that naive-B cells with VRC01-class features are generated regularly by recombination. Virtually all of these fail to mature, however, with only a few—likely one—ancestor B cell expanding to form a VRC01-class lineage in each donor. Developmental similarities in multiple donors thus reveal the generation of VRC01-class antibodies to be reproducible in principle, thereby providing a framework for attempts to elicit similar antibodies in the general population.
The whey acidic protein (WAP) four-disulfide core domain (WFDC) locus located on human chromosome 20q13 spans 19 genes with WAP and/or Kunitz domains. These genes participate in antimicrobial, immune, and tissue homoeostasis activities. Neighboring SEMG genes encode seminal proteins Semenogelin 1 and 2 (SEMG1 and SEMG2). WFDC and SEMG genes have a strikingly high rate of amino acid replacement (dN/dS), indicative of responses to adaptive pressures during vertebrate evolution. To better understand the selection pressures acting on WFDC genes in human populations, we resequenced 18 genes and 54 noncoding segments in 71 European (CEU), African (YRI), and Asian (CHB + JPT) individuals. Overall, we identified 484 single-nucleotide polymorphisms (SNPs), including 65 coding variants (of which 49 are nonsynonymous differences). Using classic neutrality tests, we confirmed the signature of short-term balancing selection on WFDC8 in Europeans and a signature of positive selection spanning genes PI3, SEMG1, SEMG2, and SLPI. Associated with the latter signal, we identified an unusually homogeneous-derived 100-kb haplotype with a frequency of 88% in Asian populations. A putative candidate variant targeted by selection is Thr56Ser in SEMG1, which may alter the proteolytic profile of SEMG1 and antimicrobial activities of semen. All the well-characterized genes residing in the WDFC locus encode proteins that appear to have a role in immunity and/or fertility, two processes that are often associated with adaptive evolution. This study provides further evidence that the WFDC and SEMG loci have been under strong adaptive pressure within the short timescale of modern humans.
WFDC; semenogelins; natural selection; innate immunity; serine protease inhibitors; reproduction
The Undiagnosed Diseases Program at the National Institutes of Health uses High Throughput Sequencing (HTS) to diagnose rare and novel diseases. HTS techniques generate large numbers of DNA sequence variants, which must be analyzed and filtered to find candidates for disease causation. Despite the publication of an increasing number of successful exome-based projects, there has been little formal discussion of the analytic steps applied to HTS variant lists. We present the results of our experience with over 30 families for whom HTS sequencing was used in an attempt to find clinical diagnoses. For each family, exome sequence was augmented with high-density SNP-array data. We present a discussion of the theory and practical application of each analytic step and provide example data to illustrate our approach. The paper is designed to provide an analytic roadmap for variant analysis, thereby enabling a wide range of researchers and clinical genetics practitioners to perform direct analysis of HTS data for their patients and projects.
genomics; next generation sequencing; exome; molecular diagnosis
Massively-parallel cDNA sequencing (RNA-Seq) is a new technique that holds great promise for cardiovascular genomics. Here, we used RNA-Seq to study the transcriptomes of matched coronary artery disease cases and controls in the ClinSeq® study, using cell lines as tissue surrogates.
Lymphoblastoid cell lines (LCLs) from 16 cases and controls representing phenotypic extremes for coronary calcification were cultured and analyzed using RNA-Seq. All cell lines were then independently re-cultured and along with another set of 16 independent cases and controls, were profiled with Affymetrix microarrays to perform a technical validation of the RNA-Seq results. Statistically significant changes (p < 0.05) were detected in 186 transcripts, many of which are expressed at extremely low levels (5–10 copies/cell), which we confirmed through a separate spike-in control RNA-Seq experiment. Next, by fitting a linear model to exon-level RNA-Seq read counts, we detected signals of alternative splicing in 18 transcripts. Finally, we used the RNA-Seq data to identify differential expression (p < 0.0001) in eight previously unannotated regions that may represent novel transcripts. Overall, differentially expressed genes showed strong enrichment (p = 0.0002) for prior association with cardiovascular disease. At the network level, we found evidence for perturbation in pathways involving both cardiovascular system development and function as well as lipid metabolism.
We present a pilot study for transcriptome involvement in coronary artery calcification and demonstrate how RNA-Seq analyses using LCLs as a tissue surrogate may yield fruitful results in a clinical sequencing project. In addition to canonical gene expression, we present candidate variants from alternative splicing and novel transcript detection, which have been unexplored in the context of this disease.
Coronary artery calcification; RNA-Seq; Lymphoblastoid cell lines; Transcriptome profiling
Although a considerable proportion of serum lipids loci identified in European ancestry individuals (EA) replicate in African Americans (AA), interethnic differences in the distribution of serum lipids suggest that some genetic determinants differ by ethnicity. We conducted a comprehensive evaluation of five lipid candidate genes to identify variants with ethnicity-specific effects. We sequenced ABCA1, LCAT, LPL, PON1, and SERPINE1 in 48 AA individuals with extreme serum lipid concentrations (high HDLC/low TG or low HDLC/high TG). Identified variants were genotyped in the full population-based sample of AA (n = 1694) and tested for an association with serum lipids. rs328 (LPL) and correlated variants were associated with higher HDLC and lower TG. Interestingly, a stronger effect was observed on a “European” vs. “African” genetic background at this locus. To investigate this effect, we evaluated the region among West Africans (WA). For TG, the effect size among WA was the same in AA with only African local ancestry (2–3% lower TG), while the larger association among AA with local European ancestry matched previous reports in EA (10%). For HDLC, there was no association with rs328 in AA with only African local ancestry or in WA, while the association among AA with European local ancestry was much greater than what has been observed for EA (15 vs. ∼5 mg/dl), suggesting an interaction with an environmental or genetic factor that differs by ethnicity. Beyond this ancestry effect, the importance of African ancestry-focused, sequence-based work was also highlighted by serum lipid associations of variants that were in higher frequency (or present only) among those of African ancestry. By beginning our study with the sequence variation present in AA individuals, investigating local ancestry effects, and seeking replication in WA, we were able to comprehensively evaluate the role of a set of candidate genes in serum lipids in AA.
Most of the work on the genetic epidemiology of serum lipids in African Americans (AA) has focused on replicating findings that were identified in European ancestry individuals. While this can be very informative about the generalizability of lipids loci across populations, African ancestry-specific variation will be missed using this approach. Our aim was to comprehensively evaluate five lipid candidate genes in an AA population, from the identification of variants of interest to population-level analysis of high-density lipoprotein cholesterol (HDLC) and triglycerides (TG). We sequenced five genes in individuals with extreme lipids (n = 48) drawn from a population-based study of AA. The variants identified were genotyped in 1,694 AA and analyzed. Notable among the findings were the observation of ancestry specific effect for several variants in the LPL gene among these admixed individuals, with a greater effect observed among those with European ancestry in this region. These associations were further elucidated by replication in West Africans. By beginning with the sequence variation present among AA, investigating ancestry effects, and seeking replication in West Africans, we were able to comprehensively evaluate these candidate genes with a focus on African ancestry individuals.
The genetic structure of the indigenous hunter-gatherer peoples of southern Africa, the oldest known lineage of modern human, is important for understanding human diversity. Studies based on mitochondrial1 and small sets of nuclear markers2 have shown that these hunter-gatherers, known as Khoisan, San, or Bushmen, are genetically divergent from other humans1,3. However, until now, fully sequenced human genomes have been limited to recently diverged populations4–8. Here we present the complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and a Bantu from southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, including 13,146 novel amino acid variants. In terms of nucleotide substitutions, the Bushmen seem to be, on average, more different from each other than, for example, a European and an Asian. Observed genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle. Adding the described variants to current databases will facilitate inclusion of southern Africans in medical research efforts, particularly when family and medical histories can be correlated with genome-wide data.
Current HIV-1 vaccines elicit strain-specific neutralizing antibodies. However, cross-reactive neutralizing antibodies arise in ~20% of HIV-1-infected individuals, and details of their generation could provide a roadmap for effective vaccination. Here we report the isolation, evolution and structure of a broadly neutralizing antibody from an African donor followed from time of infection. The mature antibody, CH103, neutralized ~55% of HIV-1 isolates, and its co-crystal structure with gp120 revealed a novel loop-based mechanism of CD4-binding site recognition. Virus and antibody gene sequencing revealed concomitant virus evolution and antibody maturation. Notably, the CH103-lineage unmutated common ancestor avidly bound the transmitted/founder HIV-1 envelope glycoprotein, and evolution of antibody neutralization breadth was preceded by extensive viral diversification in and near the CH103 epitope. These data elucidate the viral and antibody evolution leading to induction of a lineage of HIV-1 broadly neutralizing antibodies and provide insights into strategies to elicit similar antibodies via vaccination.
Color markings among felid species display both a remarkable diversity and a common underlying periodicity. A similar range of patterns in domestic cats suggests a conserved mechanism whose appearance can be altered by selection. We identified the gene responsible for tabby pattern variation in domestic cats as Transmembrane aminopeptidase Q (Taqpep), which encodes a membrane-bound metalloprotease. Analyzing 31 other felid species, we identified Taqpep as the cause of the rare king cheetah phenotype, in which spots coalesce into blotches and stripes. Histologic, genomic expression, and transgenic mouse studies indicate that paracrine expression of Endothelin3 (Edn3) coordinates localized color differences. We propose a two-stage model in which Taqpep helps to establish a periodic pre-pattern during skin development that is later implemented by differential expression of Edn3.
The current genetic and recombination maps of the cat have less than 3,000 markers and a resolution limit greater than 1 Mb. To complement the first generation domestic cat maps, support higher resolution mapping studies, and aid genome assembly in specific areas as well as in the whole genome, a 15,000Rad radiation hybrid (RH) panel for the domestic cat was generated. Fibroblasts from the female Abyssinian cat that was used to generate the cat genomic sequence were fused to a Chinese hamster cell line (A23), producing 150 hybrid lines. The clones were initially characterized using 39 STR and 1536 SNP markers. The utility of whole genome amplification (WGA) in preserving and extending RH panel DNA was also tested using ten STR markers; no significant difference in retention was observed. The resolution of the 15,000Rad RH panel was established by constructing framework maps across ten different 1 Mb regions on different feline chromosomes. In these regions, two-point analysis was used to estimate RH distances, which compared favorably with the estimation of physical distances. The study demonstrates that the 15,000Rad RH panel constitutes a powerful tool for constructing high-resolution maps, having an average resolution of 40.1 kb per marker across the ten 1 Mb regions. In addition, the RH panel will complement existing genomic resources for the domestic cat, aid in the accurate reassemblies of the forthcoming cat genomic sequence, and support cross-species genomic comparisons.
The cat (Felis silvestris catus) shows significant variation in pelage, morphological, and behavioral phenotypes amongst its over 40 domesticated breeds. The majority of the breed specific phenotypic presentations originated through artificial selection, especially on desired novel phenotypic characteristics that arose only a few hundred years ago. Variations in coat texture and color of hair often delineate breeds amongst domestic animals. Although the genetic basis of several feline coat colors and hair lengths are characterized, less is known about the genes influencing variation in coat growth and texture, especially rexoid – curly coated types. Cornish Rex is a cat breed defined by a fixed recessive curly coat trait. Genome-wide analyses for selection (di, Tajima’s D and nucleotide diversity) were performed in the Cornish Rex breed and in 11 phenotypically diverse breeds and two random bred populations. Approximately 63K SNPs were used in the analysis that aimed to localize the locus controlling the rexoid hair texture. A region with a strong signature of recent selective sweep was identified in the Cornish Rex breed on chromosome A1, as well as a consensus block of homozygosity that spans approximately 3 Mb. Inspection of the region for candidate genes led to the identification of the lysophosphatidic acid receptor 6 (LPAR6). A 4 bp deletion in exon 5, c.250_253_delTTTG, which induces a premature stop codon in the receptor, was identified via Sanger sequencing. The mutation is fixed in Cornish Rex, absent in all straight haired cats analyzed, and is also segregating in the German Rex breed. LPAR6 encodes a G protein-coupled receptor essential for maintaining the structural integrity of the hair shaft; and has mutations resulting in a wooly hair phenotype in humans.
Most endometrial cancers can be classified histologically as endometrioid, serous, or clear cell. Non-endometrioid endometrial cancers (NEECs; serous and clear cell) are the most clinically aggressive of the three major histotypes and are characterized by aneuploidy, a feature of chromosome instability. The genetic alterations that underlie chromosome instability in endometrial cancer are poorly understood. In the present study, we used Sanger sequencing to search for nucleotide variants in the coding exons and splice junctions of 21 candidate chromosome instability genes, including 19 genes implicated in sister chromatid cohesion, from 24 primary, microsatellite-stable NEECs. Somatic mutations were verified by sequencing matched normal DNAs. We subsequently resequenced mutated genes from 41 additional NEECs as well as 42 endometrioid ECs (EECs). We uncovered nonsynonymous somatic mutations in ESCO1, CHTF18, and MRE11A in, respectively, 3.7% (4 of 107), 1.9% (2 of 107), and 1.9% (2 of 107) of endometrial tumors. Overall, 7.7% (5 of 65) of NEECs and 2.4% (1 of 42) of EECs had somatically mutated one or more of the three genes. A subset of mutations are predicted to impact protein function. The co-occurrence of somatic mutations in ESCO1 and CHTF18 was statistically significant (P = 0.0011, two-tailed Fisher's exact test). This is the first report of somatic mutations within ESCO1 and CHTF18 in endometrial tumors and of MRE11A mutations in microsatellite-stable endometrial tumors. Our findings warrant future studies to determine whether these mutations are driver events that contribute to the pathogenesis of endometrial cancer.
Genomic technologies, such as whole-exome sequencing, are a powerful tool in genetic research. Such testing yields a great deal of incidental medical information, or medical information not related to the primary research target. We describe the management of incidental medical information derived from whole-exome sequencing in the research context. We performed whole-exome sequencing on a monozygotic twin pair in which only 1 child was affected with congenital anomalies and applied an institutional review board–approved algorithm to determine what genetic information would be returned. Whole-exome sequencing identified 79 525 genetic variants in the twins. Here, we focus on novel variants. After filtering artifacts and excluding known single nucleotide polymorphisms and variants not predicted to be pathogenic, the twins had 32 novel variants in 32 genes that were felt to be likely to be associated with human disease. Eighteen of these novel variants were associated with recessive disease and 18 were associated with dominantly manifesting conditions (variants in some genes were potentially associated with both recessive and dominant conditions), but only 1 variant ultimately met our institutional review board–approved criteria for return of information to the research participants.
whole-exome sequencing; incidental medical information
Endometrial cancer is the 6th most commonly diagnosed cancer among women worldwide, causing ~74,000 deaths annually 1. Serous endometrial cancers are a clinically aggressive subtype with a poorly defined genetic etiology 2-4. We used whole exome sequencing (WES) to comprehensively search for somatic mutations within ~22,000 protein-encoding genes among 13 primary serous endometrial tumors. We subsequently resequenced 18 genes that were mutated in more than one tumor, and/or were genes that formed an enriched functional grouping, from 40 additional serous tumors. We identified high frequencies of somatic mutations in CHD4 (17%), EP300 (8%), ARID1A (6%), TSPYL2 (6%), FBXW7 (29%), SPOP (8%), MAP3K4 (6%) and ABCC9 (6%). Overall, 36.5% of serous tumors had mutated a chromatin-remodeling gene and 35% had mutated a ubiquitin ligase complex gene, implicating the frequent mutational disruption of these processes in the molecular pathogenesis of one of the deadliest forms of endometrial cancer.
In this study we assess exome sequencing (ES) as a diagnostic alternative for genetically heterogeneous disorders. Since ES readily identified a previously reported homozygous mutation in the CAPN3 gene for an individual with an undiagnosed limb girdle muscular dystrophy, we evaluated ES as a generalizable clinical diagnostic tool by assessing the targeting efficiency and sequencing-coverage of 88 genes associated with muscle disease (MD) and spastic paraplegia (SPG). We used three exome-capture kits on 125 individuals. Exons constituting each gene were defined using the UCSC and CCDS databases. The three exome-capture kits targeted 47–92% of bases within the UCSC-defined exons, and 97%–99% of bases within the CCDS-defined exons. An average of 61.2–99.5% and 19.1–99.5% of targeted bases per gene were sequenced to 20X coverage within the CCDS-defined MD and SPG coding exons, respectively. Greater than 95–99% of targeted known mutation positions were sequenced to ≥1X coverage and 55–87% to ≥20X coverage in every exome. We conclude therefore that ES is a rapid and efficient first tier method to screen for mutations, particularly within the CCDS annotated exons, although its application requires disclosure of the extent of coverage for each targeted gene and supplementation with second tier Sanger sequencing for full coverage.
CAPN3; exome; LGMD; HSP; neuromuscular disorders; clinical genetic testing
Large data sets on human genetic variation have been collected recently, but their usefulness for learning about history and natural selection has been limited by biases in the ways polymorphisms were chosen. We report large subsets of SNPs from the International HapMap Project1,2 that allow us to overcome these biases and to provide accurate measurement of a quantity of crucial importance for understanding genetic variation: the allele frequency spectrum. Our analysis shows that East Asian and northern European ancestors shared the same population bottleneck expanding out of Africa but that both also experienced more recent genetic drift, which was greater in East Asians.