Although acute lymphocytic leukemia (ALL) is the most common childhood cancer, genetic predisposition to ALL remains poorly understood. Whole-exome sequencing was performed in an extended kindred in which five individuals had been diagnosed with leukemia. Analysis revealed a nonsense variant of TP53 which has been previously reported in families with sarcomas and other typical Li Fraumeni syndrome-associated cancers but never in a familial leukemia kindred. This unexpected finding enabled identification of an appropriate sibling bone marrow donor and illustrates that exome sequencing will reveal atypical clinical presentations of even well-studied genes.
exome sequencing; acute lymphocytic leukemia; genetic predisposition to disease; genetic testing
Approximately 15% of colorectal carcinomas (CRC) exhibit a hypermutated genotype accompanied by high levels of microsatellite instability (MSI-H) and defects in DNA mismatch repair. These tumors, unlike the majority of colorectal carcinomas, are often diploid, exhibit frequent epigenetic silencing of the MLH1 DNA mismatch repair gene, and have a better clinical prognosis. As an adjunct study to The Cancer Genome Atlas consortium that recently analyzed 224 colorectal cancers by whole exome sequencing, we compared the 35 CRC (15.6%) with a hypermutated genotype to those with a non-hypermutated genotype. We found that 22 (63%) of hypermutated CRC exhibited transcriptional silencing of the MLH1 gene, a high frequency of BRAF V600E gene mutations and infrequent APC and KRAS mutations, a mutational pattern significantly different from their non-hypermutated counterparts. However, the remaining 13 (37%) hypermutated CRC lacked MLH1 silencing, contained tumors with the highest mutation rates (“ultramutated” CRC), and exhibited higher incidences of APC and KRAS mutations, but infrequent BRAF mutations. These patterns were confirmed in an independent validation set of 250 exome-sequenced CRC. Analysis of mRNA and microRNA expression signatures revealed that hypermutated CRC with MLH1 silencing had greatly reduced levels of WNT signaling and increased BRAF signaling relative non-hypermutated CRC. Our findings suggest that hypermutated CRC include one subgroup with fundamentally different pathways to malignancy than the majority of CRC. Examination of MLH1 expression status and frequencies of APC, KRAS, and BRAF mutation in CRC may provide a useful diagnostic tool that could supplement the standard microsatellite instability assays and influence therapeutic decisions.
colorectal cancer; microsatellite instability; MLH1; APC; KRAS; BRAF; WNT signaling; mutation rate
The advent of whole-exome next-generation sequencing (WES) has been pivotal for the molecular characterization of Mendelian disease; however, the clinical application of WES has remained relatively unexplored. We describe our experience with WES as a diagnostic tool in a three-year old female patient with a two-year history of episodic muscle weakness and paroxysmal dystonia who presented following a previous extensive but unrevealing diagnostic work-up. WES was performed on the proband and her two parents. Parental exome data was used to filter de novo genomic events in the proband and suspected mutations were confirmed using di-deoxy sequencing. WES revealed a de novo non-synonymous mutation in exon 21 of the calcium channel gene CACNA1S that has been previously reported in a single patient as a rare cause of atypical hypokalemic periodic paralysis. This was unexpected, as the proband’s original differential diagnosis had included hypokalemic periodic paralysis, but clinical and laboratory features were equivocal, and standard clinical molecular testing for hypokalemic periodic paralysis and related disorders was negative. This report highlights the potential diagnostic utility of WES in clinical practice, with implications for the approach to similar diagnostic dilemmas in the future.
Hypokalemic periodic paralysis; CACNA1S; next generation sequencing; hypotonia
Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results.
To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts.
By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.
NGS data; Variant calling; Annotation; Clinical sequencing; Cloud computing
To characterize the role of rare complete human knockouts in autism spectrum disorders (ASD), we identify genes with homozygous or compound heterozygous loss-of-function (LoF) variants (defined as nonsense and essential splice sites) from exome sequencing of 933 cases and 869 controls. We identify a two-fold increase in complete knockouts of autosomal genes with low rates of LoF variation (≤5% frequency) in cases and estimate a 3% contribution to ASD risk by these events, confirming this observation in an independent set of 563 probands and 4,605 controls. Outside the pseudo-autosomal regions on the X-chromosome, we similarly observe a significant 1.5-fold increase in rare hemizygous knockouts in males, contributing to another 2% of ASDs in males. Taken together these results provide compelling evidence that rare autosomal and X-chromosome complete gene knockouts are important inherited risk factors for ASD.
Deafness, onychodystrophy, osteodystrophy, mental retardation, and seizures (DOORS) syndrome is a rare autosomal recessive disorder of unknown cause. We aimed to identify the genetic basis of this syndrome by sequencing most coding exons in affected individuals.
Through a search of available case studies and communication with collaborators, we identified families that included at least one individual with at least three of the five main features of the DOORS syndrome: deafness, onychodystrophy, osteodystrophy, intellectual disability, and seizures. Participants were recruited from 26 centres in 17 countries. Families described in this study were enrolled between Dec 1, 2010, and March 1, 2013. Collaborating physicians enrolling participants obtained clinical information and DNA samples from the affected child and both parents if possible. We did whole-exome sequencing in affected individuals as they were enrolled, until we identified a candidate gene, and Sanger sequencing to confirm mutations. We did expression studies in human fibroblasts from one individual by real-time PCR and western blot analysis, and in mouse tissues by immunohistochemistry and real-time PCR.
26 families were included in the study. We did exome sequencing in the first 17 enrolled families; we screened for TBC1D24 by Sanger sequencing in subsequent families. We identified TBC1D24 mutations in 11 individuals from nine families (by exome sequencing in seven families, and Sanger sequencing in two families). 18 families had individuals with all five main features of DOORS syndrome, and TBC1D24 mutations were identified in half of these families. The seizure types in individuals with TBC1D24 mutations included generalised tonic-clonic, complex partial, focal clonic, and infantile spasms. Of the 18 individuals with DOORS syndrome from 17 families without TBC1D24 mutations, eight did not have seizures and three did not have deafness. In expression studies, some mutations abrogated TBC1D24 mRNA stability. We also detected Tbc1d24 expression in mouse phalangeal chondrocytes and calvaria, which suggests a role of TBC1D24 in skeletogenesis.
Our findings suggest that mutations in TBC1D24 seem to be an important cause of DOORS syndrome and can cause diverse phenotypes. Thus, individuals with DOORS syndrome without deafness and seizures but with the other features should still be screened for TBC1D24 mutations. More information is needed to understand the cellular roles of TBC1D24 and identify the genes responsible for DOORS phenotypes in individuals who do not have a mutation in TBC1D24.
US National Institutes of Health, the CIHR (Canada), the NIHR (UK), the Wellcome Trust, the Henry Smith Charity, and Action Medical Research.
In this study, combinatorial libraries were used in conjunction with ultra-high throughput sequencing to comprehensively determine the impact of each of the 19 possible amino acid substitutions at each residue position in the TEM-1β-lactamase enzyme. The libraries were introduced into E. coli and mutants were selected for ampicillin resistance. The selected colonies were pooled and subjected to ultra-high throughput sequencing to reveal the sequence preferences at each position. The depth of sequencing provided a clear, statistically significant picture of what amino acids are favored for ampicillin hydrolysis for all 263 positions of the enzyme in one experiment. Although the enzyme is generally tolerant of amino acid substitutions, several surface positions far from the active site are sensitive to substitutions suggesting a role for these residues in enzyme stability, solubility or catalysis. In addition, information on the frequency of substitutions was used to identify mutations that increase enzyme thermodynamic stability. Finally, a comparison of sequence requirements based on the mutagenesis results versus those inferred from sequence conservation in an alignment of 156 class A β-lactamases reveals significant differences in that several residues in TEM-1 do not tolerate substitutions and yet extensive variation is observed in the alignment, and vice versa. An analysis of the TEM-1 and other class A structures suggests residues that vary in the alignment may nevertheless make unique, but important, interactions within individual enzymes.
Dysosteosclerosis (DSS) is the form of osteopetrosis distinguished by the presence of skin findings such as red-violet macular atrophy, platyspondyly and metaphyseal osteosclerosis with relative radiolucency of widened diaphyses. At the histopathological level, there is a paucity of osteoclasts when the disease presents. In two patients with DSS, we identified homozygous or compound heterozygous missense mutations in SLC29A3 by whole-exome sequencing. This gene encodes a nucleoside transporter, mutations in which cause histiocytosis–lymphadenopathy plus syndrome, a group of conditions with little or no skeletal involvement. This transporter is essential for lysosomal function in mice. We demonstrate the expression of Slc29a3 in mouse osteoclasts in vivo. In monocytes from patients with DSS, we observed reduced osteoclast differentiation and function (demineralization of calcium surface). Our report highlights the pleomorphic consequences of dysfunction of this nucleoside transporter, and importantly suggests a new mechanism for the control of osteoclast differentiation and function.
This report identifies human skeletal diseases associated with mutations in WNT1. In ten family members with dominantly inherited early-onset osteoporosis, a heterozygous missense variation c.652T>G (p.Cys218Gly) in WNT1 segregated with the disease, and a homozygous nonsense mutation (c.884C>A, p.Ser295*) was identified in two siblings with recessive osteogenesis imperfecta. In vitro, aberrant forms of WNT1 protein showed impaired capacity to induce canonical WNT signaling, their target genes, and mineralization. Wnt1 was clearly expressed in bone marrow, especially in B cell lineage and hematopoietic progenitors; lineage tracing identified expression in a subset of osteocytes, suggesting altered cross-talk of WNT signaling between hematopoietic and osteoblastic lineage cells in these diseases.
Recent advances in human genomics and biotechnologies have profound impacts on medical research and clinical practice. Individual genomic information, including DNA sequences and gene expression profiles, can be used for prediction, prevention, diagnosis, and treatment for many complex diseases. Personalized medicine attempts to tailor medical care to individual patients by incorporating their genomic information. In a case of pancreatic cancer, the fourth leading cause of cancer death in the United States, alteration in many genes as well as molecular profiles in blood, pancreas tissue, and pancreas juice has recently been discovered to be closely associated with tumorigenesis or prognosis of the cancer. This review aims to summarize recent advances of important genes, proteins, and microRNAs that play a critical role in the pathogenesis of pancreatic cancer, and to provide implications for personalized medicine in pancreatic cancer.
pancreatic cancer; genomics; genetics; biomarker; molecular target; personalized medicine
Neurofibromatosis Type 1 (NF1) is a genetic disorder that is driven by the loss of neurofibromin (Nf) protein function. Nf contains a Ras GTPase activating domain (Ras-GAP), which directly regulates Ras signaling. Numerous clinical manifestations are associated with the loss of Nf and increased Ras activity. Ras proteins must be prenylated in order to traffic and functionally localize with target membranes. Hence, Ras is a potential therapeutic target for treating NF1. We have tested the efficacy of two novel farnesyl transferase inhibitors (FTI), 1 and 2, alone or in combination with lovastatin, on two NF1 malignant peripheral nerve sheath tumor (MPSNT) cell lines, NF90-8 and ST88-14. Single treatments of 1, 2, or lovastatin had no effect on MPNST cell proliferation. However, low micromolar combinations of 1 or 2 with lovastatin (FTI/lovastatin) reduced Ras prenylation in both MPNST cell lines. Further, this FTI/lovastatin combination treatment reduced cell proliferation and induced an apoptotic response as shown by morphological analysis, pro-caspase-3/-7 activation, loss of mitochondrial membrane potential, and accumulation of cells with sub G1 DNA content. Little to no detectable toxicity was observed in normal rat Schwann cells following FTI/lovastatin combination treatment. These data support the hypothesis that combination FTI plus lovastatin therapy may be a potential treatment for NF1 MPNSTs.
De novo mutations affect risk for many diseases and disorders, especially those with early-onset. An example is autism spectrum disorders (ASD). Four recent whole-exome sequencing (WES) studies of ASD families revealed a handful of novel risk genes, based on independent de novo loss-of-function (LoF) mutations falling in the same gene, and found that de novo LoF mutations occurred at a twofold higher rate than expected by chance. However successful these studies were, they used only a small fraction of the data, excluding other types of de novo mutations and inherited rare variants. Moreover, such analyses cannot readily incorporate data from case-control studies. An important research challenge in gene discovery, therefore, is to develop statistical methods that accommodate a broader class of rare variation. We develop methods that can incorporate WES data regarding de novo mutations, inherited variants present, and variants identified within cases and controls. TADA, for Transmission And De novo Association, integrates these data by a gene-based likelihood model involving parameters for allele frequencies and gene-specific penetrances. Inference is based on a Hierarchical Bayes strategy that borrows information across all genes to infer parameters that would be difficult to estimate for individual genes. In addition to theoretical development we validated TADA using realistic simulations mimicking rare, large-effect mutations affecting risk for ASD and show it has dramatically better power than other common methods of analysis. Thus TADA's integration of various kinds of WES data can be a highly effective means of identifying novel risk genes. Indeed, application of TADA to WES data from subjects with ASD and their families, as well as from a study of ASD subjects and controls, revealed several novel and promising ASD candidate genes with strong statistical support.
The genetic underpinnings of autism spectrum disorder (ASD) have proven difficult to determine, despite a wealth of evidence for genetic causes and ongoing effort to identify genes. Recently investigators sequenced the coding regions of the genomes from ASD children along with their unaffected parents (ASD trios) and identified numerous new candidate genes by pinpointing spontaneously occurring (de novo) mutations in the affected offspring. A gene with a severe (de novo) mutation observed in more than one individual is immediately implicated in ASD; however, the majority of severe mutations are observed only once per gene. These genes create a short list of candidates, and our results suggest about 50% are true risk genes. To strengthen our inferences, we develop a novel statistical method (TADA) that utilizes inherited variation transmitted to affected offspring in conjunction with (de novo) mutations to identify risk genes. Through simulations we show that TADA dramatically increases power. We apply this approach to nearly 1000 ASD trios and 2000 subjects from a case-control study and identify several promising genes. Through simulations and application we show that TADA's integration of sequencing data can be a highly effective means of identifying risk genes.
Osteogenesis imperfecta (OI), Ehlers-Danlos syndrome (EDS), and osteopetrosis (OPT)are collectively common inherited skeletal diseases. Evaluation of subjects with these conditions often includes molecular testing which has important counseling, therapeutic and sometimes legal implications. Since several different genes have been implicated in these conditions, Sanger sequencing of each gene can be a prohibitively expensive and time consuming way to reach a molecular diagnosis.
In order to circumvent these problems, we have designed and tested a NGS platform that would allow simultaneous sequencing on a single diagnostic platform of different genes implicated in OI, OPT, EDS, and other inherited conditions leading to low or high bone mineral density. We used a liquid-phase probe library that captures 602 exons (~100 kb) of 34 selected genes and have applied it to test clinical samples from patients with bone disorders.
NGS of the captured exons by Illumina HiSeq2000 resulted in an average coverage of over 900X. The platform was successfully validated by identifying mutations in 6 patients with known mutations. Moreover, in 4 patients with OI or OPT without a prior molecular diagnosis, the assay was able to detect the causative mutations.
In conclusion, our NGS panel provides a fast and accurate method to arrive at a molecular diagnosis in most patients with inherited high or low bone mineral density disorders.
Isoprenylcysteine carboxyl methyltransferases (Icmts) are a class of integral membrane protein methyltransferases localized to the endoplasmic reticulum (ER) membrane in eukaryotes. The Icmts from human (hIcmt) and S. cerevisae (Ste14p) catalyze the α-carboxyl methyl esterification step in the post-translational processing of CaaX proteins, including the yeast a-factor mating pheromones and both human and yeast Ras proteins. Herein, we evaluated synthetic analogs of two well-characterized Icmt substrates, N-acetyl-S-farnesyl-L-cysteine (AFC) and the yeast a-factor peptide mating pheromone, that contain photoactive benzophenone moieties in either the lipid or peptide portion of the molecule. The AFC based-compounds were substrates for both hIcmt and Ste14p, whereas the a-factor analogs were only substrates for Ste14p. However, the a-factor analogs were found to be micromolar inhibitors of hIcmt. Together, these data suggest that the Icmt substrate binding site is dependent upon features in both the isoprenyl moiety and upstream amino acid composition and that hIcmt and Ste14p have overlapping, yet distinct, substrate specificities. Photocrosslinking and neutravidin-agarose capture experiments with these analogs revealed that both hIcmt and Ste14p were specifically photolabeled to varying degrees with all of the compounds tested. These data suggest that these analogs will be useful for the future identification of the Icmt substrate binding sites.
Icmt; Ste14p; a-factor; photocrosslinking; benzophenone; methyltransferase
The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly.
In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies.
Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
Genome assembly; N50; Scaffolds; Assessment; Heterozygosity; COMPASS
Next generation exome sequencing (ES) and whole genome sequencing (WGS) are new powerful tools for discovering the gene(s) that underlie Mendelian disorders. To accelerate these discoveries, the National Institutes of Health has established three Centers for Mendelian Genomics (CMGs): the Center for Mendelian Genomics at the University of Washington; the Center for Mendelian Disorders at Yale University; and the Baylor-Johns Hopkins Center for Mendelian Genomics at Baylor College of Medicine and Johns Hopkins University. The CMGs will provide ES/WGS and extensive analysis expertise at no cost to collaborating investigators where the causal gene(s) for a Mendelian phenotype has yet to be uncovered. Over the next few years and in collaboration with the global human genetics community, the CMGs hope to facilitate the identification of the genes underlying a very large fraction of all Mendelian disorders see http://mendelian.org.
mendelian; exome sequencing; commentary
Genitopatellar syndrome (GPS) and Say-Barber-Biesecker-Young-Simpson syndrome (SBBYSS or Ohdo syndrome) have both recently been shown to be caused by distinct mutations in the histone acetyltransferase KAT6B (a.k.a. MYST4/MORF). All variants are de novo dominant mutations that lead to protein truncation. Mutations leading to GPS occur in the proximal portion of the last exon and lead to the expression of a protein without an activation domain. Mutations leading to SBBYSS occur either throughout the gene, leading to nonsense-mediated decay, or more distally in the last exon. Features present only in GPS are contractures, anomalies of the spine, ribs and pelvis, renal cysts, hydronephrosis and agenesis of the corpus callosum. Features present only in SBBYSS include long thumbs and long great toes and lacrimal duct abnormalities. Several features occur in both, such as intellectual disability, congenital heart defects, genital and patellar anomalies. We propose that haploinsufficiency or loss of a function mediated by the C-terminal domain causes the common features, whereas gain-of-function activities would explain the features unique to GPS. Further molecular studies and the compilation of mutations in a database for genotype-phenotype correlations (www.LOVD.nl/KAT6B) might help tease out answers to these questions and understand the developmental programs dysregulated by the different truncations.
KAT6B; MYST4; mutation database; Genitopatellar syndrome; Ohdo Syndrome
The debate regarding the relative merits of whole genome sequencing (WGS) versus exome sequencing (ES) centers around comparative cost, average depth of coverage for each interrogated base, and their relative efficiency in the identification of medically actionable variants from the myriad of variants identified by each approach. Nevertheless, few genomes have been subjected to both WGS and ES, using multiple next generation sequencing platforms. In addition, no personal genome has been so extensively analyzed using DNA derived from peripheral blood as opposed to DNA from transformed cell lines that may either accumulate mutations during propagation or clonally expand mosaic variants during cell transformation and propagation.
We investigated a genome that was studied previously by SOLiD chemistry using both ES and WGS, and now perform six independent ES assays (Illumina GAII (x2), Illumina HiSeq (x2), Life Technologies' Personal Genome Machine (PGM) and Proton), and one additional WGS (Illumina HiSeq).
We compared the variants identified by the different methods and provide insights into the differences among variants identified between ES runs in the same technology platform and among different sequencing technologies. We resolved the true genotypes of medically actionable variants identified in the proband through orthogonal experimental approaches. Furthermore, ES identified an additional SH3TC2 variant (p.M1?) that likely contributes to the phenotype in the proband.
ES identified additional medically actionable variant calls and helped resolve ambiguous single nucleotide variants (SNV) documenting the power of increased depth of coverage of the captured targeted regions. Comparative analyses of WGS and ES reveal that pseudogenes and segmental duplications may explain some instances of apparent disease mutations in unaffected individuals.
Exome sequencing; Whole-genome sequencing; Incidental findings; SH3TC2; Personal genomes; Precision medicine
Czech dysplasia, metatarsal type is an autosomal dominant skeletal disorder that is characterized by early-onset, progressive arthritis, brachydactyly of the 3rd and 4th toes, and characteristic radiographic findings in patients of normal stature. Patients with Czech dysplasia typically present in late childhood or later. In the present report, whole exome sequencing identified a mutation in COL2A1 (c.823C>T, p.R275C) known to be associated with Czech dysplasia in a 3.5 year old female who had a family history of early-onset arthritis and who was asymptomatic except for prominent knees. The use of whole exome sequencing facilitated diagnosis of this rare disease (less than 15 families in the literature) in the presymptomatic period and thus enabled us to provide early anticipatory guidance and genetic counseling for the family.
Czech dysplasia; skeletal dysplasia; prominent knees; early-onset osteoarthritis; Depressed nasal bridge; Brachydactyly of 3rd and 4th toes; Normal stature; Early-onset arthritis
Transposable elements (TEs) are abundant in the human genome, and some are capable of generating new insertions through RNA intermediates. In cancer, the disruption of cellular mechanisms that normally suppress TE activity may facilitate mutagenic retrotranspositions. We performed single-nucleotide resolution analysis of TE insertions in 43 high-coverage whole-genome sequencing data sets from five cancer types. We identified 194 high-confidence somatic TE insertions, as well as thousands of polymorphic TE insertions in matched normal genomes. Somatic insertions were present in epithelial tumors but not in blood or brain cancers. Somatic L1 insertions tend to occur in genes that are commonly mutated in cancer, disrupt the expression of the target genes, and are biased toward regions of cancer-specific DNA hypomethylation, highlighting their potential impact in tumorigenesis.
Human diseases are caused by alleles that encompass the full range of variant types, from single-nucleotide changes to copy-number variants, and these variations span a broad frequency spectrum, from the very rare to the common. The picture emerging from analysis of whole-genome sequences, the 1000 Genomes Project pilot studies, and targeted genomic sequencing derived from very large sample sizes reveals an abundance of rare and private variants. One implication of this realization is that recent mutation may have a greater influence on disease susceptibility or protection than is conferred by variations that arose in distant ancestors.
Following the “finished,” euchromatic, haploid human reference genome sequence, the rapid development of novel, faster, and cheaper sequencing technologies is making possible the era of personalized human genomics. Personal diploid human genome sequences have been generated, and each has contributed to our better understanding of variation in the human genome. We have consequently begun to appreciate the vastness of individual genetic variation from single nucleotide to structural variants. Translation of genome-scale variation into medically useful information is, however, in its infancy. This review summarizes the initial steps undertaken in clinical implementation of personal genome information, and describes the application of whole-genome and exome sequencing to identify the cause of genetic diseases and to suggest adjuvant therapies. Better analysis tools and a deeper understanding of the biology of our genome are necessary in order to decipher, interpret, and optimize clinical utility of what the variation in the human genome can teach us. Personal genome sequencing may eventually become an instrument of common medical practice, providing information that assists in the formulation of a differential diagnosis. We outline herein some of the remaining challenges.
whole-genome sequencing (WGS); exome sequencing; simple nucleotide variation (SNV); structural variation; personal genomics
Since the initial report of targeted-enrichment (Albert et al, 2007) we have been evolving the design and utility of capture reagents and methods, while taking advantage of the parallel advances in sequencing platforms. New exome designs target a comprehensive set of coding exons from 6 different gene databases, as well as computationally predicted coding and non-coding elements: regulatory regions, and conserved UTRs. Library automation, reduction of DNA input samples, capture hybridization multiplexing and application of faster read mapping tools such as BWA, together allow a rate of >4,300 libraries/captures per month, with >40,000 exome and regional capture libraries completed to date. In addition, a fully integrated informatics and analysis pipeline (Mercury), supports all aspects of data flow and analysis from the initial data production on the sequencing instrument to annotated variant calls (SNPs and small Indels). These laboratory methods and analysis pipelines have been production hardened at the Human Genome Sequencing Center (HGSC) and have now been applied toward clinical exome sequencing. Through a joint collaboration between the Human Genome Sequencing Center and the Medical Genetics Laboratories (MGL) of the Department of Molecular and Human Genetics, clinical exome sequencing and interpretation are now provided through the CAP/CLIA certified Whole Genome Laboratory (WGL). To date, the WGL has completed exome sequencing of 650 patient samples and final interpretation completed for over 450 patients with causative deleterious mutations identified in 25% of cases. Performance has been maintained to a high standard of 95% of the exome target bases represented at 20X coverage. Overall exome performance metrics, LIMS support, variant analysis and validation of the clinical pipeline for a CAP/CLIA environment will be presented.
Next generation sequencing platforms have greatly reduced sequencing costs, leading to the production of unprecedented amounts of sequence data. BWA is one of the most popular alignment tools due to its relatively high accuracy. However, mapping reads using BWA is still the most time consuming step in sequence analysis. Increasing mapping efficiency would allow the community to better cope with ever expanding volumes of sequence data.
We designed a new program, CGAP-align, that achieves a performance improvement over BWA without sacrificing recall or precision. This is accomplished through the use of Suffix Tarray, a novel data structure combining elements of Suffix Array and Suffix Tree. We also utilize a tighter lower bound estimation for the number of mismatches in a read, allowing for more effective pruning during inexact mapping. Evaluation of both simulated and real data suggests that CGAP-align consistently outperforms the current version of BWA and can achieve over twice its speed under certain conditions, all while obtaining nearly identical results.
CGAP-align is a new time efficient read alignment tool that extends and improves BWA. The increase in alignment speed will be of critical assistance to all sequence-based research and medicine. CGAP-align is freely available to the academic community at http://sourceforge.net/p/cgap-align under the GNU General Public License (GPL).
Elephant endotheliotropic herpesvirus 1A is a member of the Proboscivirus genus and is a major cause of fatal hemorrhagic disease in endangered juvenile Asian elephants worldwide. Here, we report the first complete genome sequence from this genus, obtained directly from necropsy DNA, in which 60 of the 115 predicted genes are not found in any known herpesvirus.