The acute myeloid leukemia (AML) genome has been the subject of intensive research over the past four decades. New technologies, enabling characterization of the AML genome at increased resolution, have revealed deeper layers of complexity that have provided insights into the biological basis of this disease, nominated targets for therapy, and identified biomarkers predictive of response to therapy or long-term prognosis. Still, our understanding of AML genomics is incomplete. Recent publications have demonstrated that whole genome sequencing (WGS) of primary AML samples is feasible and can detect novel, clinically relevant mutations. New insights are emerging from this work, including the clonal heterogeneity of this disease and clonal evolution that occurs over time. Some of the novel mutations are highly recurrent (>20% of patients), but there appears to be a continuum of mutation frequency down to rare (<5%) or even singleton mutations that may be relevant for the biology of this disease. Large cohorts of well-annotated samples are needed to establish mutation frequencies, implicate biological pathways, and demonstrate genotype:phenotype correlations. Although many technical and logistical challenges must be overcome, the capacity of WGS to detect all classes of inherited and acquired genetic abnormalities makes it an attractive candidate for development as a clinical diagnostic test.
acute myeloid leukemia; genomics; next generation sequencing
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35–250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
massively parallel sequencing; next generation sequencing; human genome; variant detection; short read alignment; whole genome sequencing
Summary: Despite recent progress, computational tools that identify gene fusions from next-generation whole transcriptome sequencing data are often limited in accuracy and scalability. Here, we present a software package, BreakFusion that combines the strength of reference alignment followed by read-pair analysis and de novo assembly to achieve a good balance in sensitivity, specificity and computational efficiency.
Supplementary data are available at Bioinformatics online
As part of the molecular revolution sweeping medicine, comprehensive genomic studies are adding powerful dimensions to medical research. However, their power exposes new regulatory, strategic, and quality assurance challenges for biorepositories. A key issue is that unlike other research techniques commonly applied to banked specimens, nucleic acid sequencing, if sufficiently extensive, yields data that could identify a patient. This evolving paradigm renders the concepts of anonymized and anonymous specimens increasingly outdated. The challenges for biorepositories in this new era include refined consent processes and wording, selection and use of legacy specimens, quality assurance procedures, institutional documentation, data sharing, and interaction with institutional review boards. Given current trends, biorepositories should consider these issues now, even if they are not currently experiencing sample requests for genomic analysis. We summarize our current experiences and best practices at Washington University Medical School, St Louis, MO, our perceptions of emerging trends, and recommendations.
Genomic studies; Biorepositories; Biobanks; Quality assurance; Regulatory standards
Detection and characterization of genomic structural variation are important for understanding the landscape of genetic variation in human populations and in complex diseases such as cancer. Recent studies demonstrate the feasibility of detecting structural variation using next-generation, short-insert, paired-end sequencing reads. However, the utility of these reads is not entirely clear, nor are the analysis methods under which accurate detection can be achieved. The algorithm BreakDancer predicts a wide variety of structural variants including indels, inversions, and translocations. We examined BreakDancer's performance in simulation, comparison with other methods, analysis of an acute myeloid leukemia sample, and the 1,000 Genomes trio individuals. We found that it substantially improved the detection of small and intermediate size indels from 10 bp to 1 Mbp that are difficult to detect via a single conventional approach.
The human Y chromosome began to evolve from an autosome hundreds of millions of years ago, acquiring a sex-determining function and undergoing a series of inversions that suppressed crossing over with the X chromosome1,2. Little is known about the Y chromosome’s recent evolution because only the human Y chromosome has been fully sequenced. Prevailing theories hold that Y chromosomes evolve by gene loss, the pace of which slows over time, eventually leading to a paucity of genes, and stasis3,4. These theories have been buttressed by partial sequence data from newly emergent plant and animal Y chromosomes5-8, but they have not been tested in older, highly evolved Y chromosomes like that of humans. We therefore finished sequencing the male-specific region of the Y chromosome (MSY) in our closest living relative, the chimpanzee, achieving levels of accuracy and completion previously reached for the human MSY. We then compared the MSYs of the two species and found that they differ radically in sequence structure and gene content, implying rapid evolution during the past 6 million years. The chimpanzee MSY harbors twice as many massive palindromes as the human MSY, yet it has lost large fractions of the MSY protein-coding genes and gene families present in the last common ancestor. We suggest that the extraordinary divergence of the chimpanzee and human MSYs was driven by four synergistic factors: the MSY’s prominent role in sperm production, genetic hitchhiking effects in the absence of meiotic crossing over, frequent ectopic recombination within the MSY, and species differences in mating behavior. While genetic decay may be the principal dynamic in the evolution of newly emergent Y chromosomes, wholesale renovation is the paramount theme in the ongoing evolution of chimpanzee, human, and perhaps other older MSYs.
The St. Jude Children’s Research Hospital–Washington University Pediatric Cancer Genome Project (PCGP) is participating in the international effort to identify somatic mutations that drive cancer. These cancer genome sequencing efforts will not only yield an unparalleled view of the altered signaling pathways in cancer but should also identify new targets against which novel therapeutics can be developed. Although these projects are still deep in the phase of generating primary DNA sequence data, important results are emerging and valuable community resources are being generated that should catalyze future cancer research. We describe here the rationale for conducting the PCGP, present some of the early results of this project and discuss the major lessons learned and how these will affect the application of genomic sequencing in the clinic.
Motivation: The sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.
Results: In this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.
Availability and implementation: Binaries are freely available for download at http://gmt.genome.wustl.edu/somatic-sniper/current/, implemented in C and supported on Linux and Mac OS X.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
We developed CREST (Clipping REveals STructure), an algorithm that uses next-generation sequencing reads with partial alignments to a reference genome to directly map structural variations at the nucleotide level of resolution. Application of CREST to whole-genome sequencing data from five pediatric T-lineage acute lymphoblastic leukemias (T-ALLs) and a human melanoma cell line, COLO-829, identified 160 somatic structural variations. Experimental validation exceeded 80% demonstrating that CREST had a high predictive accuracy.
To identify somatic mutations in paediatric diffuse intrinsic pontine gliomas (DIPGs), we performed whole genome sequencing of 7 DIPGs and matched germline DNA, and targeted sequencing of an additional 43 DIPGs and 36 non-brainstem paediatric glioblastomas (non-BS-PGs). 78% of DIPGs and 22% of non-BS-PGs contained p.K27M mutation in H3F3A, encoding histone H3.3, or the related HIST1H3B, encoding histone H3.1. An additional 14% of non-BS-PGs had somatic p.G34R H3F3A mutations.
Motivation: The expansion of cancer genome sequencing continues to stimulate development of analytical tools for inferring relationships between somatic changes and tumor development. Pathway associations are especially consequential, but existing algorithms are demonstrably inadequate.
Methods: Here, we propose the PathScan significance test for the scenario where pathway mutations collectively contribute to tumor development. Its design addresses two aspects that established methods neglect. First, we account for variations in gene length and the consequent differences in their mutation probabilities under the standard null hypothesis of random mutation. The associated spike in computational effort is mitigated by accurate convolution-based approximation. Second, we combine individual probabilities into a multiple-sample value using Fisher–Lancaster theory, thereby improving differentiation between a few highly mutated genes and many genes having only a few mutations apiece. We investigate accuracy, computational effort and power, reporting acceptable performance for each.
Results: As an example calculation, we re-analyze KEGG-based lung adenocarcinoma pathway mutations from the Tumor Sequencing Project. Our test recapitulates the most significant pathways and finds that others for which the original test battery was inconclusive are not actually significant. It also identifies the focal adhesion pathway as being significantly mutated, a finding consistent with earlier studies. We also expand this analysis to other databases: Reactome, BioCarta, Pfam, PID and SMART, finding additional hits in ErbB and EPHA signaling pathways and regulation of telomerase. All have implications and plausible mechanistic roles in cancer. Finally, we discuss aspects of extending the method to integrate gene-specific background rates and other types of genetic anomalies.
Availability: PathScan is implemented in Perl and is available from the Genome Institute at: http://genome.wustl.edu/software/pathscan.
Supplementary information: Supplementary data are available at Bioinformatics online.
Alterations in DNA methylation have been implicated in the pathogenesis of myelodysplastic syndromes (MDS), although the underlying mechanism remains largely unknown. Methylation of CpG dinucleotides is mediated by DNA methyltransferases, including DNMT1, DNMT3A, and DNMT3B. DNMT3A mutations have recently been reported in patients with de novo acute myeloid leukemia (AML), providing a rationale for examining the status of DNMT3A in MDS samples. Here, we report the frequency of DNMT3A mutations in patients with de novo MDS, and their association with secondary AML. We sequenced all coding exons of DNMT3A using DNA from bone marrow and paired normal cells from 150 patients with MDS and identified 13 heterozygous mutations with predicted translational consequences in 12/150 patients (8.0%). Amino acid R882, located in the methyltransferase domain of DNMT3A, was the most common mutation site, accounting for 4/13 mutations. DNMT3A mutations were expressed in the majority of cells in all tested mutant samples regardless of blast counts, suggesting that DNMT3A mutations occur early in the course of MDS. Patients with DNMT3A mutations had worse overall survival compared to patients without DNMT3A mutations (p=0.005) and more rapid progression to AML (p=0.007), suggesting that DNMT3A mutation status may have prognostic value in de novo MDS.
myelodysplastic syndrome; DNMT3A; mutation
The Oxytricha trifallax mitochondrial genome contains the largest sequenced ciliate mitochondrial chromosome (∼70 kb) plus a ∼5-kb linear plasmid bearing mitochondrial telomeres. We identify two new ciliate split genes (rps3 and nad2) as well as four new mitochondrial genes (ribosomal small subunit protein genes: rps- 2, 7, 8, 10), previously undetected in ciliates due to their extreme divergence. The increased size of the Oxytricha mitochondrial genome relative to other ciliates is primarily a consequence of terminal expansions, rather than the retention of ancestral mitochondrial genes. Successive segmental duplications, visible in one of the two Oxytricha mitochondrial subterminal regions, appear to have contributed to the genome expansion. Consistent with pseudogene formation and decay, the subtermini possess shorter, more loosely packed open reading frames than the remainder of the genome. The mitochondrial plasmid shares a 251-bp region with 82% identity to the mitochondrial chromosome, suggesting that it most likely integrated into the chromosome at least once. This region on the chromosome is also close to the end of the most terminal member of a series of duplications, hinting at a possible association between the plasmid and the duplications. The presence of mitochondrial telomeres on the mitochondrial plasmid suggests that such plasmids may be a vehicle for lateral transfer of telomeric sequences between mitochondrial genomes. We conjecture that the extreme divergence observed in ciliate mitochondrial genomes may be due, in part, to repeated invasions by relatively error-prone DNA polymerase-bearing mobile elements.
split genes; segmental duplication; genome expansion; linear mitochondrial plasmid; mobile elements; extreme mitochondrial divergences
New DNA sequencing platforms have revolutionized human genome sequencing. The dramatic advances in genome sequencing technologies predict that the $1,000 genome will become a reality within the next few years. Applied to cancer, the availability of cancer genome sequences permits real-time decision-making with the potential to affect diagnosis, prognosis, and treatment, and has opened the door towards personalized medicine. A promising strategy is the identification of mutated tumor antigens, and the design of personalized cancer vaccines. Supporting this notion are preliminary analyses of the epitope landscape in breast cancer suggesting that individual tumors express significant numbers of novel antigens to the immune system that can be specifically targeted through cancer vaccines.
cancer genome sequencing; unique tumor antigen; DNA vaccine
To compare clinical, immunohistochemical and gene expression models of prognosis applicable to formalin-fixed, paraffin-embedded blocks in a large series of estrogen receptor positive breast cancers, from patients uniformly treated with adjuvant tamoxifen.
qRT-PCR assays for 50 genes identifying intrinsic breast cancer subtypes were completed on 786 specimens linked to clinical (median followup 11.7 years) and immunohistochemical (ER, PR, HER2, Ki67) data. Performance of predefined intrinsic subtype and Risk-Of-Relapse scores was assessed using multivariable Cox models and Kaplan-Meier analysis. Harrell’s C index was used to compare fixed models trained in independent data sets, including proliferation signatures.
Despite clinical ER positivity, 10% of cases were assigned to non-Luminal subtypes. qRT-PCR signatures for proliferation genes gave more prognostic information than clinical assays for hormone receptors or Ki67. In Cox models incorporating standard prognostic variables, hazard ratios for breast cancer disease specific survival over the first 5 years of followup, relative to the most common Luminal A subtype, are 1.99 (95% CI: 1.09–3.64) for Luminal B, 3.65 (1.64–8.16) for HER2-enriched and 17.71 (1.71–183.33) for the basal like subtype. For node-negative disease, PAM50 qRT-PCR based risk assignment weighted for tumor size and proliferation identifies a group with >95% 10 yr survival without chemotherapy. In node positive disease, PAM50-based prognostic models were also superior.
The PAM50 gene expression test for intrinsic biological subtype can be applied to large series of formalin-fixed paraffin-embedded breast cancers, and gives more prognostic information than clinical factors and immunohistochemistry using standard cutpoints.
The full complement of DNA mutations that are responsible for the pathogenesis of acute myeloid leukemia (AML) is not yet known.
We used massively parallel DNA sequencing to obtain a very high level of coverage (approximately 98%) of a primary, cytogenetically normal, de novo genome for AML with minimal maturation (AML-M1) and a matched normal skin genome.
We identified 12 acquired (somatic) mutations within the coding sequences of genes and 52 somatic point mutations in conserved or regulatory portions of the genome. All mutations appeared to be heterozygous and present in nearly all cells in the tumor sample. Four of the 64 mutations occurred in at least 1 additional AML sample in 188 samples that were tested. Mutations in NRAS and NPM1 had been identified previously in patients with AML, but two other mutations had not been identified. One of these mutations, in the IDH1 gene, was present in 15 of 187 additional AML genomes tested and was strongly associated with normal cytogenetic status; it was present in 13 of 80 cytogenetically normal samples (16%). The other was a nongenic mutation in a genomic region with regulatory potential and conservation in higher mammals; we detected it in one additional AML tumor. The AML genome that we sequenced contains approximately 750 point mutations, of which only a small fraction are likely to be relevant to pathogenesis.
By comparing the sequences of tumor and skin genomes of a patient with AML-M1, we have identified recurring mutations that may be relevant for pathogenesis.
The application of next-generation sequencing technology has produced a transformation in cancer genomics, generating large data sets that can be analyzed in different ways to answer a multitude of questions about the genomic alterations associated with the disease. Analytical approaches can discover focused mutations such as substitutions and small insertion/deletions, large structural alterations and copy number events. As our capacity to produce such data for multiple cancers of the same type is improving, so are the demands to analyze multiple tumor genomes simultaneously growing. For example, pathway-based analyses that provide the full mutational impact on cellular protein networks and correlation analyses aimed at revealing causal relationships between genomic alterations and clinical presentations are both enabled. As the repertoire of data grows to include mRNA-seq, non-coding RNA-seq and methylation for multiple genomes, our challenge will be to intelligently integrate data types and genomes to produce a coherent picture of the genetic basis of cancer.
Genome-based studies of metazoan evolution are most informative when phylogenetically diverse species are incorporated in the analysis. As such, evolutionary trends within and outside the phylum Nematoda have been less revealing by focusing only on comparisons involving Caenorhabditis elegans. Herein, we present a draft of the 64 megabase nuclear genome of Trichinella spiralis, containing 15,808 protein coding genes. This parasitic nematode is an extant member of a clade that diverged early in the evolution of the phylum enabling identification of archetypical genes and molecular signatures exclusive to nematodes. Comparative analyses support intrachromosomal rearrangements across the phylum, disproportionate numbers of protein family deaths over births in parasitic vs. a non-parasitic nematode, and a preponderance of gene loss and gain events in nematodes relative to Drosophila melanogaster. This sequence and the panphylum characteristics identified herein will advance evolutionary studies and strategies to combat global parasites of humans, food animals and crops.
We took advantage of the unusual genomic organization of the ciliate Oxytricha trifallax to screen for eukaryotic non-coding RNA (ncRNA) genes. Ciliates have two types of nuclei: a germ line micronucleus that is usually transcriptionally inactive, and a somatic macronucleus that contains a reduced, fragmented and rearranged genome that expresses all genes required for growth and asexual reproduction. In some ciliates including Oxytricha, the macronuclear genome is particularly extreme, consisting of thousands of tiny ‘nanochromosomes’, each of which usually contains only a single gene. Because the organism itself identifies and isolates most of its genes on single-gene nanochromosomes, nanochromosome structure could facilitate the discovery of unusual genes or gene classes, such as ncRNA genes. Using a draft Oxytricha genome assembly and a custom-written protein-coding genefinding program, we identified a subset of nanochromosomes that lack any detectable protein-coding gene, thereby strongly enriching for nanochromosomes that carry ncRNA genes. We found only a small proportion of non-coding nanochromosomes, suggesting that Oxytricha has few independent ncRNA genes besides homologs of already known RNAs. Other than new members of known ncRNA classes including C/D and H/ACA snoRNAs, our screen identified one new family of small RNA genes, named the Arisong RNAs, which share some of the features of small nuclear RNAs.
Clostridium difficile is a common cause of infectious diarrhea in hospitalized patients. A severe and increased incidence of C. difficile infection (CDI) is associated predominantly with the NAP1 strain; however, the existence of other severe-disease-associated (SDA) strains and the extensive genetic diversity across C. difficile complicate reliable detection and diagnosis. Comparative genome analysis of 14 sequenced genomes, including those of a subset of NAP1 isolates, allowed the assessment of genetic diversity within and between strain types to identify DNA markers that are associated with severe disease. Comparative genome analysis of 14 isolates, including five publicly available strains, revealed that C. difficile has a core genome of 3.4 Mb, comprising ∼3,000 genes. Analysis of the core genome identified candidate DNA markers that were subsequently evaluated using a multistrain panel of 177 isolates, representing more than 50 pulsovars and 8 toxinotypes. A subset of 117 isolates from the panel had associated patient data that allowed assessment of an association between the DNA markers and severe CDI. We identified 20 candidate DNA markers for species-wide detection and 10,683 single nucleotide polymorphisms (SNPs) associated with the predominant SDA strain (NAP1). A species-wide detection candidate marker, the sspA gene, was found to be the same across 177 sequenced isolates and lacked significant similarity to those of other species. Candidate SNPs in genes CD1269 and CD1265 were found to associate more closely with disease severity than currently used diagnostic markers, as they were also present in the toxin A-negative and B-positive (A-B+) strain types. The genetic markers identified illustrate the potential of comparative genomics for the discovery of diagnostic DNA-based targets that are species specific or associated with multiple SDA strains.
Whole-genome analysis of human tumors has identified some unsuspected tumor-associated genes
Unbiased sequencing and analysis of human tumors is revealing unsuspected somatic changes that, upon further study, are elucidating aspects of tumor biology and identifying new biomarkers.
Zinc is an essential trace element involved in a wide range of biological
processes and human diseases. Zinc excess is deleterious, and animals require
mechanisms to protect against zinc toxicity. To identify genes that modulate
zinc tolerance, we performed a forward genetic screen for Caenorhabditis
elegans mutants that were resistant to zinc toxicity. Here we
demonstrate that mutations of the C. elegans histidine ammonia
lyase (haly-1) gene promote zinc tolerance. C. elegans
haly-1 encodes a protein that is homologous to vertebrate HAL, an
enzyme that converts histidine to urocanic acid. haly-1 mutant
animals displayed elevated levels of histidine, indicating that C.
elegans HALY-1 protein is an enzyme involved in histidine
catabolism. These results suggest the model that elevated histidine chelates
zinc and thereby reduces zinc toxicity. Supporting this hypothesis, we
demonstrated that dietary histidine promotes zinc tolerance. Nickel is another
metal that binds histidine with high affinity. We demonstrated that
haly-1 mutant animals are resistant to nickel toxicity and
dietary histidine promotes nickel tolerance in wild-type animals. These studies
identify a novel role for haly-1 and histidine in zinc
metabolism and may be relevant for other animals.
Zinc is an essential nutrient that is critical for human health. However, excess
zinc can cause toxicity, indicating that regulatory mechanisms are necessary to
maintain homeostasis. The analysis of mechanisms that promote zinc homeostasis
can elucidate fundamental regulatory processes and suggest new approaches for
treating disorders of zinc metabolism. To discover genes that modulate zinc
tolerance, we screened for C. elegans mutants that were
resistant to zinc toxicity. Here we demonstrate that mutations of the histidine
ammonia lyase (haly-1) gene promote zinc tolerance.
haly-1 encodes a protein that is similar to vertebrate HAL,
an enzyme that converts histidine to urocanic acid. Mutations in the human HAL
gene cause elevated levels of serum histidine and abnormal zinc metabolism.
Mutations in C. elegans haly-1 cause elevated levels of
histidine, suggesting that histidine causes resistance to excess zinc.
Consistent with this hypothesis, we demonstrated that dietary histidine promoted
tolerance to excess zinc in wild-type worms. Mutations in
haly-1 and supplemental dietary histidine also caused
resistance to nickel, another metal that can bind histidine. A likely mechanism
of protection is chelation of zinc and nickel by histidine. These studies
suggest that histidine plays a physiological role in zinc metabolism.
Motivation: DNA copy number aberration (CNA) is a hallmark of genomic abnormality in tumor cells. Recurrent CNA (RCNA) occurs in multiple cancer samples across the same chromosomal region and has greater implication in tumorigenesis. Current commonly used methods for RCNA identification require CNA calling for individual samples before cross-sample analysis. This two-step strategy may result in a heavy computational burden, as well as a loss of the overall statistical power due to segmentation and discretization of individual sample's data. We propose a population-based approach for RCNA detection with no need of single-sample analysis, which is statistically powerful, computationally efficient and particularly suitable for high-resolution and large-population studies.
Results: Our approach, correlation matrix diagonal segmentation (CMDS), identifies RCNAs based on a between-chromosomal-site correlation analysis. Directly using the raw intensity ratio data from all samples and adopting a diagonal transformation strategy, CMDS substantially reduces computational burden and can obtain results very quickly from large datasets. Our simulation indicates that the statistical power of CMDS is higher than that of single-sample CNA calling based two-step approaches. We applied CMDS to two real datasets of lung cancer and brain cancer from Affymetrix and Illumina array platforms, respectively, and successfully identified known regions of CNA associated with EGFR, KRAS and other important oncogenes. CMDS provides a fast, powerful and easily implemented tool for the RCNA analysis of large-scale data from cancer genomes.
Availability: The R and C programs implementing our method are available at https://dsgweb.wustl.edu/qunyuan/software/cmds.
Supplementary information: Supplementary data are available at Bioinformatics online.