The acute myeloid leukemia (AML) genome has been the subject of intensive research over the past four decades. New technologies, enabling characterization of the AML genome at increased resolution, have revealed deeper layers of complexity that have provided insights into the biological basis of this disease, nominated targets for therapy, and identified biomarkers predictive of response to therapy or long-term prognosis. Still, our understanding of AML genomics is incomplete. Recent publications have demonstrated that whole genome sequencing (WGS) of primary AML samples is feasible and can detect novel, clinically relevant mutations. New insights are emerging from this work, including the clonal heterogeneity of this disease and clonal evolution that occurs over time. Some of the novel mutations are highly recurrent (>20% of patients), but there appears to be a continuum of mutation frequency down to rare (<5%) or even singleton mutations that may be relevant for the biology of this disease. Large cohorts of well-annotated samples are needed to establish mutation frequencies, implicate biological pathways, and demonstrate genotype:phenotype correlations. Although many technical and logistical challenges must be overcome, the capacity of WGS to detect all classes of inherited and acquired genetic abnormalities makes it an attractive candidate for development as a clinical diagnostic test.
acute myeloid leukemia; genomics; next generation sequencing
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35–250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
massively parallel sequencing; next generation sequencing; human genome; variant detection; short read alignment; whole genome sequencing
Motivation: The sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.
Results: In this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.
Availability and implementation: Binaries are freely available for download at http://gmt.genome.wustl.edu/somatic-sniper/current/, implemented in C and supported on Linux and Mac OS X.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
We developed CREST (Clipping REveals STructure), an algorithm that uses next-generation sequencing reads with partial alignments to a reference genome to directly map structural variations at the nucleotide level of resolution. Application of CREST to whole-genome sequencing data from five pediatric T-lineage acute lymphoblastic leukemias (T-ALLs) and a human melanoma cell line, COLO-829, identified 160 somatic structural variations. Experimental validation exceeded 80% demonstrating that CREST had a high predictive accuracy.
To identify somatic mutations in paediatric diffuse intrinsic pontine gliomas (DIPGs), we performed whole genome sequencing of 7 DIPGs and matched germline DNA, and targeted sequencing of an additional 43 DIPGs and 36 non-brainstem paediatric glioblastomas (non-BS-PGs). 78% of DIPGs and 22% of non-BS-PGs contained p.K27M mutation in H3F3A, encoding histone H3.3, or the related HIST1H3B, encoding histone H3.1. An additional 14% of non-BS-PGs had somatic p.G34R H3F3A mutations.
Motivation: The expansion of cancer genome sequencing continues to stimulate development of analytical tools for inferring relationships between somatic changes and tumor development. Pathway associations are especially consequential, but existing algorithms are demonstrably inadequate.
Methods: Here, we propose the PathScan significance test for the scenario where pathway mutations collectively contribute to tumor development. Its design addresses two aspects that established methods neglect. First, we account for variations in gene length and the consequent differences in their mutation probabilities under the standard null hypothesis of random mutation. The associated spike in computational effort is mitigated by accurate convolution-based approximation. Second, we combine individual probabilities into a multiple-sample value using Fisher–Lancaster theory, thereby improving differentiation between a few highly mutated genes and many genes having only a few mutations apiece. We investigate accuracy, computational effort and power, reporting acceptable performance for each.
Results: As an example calculation, we re-analyze KEGG-based lung adenocarcinoma pathway mutations from the Tumor Sequencing Project. Our test recapitulates the most significant pathways and finds that others for which the original test battery was inconclusive are not actually significant. It also identifies the focal adhesion pathway as being significantly mutated, a finding consistent with earlier studies. We also expand this analysis to other databases: Reactome, BioCarta, Pfam, PID and SMART, finding additional hits in ErbB and EPHA signaling pathways and regulation of telomerase. All have implications and plausible mechanistic roles in cancer. Finally, we discuss aspects of extending the method to integrate gene-specific background rates and other types of genetic anomalies.
Availability: PathScan is implemented in Perl and is available from the Genome Institute at: http://genome.wustl.edu/software/pathscan.
Supplementary information: Supplementary data are available at Bioinformatics online.
Alterations in DNA methylation have been implicated in the pathogenesis of myelodysplastic syndromes (MDS), although the underlying mechanism remains largely unknown. Methylation of CpG dinucleotides is mediated by DNA methyltransferases, including DNMT1, DNMT3A, and DNMT3B. DNMT3A mutations have recently been reported in patients with de novo acute myeloid leukemia (AML), providing a rationale for examining the status of DNMT3A in MDS samples. Here, we report the frequency of DNMT3A mutations in patients with de novo MDS, and their association with secondary AML. We sequenced all coding exons of DNMT3A using DNA from bone marrow and paired normal cells from 150 patients with MDS and identified 13 heterozygous mutations with predicted translational consequences in 12/150 patients (8.0%). Amino acid R882, located in the methyltransferase domain of DNMT3A, was the most common mutation site, accounting for 4/13 mutations. DNMT3A mutations were expressed in the majority of cells in all tested mutant samples regardless of blast counts, suggesting that DNMT3A mutations occur early in the course of MDS. Patients with DNMT3A mutations had worse overall survival compared to patients without DNMT3A mutations (p=0.005) and more rapid progression to AML (p=0.007), suggesting that DNMT3A mutation status may have prognostic value in de novo MDS.
myelodysplastic syndrome; DNMT3A; mutation
The Oxytricha trifallax mitochondrial genome contains the largest sequenced ciliate mitochondrial chromosome (∼70 kb) plus a ∼5-kb linear plasmid bearing mitochondrial telomeres. We identify two new ciliate split genes (rps3 and nad2) as well as four new mitochondrial genes (ribosomal small subunit protein genes: rps- 2, 7, 8, 10), previously undetected in ciliates due to their extreme divergence. The increased size of the Oxytricha mitochondrial genome relative to other ciliates is primarily a consequence of terminal expansions, rather than the retention of ancestral mitochondrial genes. Successive segmental duplications, visible in one of the two Oxytricha mitochondrial subterminal regions, appear to have contributed to the genome expansion. Consistent with pseudogene formation and decay, the subtermini possess shorter, more loosely packed open reading frames than the remainder of the genome. The mitochondrial plasmid shares a 251-bp region with 82% identity to the mitochondrial chromosome, suggesting that it most likely integrated into the chromosome at least once. This region on the chromosome is also close to the end of the most terminal member of a series of duplications, hinting at a possible association between the plasmid and the duplications. The presence of mitochondrial telomeres on the mitochondrial plasmid suggests that such plasmids may be a vehicle for lateral transfer of telomeric sequences between mitochondrial genomes. We conjecture that the extreme divergence observed in ciliate mitochondrial genomes may be due, in part, to repeated invasions by relatively error-prone DNA polymerase-bearing mobile elements.
split genes; segmental duplication; genome expansion; linear mitochondrial plasmid; mobile elements; extreme mitochondrial divergences
To compare clinical, immunohistochemical and gene expression models of prognosis applicable to formalin-fixed, paraffin-embedded blocks in a large series of estrogen receptor positive breast cancers, from patients uniformly treated with adjuvant tamoxifen.
qRT-PCR assays for 50 genes identifying intrinsic breast cancer subtypes were completed on 786 specimens linked to clinical (median followup 11.7 years) and immunohistochemical (ER, PR, HER2, Ki67) data. Performance of predefined intrinsic subtype and Risk-Of-Relapse scores was assessed using multivariable Cox models and Kaplan-Meier analysis. Harrell’s C index was used to compare fixed models trained in independent data sets, including proliferation signatures.
Despite clinical ER positivity, 10% of cases were assigned to non-Luminal subtypes. qRT-PCR signatures for proliferation genes gave more prognostic information than clinical assays for hormone receptors or Ki67. In Cox models incorporating standard prognostic variables, hazard ratios for breast cancer disease specific survival over the first 5 years of followup, relative to the most common Luminal A subtype, are 1.99 (95% CI: 1.09–3.64) for Luminal B, 3.65 (1.64–8.16) for HER2-enriched and 17.71 (1.71–183.33) for the basal like subtype. For node-negative disease, PAM50 qRT-PCR based risk assignment weighted for tumor size and proliferation identifies a group with >95% 10 yr survival without chemotherapy. In node positive disease, PAM50-based prognostic models were also superior.
The PAM50 gene expression test for intrinsic biological subtype can be applied to large series of formalin-fixed paraffin-embedded breast cancers, and gives more prognostic information than clinical factors and immunohistochemistry using standard cutpoints.
The full complement of DNA mutations that are responsible for the pathogenesis of acute myeloid leukemia (AML) is not yet known.
We used massively parallel DNA sequencing to obtain a very high level of coverage (approximately 98%) of a primary, cytogenetically normal, de novo genome for AML with minimal maturation (AML-M1) and a matched normal skin genome.
We identified 12 acquired (somatic) mutations within the coding sequences of genes and 52 somatic point mutations in conserved or regulatory portions of the genome. All mutations appeared to be heterozygous and present in nearly all cells in the tumor sample. Four of the 64 mutations occurred in at least 1 additional AML sample in 188 samples that were tested. Mutations in NRAS and NPM1 had been identified previously in patients with AML, but two other mutations had not been identified. One of these mutations, in the IDH1 gene, was present in 15 of 187 additional AML genomes tested and was strongly associated with normal cytogenetic status; it was present in 13 of 80 cytogenetically normal samples (16%). The other was a nongenic mutation in a genomic region with regulatory potential and conservation in higher mammals; we detected it in one additional AML tumor. The AML genome that we sequenced contains approximately 750 point mutations, of which only a small fraction are likely to be relevant to pathogenesis.
By comparing the sequences of tumor and skin genomes of a patient with AML-M1, we have identified recurring mutations that may be relevant for pathogenesis.
The application of next-generation sequencing technology has produced a transformation in cancer genomics, generating large data sets that can be analyzed in different ways to answer a multitude of questions about the genomic alterations associated with the disease. Analytical approaches can discover focused mutations such as substitutions and small insertion/deletions, large structural alterations and copy number events. As our capacity to produce such data for multiple cancers of the same type is improving, so are the demands to analyze multiple tumor genomes simultaneously growing. For example, pathway-based analyses that provide the full mutational impact on cellular protein networks and correlation analyses aimed at revealing causal relationships between genomic alterations and clinical presentations are both enabled. As the repertoire of data grows to include mRNA-seq, non-coding RNA-seq and methylation for multiple genomes, our challenge will be to intelligently integrate data types and genomes to produce a coherent picture of the genetic basis of cancer.
Genome-based studies of metazoan evolution are most informative when phylogenetically diverse species are incorporated in the analysis. As such, evolutionary trends within and outside the phylum Nematoda have been less revealing by focusing only on comparisons involving Caenorhabditis elegans. Herein, we present a draft of the 64 megabase nuclear genome of Trichinella spiralis, containing 15,808 protein coding genes. This parasitic nematode is an extant member of a clade that diverged early in the evolution of the phylum enabling identification of archetypical genes and molecular signatures exclusive to nematodes. Comparative analyses support intrachromosomal rearrangements across the phylum, disproportionate numbers of protein family deaths over births in parasitic vs. a non-parasitic nematode, and a preponderance of gene loss and gain events in nematodes relative to Drosophila melanogaster. This sequence and the panphylum characteristics identified herein will advance evolutionary studies and strategies to combat global parasites of humans, food animals and crops.
We took advantage of the unusual genomic organization of the ciliate Oxytricha trifallax to screen for eukaryotic non-coding RNA (ncRNA) genes. Ciliates have two types of nuclei: a germ line micronucleus that is usually transcriptionally inactive, and a somatic macronucleus that contains a reduced, fragmented and rearranged genome that expresses all genes required for growth and asexual reproduction. In some ciliates including Oxytricha, the macronuclear genome is particularly extreme, consisting of thousands of tiny ‘nanochromosomes’, each of which usually contains only a single gene. Because the organism itself identifies and isolates most of its genes on single-gene nanochromosomes, nanochromosome structure could facilitate the discovery of unusual genes or gene classes, such as ncRNA genes. Using a draft Oxytricha genome assembly and a custom-written protein-coding genefinding program, we identified a subset of nanochromosomes that lack any detectable protein-coding gene, thereby strongly enriching for nanochromosomes that carry ncRNA genes. We found only a small proportion of non-coding nanochromosomes, suggesting that Oxytricha has few independent ncRNA genes besides homologs of already known RNAs. Other than new members of known ncRNA classes including C/D and H/ACA snoRNAs, our screen identified one new family of small RNA genes, named the Arisong RNAs, which share some of the features of small nuclear RNAs.
Clostridium difficile is a common cause of infectious diarrhea in hospitalized patients. A severe and increased incidence of C. difficile infection (CDI) is associated predominantly with the NAP1 strain; however, the existence of other severe-disease-associated (SDA) strains and the extensive genetic diversity across C. difficile complicate reliable detection and diagnosis. Comparative genome analysis of 14 sequenced genomes, including those of a subset of NAP1 isolates, allowed the assessment of genetic diversity within and between strain types to identify DNA markers that are associated with severe disease. Comparative genome analysis of 14 isolates, including five publicly available strains, revealed that C. difficile has a core genome of 3.4 Mb, comprising ∼3,000 genes. Analysis of the core genome identified candidate DNA markers that were subsequently evaluated using a multistrain panel of 177 isolates, representing more than 50 pulsovars and 8 toxinotypes. A subset of 117 isolates from the panel had associated patient data that allowed assessment of an association between the DNA markers and severe CDI. We identified 20 candidate DNA markers for species-wide detection and 10,683 single nucleotide polymorphisms (SNPs) associated with the predominant SDA strain (NAP1). A species-wide detection candidate marker, the sspA gene, was found to be the same across 177 sequenced isolates and lacked significant similarity to those of other species. Candidate SNPs in genes CD1269 and CD1265 were found to associate more closely with disease severity than currently used diagnostic markers, as they were also present in the toxin A-negative and B-positive (A-B+) strain types. The genetic markers identified illustrate the potential of comparative genomics for the discovery of diagnostic DNA-based targets that are species specific or associated with multiple SDA strains.
Whole-genome analysis of human tumors has identified some unsuspected tumor-associated genes
Unbiased sequencing and analysis of human tumors is revealing unsuspected somatic changes that, upon further study, are elucidating aspects of tumor biology and identifying new biomarkers.
Zinc is an essential trace element involved in a wide range of biological
processes and human diseases. Zinc excess is deleterious, and animals require
mechanisms to protect against zinc toxicity. To identify genes that modulate
zinc tolerance, we performed a forward genetic screen for Caenorhabditis
elegans mutants that were resistant to zinc toxicity. Here we
demonstrate that mutations of the C. elegans histidine ammonia
lyase (haly-1) gene promote zinc tolerance. C. elegans
haly-1 encodes a protein that is homologous to vertebrate HAL, an
enzyme that converts histidine to urocanic acid. haly-1 mutant
animals displayed elevated levels of histidine, indicating that C.
elegans HALY-1 protein is an enzyme involved in histidine
catabolism. These results suggest the model that elevated histidine chelates
zinc and thereby reduces zinc toxicity. Supporting this hypothesis, we
demonstrated that dietary histidine promotes zinc tolerance. Nickel is another
metal that binds histidine with high affinity. We demonstrated that
haly-1 mutant animals are resistant to nickel toxicity and
dietary histidine promotes nickel tolerance in wild-type animals. These studies
identify a novel role for haly-1 and histidine in zinc
metabolism and may be relevant for other animals.
Zinc is an essential nutrient that is critical for human health. However, excess
zinc can cause toxicity, indicating that regulatory mechanisms are necessary to
maintain homeostasis. The analysis of mechanisms that promote zinc homeostasis
can elucidate fundamental regulatory processes and suggest new approaches for
treating disorders of zinc metabolism. To discover genes that modulate zinc
tolerance, we screened for C. elegans mutants that were
resistant to zinc toxicity. Here we demonstrate that mutations of the histidine
ammonia lyase (haly-1) gene promote zinc tolerance.
haly-1 encodes a protein that is similar to vertebrate HAL,
an enzyme that converts histidine to urocanic acid. Mutations in the human HAL
gene cause elevated levels of serum histidine and abnormal zinc metabolism.
Mutations in C. elegans haly-1 cause elevated levels of
histidine, suggesting that histidine causes resistance to excess zinc.
Consistent with this hypothesis, we demonstrated that dietary histidine promoted
tolerance to excess zinc in wild-type worms. Mutations in
haly-1 and supplemental dietary histidine also caused
resistance to nickel, another metal that can bind histidine. A likely mechanism
of protection is chelation of zinc and nickel by histidine. These studies
suggest that histidine plays a physiological role in zinc metabolism.
Motivation: DNA copy number aberration (CNA) is a hallmark of genomic abnormality in tumor cells. Recurrent CNA (RCNA) occurs in multiple cancer samples across the same chromosomal region and has greater implication in tumorigenesis. Current commonly used methods for RCNA identification require CNA calling for individual samples before cross-sample analysis. This two-step strategy may result in a heavy computational burden, as well as a loss of the overall statistical power due to segmentation and discretization of individual sample's data. We propose a population-based approach for RCNA detection with no need of single-sample analysis, which is statistically powerful, computationally efficient and particularly suitable for high-resolution and large-population studies.
Results: Our approach, correlation matrix diagonal segmentation (CMDS), identifies RCNAs based on a between-chromosomal-site correlation analysis. Directly using the raw intensity ratio data from all samples and adopting a diagonal transformation strategy, CMDS substantially reduces computational burden and can obtain results very quickly from large datasets. Our simulation indicates that the statistical power of CMDS is higher than that of single-sample CNA calling based two-step approaches. We applied CMDS to two real datasets of lung cancer and brain cancer from Affymetrix and Illumina array platforms, respectively, and successfully identified known regions of CNA associated with EGFR, KRAS and other important oncogenes. CMDS provides a fast, powerful and easily implemented tool for the RCNA analysis of large-scale data from cancer genomes.
Availability: The R and C programs implementing our method are available at https://dsgweb.wustl.edu/qunyuan/software/cmds.
Supplementary information: Supplementary data are available at Bioinformatics online.
In birds, as in mammals, one pair of chromosomes differs between the sexes. In birds, males are ZZ and females ZW. In mammals, males are XY and females XX. Like the mammalian XY pair, the avian ZW pair is believed to have evolved from autosomes, with most change occurring in the chromosomes found in only one sex – the W and Y chromosomes1–5. By contrast, the sex chromosomes found in both sexes – the Z and X chromosomes – are assumed to have diverged little from their autosomal progenitors2. Here we report findings that overturn this assumption for both the chicken Z and human X chromosomes. The chicken Z chromosome, which we sequenced essentially to completion, is less gene-dense than chicken autosomes but contains a massive tandem array containing hundreds of duplicated genes expressed in testes. A comprehensive comparison of the chicken Z chromosome to the finished sequence of the human X chromosome demonstrates that each evolved independently from different portions of the ancestral genome. Despite this independence, the chicken Z and human X chromosomes share features that distinguish them from autosomes: the acquisition and amplification of testis-expressed genes, as well as a low gene density resulting from an expansion of intergenic regions. These features were not present on the autosomes from which the Z and X chromosomes originated but were instead acquired during the evolution of the Z and X as sex chromosomes. We conclude that the avian Z and mammalian X chromosomes followed convergent evolutionary trajectories, despite their evolving with opposite (female vs. male) systems of heterogamety. More broadly, in birds and mammals, sex chromosome evolution involved not only gene loss in sex-specific chromosomes, but also marked expansion and gene acquisition in sex chromosomes common to males and females.
Within the current worldwide epidemic of community-acquired Staphylococcus aureus infections, attention has focused on the role of methicillin-resistant strains. We characterized methicillin-susceptible strains that also contribute.
We tracked cultures from abscesses submitted to the microbiology laboratory at St. Louis Children’s Hospital. We also sought Panton-Valentine leukocidin (PVL) genes in methicillin-susceptible Staphylococcus aureus (MSSA) isolates, and we further characterized some isolates by multilocus sequence typing (MLST), pulsed-field gel electrophoresis (PFGE), antibiotic susceptibility, accessory gene regulator (agr) allele, and presence of the arcA gene of the arginine catabolic mobile element (ACME).
From 1999 to 2007, we detected a 250-fold increase in cultures of abscesses yielding methicillin-resistant Staphylococcus aureus (MRSA) and a 5-fold increase in abscess cultures yielding MSSA. MSSA isolates from abscesses and wounds were more likely to encode PVL than isolates from other sources. In contrast to PVL-negative isolates of MSSA which were genetically diverse, PVL-positive isolates were predominantly MLST 8, Agr type 1. More than half of PVL-positive MSSA isolates were resistant to erythromycin and susceptible to clindamycin with absence of inducible resistance, a pattern uncommon in PVL-negative MSSA but frequent in the USA300 clone of MRSA. In addition, PFGE of PVL-positive MSSA strains revealed the USA300 pattern.
In addition to methicillin-resistant strains, the current epidemic of Staphylococcus aureus infections includes infections caused by methicillin-susceptible strains that are closely related genetically and share phenotypic characteristics other than susceptibility to methicillin. These findings suggest that factors other than methicillin resistance are driving the epidemic.
Staphylococcus aureus; Panton-Valentine leukocidin; methicillin resistance
A genomic era of cancer studies is developing rapidly, fueled by the emergence of next-generation sequencing technologies that provide exquisite sensitivity and resolution. This article discusses several areas within cancer genomics that are being transformed by the application of new technology, and in the process are dramatically expanding our understanding of this disease. Although, we anticipate that there will be many exciting discoveries in the near future, the ultimate success of these endeavors rests on our ability to translate what is learned into better diagnosis, treatment and prevention of cancer.
To date, few peptides in the complex mixture of platypus venom have been identified and sequenced, in part due to the limited amounts of platypus venom available to study. We have constructed and sequenced a cDNA library from an active platypus venom gland to identify the remaining components.
We identified 83 novel putative platypus venom genes from 13 toxin families, which are homologous to known toxins from a wide range of vertebrates (fish, reptiles, insectivores) and invertebrates (spiders, sea anemones, starfish). A number of these are expressed in tissues other than the venom gland, and at least three of these families (those with homology to toxins from distant invertebrates) may play non-toxin roles. Thus, further functional testing is required to confirm venom activity. However, the presence of similar putative toxins in such widely divergent species provides further evidence for the hypothesis that there are certain protein families that are selected preferentially during evolution to become venom peptides. We have also used homology with known proteins to speculate on the contributions of each venom component to the symptoms of platypus envenomation.
This study represents a step towards fully characterizing the first mammal venom transcriptome. We have found similarities between putative platypus toxins and those of a number of unrelated species, providing insight into the evolution of mammalian venom.
Summary: Massively parallel sequencing technologies hold incredible promise for the study of DNA sequence variation, particularly the identification of variants affecting human disease. The unprecedented throughput and relatively short read lengths of Roche/454, Illumina/Solexa, and other platforms have spurred development of a new generation of sequence alignment algorithms. Yet detection of sequence variants based on short read alignments remains challenging, and most currently available tools are limited to a single platform or aligner type. We present VarScan, an open source tool for variant detection that is compatible with several short read aligners. We demonstrate VarScan's ability to detect SNPs and indels with high sensitivity and specificity, in both Roche/454 sequencing of individuals and deep Illumina/Solexa sequencing of pooled samples.
Availability and Implementation: Source code and documentation freely available at http://genome.wustl.edu/tools/cancer-genomics implemented as a Perl package and supported on Linux/UNIX, MS Windows and Mac OSX.
Supplementary information: Supplementary data are available at Bioinformatics online.