High-throughput sequencing allows the detection and quantification of frequencies of somatic single nucleotide variants (SNV) in heterogeneous tumor cell populations. In some cases, the evolutionary history and population frequency of the subclonal lineages of tumor cells present in the sample can be reconstructed from these SNV frequency measurements. But automated methods to do this reconstruction are not available and the conditions under which reconstruction is possible have not been described.
We describe the conditions under which the evolutionary history can be uniquely reconstructed from SNV frequencies from single or multiple samples from the tumor population and we introduce a new statistical model, PhyloSub, that infers the phylogeny and genotype of the major subclonal lineages represented in the population of cancer cells. It uses a Bayesian nonparametric prior over trees that groups SNVs into major subclonal lineages and automatically estimates the number of lineages and their ancestry. We sample from the joint posterior distribution over trees to identify evolutionary histories and cell population frequencies that have the highest probability of generating the observed SNV frequency data. When multiple phylogenies are consistent with a given set of SNV frequencies, PhyloSub represents the uncertainty in the tumor phylogeny using a “partial order plot”. Experiments on a simulated dataset and two real datasets comprising tumor samples from acute myeloid leukemia and chronic lymphocytic leukemia patients demonstrate that PhyloSub can infer both linear (or chain) and branching lineages and its inferences are in good agreement with ground truth, where it is available.
PhyloSub can be applied to frequencies of any “binary” somatic mutation, including SNVs as well as small insertions and deletions. The PhyloSub and partial order plot software is available from https://github.com/morrislab/phylosub/.
Reactome (http://www.reactome.org) is a manually curated open-source open-data resource of human pathways and reactions. The current version 46 describes 7088 human proteins (34% of the predicted human proteome), participating in 6744 reactions based on data extracted from 15 107 research publications with PubMed links. The Reactome Web site and analysis tool set have been completely redesigned to increase speed, flexibility and user friendliness. The data model has been extended to support annotation of disease processes due to infectious agents and to mutation.
Gramene (http://www.gramene.org) is a curated online resource for comparative functional genomics in crops and model plant species, currently hosting 27 fully and 10 partially sequenced reference genomes in its build number 38. Its strength derives from the application of a phylogenetic framework for genome comparison and the use of ontologies to integrate structural and functional annotation data. Whole-genome alignments complemented by phylogenetic gene family trees help infer syntenic and orthologous relationships. Genetic variation data, sequences and genome mappings available for 10 species, including Arabidopsis, rice and maize, help infer putative variant effects on genes and transcripts. The pathways section also hosts 10 species-specific metabolic pathways databases developed in-house or by our collaborators using Pathway Tools software, which facilitates searches for pathway, reaction and metabolite annotations, and allows analyses of user-defined expression datasets. Recently, we released a Plant Reactome portal featuring 133 curated rice pathways. This portal will be expanded for Arabidopsis, maize and other plant species. We continue to provide genetic and QTL maps and marker datasets developed by crop researchers. The project provides a unique community platform to support scientific research in plant genomics including studies in evolution, genetics, plant breeding, molecular biology, biochemistry and systems biology.
WormBase (http://www.wormbase.org/) is a highly curated resource dedicated to supporting research using the model organism Caenorhabditis elegans. With an electronic history predating the World Wide Web, WormBase contains information ranging from the sequence and phenotype of individual alleles to genome-wide studies generated using next-generation sequencing technologies. In recent years, we have expanded the contents to include data on additional nematodes of agricultural and medical significance, bringing the knowledge of C. elegans to bear on these systems and providing support for underserved research communities. Manual curation of the primary literature remains a central focus of the WormBase project, providing users with reliable, up-to-date and highly cross-linked information. In this update, we describe efforts to organize the original atomized and highly contextualized curated data into integrated syntheses of discrete biological topics. Next, we discuss our experiences coping with the vast increase in available genome sequences made possible through next-generation sequencing platforms. Finally, we describe some of the features and tools of the new WormBase Web site that help users better find and explore data of interest.
New strategies to combat complex human disease require systems approaches to biology that integrate experiments from cell lines, primary tissues and model organisms. We have developed Pathprint, a functional approach that compares gene expression profiles in a set of pathways, networks and transcriptionally regulated targets. It can be applied universally to gene expression profiles across species. Integration of large-scale profiling methods and curation of the public repository overcomes platform, species and batch effects to yield a standard measure of functional distance between experiments. We show that pathprints combine mouse and human blood developmental lineage, and can be used to identify new prognostic indicators in acute myeloid leukemia. The code and resources are available at http://compbio.sph.harvard.edu/hidelab/pathprint
Funded by the National Institutes of Health (NIH), the aim of the Model Organism ENCyclopedia of DNA Elements (modENCODE) project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans (worm) and D. melanogaster (fly). With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.
In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy (https://github.com/modENCODE-DCC/Galaxy), on the public Amazon Cloud (http://aws.amazon.com), and on the private Bionimbus Cloud for genomic research (http://www.bionimbus.org). In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control (QC) and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies.
Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.
Model organisms are widely used for understanding basic biology, and have significantly contributed to the study of human disease. In recent years, genomic analysis has provided extensive evidence of widespread conservation of gene sequence and function amongst eukaryotes, allowing insights from model organisms to help decipher gene function in a wider range of species. The InterMOD consortium is developing an infrastructure based around the InterMine data warehouse system to integrate genomic and functional data from a number of key model organisms, leading the way to improved cross-species research. So far including budding yeast, nematode worm, fruit fly, zebrafish, rat and mouse, the project has set up data warehouses, synchronized data models, and created analysis tools and links between data from different species. The project unites a number of major model organism databases, improving both the consistency and accessibility of comparative research, to the benefit of the wider scientific community.
Most eukaryotic genes are interrupted by spliceosomal introns. The evolution of exon-intron structure remains mysterious despite rapid advance in genome sequencing technique. In this work, a novel approach is taken based on the assumptions that the evolution of exon-intron structure is a stochastic process, and that the characteristics of this process can be understood by examining its historical outcome, the present-day size distribution of internal translated exons (exon). Through the combination of simulation and modeling the size distribution of exons in different species, we propose a general random fragmentation process (GRFP) to characterize the evolution dynamics of exon-intron structure. This model accurately predicts the probability that an exon will be split by a new intron and the distribution of novel insertions along the length of the exon.
As the first observation from this model, we show that the chance for an exon to obtain an intron is proportional to its size to the 3rd power. We also show that such size dependence is nearly constant across gene, with the exception of the exons adjacent to the 5′ UTR. As the second conclusion from the model, we show that intron insertion loci follow a normal distribution with a mean of 0.5 (center of the exon) and a standard deviation of 0.11. Finally, we show that intron insertions within a gene are independent of each other for vertebrates, but are more negatively correlated for non-vertebrate. We use simulation to demonstrate that the negative correlation might result from significant intron loss during evolution, which could be explained by selection against multi-intron genes in these organisms.
The GRFP model suggests that intron gain is dynamic with a higher chance for longer exons; introns are inserted into exons randomly with the highest probability at the center of the exon. GRFP estimates that there are 78 introns in every 10 kb coding sequences for vertebrate genomes, agreeing with empirical observations. GRFP also estimates that there are significant intron losses in the evolution of non-vertebrate genomes, with extreme cases of around 57% intron loss in Drosophila melanogaster, 28% in Caenorhabditis elegans, and 24% in Oryza sativa.
Evolution of exon-intron structure; General random fragmentation process; Simulation
GBrowse is a mature web-based genome browser that is suitable for deployment on both public and private web sites. It supports most of genome browser features, including qualitative and quantitative (wiggle) tracks, track uploading, track sharing, interactive track configuration, semantic zooming and limited smooth track panning. As of version 2.0, GBrowse supports next-generation sequencing (NGS) data by providing for the direct display of SAM and BAM sequence alignment files. SAM/BAM tracks provide semantic zooming and support both local and remote data sources. This article provides step-by-step instructions for configuring GBrowse to display NGS data.
bioinformatics; genomics; DNA sequencing; genome browser; data visualization; data sharing
Discovering robust prognostic gene signatures as biomarkers using genomics data can be challenging. We have developed a simple but efficient method for discovering prognostic biomarkers in cancer gene expression data sets using modules derived from a highly reliable gene functional interaction network. When applied to breast cancer, we discover a novel 31-gene signature associated with patient survival. The signature replicates across 5 independent gene expression studies, and outperforms 48 published gene signatures. When applied to ovarian cancer, the algorithm identifies a 75-gene signature associated with patient survival. A Cytoscape plugin implementation of the signature discovery method is available at http://wiki.reactome.org/index.php/Reactome_FI_Cytoscape_Plugin
Reactome describes biological pathways as chemical reactions that closely mirror the actual physical interactions that occur in the cell. Recent extensions of our data model accommodate the annotation of cancer and other disease processes. First, we have extended our class of protein modifications to accommodate annotation of changes in amino acid sequence and the formation of fusion proteins to describe the proteins involved in disease processes. Second, we have added a disease attribute to reaction, pathway, and physical entity classes that uses disease ontology terms. To support the graphical representation of “cancer” pathways, we have adapted our Pathway Browser to display disease variants and events in a way that allows comparison with the wild type pathway, and shows connections between perturbations in cancer and other biological pathways. The curation of pathways associated with cancer, coupled with our efforts to create other disease-specific pathways, will interoperate with our existing pathway and network analysis tools. Using the Epidermal Growth Factor Receptor (EGFR) signaling pathway as an example, we show how Reactome annotates and presents the altered biological behavior of EGFR variants due to their altered kinase and ligand-binding properties, and the mode of action and specificity of anti-cancer therapeutics.
pathway database; pathway visualization; network visualization; cancer annotation; EGFR signaling
Chromosomal gain at 7q21 is a frequent event in esophageal adenocarcinoma (EAC). However, this event has not been mapped with fine resolution in a large EAC cohort and its association with clinical endpoints and functional relevance are unclear.
We used a cohort of 116 patients to fine map the 7q21 amplification using SNP microarrays. Prognostic significance and functional role of 7q21 amplification and its gene expression were explored.
Amplification of the 7q21 region was observed in 35% of tumors with a focal, minimal amplicon containing 6 genes. 7q21 amplification was associated with poor survival and analysis of gene expression identified CDK6 as the only gene in the minimal amplicon whose expression was also associated with poor survival. A low level amplification (10%) was observed at the 12q13 region containing the CDK6 homolog, CDK4. Both amplification and expression of CDK4 correlated with poor survival. A combined model of both CDK6 and CDK4 expression is a superior predictor of survival than either alone. Specific knockdown of CDK4 and/or CDK6 by siRNAs shows that they are required for proliferation of EAC cells and that their function is additive. PD-0332991 targets the kinase activity of both molecules and suppresses proliferation and anchorage-independence of EAC cells through activation of the pRB pathway.
We suggest that CDK6 is the driver of 7q21 amplification and that both CDK4 and CDK6 are prognostic markers and bona fide oncogenes in EAC. Targeting these molecules may constitute a viable new therapy for this disease.
Esophageal adenocarcinoma; CDK6; CDK4; PD-0332991
We screened 124 genes that are amplified in human HCC using a mouse hepatoblast model and identified 18 tumor-promoting genes, including CCND1 and its neighbor on 11q13.3, FGF19. Although it is widely assumed that CCND1 is the main driving oncogene of this common amplicon (15% frequency in HCC), both forward-transformation assays and RNAi-mediated inhibition in human HCC cells established that FGF19 is an equally important driver gene in HCC. Furthermore, clonal growth and tumorigenicity of HCC cells harboring the 11q13.3 amplicon were selectively inhibited by RNAi-mediated knockdown of CCND1 or FGF19, as well as by an anti-FGF19 antibody. These results show that 11q13.3 amplification could be an effective biomarker for patients most likely to respond to anti-FGF19 therapy.
The taxanes paclitaxel and docetaxel are widely used in the treatment of breast, ovarian, and other cancers. Although their cytotoxicity has been attributed to cell-cycle arrest through stabilization of microtubules, the mechanisms by which tumor cells die remains unclear. Paclitaxel has been shown to induce soluble tumor necrosis factor alpha (sTNF-α) production in macrophages, but the involvement of TNF production in taxane cytotoxicity or resistance in tumor cells has not been established. Our study aimed to correlate alterations in the TNF pathway with taxane cytotoxicity and the acquisition of taxane resistance.
MCF-7 cells or isogenic drug-resistant variants (developed by selection for surviving cells in increasing concentrations of paclitaxel or docetaxel) were assessed for sTNF-α production in the absence or presence of taxanes by enzyme-linked immunosorbent assay (ELISA) and for sensitivity to docetaxel or sTNF-α by using a clonogenic assay (in the absence or presence of TNFR1 or TNFR2 neutralizing antibodies). Nuclear factor (NF)-κB activity was also measured with ELISA, whereas gene-expression changes associated with docetaxel resistance in MCF-7 and A2780 cells were determined with microarray analysis and quantitative reverse transcription polymerase chain reaction (RTqPCR).
MCF-7 and A2780 cells increased production of sTNF-α in the presence of taxanes, whereas docetaxel-resistant variants of MCF-7 produced high levels of sTNF-α, although only within a particular drug-concentration threshold (between 3 and 45 nM). Increased production of sTNF-α was NF-κB dependent and correlated with decreased sensitivity to sTNF-α, decreased levels of TNFR1, and increased survival through TNFR2 and NF-κB activation. The NF-κB inhibitor SN-50 reestablished sensitivity to docetaxel in docetaxel-resistant MCF-7 cells. Gene-expression analysis of wild-type and docetaxel-resistant MCF-7, MDA-MB-231, and A2780 cells identified changes in the expression of TNF-α-related genes consistent with reduced TNF-induced cytotoxicity and activation of NF-κB survival pathways.
We report for the first time that taxanes can promote dose-dependent sTNF-α production in tumor cells at clinically relevant concentrations, which can contribute to their cytotoxicity. Defects in the TNF cytotoxicity pathway or activation of TNF-dependent NF-κB survival genes may, in contrast, contribute to taxane resistance in tumor cells. These findings may be of strong clinical significance.
In an effort to comprehensively characterize the functional elements within the genomes of the important model organisms Drosophila melanogaster and Caenorhabditis elegans, the NHGRI model organism Encyclopaedia of DNA Elements (modENCODE) consortium has generated an enormous library of genomic data along with detailed, structured information on all aspects of the experiments. The modMine database (http://intermine.modencode.org) described here has been built by the modENCODE Data Coordination Center to allow the broader research community to (i) search for and download data sets of interest among the thousands generated by modENCODE; (ii) access the data in an integrated form together with non-modENCODE data sets; and (iii) facilitate fine-grained analysis of the above data. The sophisticated search features are possible because of the collection of extensive experimental metadata by the consortium. Interfaces are provided to allow both biologists and bioinformaticians to exploit these rich modENCODE data sets now available via modMine.
Since its release in 2000, WormBase (http://www.wormbase.org) has grown from a small resource focusing on a single species and serving a dedicated research community, to one now spanning 15 species essential to the broader biomedical and agricultural research fields. To enhance the rate of curation, we have automated the identification of key data in the scientific literature and use similar methodology for data extraction. To ease access to the data, we are collaborating with journals to link entities in research publications to their report pages at WormBase. To facilitate discovery, we have added new views of the data, integrated large-scale datasets and expanded descriptions of models for human disease. Finally, we have introduced a dramatic overhaul of the WormBase website for public beta testing. Designed to balance complexity and usability, the new site is species-agnostic, highly customizable, and interactive. Casual users and developers alike will be able to leverage the public RESTful application programming interface (API) to generate custom data mining solutions and extensions to the site. We report on the growth of our database and on our work in keeping pace with the growing demand for data, efforts to anticipate the requirements of users and new collaborations with the larger science community.
Reactome is an open source, expert-authored, manually curated and peer-reviewed database of reactions, pathways and biological processes. We provide an intuitive web-based user interface to pathway knowledge and a suite of data analysis tools. The Reactome BioMart provides biologists and bioinformaticians with a single web interface for performing simple or elaborate queries of the Reactome database, aggregating data from different sources and providing an opportunity to integrate experimental and computational results with information relating to biological pathways.
Database URL: http://www.reactome.org
The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and analyzed data, and ensure the community is supplied with the knowledge of the experimental conditions, protocols and verification checks used to generate each primary data set. We present here the design principles of the modENCODE DCC, and describe the ramifications of collecting thorough and deep metadata for describing experiments, including the use of a wiki for capturing protocol and reagent information, and the BIR-TAB specification for linking biological samples to experimental results. modENCODE data can be found at http://www.modencode.org.
Database URL: http://www.modencode.org.
Chromatin immunoprecipitation (ChIP), coupled with massively parallel short-read sequencing (seq) is used to probe chromatin dynamics. Although there are many algorithms to call peaks from ChIP-seq datasets, most are tuned either to handle punctate sites, such as transcriptional factor binding sites, or broad regions, such as histone modification marks; few can do both. Other algorithms are limited in their configurability, performance on large data sets, and ability to distinguish closely-spaced peaks.
In this paper, we introduce PeakRanger, a peak caller software package that works equally well on punctate and broad sites, can resolve closely-spaced peaks, has excellent performance, and is easily customized. In addition, PeakRanger can be run in a parallel cloud computing environment to obtain extremely high performance on very large data sets. We present a series of benchmarks to evaluate PeakRanger against 10 other peak callers, and demonstrate the performance of PeakRanger on both real and synthetic data sets. We also present real world usages of PeakRanger, including peak-calling in the modENCODE project.
Compared to other peak callers tested, PeakRanger offers improved resolution in distinguishing extremely closely-spaced peaks. PeakRanger has above-average spatial accuracy in terms of identifying the precise location of binding events. PeakRanger also has excellent sensitivity and specificity in all benchmarks evaluated. In addition, PeakRanger offers significant improvements in run time when running on a single processor system, and very marked improvements when allowed to take advantage of the MapReduce parallel environment offered by a cloud computing resource. PeakRanger can be downloaded at the official site of modENCODE project: http://www.modencode.org/software/ranger/
With DNA sequencing now getting cheaper more quickly than data storage, the time may have come to use cloud computing for genome informatics.
With DNA sequencing now getting cheaper more quickly than data storage or computation, the time may have come for genome informatics to migrate to the cloud.
The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans.
Using both simulated TP signals and the real C. elegans sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.
MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.
Reactome (http://www.reactome.org) is a collaboration among groups at the Ontario Institute for Cancer Research, Cold Spring Harbor Laboratory, New York University School of Medicine and The European Bioinformatics Institute, to develop an open source curated bioinformatics database of human pathways and reactions. Recently, we developed a new web site with improved tools for pathway browsing and data analysis. The Pathway Browser is an Systems Biology Graphical Notation (SBGN)-based visualization system that supports zooming, scrolling and event highlighting. It exploits PSIQUIC web services to overlay our curated pathways with molecular interaction data from the Reactome Functional Interaction Network and external interaction databases such as IntAct, BioGRID, ChEMBL, iRefIndex, MINT and STRING. Our Pathway and Expression Analysis tools enable ID mapping, pathway assignment and overrepresentation analysis of user-supplied data sets. To support pathway annotation and analysis in other species, we continue to make orthology-based inferences of pathways in non-human species, applying Ensembl Compara to identify orthologs of curated human proteins in each of 20 other species. The resulting inferred pathway sets can be browsed and analyzed with our Species Comparison tool. Collaborations are also underway to create manually curated data sets on the Reactome framework for chicken, Drosophila and rice.
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.
Reactome is an open-source, freely available database of human biological pathways and processes. A major goal of our work is to provide an integrated view of cellular signalling processes that spans from ligand–receptor interactions to molecular readouts at the level of metabolic and transcriptional events. To this end, we have built the first catalogue of all human G protein-coupled receptors (GPCRs) known to bind endogenous or natural ligands. The UniProt database has records for 797 proteins classified as GPCRs and sorted into families A/1, B/2 and C/3 on the basis of amino accid sequence. To these records we have added details from the IUPHAR database and our own manual curation of relevant literature to create reactions in which 563 GPCRs bind ligands and also interact with specific G-proteins to initiate signalling cascades. We believe the remaining 234 GPCRs are true orphans. The Reactome GPCR pathway can be viewed as a detailed interactive diagram and can be exported in many forms. It provides a template for the orthology-based inference of GPCR reactions for diverse model organism species, and can be overlaid with protein–protein interaction and gene expression datasets to facilitate overrepresentation studies and other forms of pathway analysis.
Database URL: http://www.reactome.org