GBrowse is a mature web-based genome browser that is suitable for deployment on both public and private web sites. It supports most of genome browser features, including qualitative and quantitative (wiggle) tracks, track uploading, track sharing, interactive track configuration, semantic zooming and limited smooth track panning. As of version 2.0, GBrowse supports next-generation sequencing (NGS) data by providing for the direct display of SAM and BAM sequence alignment files. SAM/BAM tracks provide semantic zooming and support both local and remote data sources. This article provides step-by-step instructions for configuring GBrowse to display NGS data.
bioinformatics; genomics; DNA sequencing; genome browser; data visualization; data sharing
Discovering robust prognostic gene signatures as biomarkers using genomics data can be challenging. We have developed a simple but efficient method for discovering prognostic biomarkers in cancer gene expression data sets using modules derived from a highly reliable gene functional interaction network. When applied to breast cancer, we discover a novel 31-gene signature associated with patient survival. The signature replicates across 5 independent gene expression studies, and outperforms 48 published gene signatures. When applied to ovarian cancer, the algorithm identifies a 75-gene signature associated with patient survival. A Cytoscape plugin implementation of the signature discovery method is available at http://wiki.reactome.org/index.php/Reactome_FI_Cytoscape_Plugin
Chromosomal gain at 7q21 is a frequent event in esophageal adenocarcinoma (EAC). However, this event has not been mapped with fine resolution in a large EAC cohort and its association with clinical endpoints and functional relevance are unclear.
We used a cohort of 116 patients to fine map the 7q21 amplification using SNP microarrays. Prognostic significance and functional role of 7q21 amplification and its gene expression were explored.
Amplification of the 7q21 region was observed in 35% of tumors with a focal, minimal amplicon containing 6 genes. 7q21 amplification was associated with poor survival and analysis of gene expression identified CDK6 as the only gene in the minimal amplicon whose expression was also associated with poor survival. A low level amplification (10%) was observed at the 12q13 region containing the CDK6 homolog, CDK4. Both amplification and expression of CDK4 correlated with poor survival. A combined model of both CDK6 and CDK4 expression is a superior predictor of survival than either alone. Specific knockdown of CDK4 and/or CDK6 by siRNAs shows that they are required for proliferation of EAC cells and that their function is additive. PD-0332991 targets the kinase activity of both molecules and suppresses proliferation and anchorage-independence of EAC cells through activation of the pRB pathway.
We suggest that CDK6 is the driver of 7q21 amplification and that both CDK4 and CDK6 are prognostic markers and bona fide oncogenes in EAC. Targeting these molecules may constitute a viable new therapy for this disease.
Esophageal adenocarcinoma; CDK6; CDK4; PD-0332991
We screened 124 genes that are amplified in human HCC using a mouse hepatoblast model and identified 18 tumor-promoting genes, including CCND1 and its neighbor on 11q13.3, FGF19. Although it is widely assumed that CCND1 is the main driving oncogene of this common amplicon (15% frequency in HCC), both forward-transformation assays and RNAi-mediated inhibition in human HCC cells established that FGF19 is an equally important driver gene in HCC. Furthermore, clonal growth and tumorigenicity of HCC cells harboring the 11q13.3 amplicon were selectively inhibited by RNAi-mediated knockdown of CCND1 or FGF19, as well as by an anti-FGF19 antibody. These results show that 11q13.3 amplification could be an effective biomarker for patients most likely to respond to anti-FGF19 therapy.
The taxanes paclitaxel and docetaxel are widely used in the treatment of breast, ovarian, and other cancers. Although their cytotoxicity has been attributed to cell-cycle arrest through stabilization of microtubules, the mechanisms by which tumor cells die remains unclear. Paclitaxel has been shown to induce soluble tumor necrosis factor alpha (sTNF-α) production in macrophages, but the involvement of TNF production in taxane cytotoxicity or resistance in tumor cells has not been established. Our study aimed to correlate alterations in the TNF pathway with taxane cytotoxicity and the acquisition of taxane resistance.
MCF-7 cells or isogenic drug-resistant variants (developed by selection for surviving cells in increasing concentrations of paclitaxel or docetaxel) were assessed for sTNF-α production in the absence or presence of taxanes by enzyme-linked immunosorbent assay (ELISA) and for sensitivity to docetaxel or sTNF-α by using a clonogenic assay (in the absence or presence of TNFR1 or TNFR2 neutralizing antibodies). Nuclear factor (NF)-κB activity was also measured with ELISA, whereas gene-expression changes associated with docetaxel resistance in MCF-7 and A2780 cells were determined with microarray analysis and quantitative reverse transcription polymerase chain reaction (RTqPCR).
MCF-7 and A2780 cells increased production of sTNF-α in the presence of taxanes, whereas docetaxel-resistant variants of MCF-7 produced high levels of sTNF-α, although only within a particular drug-concentration threshold (between 3 and 45 nM). Increased production of sTNF-α was NF-κB dependent and correlated with decreased sensitivity to sTNF-α, decreased levels of TNFR1, and increased survival through TNFR2 and NF-κB activation. The NF-κB inhibitor SN-50 reestablished sensitivity to docetaxel in docetaxel-resistant MCF-7 cells. Gene-expression analysis of wild-type and docetaxel-resistant MCF-7, MDA-MB-231, and A2780 cells identified changes in the expression of TNF-α-related genes consistent with reduced TNF-induced cytotoxicity and activation of NF-κB survival pathways.
We report for the first time that taxanes can promote dose-dependent sTNF-α production in tumor cells at clinically relevant concentrations, which can contribute to their cytotoxicity. Defects in the TNF cytotoxicity pathway or activation of TNF-dependent NF-κB survival genes may, in contrast, contribute to taxane resistance in tumor cells. These findings may be of strong clinical significance.
In an effort to comprehensively characterize the functional elements within the genomes of the important model organisms Drosophila melanogaster and Caenorhabditis elegans, the NHGRI model organism Encyclopaedia of DNA Elements (modENCODE) consortium has generated an enormous library of genomic data along with detailed, structured information on all aspects of the experiments. The modMine database (http://intermine.modencode.org) described here has been built by the modENCODE Data Coordination Center to allow the broader research community to (i) search for and download data sets of interest among the thousands generated by modENCODE; (ii) access the data in an integrated form together with non-modENCODE data sets; and (iii) facilitate fine-grained analysis of the above data. The sophisticated search features are possible because of the collection of extensive experimental metadata by the consortium. Interfaces are provided to allow both biologists and bioinformaticians to exploit these rich modENCODE data sets now available via modMine.
Since its release in 2000, WormBase (http://www.wormbase.org) has grown from a small resource focusing on a single species and serving a dedicated research community, to one now spanning 15 species essential to the broader biomedical and agricultural research fields. To enhance the rate of curation, we have automated the identification of key data in the scientific literature and use similar methodology for data extraction. To ease access to the data, we are collaborating with journals to link entities in research publications to their report pages at WormBase. To facilitate discovery, we have added new views of the data, integrated large-scale datasets and expanded descriptions of models for human disease. Finally, we have introduced a dramatic overhaul of the WormBase website for public beta testing. Designed to balance complexity and usability, the new site is species-agnostic, highly customizable, and interactive. Casual users and developers alike will be able to leverage the public RESTful application programming interface (API) to generate custom data mining solutions and extensions to the site. We report on the growth of our database and on our work in keeping pace with the growing demand for data, efforts to anticipate the requirements of users and new collaborations with the larger science community.
Reactome is an open source, expert-authored, manually curated and peer-reviewed database of reactions, pathways and biological processes. We provide an intuitive web-based user interface to pathway knowledge and a suite of data analysis tools. The Reactome BioMart provides biologists and bioinformaticians with a single web interface for performing simple or elaborate queries of the Reactome database, aggregating data from different sources and providing an opportunity to integrate experimental and computational results with information relating to biological pathways.
Database URL: http://www.reactome.org
The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and analyzed data, and ensure the community is supplied with the knowledge of the experimental conditions, protocols and verification checks used to generate each primary data set. We present here the design principles of the modENCODE DCC, and describe the ramifications of collecting thorough and deep metadata for describing experiments, including the use of a wiki for capturing protocol and reagent information, and the BIR-TAB specification for linking biological samples to experimental results. modENCODE data can be found at http://www.modencode.org.
Database URL: http://www.modencode.org.
Chromatin immunoprecipitation (ChIP), coupled with massively parallel short-read sequencing (seq) is used to probe chromatin dynamics. Although there are many algorithms to call peaks from ChIP-seq datasets, most are tuned either to handle punctate sites, such as transcriptional factor binding sites, or broad regions, such as histone modification marks; few can do both. Other algorithms are limited in their configurability, performance on large data sets, and ability to distinguish closely-spaced peaks.
In this paper, we introduce PeakRanger, a peak caller software package that works equally well on punctate and broad sites, can resolve closely-spaced peaks, has excellent performance, and is easily customized. In addition, PeakRanger can be run in a parallel cloud computing environment to obtain extremely high performance on very large data sets. We present a series of benchmarks to evaluate PeakRanger against 10 other peak callers, and demonstrate the performance of PeakRanger on both real and synthetic data sets. We also present real world usages of PeakRanger, including peak-calling in the modENCODE project.
Compared to other peak callers tested, PeakRanger offers improved resolution in distinguishing extremely closely-spaced peaks. PeakRanger has above-average spatial accuracy in terms of identifying the precise location of binding events. PeakRanger also has excellent sensitivity and specificity in all benchmarks evaluated. In addition, PeakRanger offers significant improvements in run time when running on a single processor system, and very marked improvements when allowed to take advantage of the MapReduce parallel environment offered by a cloud computing resource. PeakRanger can be downloaded at the official site of modENCODE project: http://www.modencode.org/software/ranger/
With DNA sequencing now getting cheaper more quickly than data storage, the time may have come to use cloud computing for genome informatics.
With DNA sequencing now getting cheaper more quickly than data storage or computation, the time may have come for genome informatics to migrate to the cloud.
The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans.
Using both simulated TP signals and the real C. elegans sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.
MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.
Reactome (http://www.reactome.org) is a collaboration among groups at the Ontario Institute for Cancer Research, Cold Spring Harbor Laboratory, New York University School of Medicine and The European Bioinformatics Institute, to develop an open source curated bioinformatics database of human pathways and reactions. Recently, we developed a new web site with improved tools for pathway browsing and data analysis. The Pathway Browser is an Systems Biology Graphical Notation (SBGN)-based visualization system that supports zooming, scrolling and event highlighting. It exploits PSIQUIC web services to overlay our curated pathways with molecular interaction data from the Reactome Functional Interaction Network and external interaction databases such as IntAct, BioGRID, ChEMBL, iRefIndex, MINT and STRING. Our Pathway and Expression Analysis tools enable ID mapping, pathway assignment and overrepresentation analysis of user-supplied data sets. To support pathway annotation and analysis in other species, we continue to make orthology-based inferences of pathways in non-human species, applying Ensembl Compara to identify orthologs of curated human proteins in each of 20 other species. The resulting inferred pathway sets can be browsed and analyzed with our Species Comparison tool. Collaborations are also underway to create manually curated data sets on the Reactome framework for chicken, Drosophila and rice.
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.
Reactome is an open-source, freely available database of human biological pathways and processes. A major goal of our work is to provide an integrated view of cellular signalling processes that spans from ligand–receptor interactions to molecular readouts at the level of metabolic and transcriptional events. To this end, we have built the first catalogue of all human G protein-coupled receptors (GPCRs) known to bind endogenous or natural ligands. The UniProt database has records for 797 proteins classified as GPCRs and sorted into families A/1, B/2 and C/3 on the basis of amino accid sequence. To these records we have added details from the IUPHAR database and our own manual curation of relevant literature to create reactions in which 563 GPCRs bind ligands and also interact with specific G-proteins to initiate signalling cascades. We believe the remaining 234 GPCRs are true orphans. The Reactome GPCR pathway can be viewed as a detailed interactive diagram and can be exported in many forms. It provides a template for the orthology-based inference of GPCR reactions for diverse model organism species, and can be overlaid with protein–protein interaction and gene expression datasets to facilitate overrepresentation studies and other forms of pathway analysis.
Database URL: http://www.reactome.org
Linkage of the chromosome 1q21–25 region to type 2 diabetes has been demonstrated in multiple ethnic groups. We performed common variant fine-mapping across a 23-Mb interval in a multiethnic sample to search for variants responsible for this linkage signal.
RESEARCH DESIGN AND METHODS
In all, 5,290 single nucleotide polymorphisms (SNPs) were successfully genotyped in 3,179 type 2 diabetes case and control subjects from eight populations with evidence of 1q linkage. Samples were ascertained using strategies designed to enhance power to detect variants causal for 1q linkage. After imputation, we estimate ∼80% coverage of common variation across the region (r 2 > 0.8, Europeans). Association signals of interest were evaluated through in silico replication and de novo genotyping in ∼8,500 case subjects and 12,400 control subjects.
Association mapping of the 23-Mb region identified two strong signals, both of which were restricted to the subset of European-descent samples. The first mapped to the NOS1AP (CAPON) gene region (lead SNP: rs7538490, odds ratio 1.38 [95% CI 1.21–1.57], P = 1.4 × 10−6, in 999 case subjects and 1,190 control subjects); the second mapped within an extensive region of linkage disequilibrium that includes the ASH1L and PKLR genes (lead SNP: rs11264371, odds ratio 1.48 [1.18–1.76], P = 1.0 × 10−5, under a dominant model). However, there was no evidence for association at either signal on replication, and, across all data (>24,000 subjects), there was no indication that these variants were causally related to type 2 diabetes status.
Detailed fine-mapping of the 23-Mb region of replicated linkage has failed to identify common variant signals contributing to the observed signal. Future studies should focus on identification of causal alleles of lower frequency and higher penetrance.
A high-quality human functional protein interaction network is constructed. Its utility is demonstrated in the identification of cancer candidate genes.
One challenge facing biologists is to tease out useful information from massive data sets for further analysis. A pathway-based analysis may shed light by projecting candidate genes onto protein functional relationship networks. We are building such a pathway-based analysis system.
We have constructed a protein functional interaction network by extending curated pathways with non-curated sources of information, including protein-protein interactions, gene coexpression, protein domain interaction, Gene Ontology (GO) annotations and text-mined protein interactions, which cover close to 50% of the human proteome. By applying this network to two glioblastoma multiforme (GBM) data sets and projecting cancer candidate genes onto the network, we found that the majority of GBM candidate genes form a cluster and are closer than expected by chance, and the majority of GBM samples have sequence-altered genes in two network modules, one mainly comprising genes whose products are localized in the cytoplasm and plasma membrane, and another comprising gene products in the nucleus. Both modules are highly enriched in known oncogenes, tumor suppressors and genes involved in signal transduction. Similar network patterns were also found in breast, colorectal and pancreatic cancers.
We have built a highly reliable functional interaction network upon expert-curated pathways and applied this network to the analysis of two genome-wide GBM and several other cancer data sets. The network patterns revealed from our results suggest common mechanisms in the cancer biology. Our system should provide a foundation for a network or pathway-based analysis platform for cancer and other diseases.
Despite the successes of genomics, little is known about how genetic information produces complex organisms. A look at the crucial functional elements of fly and worm genomes could change that.
Insulators are DNA sequences that control the interactions among genomic regulatory elements and act as chromatin boundaries. A thorough understanding of their location and function is necessary to address the complexities of metazoan gene regulation. We studied by ChIP–chip the genome-wide binding sites of 6 insulator-associated proteins—dCTCF, CP190, BEAF-32, Su(Hw), Mod(mdg4), and GAF—to obtain the first comprehensive map of insulator elements in Drosophila embryos. We identify over 14,000 putative insulators, including all classically defined insulators. We find two major classes of insulators defined by dCTCF/CP190/BEAF-32 and Su(Hw), respectively. Distributional analyses of insulators revealed that particular sub-classes of insulator elements are excluded between cis-regulatory elements and their target promoters; divide differentially expressed, alternative, and divergent promoters; act as chromatin boundaries; are associated with chromosomal breakpoints among species; and are embedded within active chromatin domains. Together, these results provide a map demarcating the boundaries of gene regulatory units and a framework for understanding insulator function during the development and evolution of Drosophila.
The spatiotemporal specificity of gene expression is controlled by interactions among regulatory proteins, cis-regulatory elements, chromatin modifications, and genes. These interactions can occur over large distances, and the mechanisms by which they are controlled are poorly understood. Insulators are DNA sequences that can both block the interaction between regulatory elements and genes, as well as block the spread of regions of modified chromatin. To date, relatively few insulators have been identified in developing Drosophila embryos. We here present the genome wide identification of over 14,000 binding sites for 6 insulator-associated proteins. We demonstrate the existence of two broad classes of insulators. Insulators of both classes are enriched at the boundaries of a particular chromatin modification. However, only insulators bound by BEAF-32, CP190, and dCTCF are enriched in regions of open chromatin or demarcate gene boundaries, with a particular enrichment between differentially expressed promoters. Furthermore, insulators of this class are enriched at points of chromosomal rearrangement among the 12 species of sequenced Drosophila, suggesting that insulator defined regulatory boundaries are evolutionarily conserved.
Linkage of the chromosome 1q21-25 region to type 2 diabetes has been demonstrated in multiple ethnic groups. We performed common variant fine-mapping across a 23Mb interval in a multiethnic sample to search for variants responsible for this linkage signal.
Research Design and Methods
In all, 5,290 SNPs were successfully genotyped in 3,179 T2D cases and controls from eight populations with evidence of 1q linkage. Samples were ascertained using strategies designed to enhance power to detect variants causal for 1q-linkage. Following imputation, we estimate ~80% coverage of common variation across the region (r2>0.8, Europeans). Association signals of interest were evaluated through in silico replication and de novo genotyping in approximately 8,500 cases and 12,400 controls.
Association mapping of the 23Mb region identified two strong signals, both restricted to the subset of European-descent samples. The first mapped to the NOS1AP (CAPON) gene region (lead SNP: rs7538490, OR 1.38 (95% CI, 1.21-1.57), p=1.4×10-6 in 999 cases and 1,190 controls): the second within an extensive region of linkage disequilibrium that includes the ASH1L and PKLR genes (lead SNP: rs11264371, OR 1.48 [1.18-1.76], p=1.0×10-5, under a dominant model). However, there was no evidence for association at either signal on replication, and, across all data (>24,000 subjects), no indication that these variants were causally-related to T2D status.
Detailed fine-mapping of the 23Mb region of replicated linkage has failed to identify common variant signals contributing to the observed signal. Future studies should focus on identification of causal alleles of lower frequency and higher penetrance.
chromosome 1q; linkage; association
Bioinformatics is alive and well in 2008 concludes Lincoln Stein, despite his earlier prediction of its imminent demise.
Bioinformatics has become too central to biology to be left to specialist bioinformaticians. Biologists are all bioinformaticians now.
WormBase (http://www.wormbase.org) is a central data repository for nematode biology. Initially created as a service to the Caenorhabditis elegans research field, WormBase has evolved into a powerful research tool in its own right. In the past 2 years, we expanded WormBase to include the complete genomic sequence, gene predictions and orthology assignments from a range of related nematodes. This comparative data enrich the C. elegans data with improved gene predictions and a better understanding of gene function. In turn, they bring the wealth of experimental knowledge of C. elegans to other systems of medical and agricultural importance. Here, we describe new species and data types now available at WormBase. In addition, we detail enhancements to our curatorial pipeline and website infrastructure to accommodate new genomes and an extensive user base.
Summary:CMap is a web-based tool for displaying and comparing maps of any type and from any species. A user can compare an unlimited number of maps, view pair-wise comparisons of known correspondences, and search for maps or for features by name, species, type and accession. CMap is freely available, can run on a variety of database engines and uses only free and open software components.
Gramene is a comparative information resource for plants that integrates data across diverse data domains. In this article, we describe the development of a quantitative trait loci (QTL) database and illustrate how it can be used to facilitate both the forward and reverse genetics research. The QTL database contains the largest online collection of rice QTL data in the world. Using flanking markers as anchors, QTLs originally reported on individual genetic maps have been systematically aligned to the rice sequence where they can be searched as standard genomic features. Researchers can determine whether a QTL co-localizes with other QTLs detected in independent experiments and can combine data from multiple studies to improve the resolution of a QTL position. Candidate genes falling within a QTL interval can be identified and their relationship to particular phenotypes can be inferred based on functional annotations provided by ontology terms. Mutations identified in functional genomics populations and association mapping panels can be aligned with QTL regions to facilitate fine mapping and validation of gene–phenotype associations. By assembling and integrating diverse types of data and information across species and levels of biological complexity, the QTL database enhances the potential to understand and utilize QTL information in biological research.