We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.
Identifying and understanding changes in cancer genomes is essential for
the development of targeted therapeutics1. Here we analyse systematically more than 70 pairs of
primary human colon tumours by applying next-generation sequencing to
characterize their exomes, transcriptomes and copy-number alterations. We have
identified 36,303 protein-altering somatic changes that include several new
recurrent mutations in the Wnt pathway gene TCF7L2,
chromatin-remodelling genes such as TET2 and
TET3 and receptor tyrosine kinases including
ERBB3. Our analysis for significantly mutated cancer genes
identified 23 candidates, including the cell cycle checkpoint kinase
ATM. Copy-number and RNA-seq data analysis identified
amplifications and corresponding overexpression of IGF2 in a
subset of colon tumours. Furthermore, using RNA-seq data we identified multiple
fusion transcripts including recurrent gene fusions involving R-spondin family
members RSPO2 and RSPO3 that together occur in
10% of colon tumours. The RSPO fusions were mutually
exclusive with APC mutations, indicating that they probably
have a role in the activation of Wnt signalling and tumorigenesis. Consistent
with this we show that the RSPO fusion proteins were capable of potentiating Wnt
signalling. The R-spondin gene fusions and several other gene mutations
identified in this study provide new potential opportunities for therapeutic
intervention in colon cancer.
The regulatory networks of differentiation programs have been partly characterized; however, the molecular mechanisms of lineage-specific gene regulation by highly similar transcription factors remain largely unknown. Here we compare the genome-wide binding and transcription profiles of NEUROD2-mediated neurogenesis with MYOD-mediated myogenesis. We demonstrate that NEUROD2 and MYOD bind a shared CAGCTG E-box motif and E-box motifs specific for each factor: CAGGTG for MYOD and CAGATG for NEUROD2. Binding at factor-specific motifs is associated with gene transcription, whereas binding at shared sites is associated with regional epigenetic modifications but not as strongly associated with gene transcription. Binding is largely constrained to E-boxes pre-set in an accessible chromatin context that determines the set of target genes activated in each cell type. These findings demonstrate that the differentiation program is genetically determined by E-box sequence whereas cell lineage epigenetically determines the availability of E-boxes for each differentiation program.
Transcription factor overexpression is common in biological experiments and transcription factor amplification is associated with many cancers, yet few studies have directly compared the DNA-binding profiles of endogenous versus overexpressed transcription factors.
We analyzed MyoD ChIP-seq data from C2C12 mouse myotubes, primary mouse myotubes, and mouse fibroblasts differentiated into muscle cells by overexpression of MyoD and compared the genome-wide binding profiles and binding site characteristics of endogenous and overexpressed MyoD.
Overexpressed MyoD bound to the same sites occupied by endogenous MyoD and possessed the same E-box sequence preference and co-factor site enrichments, and did not bind to new sites with distinct characteristics.
Our data demonstrate a robust fidelity of transcription factor binding sites over a range of expression levels and that increased amounts of transcription factor increase the binding at physiologically bound sites.
Transcription factor; Overexpressed; MyoD; c-Myc; ChIP-seq
Small-cell lung cancer (SCLC) is an exceptionally aggressive disease with poor prognosis. Here, we obtained exome, transcriptome and copy-number alteration data from approximately 53 samples consisting of 36 primary human SCLC and normal tissue pairs and 17 matched SCLC and lymphoblastoid cell lines. We also obtained data for 4 primary tumors and 23 SCLC cell lines. We identified 22 significantly mutated genes in SCLC, including genes encoding kinases, G protein–coupled receptors and chromatin-modifying proteins. We found that several members of the SOX family of genes were mutated in SCLC. We also found SOX2 amplification in ~27% of the samples. Suppression of SOX2 using shRNAs blocked proliferation of SOX2-amplified SCLC lines. RNA sequencing identified multiple fusion transcripts and a recurrent RLF-MYCL1 fusion. Silencing of MYCL1 in SCLC cell lines that had the RLF-MYCL1 fusion decreased cell proliferation. These data provide an in-depth view of the spectrum of genomic alterations in SCLC and identify several potential targets for therapeutic intervention.
Facioscapulohumeral dystrophy (FSHD) is one of the most common inherited muscular dystrophies. The causative gene remains controversial and the mechanism of pathophysiology unknown. Here we identify genes associated with germline and early stem cell development as targets of the DUX4 transcription factor, a leading candidate gene for FSHD. The genes regulated by DUX4 are reliably detected in FSHD muscle but not in controls, providing direct support for the model that misexpression of DUX4 is a causal factor for FSHD. Additionally, we show that DUX4 binds and activates LTR elements from a class of MaLR endogenous primate retrotransposons and suppresses the innate immune response to viral infection, at least in part through the activation of DEFB103, a human defensin that can inhibit muscle differentiation. These findings suggest specific mechanisms of FSHD pathology and identify candidate biomarkers for disease diagnosis and progression.
Several malignancies are known to exhibit a “field-effect” whereby regions beyond tumor boundaries harbor histological or molecular changes that are associated with cancer. We sought to determine if histologically benign prostate epithelium collected from men with prostate cancer exhibits features indicative of pre-malignancy or field effect.
Prostate needle biopsies from 15 men with high grade(Gleason 8–10) prostate cancer and 15 age- and BMI-matched controls were identified from a biospecimen repository. Benign epithelia from each patient were isolated by laser capture microdissection. RNA was isolated, amplified, and used for microarray hybridization. Quantitative PCR(qPCR) was used to determine the expression of specific genes of interest. Alterations in protein expression were analyzed through immunohistochemistry.
Overall patterns of gene expression in microdissected benign-associated benign epithelium (BABE) and cancer-associated benign epithelium (CABE) were similar. Two genes previously associated with prostate cancer, PSMA and SSTR1, were significantly upregulated in the CABE group(FDR <1%). Expression of other prostate cancer-associated genes, including ERG, HOXC4, HOXC5 and MME, were also increased in CABE by qRT-PCR, although other genes commonly altered in prostate cancer were not different between the BABE and CABE samples. The expression of MME and PSMA proteins on IHC coincided with their mRNA alterations.
Gene expression profiles between benign epithelia of patients with and without prostate cancer are very similar. However, these tissues exhibit differences in the expression levels of several genes previously associated with prostate cancer development or progression. These differences may comprise a field effect and represent early events in carcinogenesis.
Prostate cancer; gene regulation; carcinogenesis
Although microRNAs (miRNAs) are important regulators of gene expression, the transcriptional regulation of miRNAs themselves is not well understood. We employed an integrative computational pipeline to dissect the transcription factors (TFs) responsible for altered miRNA expression in ovarian carcinoma. Using experimental data and computational predictions to define miRNA promoters across the human genome, we identified TFs with binding sites significantly overrepresented among miRNA genes overexpressed in ovarian carcinoma. This pipeline nominated TFs of the p53/p63/p73 family as candidate drivers of miRNA overexpression. Analysis of data from an independent set of 253 ovarian carcinomas in The Cancer Genome Atlas showed that p73 and p63 expression is significantly correlated with expression of miRNAs whose promoters contain p53/p63/p73 family binding sites. In experimental validation of specific miRNAs predicted by the analysis to be regulated by p73 and p63, we found that p53/p63/p73 family binding sites modulate promoter activity of miRNAs of the miR-200 family, which are known regulators of cancer stem cells and epithelial–mesenchymal transitions. Furthermore, in chromatin immunoprecipitation studies both p73 and p63 directly associated with the miR-200b/a/429 promoter. This study delineates an integrative approach that can be applied to discover transcriptional regulatory mechanisms in other biological settings where analogous genomic data are available.
Recent studies have demonstrated that MyoD initiates a feed-forward regulation of skeletal muscle gene expression, predicting that MyoD binds directly to many genes expressed during differentiation. We have used chromatin immunoprecipitation and high throughput sequencing to identify genome-wide binding of MyoD in several skeletal muscle cell types. As anticipated, MyoD preferentially binds to a VCASCTG sequence that resembles the in vitro selected site for a MyoD:E-protein heterodimer, and MyoD binding increases during differentiation at many of the regulatory regions of genes expressed in skeletal muscle. Unanticipated findings were that MyoD was constitutively bound to thousands of additional sites in both myoblasts and myotubes, and that the genome-wide binding of MyoD was associated with regional histone acetylation. Therefore, in addition to regulating muscle gene expression, MyoD binds genome-wide and has the ability to broadly alter the epigenome in myoblasts and myotubes.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
Differential genomic targeting of the transcription factor TAL1 in alternate haematopoietic lineages
Expression of the basic helix-loop-helix transcription factor TAL1/SCL is required for erythrocyte differentiation; aberrant expression in lymphoid cells leads to oncogenic transformation. Here, global analysis of TAL1 binding in erythroid and malignant T cells identifies cell type specific functional interaction with the transcription factors RUNX and ETS1.
TAL1/SCL is a master regulator of haematopoiesis whose expression promotes opposite outcomes depending on the cell type: differentiation in the erythroid lineage or oncogenesis in the T-cell lineage. Here, we used a combination of ChIP sequencing and gene expression profiling to compare the function of TAL1 in normal erythroid and leukaemic T cells. Analysis of the genome-wide binding properties of TAL1 in these two haematopoietic lineages revealed new insight into the mechanism by which transcription factors select their binding sites in alternate lineages. Our study shows limited overlap in the TAL1-binding profile between the two cell types with an unexpected preference for ETS and RUNX motifs adjacent to E-boxes in the T-cell lineage. Furthermore, we show that TAL1 interacts with RUNX1 and ETS1, and that these transcription factors are critically required for TAL1 binding to genes that modulate T-cell differentiation. Thus, our findings highlight a critical role of the cellular environment in modulating transcription factor binding, and provide insight into the mechanism by which TAL1 inhibits differentiation leading to oncogenesis in the T-cell lineage.
erythroid cells; ETS1; RUNX1; SCL/TAL1; T-cell acute lymphoblastic leukaemia (T-ALL)
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
Summary: Associations between DNA polymorphisms and mRNA abundance are a natural target of genetic investigations, and microarrays facilitate genome-wide and transcriptome-wide surveys of these associations. This work is motivated by emerging requirements for data architectures and algorithm interfaces to allow flexible exploration of public and private archives of genotyping and expression arrays. Using R/Bioconductor facilities, Phase II HapMap genotypes and Illumina 47K expression assay results archived on multiple populations may be interactively explored and analyzed using commodity hardware.
Availability and Implementation: Open Source. Bioconductor 2.3 packages GGtools, GGBase, GGdata, hmyriB36. Freely available on the web at http://www.bioconductor.org
We used massively parallel pyrosequencing to discover and characterize microRNAs (miRNAs) expressed in human embryonic stem cells (hESC). Sequencing of small RNA cDNA libraries derived from undifferentiated hESC and from isogenic differentiating cultures yielded a total of 425,505 high-quality sequence reads. A custom data analysis pipeline delineated expression profiles for 191 previously annotated miRNAs, 13 novel miRNAs and 56 candidate miRNAs. Further characterization of a subset of the novel miRNAs in Dicer-knockdown hESC demonstrated Dicer-dependent expression, providing additional validation of our results. A set of 14 miRNAs (9 known and 5 novel) were noted to be expressed in undifferentiated hESC and then strongly down-regulated with differentiation. Functional annotation analysis of predicted targets of these miRNAs and comparison to a null model using non-hESC-expressed miRNAs identified statistically enriched functional categories, including chromatin remodeling and lineage-specific differentiation annotations. Finally, integration of our data with genome-wide chromatin immunoprecipitation data on OCT4, SOX2 and NANOG binding sites implicates these transcription factors in the regulation of nine of the novel/candidate miRNAs identified here. Comparison of our results to those of recent deep sequencing studies in mouse ESC and human ESC show that most of the novel/candidate miRNAs found here were not identified in the other studies. The data indicate that hESC express a larger complement of miRNAs than previously appreciated, and provide a resource for further studies of miRNA regulation of hESC physiology.
microRNA; embryonic stem cells; deep sequencing; pyrosequencing
Motivation: Gene-set enrichment analysis (GSEA) can be greatly enhanced by linear model (regression) diagnostic techniques. Diagnostics can be used to identify outlying or influential samples, and also to evaluate model fit and explore model expansion.
Results: We demonstrate this methodology on an adult acute lymphoblastic leukemia (ALL) dataset, using GSEA based on chromosome-band mapping of genes. Individual residuals, grouped or aggregated by chromosomal loci, indicate problematic samples and potential data-entry errors, and help identify hyperdiploidy as a factor playing a key role in expression for this dataset. Subsequent analysis pinpoints suspected DNA copy number abnormalities of specific samples and chromosomes (most prevalent are chromosomes X, 21 and 14), and also reveals significant expression differences between the hyperdiploid and diploid groups on other chromosomes (most prominently 19, 22, 3 and 13)—differences which are apparently not associated with copy number.
Availability: Software for the statistical tools demonstrated in this article is available as Bioconductor package GSEAlm.
Supplementary information: Supplementary data are available at Bioinformatics online.
Flow cytometry (FCM) has become an important analysis technology in health care and medical research, but the large volume of data produced by modern high-throughput experiments has presented significant new challenges for computational analysis tools. The development of an FCM software suite in Bioconductor represents one approach to overcome these challenges. In the spirit of the R programming language (Tree Star Inc., “FlowJo,” http://www.owjo.com), these tools are predominantly console-driven, allowing for programmatic access and rapid development of novel algorithms. Using this software requires a solid understanding of programming concepts and of the R language. However, some of these tools|in particular the statistical graphics and novel analytical methods|are also useful for nonprogrammers. To this end, we have developed an open source, extensible graphical user interface (GUI) iFlow, which sits on top of the Bioconductor backbone, enabling basic analyses by means of convenient graphical menus and wizards. We envision iFlow to be easily extensible in order to quickly integrate novel methodological developments.
Flow cytometry (FCM) is an analytical tool widely used for cancer and HIV/AIDS research, and treatment, stem cell manipulation and detecting microorganisms in environmental samples. Current data standards do not capture the full scope of FCM experiments and there is a demand for software tools that can assist in the exploration and analysis of large FCM datasets. We are implementing a standardized approach to capturing, analyzing, and disseminating FCM data that will facilitate both more complex analyses and analysis of datasets that could not previously be efficiently studied. Initial work has focused on developing a community-based guideline for recording and reporting the details of FCM experiments. Open source software tools that implement this standard are being created, with an emphasis on facilitating reproducible and extensible data analyses. As well, tools for electronic collaboration will assist the integrated access and comprehension of experiments to empower users to collaborate on FCM analyses. This coordinated, joint development of bioinformatics standards and software tools for FCM data analysis has the potential to greatly facilitate both basic and clinical research—impacting a notably diverse range of medical and environmental research areas.
The recent development of semiautomated techniques for staining and analyzing flow cytometry samples has presented new challenges. Quality control and quality assessment are critical when developing new high throughput technologies and their associated information services. Our experience suggests that significant bottlenecks remain in the development of high throughput flow cytometry methods for data analysis and display. Especially, data quality control and quality assessment are crucial steps in processing and analyzing high throughput flow cytometry data.
We propose a variety of graphical exploratory data analytic tools for exploring ungated flow cytometry data. We have implemented a number of specialized functions and methods in the Bioconductor package rflowcyt. We demonstrate the use of these approaches by investigating two independent sets of high throughput flow cytometry data.
We found that graphical representations can reveal substantial nonbiological differences in samples. Empirical Cumulative Distribution Function and summary scatterplots were especially useful in the rapid identification of problems not identified by manual review.
Graphical exploratory data analytic tools are quick and useful means of assessing data quality. We propose that the described visualizations should be used as quality assessment tools and where possible, be used for quality control.
flow cytometry; high throughput; quality assessment; visualization; exploratory data analysis; statistics; software
A systems biology interpretation of genome-scale RNA interference (RNAi) experiments is complicated by scope, experimental variability and network signaling robustness. Over representation approaches (ORA), such as the Hypergeometric or z-score, are an established statistical framework used to associate RNA interference effectors to biologically annotated gene sets or pathways. These methods, however, do not directly take advantage of our growing understanding of the interactome. Furthermore, these methods can miss partial pathway activation and may be biased by protein complexes. Here we present a novel ORA, protein interaction permutation analysis (PIPA), that takes advantage of canonical pathways and established protein interactions to identify pathways enriched for protein interactions connecting RNAi hits.
We use PIPA to analyze genome-scale siRNA cell growth screens performed in HeLa and TOV cell lines. First we show that interacting gene pair siRNA hits are more reproducible than single gene hits. Using protein interactions, PIPA identifies enriched pathways not found using the standard Hypergeometric analysis including the FAK cytoskeletal remodeling pathway. Different branches of the FAK pathway are distinctly essential in HeLa versus TOV cell lines while other portions are uneffected by siRNA perturbations. Enriched hits belong to protein interactions associated with cell cycle regulation, anti-apoptosis, and signal transduction.
PIPA provides an analytical framework to interpret siRNA screen data by merging biologically annotated gene sets with the human interactome. As a result we identify pathways and signaling hypotheses that are statistically enriched to effect cell growth in human cell lines. This method provides a complementary approach to standard gene set enrichment that utilizes the additional knowledge of specific interactions within biological gene sets.
Summary: ShortRead is a package for input, quality assessment, manipulation and output of high-throughput sequencing data. ShortRead is provided in the R and Bioconductor environments, allowing ready access to additional facilities for advanced statistical analysis, data transformation, visualization and integration with diverse genomic resources.
Availability and Implementation: This package is implemented in R and available at the Bioconductor web site; the package contains a ‘vignette’ outlining typical work flows.
Summary: The rtracklayer package supports the integration of existing genome browsers with experimental data analyses performed in R. The user may (i) transfer annotation tracks to and from a genome browser and (ii) create and manipulate browser views to focus on a particular set of annotations in a specific genomic region. Currently, the UCSC genome browser is supported.
Availability: The package is freely available from http://www.bioconductor.org/. A quick-start vignette is included with the package.
Recent advances in automation technologies have enabled the use of flow cytometry for high throughput screening, generating large complex data sets often in clinical trials or drug discovery settings. However, data management and data analysis methods have not advanced sufficiently far from the initial small-scale studies to support modeling in the presence of multiple covariates.
We developed a set of flexible open source computational tools in the R package flowCore to facilitate the analysis of these complex data. A key component of which is having suitable data structures that support the application of similar operations to a collection of samples or a clinical cohort. In addition, our software constitutes a shared and extensible research platform that enables collaboration between bioinformaticians, computer scientists, statisticians, biologists and clinicians. This platform will foster the development of novel analytic methods for flow cytometry.
The software has been applied in the analysis of various data sets and its data structures have proven to be highly efficient in capturing and organizing the analytic work flow. Finally, a number of additional Bioconductor packages successfully build on the infrastructure provided by flowCore, open new avenues for flow data analysis.
Summary:: The assessment of data quality is a major concern in microarray analysis. arrayQualityMetrics is a Bioconductor package that provides a report with diagnostic plots for one or two colour microarray data. The quality metrics assess reproducibility, identify apparent outlier arrays and compute measures of signal-to-noise ratio. The tool handles most current microarray technologies and is amenable to use in automated analysis pipelines or for automatic report generation, as well as for use by individuals. The diagnosis of quality remains, in principle, a context-dependent judgement, but our tool provides powerful, automated, objective and comprehensive instruments on which to base a decision.
Availability:: arrayQualityMetrics is a free and open source package, under LGPL license, available from the Bioconductor project at www.bioconductor.org. A users guide and examples are provided with the package. Some examples of HTML reports generated by arrayQualityMetrics can be found at http://www.microarray-quality.org
Supplementary information:: Supplementary data are available at Bioinformatics online.
Using new computational tools in yeast, multi-protein complexes were identified that share an unusually high number of synthetic genetic interactions.
Synthetic lethality defines a genetic interaction where the combination of mutations in two or more genes leads to cell death. The implications of synthetic lethal screens have been discussed in the context of drug development as synthetic lethal pairs could be used to selectively kill cancer cells, but leave normal cells relatively unharmed. A challenge is to assess genome-wide experimental data and integrate the results to better understand the underlying biological processes. We propose statistical and computational tools that can be used to find relationships between synthetic lethality and cellular organizational units.
In Saccharomyces cerevisiae, we identified multi-protein complexes and pairs of multi-protein complexes that share an unusually high number of synthetic genetic interactions. As previously predicted, we found that synthetic lethality can arise from subunits of an essential multi-protein complex or between pairs of multi-protein complexes. Finally, using multi-protein complexes allowed us to take into account the pleiotropic nature of the gene products.
Modeling synthetic lethality using current estimates of the yeast interactome is an efficient approach to disentangle some of the complex molecular interactions that drive a cell. Our model in conjunction with applied statistical methods and computational methods provides new tools to better characterize synthetic genetic interactions.