Funded by the National Institutes of Health (NIH), the aim of the Model Organism ENCyclopedia of DNA Elements (modENCODE) project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans (worm) and D. melanogaster (fly). With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.
In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy (https://github.com/modENCODE-DCC/Galaxy), on the public Amazon Cloud (http://aws.amazon.com), and on the private Bionimbus Cloud for genomic research (http://www.bionimbus.org). In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control (QC) and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies.
Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.
In an effort to comprehensively characterize the functional elements within the genomes of the important model organisms Drosophila melanogaster and Caenorhabditis elegans, the NHGRI model organism Encyclopaedia of DNA Elements (modENCODE) consortium has generated an enormous library of genomic data along with detailed, structured information on all aspects of the experiments. The modMine database (http://intermine.modencode.org) described here has been built by the modENCODE Data Coordination Center to allow the broader research community to (i) search for and download data sets of interest among the thousands generated by modENCODE; (ii) access the data in an integrated form together with non-modENCODE data sets; and (iii) facilitate fine-grained analysis of the above data. The sophisticated search features are possible because of the collection of extensive experimental metadata by the consortium. Interfaces are provided to allow both biologists and bioinformaticians to exploit these rich modENCODE data sets now available via modMine.
FlyBase (http://flybase.org) is the leading database and web portal for genetic and genomic information on the fruit fly Drosophila melanogaster and related fly species. Whether you use the fruit fly as an experimental system or want to apply Drosophila biological knowledge to another field of study, FlyBase can help you successfully navigate the wealth of available Drosophila data. Here, we review the FlyBase web site with novice and less-experienced users of FlyBase in mind and point out recent developments stemming from the availability of genome-wide data from the modENCODE project. The first section of this paper explains the organization of the web site and describes the report pages available on FlyBase, focusing on the most popular, the Gene Report. The next section introduces some of the search tools available on FlyBase, in particular, our heavily used and recently redesigned search tool QuickSearch, found on the FlyBase homepage. The final section concerns genomic data, including recent modENCODE (http://www.modencode.org) data, available through our Genome Browser, GBrowse.
Chromatin immunoprecipitation (ChIP), coupled with massively parallel short-read sequencing (seq) is used to probe chromatin dynamics. Although there are many algorithms to call peaks from ChIP-seq datasets, most are tuned either to handle punctate sites, such as transcriptional factor binding sites, or broad regions, such as histone modification marks; few can do both. Other algorithms are limited in their configurability, performance on large data sets, and ability to distinguish closely-spaced peaks.
In this paper, we introduce PeakRanger, a peak caller software package that works equally well on punctate and broad sites, can resolve closely-spaced peaks, has excellent performance, and is easily customized. In addition, PeakRanger can be run in a parallel cloud computing environment to obtain extremely high performance on very large data sets. We present a series of benchmarks to evaluate PeakRanger against 10 other peak callers, and demonstrate the performance of PeakRanger on both real and synthetic data sets. We also present real world usages of PeakRanger, including peak-calling in the modENCODE project.
Compared to other peak callers tested, PeakRanger offers improved resolution in distinguishing extremely closely-spaced peaks. PeakRanger has above-average spatial accuracy in terms of identifying the precise location of binding events. PeakRanger also has excellent sensitivity and specificity in all benchmarks evaluated. In addition, PeakRanger offers significant improvements in run time when running on a single processor system, and very marked improvements when allowed to take advantage of the MapReduce parallel environment offered by a cloud computing resource. PeakRanger can be downloaded at the official site of modENCODE project: http://www.modencode.org/software/ranger/
Motivation: The highly coordinated expression of thousands of genes in an organism is regulated by the concerted action of transcription factors, chromatin proteins and epigenetic mechanisms. High-throughput experimental data for genome wide in vivo protein–DNA interactions and epigenetic marks are becoming available from large projects, such as the model organism ENCyclopedia Of DNA Elements (modENCODE) and from individual labs. Dissemination and visualization of these datasets in an explorable form is an important challenge.
Results: To support research on Drosophila melanogaster transcription regulation and make the genome wide in vivo protein–DNA interactions data available to the scientific community as a whole, we have developed a system called Flynet. Currently, Flynet contains 101 datasets for 38 transcription factors and chromatin regulator proteins in different experimental conditions. These factors exhibit different types of binding profiles ranging from sharp localized peaks to broad binding regions. The protein–DNA interaction data in Flynet was obtained from the analysis of chromatin immunoprecipitation experiments on one color and two color genomic tiling arrays as well as chromatin immunoprecipitation followed by massively parallel sequencing. A web-based interface, integrated with an AJAX based genome browser, has been built for queries and presenting analysis results. Flynet also makes available the cis-regulatory modules reported in literature, known and de novo identified sequence motifs across the genome, and other resources to study gene regulation.
Availability: Flynet is available at https://www.cistrack.org/flynet/.
Supplementary information: Supplementary data are available at Bioinformatics online.
WormBase (www.wormbase.org) has been serving the scientific community for over 11 years as the central repository for genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around 20 nematodes. In this article, we focus on WormBase’s role of genome sequence annotation, describing how we annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE.
Caenorhabditis elegans; annotation; community resource; genome; model organism database; nematode; parasitic nematode; sequence curation
Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction.
Results: We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with respect to annotated transcripts. Our results corroborate that a well-motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction.
Availability: MITIE is implemented in C++ and is available from http://bioweb.me/mitie under the GPL license.
Jonas_Behr@web.de and email@example.com
Supplementary data are available at Bioinformatics online.
The Drosophila MSL complex mediates dosage compensation by increasing transcription of the single X chromosome in males approximately two-fold. This is accomplished through recognition of the X chromosome and subsequent acetylation of histone H4K16 on X-linked genes. Initial binding to the X is thought to occur at “entry sites” that contain a consensus sequence motif (“MSL recognition element” or MRE). However, this motif is only ∼2 fold enriched on X, and only a fraction of the motifs on X are initially targeted. Here we ask whether chromatin context could distinguish between utilized and non-utilized copies of the motif, by comparing their relative enrichment for histone modifications and chromosomal proteins mapped in the modENCODE project. Through a comparative analysis of the chromatin features in male S2 cells (which contain MSL complex) and female Kc cells (which lack the complex), we find that the presence of active chromatin modifications, together with an elevated local GC content in the surrounding sequences, has strong predictive value for functional MSL entry sites, independent of MSL binding. We tested these sites for function in Kc cells by RNAi knockdown of Sxl, resulting in induction of MSL complex. We show that ectopic MSL expression in Kc cells leads to H4K16 acetylation around these sites and a relative increase in X chromosome transcription. Collectively, our results support a model in which a pre-existing active chromatin environment, coincident with H3K36me3, contributes to MSL entry site selection. The consequences of MSL targeting of the male X chromosome include increase in nucleosome lability, enrichment for H4K16 acetylation and JIL-1 kinase, and depletion of linker histone H1 on active X-linked genes. Our analysis can serve as a model for identifying chromatin and local sequence features that may contribute to selection of functional protein binding sites in the genome.
The genomes of complex organisms encompass hundreds of millions of base pairs of DNA, and regulatory molecules must distinguish specific targets within this vast landscape. In general, regulatory factors find target genes through sequence-specific interactions with the underlying DNA. However, sequence-specific factors typically bind only a fraction of the candidate genomic regions containing their specific target sequence motif. Here we identify potential roles for chromatin environment and flanking sequence composition in helping regulatory factors find their appropriate binding sites, using targeting of the Drosophila dosage compensation complex as a model. The initial stage of dosage compensation involves binding of the Male Specific Lethal (MSL) complex to a sequence motif called the MSL recognition element . Using data from a large chromatin mapping effort (the modENCODE project), we successfully identify an active chromatin environment as predictive of selective MRE binding by the MSL complex. Our study provides a framework for using genome-wide datasets to analyze and predict functional protein–DNA binding site selection.
We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation–i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts.
Availability: ACT is available at http://act.gersteinlab.org
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
Chromatin is the driver of gene regulation, yet understanding the molecular interactions underlying chromatin factor combinatorial patterns (or the “chromatin codes”) remains a fundamental challenge in chromatin biology. Here we developed a global modeling framework that leverages chromatin profiling data to produce a systems-level view of the macromolecular complex of chromatin. Our model ultilizes maximum entropy modeling with regularization-based structure learning to statistically dissect dependencies between chromatin factors and produce an accurate probability distribution of chromatin code. Our unsupervised quantitative model, trained on genome-wide chromatin profiles of 73 histone marks and chromatin proteins from modENCODE, enabled making various data-driven inferences about chromatin profiles and interactions. We provided a highly accurate predictor of chromatin factor pairwise interactions validated by known experimental evidence, and for the first time enabled higher-order interaction prediction. Our predictions can thus help guide future experimental studies. The model can also serve as an inference engine for predicting unknown chromatin profiles — we demonstrated that with this approach we can leverage data from well-characterized cell types to help understand less-studied cell type or conditions.
Chromatin, like many other molecular biological systems, is composed of multiple interacting factors. Our knowledge about chromatin factors is mostly qualitative, and such qualitative knowledge can be insufficient for predicting collective behaviors. It's also extremely challenging to study collective behaviors involving multiple interacting factors through genetic and biochemical experiments. An alternative approach is to leverage large-scale genome-wide chromatin profiles and statistical modeling to create predictive models and infer underlying interaction mechanisms based on these observed high-throughput data. In this study, we developed a novel maximum entropy-based modeling approach to quantitatively capture interactions between chromatin factors at the same genomic location, which we see as a step toward quantitative understanding of chromatin organization that involves a system of multiple interacting factors. We applied this quantitative model to successfully infer functional properties of chromatin including interactions between chromatin factors. Furthermore, the model predicts unmeasured chromatin profiles with high accuracy based on its inferred dependencies with other factors within and across cell-types. Thus our modeling approach effectively ultilizes large-scale chromatin profiles to dissect chromatin factor interactions and to make data-driven inferences about chromatin regulation.
The Myc family of transcription factors regulates a variety of biological processes, including the cell cycle, growth, proliferation, metabolism, and apoptosis. In Caenorhabditis elegans, the “Myc interaction network” consists of two opposing heterodimeric complexes with antagonistic functions in transcriptional control: the Myc-Mondo:Mlx transcriptional activation complex and the Mad:Max transcriptional repression complex. In C. elegans, Mondo, Mlx, Mad, and Max are encoded by mml-1, mxl-2, mdl-1, and mxl-1, respectively. Here we show a similar antagonistic role for the C. elegans Myc-Mondo and Mad complexes in longevity control. Loss of mml-1 or mxl-2 shortens C. elegans lifespan. In contrast, loss of mdl-1 or mxl-1 increases longevity, dependent upon MML-1:MXL-2. The MML-1:MXL-2 and MDL-1:MXL-1 complexes function in both the insulin signaling and dietary restriction pathways. Furthermore, decreased insulin-like/IGF-1 signaling (ILS) or conditions of dietary restriction increase the accumulation of MML-1, consistent with the notion that the Myc family members function as sensors of metabolic status. Additionally, we find that Myc family members are regulated by distinct mechanisms, which would allow for integrated control of gene expression from diverse signals of metabolic status. We compared putative target genes based on ChIP-sequencing data in the modENCODE project and found significant overlap in genomic DNA binding between the major effectors of ILS (DAF-16/FoxO), DR (PHA-4/FoxA), and Myc family (MDL-1/Mad/Mxd) at common target genes, which suggests that diverse signals of metabolic status converge on overlapping transcriptional programs that influence aging. Consistent with this, there is over-enrichment at these common targets for genes that function in lifespan, stress response, and carbohydrate metabolism. Additionally, we find that Myc family members are also involved in stress response and the maintenance of protein homeostasis. Collectively, these findings indicate that Myc family members integrate diverse signals of metabolic status, to coordinate overlapping metabolic and cytoprotective transcriptional programs that determine the progression of aging.
Transcription factors are essential proteins that regulate the expression of genes and play an important role in most biological processes. The results of our study presented here demonstrate for the first time a role in aging for a small family of transcription factors in the nematode worm Caenorhabditis elegans. Importantly, these proteins have close relatives in higher organisms, including humans that influence metabolism, cell replication, and have been implicated in the development of cancer. Moreover, the loss of one homologue has also been implicated in Williams-Beuren syndrome, a disease characterized in part by signs of premature aging. Our data demonstrate that these transcription factors function within insulin/IGF-1 signaling and dietary restriction, two highly conserved pathways that link nutrient sensing to longevity. Taken together, our findings provide exciting new insight into a family of proteins that may be essential for linking nutrient sensing to longevity and have implications for the improvement of human healthspan.
The functional repertoire of long intergenic noncoding RNA (lincRNA) molecules has begun to be elucidated in mammals. Determining the biological relevance and potential gene regulatory mechanisms of these enigmatic molecules would be expedited in a more tractable model organism, such as Drosophila melanogaster. To this end, we defined a set of 1,119 putative lincRNA genes in D. melanogaster using modENCODE whole transcriptome (RNA-seq) data. A large majority (1.1 of 1.3 Mb; 85%) of these bases were not previously reported by modENCODE as being transcribed. Significant selective constraint on the sequences of these loci predicts that virtually all have sustained functionality across the Drosophila clade. We observe biases in lincRNA genomic locations and expression profiles that are consistent with some of these lincRNAs being involved in the regulation of neighboring protein-coding genes with developmental functions. We identify lincRNAs that may be important in the developing nervous system and in male-specific organs, such as the testes. LincRNA loci were also identified whose positions, relative to nearby protein-coding loci, are equivalent between D. melanogaster and mouse. This study predicts that the genomes of not only vertebrates, such as mammals, but also an invertebrate (fruit fly) harbor large numbers of lincRNA loci. Our findings now permit exploitation of Drosophila genetics for the investigation of lincRNA mechanisms, including lincRNAs with potential functional analogues in mammals.
long intergenic noncoding RNAs; modENCODE; transcriptional regulation; evolution; development
Tiling arrays have been the tool of choice for probing an organism's transcriptome without prior assumptions about the transcribed regions, but RNA-Seq is becoming a viable alternative as the costs of sequencing continue to decrease. Understanding the relative merits of these technologies will help researchers select the appropriate technology for their needs.
Here, we compare these two platforms using a matched sample of poly(A)-enriched RNA isolated from the second larval stage of C. elegans. We find that the raw signals from these two technologies are reasonably well correlated but that RNA-Seq outperforms tiling arrays in several respects, notably in exon boundary detection and dynamic range of expression. By exploring the accuracy of sequencing as a function of depth of coverage, we found that about 4 million reads are required to match the sensitivity of two tiling array replicates. The effects of cross-hybridization were analyzed using a "nearest neighbor" classifier applied to array probes; we describe a method for determining potential "black list" regions whose signals are unreliable. Finally, we propose a strategy for using RNA-Seq data as a gold standard set to calibrate tiling array data. All tiling array and RNA-Seq data sets have been submitted to the modENCODE Data Coordinating Center.
Tiling arrays effectively detect transcript expression levels at a low cost for many species while RNA-Seq provides greater accuracy in several regards. Researchers will need to carefully select the technology appropriate to the biological investigations they are undertaking. It will also be important to reconsider a comparison such as ours as sequencing technologies continue to evolve.
Development of tools to jointly visualize the genome and the epigenome remains a challenge. chroGPS is a computational approach that addresses this question. chroGPS uses multidimensional scaling techniques to represent similarity between epigenetic factors, or between genetic elements on the basis of their epigenetic state, in 2D/3D reference maps. We emphasize biological interpretability, statistical robustness, integration of genetic and epigenetic data from heterogeneous sources, and computational feasibility. Although chroGPS is a general methodology to create reference maps and study the epigenetic state of any class of genetic element or genomic region, we focus on two specific kinds of maps: chroGPSfactors, which visualizes functional similarities between epigenetic factors, and chroGPSgenes, which describes the epigenetic state of genes and integrates gene expression and other functional data. We use data from the modENCODE project on the genomic distribution of a large collection of epigenetic factors in Drosophila, a model system extensively used to study genome organization and function. Our results show that the maps allow straightforward visualization of relationships between factors and elements, capturing relevant information about their functional properties that helps to interpret epigenetic information in a functional context and derive testable hypotheses.
The E2F transcription factors are important regulators of the cell cycle whose function is commonly misregulated in cancer. To identify novel regulators of E2F1 activity in vivo, we used Drosophila to conduct genetic screens. For this, we generated transgenic lines that allow the tissue-specific depletion of dE2F1 by RNAi. Expression of these transgenes using Gal4 drivers in the eyes and wings generated reliable and modifiable phenotypes. We then conducted genetic screens testing the capacity of Exelixis deficiencies to modify these E2F1-RNAi phenotypes. From these screens, we identified mutant alleles of Suppressor of zeste 2 [Su(z)2] and multiple Polycomb group genes as strong suppressors of the E2F1-RNA interference phenotypes. In validation of our genetic data, we find that depleting Su(z)2 in cultured Drosophila cells restores the cell-proliferation defects caused by reduction of dE2F1 by elevating the level of dE2f1. Furthermore, analyses of methylation status of histone H3 lysine 27 (H3K27me) from the published modENCODE data sets suggest that the genomic regions harboring dE2f1 gene and certain dE2f1 target genes display H3K27me during development and in several Drosophila cell lines. These in vivo observations suggest that the Polycomb group may regulate cell proliferation by repressing the transcription of dE2f1 and certain dE2F1 target genes. This mechanism may play an important role in coordinating cellular differentiation and proliferation during Drosophila development.
cell proliferation; E2F1; Su(z)2; PcG; Drosophila
Chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) allows genome-wide discovery of protein-DNA interactions such as transcription factor bindings and histone modifications. Previous reports only compared a small number of profiles, and little has been done to compare histone modification profiles generated by the two technologies or to assess the impact of input DNA libraries in ChIP-seq analysis. Here, we performed a systematic analysis of a modENCODE dataset consisting of 31 pairs of ChIP-chip/ChIP-seq profiles of the coactivator CBP, RNA polymerase II (RNA PolII), and six histone modifications across four developmental stages of Drosophila melanogaster.
Both technologies produce highly reproducible profiles within each platform, ChIP-seq generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks. The set of peaks identified by the two technologies can be significantly different, but the extent to which they differ varies depending on the factor and the analysis algorithm. Importantly, we found that there is a significant variation among multiple sequencing profiles of input DNA libraries and that this variation most likely arises from both differences in experimental condition and sequencing depth. We further show that using an inappropriate input DNA profile can impact the average signal profiles around genomic features and peak calling results, highlighting the importance of having high quality input DNA data for normalization in ChIP-seq analysis.
Our findings highlight the biases present in each of the platforms, show the variability that can arise from both technology and analysis methods, and emphasize the importance of obtaining high quality and deeply sequenced input DNA libraries for ChIP-seq analysis.
The ENCODE project is an international consortium with a goal of cataloguing all the functional elements in the human genome. The ENCODE Data Coordination Center (DCC) at the University of California, Santa Cruz serves as the central repository for ENCODE data. In this role, the DCC offers a collection of high-throughput, genome-wide data generated with technologies such as ChIP-Seq, RNA-Seq, DNA digestion and others. This data helps illuminate transcription factor-binding sites, histone marks, chromatin accessibility, DNA methylation, RNA expression, RNA binding and other cell-state indicators. It includes sequences with quality scores, alignments, signals calculated from the alignments, and in most cases, element or peak calls calculated from the signal data. Each data set is available for visualization and download via the UCSC Genome Browser (http://genome.ucsc.edu/). ENCODE data can also be retrieved using a metadata system that captures the experimental parameters of each assay. The ENCODE web portal at UCSC (http://encodeproject.org/) provides information about the ENCODE data and links for access.
Mapping the chromosomal locations of transcription factors, nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, and polymerases is one of the key tasks of modern biology, as evidenced by the Encyclopedia of DNA Elements (ENCODE) Project. To this end, chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is the standard methodology. Mapping such protein-DNA interactions in vivo using ChIP-seq presents multiple challenges not only in sample preparation and sequencing but also for computational analysis. Here, we present step-by-step guidelines for the computational analysis of ChIP-seq data. We address all the major steps in the analysis of ChIP-seq data: sequencing depth selection, quality checking, mapping, data normalization, assessment of reproducibility, peak calling, differential binding analysis, controlling the false discovery rate, peak annotation, visualization, and motif analysis. At each step in our guidelines we discuss some of the software tools most frequently used. We also highlight the challenges and problems associated with each step in ChIP-seq data analysis. We present a concise workflow for the analysis of ChIP-seq data in Figure 1 that complements and expands on the recommendations of the ENCODE and modENCODE projects. Each step in the workflow is described in detail in the following sections.
Mitochondria contain their own circular genome, with mitochondria-specific transcription and replication systems and corresponding regulatory proteins. All of these proteins are encoded in the nuclear genome and are post-translationally imported into mitochondria. In addition, several nuclear transcription factors have been reported to act in mitochondria, but there has been no comprehensive mapping of their occupancy patterns and it is not clear how many other factors may also be found in mitochondria. Here we address these questions by using ChIP-seq data from the ENCODE, mouseENCODE and modENCODE consortia for 151 human, 31 mouse and 35 C. elegans factors. We identified 8 human and 3 mouse transcription factors with strong localized enrichment over the mitochondrial genome that was usually associated with the corresponding recognition sequence motif. Notably, these sites of occupancy are often the sites with highest ChIP-seq signal intensity within both the nuclear and mitochondrial genomes and are thus best explained as true binding events to mitochondrial DNA, which exist in high copy number in each cell. We corroborated these findings by immunocytochemical staining evidence for mitochondrial localization. However, we were unable to find clear evidence for mitochondrial binding in ENCODE and other publicly available ChIP-seq data for most factors previously reported to localize there. As the first global analysis of nuclear transcription factors binding in mitochondria, this work opens the door to future studies that probe the functional significance of the phenomenon.
Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.
Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.
Availability and Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF
Contact: firstname.lastname@example.org; email@example.com
Identifying a bicluster, or submatrix of a gene expression dataset wherein the genes express similar behavior over the columns, is useful for discovering novel functional gene interactions. In this article, we introduce a new algorithm for finding biClusters with Linear Patterns (CLiP). Instead of solely maximizing Pearson correlation, we introduce a fitness function that also considers the correlation of complementary genes and conditions. This eliminates the need for a priori determination of the bicluster size. We employ both greedy search and the genetic algorithm in optimization, incorporating resampling for more robust discovery. When applied to both real and simulation datasets, our results show that CLiP is superior to existing methods. In analyzing RNA-seq fly and worm time-course data from modENCODE, we uncover a set of similarly expressed genes suggesting maternal dependence. Supplementary Material is available online (at www.liebertonline.com/cmb).
algorithms; gene clusters; probability
Advances in sequencing technology have boosted population genomics and made it possible to map the positions of transcription factor binding sites (TFBSs) with high precision. Here we investigate TFBS variability by combining transcription factor binding maps generated by ENCODE, modENCODE, our previously published data and other sources with genomic variation data for human individuals and Drosophila isogenic lines.
We introduce a metric of TFBS variability that takes into account changes in motif match associated with mutation and makes it possible to investigate TFBS functional constraints instance-by-instance as well as in sets that share common biological properties. We also take advantage of the emerging per-individual transcription factor binding data to show evidence that TFBS mutations, particularly at evolutionarily conserved sites, can be efficiently buffered to ensure coherent levels of transcription factor binding.
Our analyses provide insights into the relationship between individual and interspecies variation and show evidence for the functional buffering of TFBS mutations in both humans and flies. In a broad perspective, these results demonstrate the potential of combining functional genomics and population genetics approaches for understanding gene regulation.
We present a network framework for analyzing multi-level regulation in higher eukaryotes based on systematic integration of various high-throughput datasets. The network, namely the integrated regulatory network, consists of three major types of regulation: TF→gene, TF→miRNA and miRNA→gene. We identified the target genes and target miRNAs for a set of TFs based on the ChIP-Seq binding profiles, the predicted targets of miRNAs using annotated 3′UTR sequences and conservation information. Making use of the system-wide RNA-Seq profiles, we classified transcription factors into positive and negative regulators and assigned a sign for each regulatory interaction. Other types of edges such as protein-protein interactions and potential intra-regulations between miRNAs based on the embedding of miRNAs in their host genes were further incorporated. We examined the topological structures of the network, including its hierarchical organization and motif enrichment. We found that transcription factors downstream of the hierarchy distinguish themselves by expressing more uniformly at various tissues, have more interacting partners, and are more likely to be essential. We found an over-representation of notable network motifs, including a FFL in which a miRNA cost-effectively shuts down a transcription factor and its target. We used data of C. elegans from the modENCODE project as a primary model to illustrate our framework, but further verified the results using other two data sets. As more and more genome-wide ChIP-Seq and RNA-Seq data becomes available in the near future, our methods of data integration have various potential applications.
The precise control of gene expression lies at the heart of many biological processes. In eukaryotes, the regulation is performed at multiple levels, mediated by different regulators such as transcription factors and miRNAs, each distinguished by different spatial and temporal characteristics. These regulators are further integrated to form a complex regulatory network responsible for the orchestration. The construction and analysis of such networks is essential for understanding the general design principles. Recent advances in high-throughput techniques like ChIP-Seq and RNA-Seq provide an opportunity by offering a huge amount of binding and expression data. We present a general framework to combine these types of data into an integrated network and perform various topological analyses, including its hierarchical organization and motif enrichment. We find that the integrated network possesses an intrinsic hierarchical organization and is enriched in several network motifs that include both transcription factors and miRNAs. We further demonstrate that the framework can be easily applied to other species like human and mouse. As more and more genome-wide ChIP-Seq and RNA-Seq data are going to be generated in the near future, our methods of data integration have various potential applications.
ChIP-seq experiments identify genome-wide profiles of DNA-binding molecules including transcription factors, enzymes and epigenetic marks. Biological replicates are critical for reliable site discovery and are required for the deposition of data in the ENCODE and modENCODE projects. While early reports suggested two replicates were sufficient, the widespread application of the technique has led to emerging consensus that the technique is noisy and that increasing replication may be worthwhile. Additional biological replicates also allow for quantitative assessment of differences between conditions. To date it has remained controversial about how to confirm peak identification and to determine signal strength across biological replicates, particularly when the number of replicates is greater than two. Using objective metrics, we evaluate the consistency of biological replicates in ChIP-seq experiments with more than two replicates. We compare several approaches for binding site determination, including two popular but disparate peak callers, CisGenome and MACS2. Here we propose read coverage as a quantitative measurement of signal strength for estimating sample concordance. Determining binding based on genomic features, such as promoters, is also examined. We find that increasing the number of biological replicates increases the reliability of peak identification. Critically, binding sites with strong biological evidence may be missed if researchers rely on only two biological replicates. When more than two replicates are performed, a simple majority rule (>50% of samples identify a peak) identifies peaks more reliably in all biological replicates than the absolute concordance of peak identification between any two replicates, further demonstrating the utility of increasing replicate numbers in ChIP-seq experiments.
ChIP-seq; peak identification; biological replicates