Funded by the National Institutes of Health (NIH), the aim of the Model Organism ENCyclopedia of DNA Elements (modENCODE) project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans (worm) and D. melanogaster (fly). With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.
In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy (https://github.com/modENCODE-DCC/Galaxy), on the public Amazon Cloud (http://aws.amazon.com), and on the private Bionimbus Cloud for genomic research (http://www.bionimbus.org). In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control (QC) and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies.
Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.
The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and analyzed data, and ensure the community is supplied with the knowledge of the experimental conditions, protocols and verification checks used to generate each primary data set. We present here the design principles of the modENCODE DCC, and describe the ramifications of collecting thorough and deep metadata for describing experiments, including the use of a wiki for capturing protocol and reagent information, and the BIR-TAB specification for linking biological samples to experimental results. modENCODE data can be found at http://www.modencode.org.
Database URL: http://www.modencode.org.
Chromatin immunoprecipitation (ChIP), coupled with massively parallel short-read sequencing (seq) is used to probe chromatin dynamics. Although there are many algorithms to call peaks from ChIP-seq datasets, most are tuned either to handle punctate sites, such as transcriptional factor binding sites, or broad regions, such as histone modification marks; few can do both. Other algorithms are limited in their configurability, performance on large data sets, and ability to distinguish closely-spaced peaks.
In this paper, we introduce PeakRanger, a peak caller software package that works equally well on punctate and broad sites, can resolve closely-spaced peaks, has excellent performance, and is easily customized. In addition, PeakRanger can be run in a parallel cloud computing environment to obtain extremely high performance on very large data sets. We present a series of benchmarks to evaluate PeakRanger against 10 other peak callers, and demonstrate the performance of PeakRanger on both real and synthetic data sets. We also present real world usages of PeakRanger, including peak-calling in the modENCODE project.
Compared to other peak callers tested, PeakRanger offers improved resolution in distinguishing extremely closely-spaced peaks. PeakRanger has above-average spatial accuracy in terms of identifying the precise location of binding events. PeakRanger also has excellent sensitivity and specificity in all benchmarks evaluated. In addition, PeakRanger offers significant improvements in run time when running on a single processor system, and very marked improvements when allowed to take advantage of the MapReduce parallel environment offered by a cloud computing resource. PeakRanger can be downloaded at the official site of modENCODE project: http://www.modencode.org/software/ranger/
FlyBase (http://flybase.org) is the leading database and web portal for genetic and genomic information on the fruit fly Drosophila melanogaster and related fly species. Whether you use the fruit fly as an experimental system or want to apply Drosophila biological knowledge to another field of study, FlyBase can help you successfully navigate the wealth of available Drosophila data. Here, we review the FlyBase web site with novice and less-experienced users of FlyBase in mind and point out recent developments stemming from the availability of genome-wide data from the modENCODE project. The first section of this paper explains the organization of the web site and describes the report pages available on FlyBase, focusing on the most popular, the Gene Report. The next section introduces some of the search tools available on FlyBase, in particular, our heavily used and recently redesigned search tool QuickSearch, found on the FlyBase homepage. The final section concerns genomic data, including recent modENCODE (http://www.modencode.org) data, available through our Genome Browser, GBrowse.
The Drosophila MSL complex mediates dosage compensation by increasing transcription of the single X chromosome in males approximately two-fold. This is accomplished through recognition of the X chromosome and subsequent acetylation of histone H4K16 on X-linked genes. Initial binding to the X is thought to occur at “entry sites” that contain a consensus sequence motif (“MSL recognition element” or MRE). However, this motif is only ∼2 fold enriched on X, and only a fraction of the motifs on X are initially targeted. Here we ask whether chromatin context could distinguish between utilized and non-utilized copies of the motif, by comparing their relative enrichment for histone modifications and chromosomal proteins mapped in the modENCODE project. Through a comparative analysis of the chromatin features in male S2 cells (which contain MSL complex) and female Kc cells (which lack the complex), we find that the presence of active chromatin modifications, together with an elevated local GC content in the surrounding sequences, has strong predictive value for functional MSL entry sites, independent of MSL binding. We tested these sites for function in Kc cells by RNAi knockdown of Sxl, resulting in induction of MSL complex. We show that ectopic MSL expression in Kc cells leads to H4K16 acetylation around these sites and a relative increase in X chromosome transcription. Collectively, our results support a model in which a pre-existing active chromatin environment, coincident with H3K36me3, contributes to MSL entry site selection. The consequences of MSL targeting of the male X chromosome include increase in nucleosome lability, enrichment for H4K16 acetylation and JIL-1 kinase, and depletion of linker histone H1 on active X-linked genes. Our analysis can serve as a model for identifying chromatin and local sequence features that may contribute to selection of functional protein binding sites in the genome.
The genomes of complex organisms encompass hundreds of millions of base pairs of DNA, and regulatory molecules must distinguish specific targets within this vast landscape. In general, regulatory factors find target genes through sequence-specific interactions with the underlying DNA. However, sequence-specific factors typically bind only a fraction of the candidate genomic regions containing their specific target sequence motif. Here we identify potential roles for chromatin environment and flanking sequence composition in helping regulatory factors find their appropriate binding sites, using targeting of the Drosophila dosage compensation complex as a model. The initial stage of dosage compensation involves binding of the Male Specific Lethal (MSL) complex to a sequence motif called the MSL recognition element . Using data from a large chromatin mapping effort (the modENCODE project), we successfully identify an active chromatin environment as predictive of selective MRE binding by the MSL complex. Our study provides a framework for using genome-wide datasets to analyze and predict functional protein–DNA binding site selection.
The functional repertoire of long intergenic noncoding RNA (lincRNA) molecules has begun to be elucidated in mammals. Determining the biological relevance and potential gene regulatory mechanisms of these enigmatic molecules would be expedited in a more tractable model organism, such as Drosophila melanogaster. To this end, we defined a set of 1,119 putative lincRNA genes in D. melanogaster using modENCODE whole transcriptome (RNA-seq) data. A large majority (1.1 of 1.3 Mb; 85%) of these bases were not previously reported by modENCODE as being transcribed. Significant selective constraint on the sequences of these loci predicts that virtually all have sustained functionality across the Drosophila clade. We observe biases in lincRNA genomic locations and expression profiles that are consistent with some of these lincRNAs being involved in the regulation of neighboring protein-coding genes with developmental functions. We identify lincRNAs that may be important in the developing nervous system and in male-specific organs, such as the testes. LincRNA loci were also identified whose positions, relative to nearby protein-coding loci, are equivalent between D. melanogaster and mouse. This study predicts that the genomes of not only vertebrates, such as mammals, but also an invertebrate (fruit fly) harbor large numbers of lincRNA loci. Our findings now permit exploitation of Drosophila genetics for the investigation of lincRNA mechanisms, including lincRNAs with potential functional analogues in mammals.
long intergenic noncoding RNAs; modENCODE; transcriptional regulation; evolution; development
Increasingly, high-dimensional genomics data are becoming available for many organisms.Here, we develop OrthoClust for simultaneously clustering data across multiple species. OrthoClust is a computational framework that integrates the co-association networks of individual species by utilizing the orthology relationships of genes between species. It outputs optimized modules that are fundamentally cross-species, which can either be conserved or species-specific. We demonstrate the application of OrthoClust using the RNA-Seq expression profiles of Caenorhabditis elegans and Drosophila melanogaster from the modENCODE consortium. A potential application of cross-species modules is to infer putative analogous functions of uncharacterized elements like non-coding RNAs based on guilt-by-association.
Electronic supplementary material
The online version of this article (doi:10.1186/gb-2014-15-8-r100) contains supplementary material, which is available to authorized users.
Short-read high-throughput DNA sequencing technologies provide new tools to answer biological questions. However, high cost and low throughput limit their widespread use, particularly in organisms with smaller genomes such as S. cerevisiae. Although ChIP-Seq in mammalian cell lines is replacing array-based ChIP-chip as the standard for transcription factor binding studies, ChIP-Seq in yeast is still underutilized compared to ChIP-chip. We developed a multiplex barcoding system that allows simultaneous sequencing and analysis of multiple samples using Illumina's platform. We applied this method to analyze the chromosomal distributions of three yeast DNA binding proteins (Ste12, Cse4 and RNA PolII) and a reference sample (input DNA) in a single experiment and demonstrate its utility for rapid and accurate results at reduced costs.
We developed a barcoding ChIP-Seq method for the concurrent analysis of transcription factor binding sites in yeast. Our multiplex strategy generated high quality data that was indistinguishable from data obtained with non-barcoded libraries. None of the barcoded adapters induced differences relative to a non-barcoded adapter when applied to the same DNA sample. We used this method to map the binding sites for Cse4, Ste12 and Pol II throughout the yeast genome and we found 148 binding targets for Cse4, 823 targets for Ste12 and 2508 targets for PolII. Cse4 was strongly bound to all yeast centromeres as expected and the remaining non-centromeric targets correspond to highly expressed genes in rich media. The presence of Cse4 non-centromeric binding sites was not reported previously.
We designed a multiplex short-read DNA sequencing method to perform efficient ChIP-Seq in yeast and other small genome model organisms. This method produces accurate results with higher throughput and reduced cost. Given constant improvements in high-throughput sequencing technologies, increasing multiplexing will be possible to further decrease costs per sample and to accelerate the completion of large consortium projects such as modENCODE.
Drosophila melanogaster polytene chromosomes display specific banding pattern; the underlying genetic organization of this pattern has remained elusive for many years. In the present paper, we analyze 32 cytology-mapped polytene chromosome interbands. We estimated molecular locations of these interbands, described their molecular and genetic organization and demonstrate that polytene chromosome interbands contain the 5′ ends of housekeeping genes. As a rule, interbands display preferential “head-to-head” orientation of genes. They are enriched for “broad” class promoters characteristic of housekeeping genes and associate with open chromatin proteins and Origin Recognition Complex (ORC) components. In two regions, 10A and 100B, coding sequences of genes whose 5′-ends reside in interbands map to constantly loosely compacted, early-replicating, so-called “grey” bands. Comparison of expression patterns of genes mapping to late-replicating dense bands vs genes whose promoter regions map to interbands shows that the former are generally tissue-specific, whereas the latter are represented by ubiquitously active genes. Analysis of RNA-seq data (modENCODE-FlyBase) indicates that transcripts from interband-mapping genes are present in most tissues and cell lines studied, across most developmental stages and upon various treatment conditions. We developed a special algorithm to computationally process protein localization data generated by the modENCODE project and show that Drosophila genome has about 5700 sites that demonstrate all the features shared by the interbands cytologically mapped to date.
The Myc family of transcription factors regulates a variety of biological processes, including the cell cycle, growth, proliferation, metabolism, and apoptosis. In Caenorhabditis elegans, the “Myc interaction network” consists of two opposing heterodimeric complexes with antagonistic functions in transcriptional control: the Myc-Mondo:Mlx transcriptional activation complex and the Mad:Max transcriptional repression complex. In C. elegans, Mondo, Mlx, Mad, and Max are encoded by mml-1, mxl-2, mdl-1, and mxl-1, respectively. Here we show a similar antagonistic role for the C. elegans Myc-Mondo and Mad complexes in longevity control. Loss of mml-1 or mxl-2 shortens C. elegans lifespan. In contrast, loss of mdl-1 or mxl-1 increases longevity, dependent upon MML-1:MXL-2. The MML-1:MXL-2 and MDL-1:MXL-1 complexes function in both the insulin signaling and dietary restriction pathways. Furthermore, decreased insulin-like/IGF-1 signaling (ILS) or conditions of dietary restriction increase the accumulation of MML-1, consistent with the notion that the Myc family members function as sensors of metabolic status. Additionally, we find that Myc family members are regulated by distinct mechanisms, which would allow for integrated control of gene expression from diverse signals of metabolic status. We compared putative target genes based on ChIP-sequencing data in the modENCODE project and found significant overlap in genomic DNA binding between the major effectors of ILS (DAF-16/FoxO), DR (PHA-4/FoxA), and Myc family (MDL-1/Mad/Mxd) at common target genes, which suggests that diverse signals of metabolic status converge on overlapping transcriptional programs that influence aging. Consistent with this, there is over-enrichment at these common targets for genes that function in lifespan, stress response, and carbohydrate metabolism. Additionally, we find that Myc family members are also involved in stress response and the maintenance of protein homeostasis. Collectively, these findings indicate that Myc family members integrate diverse signals of metabolic status, to coordinate overlapping metabolic and cytoprotective transcriptional programs that determine the progression of aging.
Transcription factors are essential proteins that regulate the expression of genes and play an important role in most biological processes. The results of our study presented here demonstrate for the first time a role in aging for a small family of transcription factors in the nematode worm Caenorhabditis elegans. Importantly, these proteins have close relatives in higher organisms, including humans that influence metabolism, cell replication, and have been implicated in the development of cancer. Moreover, the loss of one homologue has also been implicated in Williams-Beuren syndrome, a disease characterized in part by signs of premature aging. Our data demonstrate that these transcription factors function within insulin/IGF-1 signaling and dietary restriction, two highly conserved pathways that link nutrient sensing to longevity. Taken together, our findings provide exciting new insight into a family of proteins that may be essential for linking nutrient sensing to longevity and have implications for the improvement of human healthspan.
Development of tools to jointly visualize the genome and the epigenome remains a challenge. chroGPS is a computational approach that addresses this question. chroGPS uses multidimensional scaling techniques to represent similarity between epigenetic factors, or between genetic elements on the basis of their epigenetic state, in 2D/3D reference maps. We emphasize biological interpretability, statistical robustness, integration of genetic and epigenetic data from heterogeneous sources, and computational feasibility. Although chroGPS is a general methodology to create reference maps and study the epigenetic state of any class of genetic element or genomic region, we focus on two specific kinds of maps: chroGPSfactors, which visualizes functional similarities between epigenetic factors, and chroGPSgenes, which describes the epigenetic state of genes and integrates gene expression and other functional data. We use data from the modENCODE project on the genomic distribution of a large collection of epigenetic factors in Drosophila, a model system extensively used to study genome organization and function. Our results show that the maps allow straightforward visualization of relationships between factors and elements, capturing relevant information about their functional properties that helps to interpret epigenetic information in a functional context and derive testable hypotheses.
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
Target specific antibodies are pivotal for the design of vaccines, immunodiagnostic tests, studies on proteomics for cancer biomarker discovery, identification of protein-DNA and other interactions, and small and large biochemical assays. Therefore, it is important to understand the properties of protein sequences that are important for antigenicity and to identify small peptide epitopes and large regions in the linear sequence of the proteins whose utilization result in specific antibodies.
Our analysis using protein properties suggested that sequence composition combined with evolutionary information and predicted secondary structure, as well as solvent accessibility is sufficient to predict successful peptide epitopes. The antigenicity and the specificity in immune response were also found to depend on the epitope length. We trained the B-Cell Epitope Oracle (BEOracle), a support vector machine (SVM) classifier, for the identification of continuous B-Cell epitopes with these protein properties as learning features. The BEOracle achieved an F1-measure of 81.37% on a large validation set. The BEOracle classifier outperformed the classical methods based on propensity and sophisticated methods like BCPred and Bepipred for B-Cell epitope prediction. The BEOracle classifier also identified peptides for the ChIP-grade antibodies from the modENCODE/ENCODE projects with 96.88% accuracy. High BEOracle score for peptides showed some correlation with the antibody intensity on Immunofluorescence studies done on fly embryos. Finally, a second SVM classifier, the B-Cell Region Oracle (BROracle) was trained with the BEOracle scores as features to predict the performance of antibodies generated with large protein regions with high accuracy. The BROracle classifier achieved accuracies of 75.26-63.88% on a validation set with immunofluorescence, immunohistochemistry, protein arrays and western blot results from Protein Atlas database.
Together our results suggest that antigenicity is a local property of the protein sequences and that protein sequence properties of composition, secondary structure, solvent accessibility and evolutionary conservation are the determinants of antigenicity and specificity in immune response. Moreover, specificity in immune response could also be accurately predicted for large protein regions without the knowledge of the protein tertiary structure or the presence of discontinuous epitopes. The dataset prepared in this work and the classifier models are available for download at https://sites.google.com/site/oracleclassifiers/.
Chromatin immunoprecipitation (ChIP) is widely used to identify chromosomal binding sites. Chromatin proteins are cross-linked to their target sequences in living cells. The purified chromatin is sheared and the relevant protein is enriched by immunoprecipitation with specific antibodies. The co-purifying genomic DNA is then determined by massive parallel sequencing (ChIP-seq).
We applied ChIP-seq to map the chromosomal binding sites for two ISWI-containing nucleosome remodeling factors, ACF and RSF, in Drosophila embryos. Employing several polyclonal and monoclonal antibodies directed against their signature subunits, ACF1 and RSF-1, robust profiles were obtained indicating that both remodelers co-occupied a large set of active promoters.
Further validation included controls using chromatin of mutant embryos that do not express ACF1 or RSF-1. Surprisingly, the ChIP-seq profiles were unchanged, suggesting that they were not due to specific immunoprecipitation. Conservative analysis lists about 3000 chromosomal loci, mostly active promoters that are prone to non-specific enrichment in ChIP and appear as ‘Phantom Peaks’. These peaks are not obtained with pre-immune serum and are not prominent in input chromatin.
Mining the modENCODE ChIP-seq profiles identifies potential Phantom Peaks in many profiles of epigenetic regulators. These profiles and other ChIP-seq data featuring prominent Phantom Peaks must be validated with chromatin from cells in which the protein of interest has been depleted.
Motivation: Genome-wide mapping of chromatin states is essential for defining regulatory elements and inferring their activities in eukaryotic genomes. A number of hidden Markov model (HMM)-based methods have been developed to infer chromatin state maps from genome-wide histone modification data for an individual genome. To perform a principled comparison of evolutionarily distant epigenomes, we must consider species-specific biases such as differences in genome size, strength of signal enrichment and co-occurrence patterns of histone modifications.
Results: Here, we present a new Bayesian non-parametric method called hierarchically linked infinite HMM (hiHMM) to jointly infer chromatin state maps in multiple genomes (different species, cell types and developmental stages) using genome-wide histone modification data. This flexible framework provides a new way to learn a consistent definition of chromatin states across multiple genomes, thus facilitating a direct comparison among them. We demonstrate the utility of this method using synthetic data as well as multiple modENCODE ChIP-seq datasets.
Conclusion: The hierarchical and Bayesian non-parametric formulation in our approach is an important extension to the current set of methodologies for comparative chromatin landscape analysis.
Availability and implementation: Source codes are available at https://github.com/kasohn/hiHMM. Chromatin data are available at http://encode-x.med.harvard.edu/data_sets/chromatin/.
email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
Systematic annotation of gene regulatory elements is a major challenge in genome science. Direct mapping of chromatin modification marks and transcriptional factor binding sites genome-wide 1,2 has successfully identified specific subtypes of regulatory elements 3. In Drosophila several pioneering studies have provided genome-wide identification of Polycomb-Response Elements 4, chromatin states 5, transcription factor binding sites (TFBS) 6–9, PolII regulation 8, and insulator elements 10; however, comprehensive annotation of the regulatory genome remains a significant challenge. Here we describe results from the modENCODE cis-regulatory annotation project. We produced a map of the Drosophila melanogaster regulatory genome based on more than 300 chromatin immuno-precipitation (ChIP) datasets for eight chromatin features, five histone deacetylases (HDACs) and thirty-eight site-specific transcription factors (TFs) at different stages of development. Using these data we inferred more than 20,000 candidate regulatory elements and we validated a subset of predictions for promoters, enhancers, and insulators in vivo. We also identified nearly 2,000 genomic regions of dense TF binding associated with chromatin activity and accessibility. We discovered hundreds of new TF co-binding relationships and defined a TF network with over 800 potential regulatory relationships.
We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation–i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts.
Availability: ACT is available at http://act.gersteinlab.org
We present a network framework for analyzing multi-level regulation in higher eukaryotes based on systematic integration of various high-throughput datasets. The network, namely the integrated regulatory network, consists of three major types of regulation: TF→gene, TF→miRNA and miRNA→gene. We identified the target genes and target miRNAs for a set of TFs based on the ChIP-Seq binding profiles, the predicted targets of miRNAs using annotated 3′UTR sequences and conservation information. Making use of the system-wide RNA-Seq profiles, we classified transcription factors into positive and negative regulators and assigned a sign for each regulatory interaction. Other types of edges such as protein-protein interactions and potential intra-regulations between miRNAs based on the embedding of miRNAs in their host genes were further incorporated. We examined the topological structures of the network, including its hierarchical organization and motif enrichment. We found that transcription factors downstream of the hierarchy distinguish themselves by expressing more uniformly at various tissues, have more interacting partners, and are more likely to be essential. We found an over-representation of notable network motifs, including a FFL in which a miRNA cost-effectively shuts down a transcription factor and its target. We used data of C. elegans from the modENCODE project as a primary model to illustrate our framework, but further verified the results using other two data sets. As more and more genome-wide ChIP-Seq and RNA-Seq data becomes available in the near future, our methods of data integration have various potential applications.
The precise control of gene expression lies at the heart of many biological processes. In eukaryotes, the regulation is performed at multiple levels, mediated by different regulators such as transcription factors and miRNAs, each distinguished by different spatial and temporal characteristics. These regulators are further integrated to form a complex regulatory network responsible for the orchestration. The construction and analysis of such networks is essential for understanding the general design principles. Recent advances in high-throughput techniques like ChIP-Seq and RNA-Seq provide an opportunity by offering a huge amount of binding and expression data. We present a general framework to combine these types of data into an integrated network and perform various topological analyses, including its hierarchical organization and motif enrichment. We find that the integrated network possesses an intrinsic hierarchical organization and is enriched in several network motifs that include both transcription factors and miRNAs. We further demonstrate that the framework can be easily applied to other species like human and mouse. As more and more genome-wide ChIP-Seq and RNA-Seq data are going to be generated in the near future, our methods of data integration have various potential applications.
Motivation: The highly coordinated expression of thousands of genes in an organism is regulated by the concerted action of transcription factors, chromatin proteins and epigenetic mechanisms. High-throughput experimental data for genome wide in vivo protein–DNA interactions and epigenetic marks are becoming available from large projects, such as the model organism ENCyclopedia Of DNA Elements (modENCODE) and from individual labs. Dissemination and visualization of these datasets in an explorable form is an important challenge.
Results: To support research on Drosophila melanogaster transcription regulation and make the genome wide in vivo protein–DNA interactions data available to the scientific community as a whole, we have developed a system called Flynet. Currently, Flynet contains 101 datasets for 38 transcription factors and chromatin regulator proteins in different experimental conditions. These factors exhibit different types of binding profiles ranging from sharp localized peaks to broad binding regions. The protein–DNA interaction data in Flynet was obtained from the analysis of chromatin immunoprecipitation experiments on one color and two color genomic tiling arrays as well as chromatin immunoprecipitation followed by massively parallel sequencing. A web-based interface, integrated with an AJAX based genome browser, has been built for queries and presenting analysis results. Flynet also makes available the cis-regulatory modules reported in literature, known and de novo identified sequence motifs across the genome, and other resources to study gene regulation.
Availability: Flynet is available at https://www.cistrack.org/flynet/.
Supplementary information: Supplementary data are available at Bioinformatics online.
The Motif Enrichment Tool (MET) provides an online interface that enables users to find major transcriptional regulators of their gene sets of interest. MET searches the appropriate regulatory region around each gene and identifies which transcription factor DNA-binding specificities (motifs) are statistically overrepresented. Motif enrichment analysis is currently available for many metazoan species including human, mouse, fruit fly, planaria and flowering plants. MET also leverages high-throughput experimental data such as ChIP-seq and DNase-seq from ENCODE and ModENCODE to identify the regulatory targets of a transcription factor with greater precision. The results from MET are produced in real time and are linked to a genome browser for easy follow-up analysis. Use of the web tool is free and open to all, and there is no login requirement. Address: http://veda.cs.uiuc.edu/MET/.
We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3′ UTRs (up to 15–18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.
transcriptome; alternative splice; lncRNA; transcription start site; exon junction
Sequence-specific transcription factors (TFs) are critical for specifying patterns and levels of gene expression, but target DNA elements are not sufficient to specify TF binding in vivo. In eukaryotes, the binding of a TF is in competition with a constellation of other proteins, including histones, which package DNA into nucleosomes. We used the ChIP-seq assay to examine the genome-wide distribution of Drosophila Heat Shock Factor (HSF), a TF whose binding activity is mediated by heat shock-induced trimerization. HSF binds to 464 sites after heat shock, the vast majority of which contain HSF Sequence-binding Elements (HSEs). HSF-bound sequence motifs represent only a small fraction of the total HSEs present in the genome. ModENCODE ChIP-chip datasets, generated during non-heat shock conditions, were used to show that inducibly bound HSE motifs are associated with histone acetylation, H3K4 trimethylation, RNA Polymerase II, and coactivators, compared to HSE motifs that remain HSF-free. Furthermore, directly changing the chromatin landscape, from an inactive to an active state, permits inducible HSF binding. There is a strong correlation of bound HSEs to active chromatin marks present prior to induced HSF binding, indicating that an HSE's residence in “active” chromatin is a primary determinant of whether HSF can bind following heat shock.
Many Transcription Factors (TFs) have been shown to bind DNA in a sequence-specific manner. However, only a sub-set of possible binding sites are occupied in vivo, and it remains unclear how TFs discriminate between sequences of equal predicted binding affinity. We set out to determine how a specific TF, Heat Shock Factor (HSF), distinguishes between utilized and unused potential binding sites. HSF is uniquely qualified to study this problem, because HSF is inactive and lowly bound to DNA in unstressed cells and upon stress HSF becomes active and strongly binds to DNA. We compared the properties of the unstressed chromatin between the sites that become HSF-bound or remain HSF-free following stress activation. We find that sites that are destined to be bound strongly by HSF after stress are associated with distinct chromatin marks compared to sites that are unoccupied by HSF after heat shock. Furthermore, chromatin landscape can be changed from a restrictive to a permissive state, allowing inducible HSF binding. These finding suggest that TF binding sites can be predicted based on the chromatin signatures present prior to induced TF recruitment.
Chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) allows genome-wide discovery of protein-DNA interactions such as transcription factor bindings and histone modifications. Previous reports only compared a small number of profiles, and little has been done to compare histone modification profiles generated by the two technologies or to assess the impact of input DNA libraries in ChIP-seq analysis. Here, we performed a systematic analysis of a modENCODE dataset consisting of 31 pairs of ChIP-chip/ChIP-seq profiles of the coactivator CBP, RNA polymerase II (RNA PolII), and six histone modifications across four developmental stages of Drosophila melanogaster.
Both technologies produce highly reproducible profiles within each platform, ChIP-seq generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks. The set of peaks identified by the two technologies can be significantly different, but the extent to which they differ varies depending on the factor and the analysis algorithm. Importantly, we found that there is a significant variation among multiple sequencing profiles of input DNA libraries and that this variation most likely arises from both differences in experimental condition and sequencing depth. We further show that using an inappropriate input DNA profile can impact the average signal profiles around genomic features and peak calling results, highlighting the importance of having high quality input DNA data for normalization in ChIP-seq analysis.
Our findings highlight the biases present in each of the platforms, show the variability that can arise from both technology and analysis methods, and emphasize the importance of obtaining high quality and deeply sequenced input DNA libraries for ChIP-seq analysis.
The E2F transcription factors are important regulators of the cell cycle whose function is commonly misregulated in cancer. To identify novel regulators of E2F1 activity in vivo, we used Drosophila to conduct genetic screens. For this, we generated transgenic lines that allow the tissue-specific depletion of dE2F1 by RNAi. Expression of these transgenes using Gal4 drivers in the eyes and wings generated reliable and modifiable phenotypes. We then conducted genetic screens testing the capacity of Exelixis deficiencies to modify these E2F1-RNAi phenotypes. From these screens, we identified mutant alleles of Suppressor of zeste 2 [Su(z)2] and multiple Polycomb group genes as strong suppressors of the E2F1-RNA interference phenotypes. In validation of our genetic data, we find that depleting Su(z)2 in cultured Drosophila cells restores the cell-proliferation defects caused by reduction of dE2F1 by elevating the level of dE2f1. Furthermore, analyses of methylation status of histone H3 lysine 27 (H3K27me) from the published modENCODE data sets suggest that the genomic regions harboring dE2f1 gene and certain dE2f1 target genes display H3K27me during development and in several Drosophila cell lines. These in vivo observations suggest that the Polycomb group may regulate cell proliferation by repressing the transcription of dE2f1 and certain dE2F1 target genes. This mechanism may play an important role in coordinating cellular differentiation and proliferation during Drosophila development.
cell proliferation; E2F1; Su(z)2; PcG; Drosophila
Many host-adapted bacterial pathogens contain DNA methyltransferases (mod genes) that are subject to phase-variable expression (high-frequency reversible ON/OFF switching of gene expression). In Haemophilus influenzae, the random switching of the modA gene controls expression of a phase-variable regulon of genes (a “phasevarion”), via differential methylation of the genome in the modA ON and OFF states. Phase-variable mod genes are also present in Neisseria meningitidis and Neisseria gonorrhoeae, suggesting that phasevarions may occur in these important human pathogens. Phylogenetic studies on phase-variable mod genes associated with type III restriction modification (R-M) systems revealed that these organisms have two distinct mod genes—modA and modB. There are also distinct alleles of modA (abundant: modA11, 12, 13; minor: modA4, 15, 18) and modB (modB1, 2). These alleles differ only in their DNA recognition domain. ModA11 was only found in N. meningitidis and modA13 only in N. gonorrhoeae. The recognition site for the modA13 methyltransferase in N. gonorrhoeae strain FA1090 was identified as 5′-AGAAA-3′. Mutant strains lacking the modA11, 12 or 13 genes were made in N. meningitidis and N. gonorrhoeae and their phenotype analyzed in comparison to a corresponding mod ON wild-type strain. Microarray analysis revealed that in all three modA alleles multiple genes were either upregulated or downregulated, some of which were virulence-associated. For example, in N. meningitidis MC58 (modA11), differentially expressed genes included those encoding the candidate vaccine antigens lactoferrin binding proteins A and B. Functional studies using N. gonorrhoeae FA1090 and the clinical isolate O1G1370 confirmed that modA13 ON and OFF strains have distinct phenotypes in antimicrobial resistance, in a primary human cervical epithelial cell model of infection, and in biofilm formation. This study, in conjunction with our previous work in H. influenzae, indicates that phasevarions may be a common strategy used by host-adapted bacterial pathogens to randomly switch between “differentiated” cell types.
The pathogenic Neisseria are bacterial pathogens that cause meningitis and gonorrhoea. They have adapted to life exclusively in humans and have developed unique strategies to colonize the host and to evade the immune response. Central among these strategies are genetic switches that randomly turn genes on and off. In most cases, the genes controlled by these switches, contingency genes, are required for making bacterial surface structures. Recently we described a new class of contingency gene that methylates DNA. Rather than affecting the synthesis of a single surface structure, on/off switching of this DNA-methyltransferase gene leads to random switching of multiple genes. In this study, we have shown that this mechanism exists in all pathogenic Neisseria, and alters expression of multiple genes in all cases we examined. The two distinct populations of bacteria generated by this process had different behavior in model systems of colonization and infection. Understanding this process is key to understanding these human pathogens, and to developing strategies for treatment and prevention of the diseases they cause.