As the cost of sequencing continues to decrease and the amount of sequence data generated grows, new paradigms for data storage and analysis are increasingly important. The relative scaling behavior of these evolving technologies will impact genomics research moving forward.
The immense inter- and intra-cellular heterogeneity of the central nervous system (CNS) presents major challenges for high-throughput *omic analyses. Transcriptional, translational, and post-translational regulatory events are localised to specific neuronal cell-types, or sub-cellular compartments, resulting in discrete patterns of protein expression and activity. A spatial and quantitative knowledge of the “neuroproteome” is therefore critical to understanding normal and pathological aspects of functional genomics and anatomy of the CNS. Improvements in mass-spectrometry allow profiling of proteins at sufficient depth to complement results from high-throughput genomic and transcriptomic assays. However, there are challenges in integrating proteomic data with other data modalities and even greater challenges obtaining comprehensive neuroproteomic data with cell-type specificity. Here we discuss how proteomics should be exploited to enhance high-throughput functional genomic analysis by tighter integration of data analyses. We also discuss experimental strategies to achieve finer cellular and sub-cellular resolution in transcriptomic and proteomic studies of neural tissues.
The large diversity and volume of extracellular RNA (exRNA) data that will form the basis of the exRNA Atlas generated by the Extracellular RNA Communication Consortium pose a substantial data integration challenge. We here present the strategy that is being implemented by the exRNA Data Management and Resource Repository, which employs metadata, biomedical ontologies and Linked Data technologies, such as Resource Description Framework to integrate a diverse set of exRNA profiles into an exRNA Atlas and enable integrative exRNA analysis. We focus on the following three specific data integration tasks: (a) selection of samples from a virtual biorepository for exRNA profiling and for inclusion in the exRNA Atlas; (b) retrieval of a data slice from the exRNA Atlas for integrative analysis and (c) interpretation of exRNA analysis results in the context of pathways and networks. As exRNA profiling gains wide adoption in the research community, we anticipate that the strategies discussed here will increasingly be required to enable data reuse and to facilitate integrative analysis of exRNA data.
ERC Consortium; DMRR; exRNA; exRNA Atlas; exRNA Portal
The topology of the gene-regulatory network has been extensively analyzed. Now, given the large amount of available functional genomic data, it is possible to go beyond this and systematically study regulatory circuits in terms of logic elements. To this end, we present Loregic, a computational method integrating gene expression and regulatory network data, to characterize the cooperativity of regulatory factors. Loregic uses all 16 possible two-input-one-output logic gates (e.g. AND or XOR) to describe triplets of two factors regulating a common target. We attempt to find the gate that best matches each triplet’s observed gene expression pattern across many conditions. We make Loregic available as a general-purpose tool (github.com/gersteinlab/loregic). We validate it with known yeast transcription-factor knockout experiments. Next, using human ENCODE ChIP-Seq and TCGA RNA-Seq data, we are able to demonstrate how Loregic characterizes complex circuits involving both proximally and distally regulating transcription factors (TFs) and also miRNAs. Furthermore, we show that MYC, a well-known oncogenic driving TF, can be modeled as acting independently from other TFs (e.g., using OR gates) but antagonistically with repressing miRNAs. Finally, we inter-relate Loregic’s gate logic with other aspects of regulation, such as indirect binding via protein-protein interactions, feed-forward loop motifs and global regulatory hierarchy.
Gene expression is controlled by various gene regulatory factors. Those factors work cooperatively forming a complex regulatory circuit genome wide. Corruptions of regulatory cooperativity may lead to abnormal gene expression activity such as cancer. Traditional experimental methods, however, can only identify small-scale regulatory activity. Thus, to systematically understand the cooperativity between and among different types of regulatory factors, we need the efficient and systematic computational methods. Regulatory circuits have been suggested to behave analogously to the electronic circuits in which a wide variety of electronic elements work coordinately to function correctly. Recently, an increasing amount of next generation sequencing data provides a great resource to study regulatory activity. Thus, we developed a general-purpose computational method using logic-circuit models from electronics and applied it to a human leukemia dataset, identifying the genome-wide cooperativity of transcription factors and microRNAs.
Despite the large evolutionary distances, metazoan species show remarkable commonalities, which has helped establish fly and worm as model organisms for human biology1,2. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. We mapped the genome-wide binding locations of 165 human, 93 worm, and 52 fly transcription-regulatory factors (RFs) generating a total of 1,019 data sets from diverse cell-types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous RF families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding in the regulatory underpinnings of model organism biology and how these relate to human biology, development, and disease.
Transcription Factor; Regulatory Information; Gene Regulation; Single Nucleotide Polymorphisms; ChIP-seq
We present MUSIC, a signal processing approach for identification of enriched regions in ChIP-Seq data, available at music.gersteinlab.org. MUSIC first filters the ChIP-Seq read-depth signal for systematic noise from non-uniform mappability, which fragments enriched regions. Then it performs a multiscale decomposition, using median filtering, identifying enriched regions at multiple length scales. This is useful given the wide range of scales probed in ChIP-Seq assays. MUSIC performs favorably in terms of accuracy and reproducibility compared with other methods. In particular, analysis of RNA polymerase II data reveals a clear distinction between the stalled and elongating forms of the polymerase.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0474-3) contains supplementary material, which is available to authorized users.
Transcription factors (TFs) bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 TFs in 458 ChIP-Seq experiments. We found the combinatorial, co-association of TFs to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the TF binding into a hierarchy and integrated it with other genomic information (e.g. miRNA regulation), forming a dense meta-network. Factors at different levels have different properties: for instance, top-level TFs more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs -- e.g. noise-buffering feed-forward loops. Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (i.e., differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
Increasingly, high-dimensional genomics data are becoming available for many organisms.Here, we develop OrthoClust for simultaneously clustering data across multiple species. OrthoClust is a computational framework that integrates the co-association networks of individual species by utilizing the orthology relationships of genes between species. It outputs optimized modules that are fundamentally cross-species, which can either be conserved or species-specific. We demonstrate the application of OrthoClust using the RNA-Seq expression profiles of Caenorhabditis elegans and Drosophila melanogaster from the modENCODE consortium. A potential application of cross-species modules is to infer putative analogous functions of uncharacterized elements like non-coding RNAs based on guilt-by-association.
Electronic supplementary material
The online version of this article (doi:10.1186/gb-2014-15-8-r100) contains supplementary material, which is available to authorized users.
Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment.
Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org.
Contact: firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific sub-cellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic sub-cellular localizations are also poorly understood. Since RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell’s regulatory capabilities are focused on its synthesis, processing, transport, modifications and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations taken together prompt to a redefinition of the concept of a gene.
Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.
As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions.
Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.
We present a network framework for analyzing multi-level regulation in higher eukaryotes based on systematic integration of various high-throughput datasets. The network, namely the integrated regulatory network, consists of three major types of regulation: TF→gene, TF→miRNA and miRNA→gene. We identified the target genes and target miRNAs for a set of TFs based on the ChIP-Seq binding profiles, the predicted targets of miRNAs using annotated 3′UTR sequences and conservation information. Making use of the system-wide RNA-Seq profiles, we classified transcription factors into positive and negative regulators and assigned a sign for each regulatory interaction. Other types of edges such as protein-protein interactions and potential intra-regulations between miRNAs based on the embedding of miRNAs in their host genes were further incorporated. We examined the topological structures of the network, including its hierarchical organization and motif enrichment. We found that transcription factors downstream of the hierarchy distinguish themselves by expressing more uniformly at various tissues, have more interacting partners, and are more likely to be essential. We found an over-representation of notable network motifs, including a FFL in which a miRNA cost-effectively shuts down a transcription factor and its target. We used data of C. elegans from the modENCODE project as a primary model to illustrate our framework, but further verified the results using other two data sets. As more and more genome-wide ChIP-Seq and RNA-Seq data becomes available in the near future, our methods of data integration have various potential applications.
The precise control of gene expression lies at the heart of many biological processes. In eukaryotes, the regulation is performed at multiple levels, mediated by different regulators such as transcription factors and miRNAs, each distinguished by different spatial and temporal characteristics. These regulators are further integrated to form a complex regulatory network responsible for the orchestration. The construction and analysis of such networks is essential for understanding the general design principles. Recent advances in high-throughput techniques like ChIP-Seq and RNA-Seq provide an opportunity by offering a huge amount of binding and expression data. We present a general framework to combine these types of data into an integrated network and perform various topological analyses, including its hierarchical organization and motif enrichment. We find that the integrated network possesses an intrinsic hierarchical organization and is enriched in several network motifs that include both transcription factors and miRNAs. We further demonstrate that the framework can be easily applied to other species like human and mouse. As more and more genome-wide ChIP-Seq and RNA-Seq data are going to be generated in the near future, our methods of data integration have various potential applications.
A computational pipeline for constructing a personal diploid genome and determining sites of allele-specific activity is developed. Using a regulatory network framework, allele-specific binding and expression are found to be significantly coordinated across the genome.
Software was developed for building a personal diploid genome sequence, and determining sites of allele-specific binding and expression (AlleleSeq).This computational pipeline was used to analyze variation data, and deeply sequenced RNA-Seq and ChIP-Seq datasets, for individual NA12878 from the 1000 Genomes Project.The interaction between allele-specific binding and allele-specific expression are investigated, revealing clear coordination.
To study allele-specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele-specific events with significant differences in the number of mapped reads between maternal and paternal alleles. There are many technical challenges in the construction and alignment of reads to a personal diploid genome sequence that we address, for example, bias of reads mapping to the reference allele. We have applied AlleleSeq to variation data for NA12878 from the 1000 Genomes Project as well as matched, deeply sequenced RNA-Seq and ChIP-Seq data sets generated for this purpose. In addition to observing fairly widespread allele-specific behavior within individual functional genomic data sets (including results consistent with X-chromosome inactivation), we can study the interaction between ASE and ASB. Furthermore, we investigate the coordination between ASE and ASB from multiple transcription factors events using a regulatory network framework. Correlation analyses and network motifs show mostly coordinated ASB and ASE.
allele-specific; ChIP-Seq; networks; RNA-Seq
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
Despite recent controversies, the evidence that the majority of the human genome is transcribed into RNA remains strong.
A systems understanding of nuclear organization and events is critical for determining how cells divide, differentiate, and respond to stimuli and for identifying the causes of diseases. Chromatin remodeling complexes such as SWI/SNF have been implicated in a wide variety of cellular processes including gene expression, nuclear organization, centromere function, and chromosomal stability, and mutations in SWI/SNF components have been linked to several types of cancer. To better understand the biological processes in which chromatin remodeling proteins participate, we globally mapped binding regions for several components of the SWI/SNF complex throughout the human genome using ChIP-Seq. SWI/SNF components were found to lie near regulatory elements integral to transcription (e.g. 5′ ends, RNA Polymerases II and III, and enhancers) as well as regions critical for chromosome organization (e.g. CTCF, lamins, and DNA replication origins). Interestingly we also find that certain configurations of SWI/SNF subunits are associated with transcripts that have higher levels of expression, whereas other configurations of SWI/SNF factors are associated with transcripts that have lower levels of expression. To further elucidate the association of SWI/SNF subunits with each other as well as with other nuclear proteins, we also analyzed SWI/SNF immunoprecipitated complexes by mass spectrometry. Individual SWI/SNF factors are associated with their own family members, as well as with cellular constituents such as nuclear matrix proteins, key transcription factors, and centromere components, implying a ubiquitous role in gene regulation and nuclear function. We find an overrepresentation of both SWI/SNF-associated regions and proteins in cell cycle and chromosome organization. Taken together the results from our ChIP and immunoprecipitation experiments suggest that SWI/SNF facilitates gene regulation and genome function more broadly and through a greater diversity of interactions than previously appreciated.
Genetic information and programming are not entirely contained in DNA sequence but are also governed by chromatin structure. Gaining a greater understanding of chromatin remodeling complexes can bridge gaps between processes in the genome and the epigenome and can offer insights into diseases such as cancer. We identified targets of the chromatin remodeling complex, SWI/SNF, on a genome-wide scale using ChIP-Seq. We also identify proteins that co-purify with its various components via immunoprecipitation combined with mass spectrometry. By integrating these newly-identified regions with a combination of novel and published data sources, we identify pathways and cellular compartments in which SWI/SNF plays a major role as well as discern general characteristics of SWI/SNF target sites. Our parallel evaluations of multiple SWI/SNF factors indicate that these subunits are found in highly dynamic and combinatorial assemblies. Our study presents the first genome-wide and unified view of multiple SWI/SNF components and also provides a valuable resource to the scientific community as an important data source to be integrated with future genomic and epigenomic studies.
We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation–i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts.
Availability: ACT is available at http://act.gersteinlab.org
Tiling array data is hard to interpret due to noise. The wavelet transformation is a widely used technique in signal processing for elucidating the true signal from noisy data. Consequently, we attempted to denoise representative tiling array datasets for ChIP-chip experiments using wavelets. In doing this, we used specific wavelet basis functions, Coiflets, since their triangular shape closely resembles the expected profiles of true ChIP-chip peaks.
In our wavelet-transformed data, we observed that noise tends to be confined to small scales while the useful signal-of-interest spans multiple large scales. We were also able to show that wavelet coefficients due to non-specific cross-hybridization follow a log-normal distribution, and we used this fact in developing a thresholding procedure. In particular, wavelets allow one to set an unambiguous, absolute threshold, which has been hard to define in ChIP-chip experiments. One can set this threshold by requiring a similar confidence level at different length-scales of the transformed signal. We applied our algorithm to a number of representative ChIP-chip data sets, including those of Pol II and histone modifications, which have a diverse distribution of length-scales of biochemical activity, including some broad peaks.
Finally, we benchmarked our method in comparison to other approaches for scoring ChIP-chip data using spike-ins on the ENCODE Nimblegen tiling array. This comparison demonstrated excellent performance, with wavelets getting the best overall score.
We develop a statistical framework to study the relationship between chromatin features and gene expression. This can be used to predict gene expression of protein coding genes, as well as microRNAs. We demonstrate the prediction in a variety of contexts, focusing particularly on the modENCODE worm datasets. Moreover, our framework reveals the positional contribution around genes (upstream or downstream) of distinct chromatin features to the overall prediction of expression levels.
Summary: The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues, we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools (RSEQtools) that use this format for the analysis of RNA-Seq experiments. These tools consist of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads and segmenting that signal into actively transcribed regions. Moreover, the tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by MRF, this format also facilitates the decoupling of the alignment of reads from downstream analyses.
Availability and implementation: RSEQtools is implemented in C and the source code is available at http://rseqtools.gersteinlab.org/.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
We have developed FusionSeq to identify fusion transcripts from paired-end RNA-sequencing. FusionSeq includes filters to remove spurious candidate fusions with artifacts, such as misalignment or random pairing of transcript fragments, and it ranks candidates according to several statistics. It also has a module to identify exact sequences at breakpoint junctions. FusionSeq detected known and novel fusions in a specially sequenced calibration data set, including eight cancers with and without known rearrangements.
Differences in gene expression may play a major role in speciation and phenotypic diversity. We examined genome-wide differences in transcription factor (TF) binding in several humans and a single chimpanzee using chromatin immunoprecipitation followed by sequencing (ChIP-Seq). The binding sites of RNA Polymerase II (PolII) and a key regulator of immune responses, NFκB (p65), were mapped in ten lymphoblastoid cell lines and 25% and 7.5% of the respective binding regions were found to differ between individuals. Binding differences were frequently associated with SNPs and genomic structural variants (SVs) and were often correlated with differences in gene expression, suggesting functional consequences of binding variation. Furthermore, comparing PolII binding between human and chimpanzee suggests extensive divergence in TF binding. Our results indicate that many differences in individuals and species occur at the level of TF binding and provide insight into the genetic events responsible for these differences.
Chromatin immunoprecipitation followed by tag sequencing (ChIP-Seq) using high-throughput next-generation instrumentation is replacing ChIP-chip for mapping of sites of transcription-factor binding and chromatin modification. To develop a scoring approach for this new technique, we produce two deeply sequenced datasets for human RNA polymerase II and STAT1 with matching input-DNA controls. In these, we observe that signal peaks corresponding to sites of potential binding are strongly correlated with peaks in the control, likely revealing features of open chromatin. Based on these observations, we develop a two-pass approach for scoring ChIP-Seq relative to controls. The first pass identifies putative binding sites and compensates for genomic variation in the mappability of sequences. The second pass filters sites not significantly enriched compared to the normalized control, computing precise enrichments and significances. Using our scoring we investigate optimal experimental design – i.e. depth of sequencing and value of replicas (showing marginal information gain beyond two).
Tiling arrays have been the tool of choice for probing an organism's transcriptome without prior assumptions about the transcribed regions, but RNA-Seq is becoming a viable alternative as the costs of sequencing continue to decrease. Understanding the relative merits of these technologies will help researchers select the appropriate technology for their needs.
Here, we compare these two platforms using a matched sample of poly(A)-enriched RNA isolated from the second larval stage of C. elegans. We find that the raw signals from these two technologies are reasonably well correlated but that RNA-Seq outperforms tiling arrays in several respects, notably in exon boundary detection and dynamic range of expression. By exploring the accuracy of sequencing as a function of depth of coverage, we found that about 4 million reads are required to match the sensitivity of two tiling array replicates. The effects of cross-hybridization were analyzed using a "nearest neighbor" classifier applied to array probes; we describe a method for determining potential "black list" regions whose signals are unreliable. Finally, we propose a strategy for using RNA-Seq data as a gold standard set to calibrate tiling array data. All tiling array and RNA-Seq data sets have been submitted to the modENCODE Data Coordinating Center.
Tiling arrays effectively detect transcript expression levels at a low cost for many species while RNA-Seq provides greater accuracy in several regards. Researchers will need to carefully select the technology appropriate to the biological investigations they are undertaking. It will also be important to reconsider a comparison such as ours as sequencing technologies continue to evolve.