Several consortia are systematically interrogating genetic variation, the transcriptome, the epigenome and the interactome on a genomic scale. Each experiment adds another dimension of data to the genome so there now are hundreds of dimensions of experimental data tethered onto the human genome (and other genomes) and this number is growing rapidly. The key to exploiting these data is integrating them. There are many ways to approach the challenge of data integration and we discuss three important – though not mutually exclusive – approaches below.
Data complexity reduction
For a growing number of sequencing based assays such as ChIP-Seq, DNase-Seq, FAIRE-Seq, RNA-Seq, or Hi-C, the result of each experiment is millions of short sequence reads, which essentially give a continuous signal of enrichment across the genome. A simple approach to reducing the complexity of this dataset from millions of data points to a more manageable hundreds or thousands of sites is to summarize each experiment as a collection of genomic regions with strong enrichment of signal. For ChIP-Seq, peak-finders discretize the genome-wide profiles into regions with enrichment and those without. Therefore, a commonly used method of data integration is to perform intersection analysis on enriched regions from different experiments. For example, Chen
et al mapped a collection of 13 TFs using ChIP-Seq in mouse ES cells, used a custom peak-finder to call regions of enrichment, and observed significant co-binding of TFs
97.
Although intersection analysis on discretized datasets is straightforward to perform, special attention must be paid to the underlying assumptions of data discretization. For example, blanket application of a peak finding method and set of parameters to different types of data – such as histone modifications, TF binding and open chromatin - is often ill-advised, for several reasons. Firstly, the type of experiment usually dictates a specific kind of data analysis. For instance, TFs often bind discrete, specific sites and so ChIP-Seq tags at the point of binding have a biased distribution between positive and negative strands, which can be used by peak finders to obtain excellent precision
75, 98. However, this assumption is less suitable when binding or enrichment occurs contiguously across large stretches of DNA or in clusters, as is the case for certain chromatin modifications
31, 99. Therefore, one must be mindful of the underlying assumptions and limitations of peak finders before applying them. Secondly, even among the same type of data, variability in data quality may necessitate calling peaks with different thresholds and/or data normalization methods. This is especially true for ChIP-Seq experiments, where variable quality of antibodies or sub-optimal ChIP conditions can lead to variable ChIP enrichment, which will require adjusting significance thresholds individually to achieve both high sensitivity and specificity.
It is important to note that the inherently noisy nature of genome-wide data means that a perfect peak finder cannot exist: in calling regions of enrichment, one can only hope to minimize, but not eliminate, false positives and false negatives. Realizing this, it is evident that we cannot simply trust peak finders blindly and that it is especially important to inspect at least some of the results by eye. Thus, if we are to perform meaningful analysis, we cannot be far removed from the original data and should follow the analysis with validation experiments.
Unsupervised integration
A more scalable method for integrating data is unsupervised learning, which approaches the data with no prior biases, knowledge, or hypotheses. To summarize a large dataset into smaller groups that can be more easily conceptualized, an unsupervised approach simply asks the question: what kinds of patterns exist in a dataset? One common assumption made by unsupervised approaches is that the interesting features of the data are the ones that occur frequently, and therefore the goal is to find common patterns. As diverse experimental methods equate frequency of genomic mapping with activity, an unsupervised analysis can treat these datasets equally and need not know the nature of the measurement. For example, Zhao and colleagues profiled 37 histone modifications in human CD4+ T cells
31, 32. While the number of different possible combinations of modifications is a staggering 2
37 ≈ 137.4 billion, it is likely that most combinations do not exist, or occur very infrequently. To enumerate commonly occurring chromatin signatures, or other patterns, clustering can be applied. Clustering approaches are introduced in
Box 2.
Box 2. ClusteringClustering is an integral bioinformatics tool to partition a large dataset into more easily digestible, conceptual pieces. It can be applied to a wide variety of data, but traditionally has been applied to gene expression profiles. Here, each gene is represented by a list of expression values in various cell types or conditions, and clustering identifies sets of co-expressed genes. In general, conventional clustering works well when the experimental values can be easily discretized into the clustered entities, for example RPKM-normalized expression to an associated gene.
However, for other applications, this discretization is not possible or not desired. One example is for histone modification data derived from ChIP-Seq, where the profile of experimental values over a contiguous region is informative. Conventional clustering can be applied to this data, provided that the profiles are well aligned. For example, to enumerate commonly occurring chromatin signatures in an unbiased way, conventional clustering can be applied to a subset of genomic regions such as promoters. If a pre-defined number of clusters
k is expected then
k-means clustering can be applied, otherwise hierarchical clustering can be used to offer more flexibility. Clearly, conventional clustering can be applied to a wide variety of genomic datasets, spanning genomes, epigenomes
102, transcriptomes
16, and interactomes
120. But this method gives the best results when the set of loci examined are well-aligned, which is the case for gene definitions where excellent annotations exist. To cluster loci with poorly aligned or asymmetric chromatin signatures, or for poorly annotated loci such as gene-distal regulatory elements, our laboratory has developed an approach called ChromaSig
90, 101. Given set of genomic loci, ChromaSig aligns and orients the epigenetic profiles around the loci, outputting clusters of loci that share similar profiles. Alternatively, given the genome-wide nature of epigenetic data, another clustering approach taken is to assign a cluster to every part of the genome. To accomplish this task, Jaschek
et al121 employ a hidden Markov model approach to learn the most likely epigenetic states given the data.
The genome serves as a scaffold upon which high-throughput data are assembled and from a genome-centric perspective, clustering can be seen as a way of classifying genomic loci into conceptual groups with shared attributes. Clustering data from different experiments gives distinct types of conceptual groups and the first phase of data integration can be seen as enumerating the conceptual modules of each dataset. For example, clustering of RNA expression reveals co-expressed genes
100, clustering of histone modifications gives loci that share similar chromatin structure
90, 101, 102, protein-protein interaction clustering finds proteins in the same complex
103, and genetic interaction clustering reveals members of the same or similar pathways
56.
Although all modules are tethered to the genome, modules from one experiment are not linked to those from others. Thus, the next task in data integration is to connect these modules. One approach is to examine a module from one data type, for example chromatin signatures, in the context of another data type, for example DNA methylation
25, 104, 105. Alignment of data sets on a browser such as the UCSC Genome Browser
106 might be useful in this regard (). Furthermore, the Genome Browser also contains annotations such as gene definitions, evolutionary conservation, and disease associations
107. Therefore, co-clustering of new experimental data with known annotations can provide an easy bridge to hypothesis generation. In the past, when genomics consisted only of global gene expression analysis, annotation libraries such as Gene Ontology
108 and the more sophisticated Gene Set Enrichment Analysis
109 were developed to provide an easy way to assess the biological significance of gene hits. As datasets are now extending to include non-coding RNAs, disease-associated SNPs and regions of TF binding, it appears that “Locus Set Enrichment Analysis” will be an important part of genomics. Sets of loci that share factor binding, epigenetic modifications or disease association will provide efficient ways to form hypotheses regarding function outside of coding regions.
Another approach to connecting conceptual modules involves network biology, which leverages high-throughput techniques to find relationships that connect genomic loci and conceptual groups. For example: methods to map chromosomal interactions, such as Hi-C, connect genomic loci to each other; genetic interactions from E-MAPS connect proteins to pathways; and ChIP-Seq links transcription factors to regulated genes. This second level of integration - linking different kinds of experiments - can form a knowledge base from which to extract biological insights or suggest hypotheses for further study.
As a hypothetical example, suppose we used ChIP-Seq to map a novel TF genome-wide and wanted to know the significance of its binding profile. Complicating matters, most of the binding sites are distal to promoters. Clustering reveals that a subset of binding sites share a similar chromatin environment, which suggests these sites may function similarly. Hi-C data then links this subset of binding sites with their target genes and RNA-Seq data reveals these genes are highly expressed. Finally, protein-protein and genetic interaction data reveals that some of these expressed genes belong to related but distinct protein complexes that regulate RNA splicing. Thus, data integration would allow us to efficiently propose the hypothesis that the binding of this new factor to DNA regulates the process of RNA splicing.
Often, the scope of genomic experiments performed is so diverse that it is not immediately clear how, or even if, one experiment relates to another. It is in such cases that unsupervised, data-driven approaches to integration are most useful. Unsupervised integration is a discovery tool to find correlations between two or more experiments. Novel associations lead to hypotheses of function, which can be followed up by supervised integration and by direct experimental validation (see below). In this way, high-throughput experiments are screens to identify interesting, unexpected associations. Because of the power of the approach and because the inputs required are minimal, unsupervised integration is arguably the first tool that should be applied to a new dataset, and it should be constantly run as new experiments are added to an existing dataset to find additional associations.
Supervised integration
The discovery of patterns is one output of unsupervised integration, but the patterns alone do not advance our understanding of biology or disease. Like most systems biology approaches, unsupervised integration excels at generating hypotheses. Therefore, a novel pattern is simply an observation, from which we must make and test predictions of function, often by incorporating external datasets or new experiments. This is the realm of supervised integration. Supervised integration is driven by testable hypotheses and so often relies on only a few dimensions of a full dataset.
It is important to note that the choice of data to include in supervised integration and the specific method used depend crucically on the question posed. For example, using an unsupervised clustering approach we recently observed that a set of distinct histone modifications at exons, which led to the hypothesis that these modifications mark alternatively expressed exons
90. To test this hypothesis, we needed to examine these chromatin modifications in the context of expression at the exonic level and we were able to use previously published exon expression array data from the same cell type
110.
However, in most instances the impetus for supervised integration is anecdotal evidence, either through observations obtained by simply viewing genome-scale data on a browser or from previously published studies. For example, Guttman et al took advantage of previous observations that RNAPII-transcribed genes are marked by H3K4me3 at promoters and have H3K36me3 spreading into the transcribed region and searched for this chromatin signature to identify RNAPII-transcribed lincRNAs
16. Thus, supervised integration starts with a prediction based on an observation and ends with a test of this prediction. This is arguably how our biological understanding is advanced most: the more predictive the hypothesis, the more biological insight gained. Therefore, observation and data integration cannot be independent from each other and there is no substitute for seeing the data with one’s own eyes. Our opinion that it is necessary to see raw data using a browser, for example, is consistent with the current trend in data visualization towards replacing traditional averaged plots with more information-rich heatmaps that simultaneously illustrate experimental profiles for thousands of loci (e.g genome-wide heatmaps of ChIP-chip data
59).
As there are now tens of thousands of high-throughput experiments linked to the human genome, finding dependence relationships among the many dimensions of experimental data is essential to increasing our knowledge. In the simplest case, relationships can be discovered by correlation analysis. For example, a strong, positive correlation among the binding profiles of two transcription factors indicates that one may be dependent on another. Additionally, for genetic interactions, finding positive and negative correlations for a mutant under different conditions can allow systematic discovery of condition-dependent relationships (S. Bandyopadhyay - UCSD, personal communication).
Although informative, correlation analysis can become unwieldy as the number of datasets grows – doubling a dataset would effectively quadruple the number of computations necessary and the number of visualizations required. Luckily, machine learning techniques, notably Bayesian networks (for a primer see Needham et al
111), offer a supervised approach to discover relationships among data entities. Using a probabilistic framework, Bayesian networks can find dependence relationships, for example as van Steensel
et al did for a panel of chromatin modifications and chromatin-associated proteins and modifiers
112. Bayesian networks can also readily integrate data from different kinds of experiments. For example, Yu
et al modeled the interdependence of histone modification profiles with the binding of transcription factors, together with their relationship to gene expression
113. However, it is important to note that the types of prediction that are the output by a Bayesian network critically depend on how the network is designed, which in turn depends on the question asked. For example, Jansen
et al designed a Bayesian network to predict protein complexes by integrating diverse data sources including protein-protein interactions, expression and gene annotation
114. In summary, Bayesian networks can find relationships among diverse kinds of data and thereby create hypotheses that can be tested experimentally.