Recently, the genomics community has seen an explosion in the availability of large-scale functional genomics data. Researchers have produced genome-wide data sets on the locations of transcription factor binding and histone modifications via chromatin immunoprecipitation (ChIP), open chromatin, and RNA transcription using sequence census assays. Consequently, our representation of the whole human genome has expanded from a sequence of nucleotides, occasionally annotated with discrete features, to a collection of numerical data tracks defined at almost every part of every chromosome. Simultaneously, we have moved beyond treating the cell state determined by functional genomics experiments as constitutive and universal. We will soon have access to data from dozens of cell types1
. How can we make sense out of this multitude of data, unprecedented in experimental biology?
A segmentation procedure provides a conceptually simple approach to finding patterns in this kind of genomic data2,3,4,5,6,7
. In this procedure, one partitions the genome into non-overlapping contiguous segments, assigning to each segment a label, such as “promoter” or “enhancer,” drawn from a finite set. One assigns the labels such that regions sharing the same segment label share certain properties in the observed data. Occam’s Razor implies the desirability of making the set of segment labels as small as possible, while still retaining their capability of accurately modeling the observations. The segmentation task involves finding segment boundaries while simultaneously assigning labels to the segments.
Approaches to segmentation using a hidden Markov model (HMM) suffer from some limitations. For example, HMMs handle missing data poorly, requiring interpolation and smoothing to process regions where data is not or cannot be reported. Similarly, hard or soft constraints on segment lengths prove complex and difficult to implement with a simple HMM.
Dynamic Bayesian networks (DBNs) provide a powerful framework for modeling the complex hidden relationships that explain observed data defined along an axis of arbitrary length. Automatic speech recognition researchers have long used DBNs8
, and now scientists have started using them to solve biological problems9
. A DBN is a graphical structure that depicts the conditional independence properties of multiple random variables. We can represent a standard HMM by a DBN with a hidden random variable for the HMM’s hidden state, and an observed random variable for the observations. With a DBN, however, we can easily incorporate arbitrarily complex relationships among variables at the current or nearby positions in the genome. The DBN allows us to include multiple hidden variables and model their interrelationships without flattening them into a single state variable and thereby ignoring the properties implied by these interrelationships. For example, the DBN can incorporate a structured state space, in which superlabels and sublabels augment a simple label set, where the superlabel might represent something like a region of active euchromatin, and the sublabel the 3′ end of an active gene within active euchromatin. In general, incorporating various types of prior knowledge, and interpreting a trained model, proves much easier with a DBN than with an HMM. For example, we have extended the model to include a principled view of heterogeneous missing data (Supplementary Discussion
, Supplementary Fig. 1
, and Supplementary Table 1
In this communication, we describe a new segmentation method, called Segway, which uses DBN techniques to perform simultaneous segmentation and clustering of genomic data. We describe the application of Segway to human ChIP-seq, DNase-seq, and FAIRE-seq data from the ENCODE Project1
We performed unsupervised training on 1% of the human genome using the Segway model and 31 ENCODE signal tracks. We arbitrarily fixed the number of labels at 25 so that the set of labels would be sufficiently small to remain interpretable by biologists. The use of 25 labels provides human convenience in the interpretation of results. Greater numbers of labels may have more statistical support, but yield difficult-to-interpret results. Our method aims to reduce complex data for easier interpretation, and using an arbitrarily small number of labels served this purpose. The discovered parameters characterize a diverse set of biological patterns (). We then Viterbi decoded the genome, identifying the most probable path of segment labels given the trained parameters and observed signal.
Fig. 1 Heat map of discovered Gaussian parameters in an unsupervised 25-label segmentation of 31 ENCODE signal tracks. Each row contains parameters for one signal track, and each column contains parameters for one segment label. Within each row, we performed (more ...)
We refer to the resulting assignment of labels to non-overlapping genomic regions as the segmentation (Supplementary Results
, Supplementary Fig. 2
, and Supplementary Table 2
). The labels allow one to easily identify other regions that exhibit similar signal patterns in the observation tracks. By examining the learned parameters directly ( and Supplementary Fig. 3
) and comparing the segmentation to public annotations ( and Supplementary Figs. 3–5
), we assigned functional categories to groups of segment labels based on notable features. Many of these labels recapitulated known patterns in the chromatin literature, and some represented novel patterns.
Fig. 2 Gene structure in Segway labels. (a) Segmentation of the BRD2 locus, as shown in the University of California, Santa Cruz (UCSC) Genome Browser. The first three rows show the position along chromosome 6 and a scale bar. The next 25 rows correspond to (more ...)
Notably, Segway “rediscovered” protein-coding genes, as the unsupervised segmentation included chromatin states associated with the starts, middles, and ends of genes () found without direct recourse to information traditionally used to find genes (primary sequence, similarity to mRNAs or genome sequences of other species). Several labels tended to reside in particular locations within protein-coding genes, and the discovered transition parameters of the model increased the probability of moving from the labels that tended to reside in 5′ ends of genes to the labels found more often in 3′ ends of genes. We also found expected patterns of transcription factor binding near the transcription start site (TSS), histone H2A.Z associated with the TSS10
, and histone mark H3K79me2 associated with the gene start10
(Supplementary Fig. 6
). The patterns of chromatin structure in protein-coding genes fit well with existing knowledge (Supplementary Results
, and Supplementary Figs. 7 and 3
Many of the chromatin states which, at first glance, seemed to associate with protein-coding genes, turned out to be associated only with genes active in the tissue type of the assays incorporated into the segmentation (Supplementary Results
). Segway also discovered patterns associated with specific functional elements, such as enhancers, insulators, regions of repressed gene expression, and “dead” regions (Supplementary Results
and Supplementary Figs. 5 and 8
We used Segway to generate hypotheses about the role of individual sequences in promoting transcription, which we then tested experimentally. We began by identifying expressed and unexpressed genes using RNA data. To determine whether specific sequences identified by segmentation labels sufficed for these genes’ expression or non-expression, we tested small ~1-kbp constructs that overlapped the gene’s TSS and either a segmentation TSS label (24 positive predictions) or an R label (67 negative predictions) using transient transfection luciferase assays. Every positive prediction resulted in substantial expression of a reporter, a majority of the negative predictions resulted in no substantial expression, and a larger majority of negative predictions not overlapping CpG islands resulted in no substantial expression (Supplementary Fig. 9a
). Positive predictions showed higher median expression activity than negative predictions.
Segway differs substantially from several previously described methods for jointly analyzing chromatin data (Supplementary Discussion
). For example, Segway solves a fundamentally different problem than ChromaSig11
, in the sense that ChromaSig does not attempt to fully partition the genome or integrate arbitrary combinations of functional genomics data. Segway is most similar to a hidden Markov model-based method called ChromHMM5
. However, Segway and ChromHMM differ in several respects, chiefly the relative resolution of the two methods: Segway operates on the full data set, whereas ChromHMM works at 200 bp resolution and condenses each data track to a single binary datum within each 200-bp window. Consequently, Segway provides a finer-grained segmentation. Whereas the segments in a published ChromHMM segmentation5
had a minimum length of 200 bp, a mean of 4,862 bp and a median of 800 bp, Segway segments had a mean length of 168 bp and a median of 124 bp.
We directly compared the behavior of Segway and ChromHMM in three separate loci ( and Supplementary Fig. 10
). Segway successfully identified the TSS of GATA1
, followed by a series of gene-related labels that generally progress from “gene start” through “gene middle” labels (). ChromHMM, in contrast, identified a large region that it identifies as “active promoter,” spanning thousands of base pairs. It labeled the rest of the gene “strong enhancer.” This example suggests that the 200 bp window size makes precise identification of fine-scale elements like promoters and internal gene structure difficult. We saw a similar effect at the HBG1
locus (). Here, a known enhancer12
appeared in the midst of a large (~10 kbp) ChromHMM segment labeled “active promoter,” whereas Segway placed a small enhancer label in the middle of the known enhancer region. We saw this behavior genome-wide (Supplementary Results
Fig. 3 Comparison of Segway and ChromHMM segmentations in several locations. (a) The GATA1 locus. (b) The HBG1 locus and an upstream enhancer12.
We have shown that Segway successfully reduces heterogeneous datasets into understandable patterns with clear biological implications. In the future, the flexibility of the DBN suggests a solution to the problem of learning large numbers of segment labels while retaining comprehensibility. Extension to a hierarchical segmentation would allow the learning of many sublabel patterns, while keeping a higher-level yet smaller-order structure that a researcher could analyze and understand more readily. While effective hyperparameter setting for hierarchical segmentation requires further research, we have already implemented the capability for hierarchical training and identification in Segway.