Recently, the genomics community has seen an explosion in the availability of large-scale functional genomics data. Researchers have produced genome-wide data sets on the locations of transcription factor binding and histone modifications via chromatin immunoprecipitation (ChIP), open chromatin, and RNA transcription using sequence census assays. Consequently, our representation of the whole human genome has expanded from a sequence of nucleotides, occasionally annotated with discrete features, to a collection of numerical data tracks defined at almost every part of every chromosome. Simultaneously, we have moved beyond treating the cell state determined by functional genomics experiments as constitutive and universal. We will soon have access to data from dozens of cell types
1. How can we make sense out of this multitude of data, unprecedented in experimental biology?
A segmentation procedure provides a conceptually simple approach to finding patterns in this kind of genomic data
2,3,4,5,6,7. In this procedure, one partitions the genome into non-overlapping contiguous segments, assigning to each segment a label, such as “promoter” or “enhancer,” drawn from a finite set. One assigns the labels such that regions sharing the same segment label share certain properties in the observed data. Occam’s Razor implies the desirability of making the set of segment labels as small as possible, while still retaining their capability of accurately modeling the observations. The segmentation task involves finding segment boundaries while simultaneously assigning labels to the segments.
Approaches to segmentation using a hidden Markov model (HMM) suffer from some limitations. For example, HMMs handle missing data poorly, requiring interpolation and smoothing to process regions where data is not or cannot be reported. Similarly, hard or soft constraints on segment lengths prove complex and difficult to implement with a simple HMM.
Dynamic Bayesian networks (DBNs) provide a powerful framework for modeling the complex hidden relationships that explain observed data defined along an axis of arbitrary length. Automatic speech recognition researchers have long used DBNs
8, and now scientists have started using them to solve biological problems
9. A DBN is a graphical structure that depicts the conditional independence properties of multiple random variables. We can represent a standard HMM by a DBN with a hidden random variable for the HMM’s hidden state, and an observed random variable for the observations. With a DBN, however, we can easily incorporate arbitrarily complex relationships among variables at the current or nearby positions in the genome. The DBN allows us to include multiple hidden variables and model their interrelationships without flattening them into a single state variable and thereby ignoring the properties implied by these interrelationships. For example, the DBN can incorporate a structured state space, in which superlabels and sublabels augment a simple label set, where the superlabel might represent something like a region of active euchromatin, and the sublabel the 3′ end of an active gene within active euchromatin. In general, incorporating various types of prior knowledge, and interpreting a trained model, proves much easier with a DBN than with an HMM. For example, we have extended the model to include a principled view of heterogeneous missing data (
Supplementary Discussion,
Supplementary Fig. 1, and
Supplementary Table 1).
In this communication, we describe a new segmentation method, called Segway, which uses DBN techniques to perform simultaneous segmentation and clustering of genomic data. We describe the application of Segway to human ChIP-seq, DNase-seq, and FAIRE-seq data from the ENCODE Project
1.
We performed unsupervised training on 1% of the human genome using the Segway model and 31 ENCODE signal tracks. We arbitrarily fixed the number of labels at 25 so that the set of labels would be sufficiently small to remain interpretable by biologists. The use of 25 labels provides human convenience in the interpretation of results. Greater numbers of labels may have more statistical support, but yield difficult-to-interpret results. Our method aims to reduce complex data for easier interpretation, and using an arbitrarily small number of labels served this purpose. The discovered parameters characterize a diverse set of biological patterns (). We then Viterbi decoded the genome, identifying the most probable path of segment labels given the trained parameters and observed signal.
We refer to the resulting assignment of labels to non-overlapping genomic regions as the segmentation (
Supplementary Results,
Supplementary Fig. 2, and
Supplementary Table 2). The labels allow one to easily identify other regions that exhibit similar signal patterns in the observation tracks. By examining the learned parameters directly ( and
Supplementary Fig. 3) and comparing the segmentation to public annotations ( and
Supplementary Figs. 3–5), we assigned functional categories to groups of segment labels based on notable features. Many of these labels recapitulated known patterns in the chromatin literature, and some represented novel patterns.
Notably, Segway “rediscovered” protein-coding genes, as the unsupervised segmentation included chromatin states associated with the starts, middles, and ends of genes () found without direct recourse to information traditionally used to find genes (primary sequence, similarity to mRNAs or genome sequences of other species). Several labels tended to reside in particular locations within protein-coding genes, and the discovered transition parameters of the model increased the probability of moving from the labels that tended to reside in 5′ ends of genes to the labels found more often in 3′ ends of genes. We also found expected patterns of transcription factor binding near the transcription start site (TSS), histone H2A.Z associated with the TSS
10, and histone mark H3K79me2 associated with the gene start
10 (
Supplementary Fig. 6). The patterns of chromatin structure in protein-coding genes fit well with existing knowledge (
Supplementary Results, and
Supplementary Figs. 7 and 3).
Many of the chromatin states which, at first glance, seemed to associate with protein-coding genes, turned out to be associated only with genes active in the tissue type of the assays incorporated into the segmentation (
Supplementary Results). Segway also discovered patterns associated with specific functional elements, such as enhancers, insulators, regions of repressed gene expression, and “dead” regions (
Supplementary Results and
Supplementary Figs. 5 and 8).
We used Segway to generate hypotheses about the role of individual sequences in promoting transcription, which we then tested experimentally. We began by identifying expressed and unexpressed genes using RNA data. To determine whether specific sequences identified by segmentation labels sufficed for these genes’ expression or non-expression, we tested small ~1-kbp constructs that overlapped the gene’s TSS and either a segmentation TSS label (24 positive predictions) or an R label (67 negative predictions) using transient transfection luciferase assays. Every positive prediction resulted in substantial expression of a reporter, a majority of the negative predictions resulted in no substantial expression, and a larger majority of negative predictions not overlapping CpG islands resulted in no substantial expression (
Supplementary Fig. 9a). Positive predictions showed higher median expression activity than negative predictions.
Segway differs substantially from several previously described methods for jointly analyzing chromatin data (
Supplementary Discussion). For example, Segway solves a fundamentally different problem than ChromaSig
11, in the sense that ChromaSig does not attempt to fully partition the genome or integrate arbitrary combinations of functional genomics data. Segway is most similar to a hidden Markov model-based method called ChromHMM
5. However, Segway and ChromHMM differ in several respects, chiefly the relative resolution of the two methods: Segway operates on the full data set, whereas ChromHMM works at 200 bp resolution and condenses each data track to a single binary datum within each 200-bp window. Consequently, Segway provides a finer-grained segmentation. Whereas the segments in a published ChromHMM segmentation
5 had a minimum length of 200 bp, a mean of 4,862 bp and a median of 800 bp, Segway segments had a mean length of 168 bp and a median of 124 bp.
We directly compared the behavior of Segway and ChromHMM in three separate loci ( and
Supplementary Fig. 10). Segway successfully identified the TSS of
GATA1, followed by a series of gene-related labels that generally progress from “gene start” through “gene middle” labels (). ChromHMM, in contrast, identified a large region that it identifies as “active promoter,” spanning thousands of base pairs. It labeled the rest of the gene “strong enhancer.” This example suggests that the 200 bp window size makes precise identification of fine-scale elements like promoters and internal gene structure difficult. We saw a similar effect at the
HBG1 locus (). Here, a known enhancer
12 appeared in the midst of a large (~10 kbp) ChromHMM segment labeled “active promoter,” whereas Segway placed a small enhancer label in the middle of the known enhancer region. We saw this behavior genome-wide (
Supplementary Results).
We have shown that Segway successfully reduces heterogeneous datasets into understandable patterns with clear biological implications. In the future, the flexibility of the DBN suggests a solution to the problem of learning large numbers of segment labels while retaining comprehensibility. Extension to a hierarchical segmentation would allow the learning of many sublabel patterns, while keeping a higher-level yet smaller-order structure that a researcher could analyze and understand more readily. While effective hyperparameter setting for hierarchical segmentation requires further research, we have already implemented the capability for hierarchical training and identification in Segway.