The sequencing of the human genome produced the complete recipe for a human being encoded in digital form, and much of the past decade of molecular biology has been devoted to deciphering the meaning of this code. On this premise, the ENCODE Project Consortium sought to discover a complete catalog of all functional elements in the human genome (1
), analogous to delineating sentences and words that comprise the human genome, and understanding the type of function each element plays. Such a catalog will undoubtedly never be complete, given the diversity of cell types where elements may activate, the diversity of experimental assays needed to probe them, and the specific conditions and stimuli to which they may respond. The scale-up phase of the ENCODE project, however, has made substantial advances toward the goal of a comprehensive catalog. It has carried out a daunting 1640 total experiments in 147 cell types, using multiple distinct biochemical assays, including ChIP-seq, DNase-seq, FAIRE-seq and RNA-seq. Interpreting the resulting information is arguably more complex than interpreting the primary sequence of the human genome. The four nucleotides of the sequence have been replaced by a large vector of numerical values, each representing the result of a different biochemical assay in a given condition and a given cell type, at each position of each chromosome. The challenge at hand is thus to turn these vectors of numerical values into an interpretable annotation, namely, the list of functional elements that the ENCODE project set out to annotate.
To address this challenge, we and others have developed a variety of computational techniques that seek to identify functional elements from high-throughput genomic datasets. These techniques fall into two groups: supervised learning methods
that attempt to find instances of one or more pre-determined classes of elements, and unsupervised learning methods
that seek to simultaneously discover functional classes and annotate their instances de novo
. Supervised learning methods have been widely used for automatic gene finding methods that can recognize protein-coding transcripts using sequence features, cDNA sequence and evolutionary conservation of known examples (2
). Supervised models have also been successfully used to recognize promoters (4
), enhancers (5
) and microRNAs (6
), based on known examples. As supervised learning methods require a training set of known examples, they are incapable of discovering novel types of functional elements. Unsupervised methods, in contrast, identify candidate functional elements without the need for previously defined classes or known examples, thereby avoiding biases toward well-understood phenomena. Despite their generality, peak-finding algorithms for ChIP-seq analysis can be seen as supervised learning methods, seeking to recognize ‘peak-like’ behavior. Moreover, peak-finding methods have difficulty generalizing to the joint analysis of dozens of tracks of functional genomics data, where the diversity of possibly interesting patterns is very high. Such integrative analyses are central to the mission of ENCODE, which aims not only to produce such data but also to make sense of the resulting collection of datasets.
In this work, we apply unsupervised chromatin state annotation methods that simultaneously discover the locations of functional elements in the human genome and assign to each element one of a small number of labels, which can be interpreted as functional annotations. As input, our methods receive a collection of functional genomics datasets and a user-specified parameter for the number of distinct labels that the method should discover. The input datasets consist of ChIP-seq assays for multiple histone modifications, general transcription factors and chromatin accessibility assays. We restrained ourselves to chromatin-level information in the initial annotation stage, and did not use RNA-seq information as an input to our models, instead reserving it for later validation. Our computational analysis provides as output an annotation of the human genome. This annotation consists of a segmentation into non-overlapping segments, and a labeling of each segment using one of a small set of labels, which we refer to as chromatin states. The goal of the chromatin state annotation is to capture the similarities of segments that show the same patterns across many experiments by assigning them the same label, thus summarizing a very large collection of data into a more meaningful form. The resulting segment labels typically correspond to an intuitive, human interpretable biological function, which we use to summarize them, even though we recognize that the underlying biology is usually more complex. Other times, the segment labels may remain uninterpreted until we learn more about additional functions that may be distinguishable by their specific combinations of chromatin marks, but whose biological roles may not yet be understood until additional biological processes become elucidated, or until additional datasets become available. The unsupervised nature of these chromatin state annotations may thus identify novel instances of known classes of functional elements, suggest novel subdivisions of classes into subclasses, or hypothesize the existence of entirely new types of functional elements.
During the pilot phase of ENCODE, Thurman et al.
combined a hidden Markov model (HMM) with wavelet smoothing to produce a two-label segmentation of the ENCODE pilot regions into ‘active’ and ‘repressed’ regions (7
). A variety of segmentation models have been described subsequently, employing HMMs with flat (9
) or hierarchical (11
) structures, or generalizing the HMM to a hierarchical change-point model (12
For the second phase of ENCODE, two research groups within the consortium independently developed chromatin state annotation algorithms, ChromHMM (13
) and Segway (15
). Although the methods were designed and initially implemented independently of one another, they share many key features. Most significantly, the methods employ closely related probabilistic models. ChromHMM is implemented as an HMM, in which the ‘time’ axis is the chromosomal coordinate and the various ENCODE datasets are the observed variables. Similarly, Segway employs a dynamic Bayesian network (DBN) approach, which is a generalization of the HMM framework. The HMM/DBN approach offers multiple important advantages, including efficient algorithms for carrying out inference and a modeling paradigm in which the model’s internal variables have well-defined semantics.
Key differences between the two chromatin state annotation methods are summarized in . Broadly speaking, ChromHMM aims to take more of a birds-eye view of the data, opting to compress each data track to a single Boolean value for each 200-bp segment of the genome. This approach makes ChromHMM computationally efficient, enabling training on the entire genome, and reduces the chances that artifacts related to scaling of the data or local patterns of missing data due to mapping problems will mislead ChromHMM. The 200-bp resolution also ensures that ChromHMM produces reasonably large nucleosome-sized segments without having to implement complex constraints on segment length distributions. Segway, in contrast, operates on the full data matrix at 1-bp resolution. Segway handles missing data in a principled fashion by marking each missing data point as a hidden variable and marginalizing over all possible values. To ensure that Segway produces segments of a reasonable size, we employ both hard constraints (for example, enforcing a minimum length of 100 bp per segment) and soft constraints (length priors). For efficiency, we trained the Segway models described here only on 1% of the human genome, but training on the entire genome is possible. Finally, ChromHMM and Segway differ in the choice of algorithm used to assign the final labeling: ChromHMM assigns to each segment the label with maximum posterior probability, whereas Segway selects the series of labels that jointly achieves the highest probability over the entire segmentation path.
Major differences between ChromHMM and Segway as applied to the ENCODE data
The work presented here constitutes the first systematic integration of chromatin elements across the entire ENCODE project. The two methods used reveal functional chromatin elements at different levels of resolution, making it possible to study both the transitions between different types of chromatin states at single-nucleotide resolution, and to obtain a robust annotation that can tolerate small variations in large chromatin domains. These two annotations form the basis of the integrative analysis for the ENCODE project, and they provide a systematic view of the chromatin landscape. This integrated viewpoint will be of great value to epigenomics research. In addition, this work describes a manually curated chromatin annotation that synthesizes the two complementary methodologies.
The resulting annotations capture the remarkable diversity of genomic functions encoded by distinct chromatin states, are robust across different cell types, and are reliably recovered by the two methods used here. We also created a combined segmentation that contains features of both. Our systematic annotation of chromatin elements has important implications for the study of the human genome.
- The annotation successfully and automatically recovers much of what is already known about genome organization, including transcript-associated chromatin states and diverse classes of regulatory elements, based solely on an unsupervised analysis of chromatin data.
- The annotation reveals the important relationship between biochemical activity for chromatin functions and RNA transcription, and shows important differences between the two.
- The annotation points to the surprising finding that a large portion of the human genome exists in a quiescent state, which holds across multiple cell types.
- The annotation provides an unbiased view of functional non-coding regulatory elements, enabling us to evaluate different metrics and methods for measuring human evolutionary constraint. In particular, we report the first genome-wide experimental demonstration of the functional relevance of evolution-based inference of constraint for pairs of nucleotides, rather than individual nucleotides at each position (16).
- The annotation enables us to revisit disease-associated regions, identified via genome-wide association studies (GWAS), that previously lacked any functional annotations, providing focused, testable hypotheses revealed through the lens of the chromatin landscape.