Transcription factors (TFs) are proteins that bind specific DNA elements and regulate gene transcription. There are approximately 1,700 to 1,900 TFs in human, including about 1,400 manually curated sequence-specific TFs [1
]. They bind different types of DNA elements, including promoters, enhancers, silencers, insulators and locus control regions [2
]. While promoters are close to transcription start sites (TSSs), the other types of elements could be far away from the genes that they regulate, and there are no simple rules known to define their exact locations. For instance, enhancers can be as far as one mega base pairs (1 Mbp) from the target gene in eukaryotes [3
], and can be both upstream and downstream of the promoter of the target gene [4
One important step towards a thorough understanding of transcriptional regulation is to catalog all regulatory elements in a genome. There are databases for regulatory elements with experimental data [5
]. The completeness of these databases has been limited by a small number of validation experiments performed relative to the expected number of regulatory elements, and a small amount of TF binding data available relative to the total number of TFs. There are also a lot of computational methods for predicting cis
-regulatory modules, many of which are based on evolutionary conservation and binding motif densities and distributions [8
]. Since these features are static information that does not take into account the dynamic environment of DNA, such as DNA methylation, nucleosome occupancy and histone modifications, these predictions usually have high false positive rates.
To systematically identify TF binding sites on a large scale, high-throughput methods such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) [10
] have been invented. With a goal to identify all functional elements in the human genome, the Encyclopedia of DNA Elements (ENCODE) project [12
] has used high-throughput methods to produce a large amount of experimental data for studying TF binding sites. In the pilot phase, which aimed at studying 44 regions that sum up to about 1% of the human genome [13
], the binding sites of 18 sequence-specific TFs and components of the general transcription machinery were identified using chromatin immunoprecipitation followed by microarray (ChIP-chip) [14
], paired-end tag sequencing (ChIP-PET) [16
], and sequence tag analysis of genomic enrichment (STAGE) [17
]. Analysis of a subset of these data revealed non-uniform distribution of TF binding sites in the surveyed regions, statistical association of the binding sties with both TSSs and transcription end sites of known genes, and clustering of binding sites of different TFs [18
With the success of the pilot phase, ENCODE has entered its production phase since 2007 to study DNA elements in the whole human genome. Both the scale and variety of experiments have been greatly increased [19
]. In terms of protein-DNA binding, many ChIP-seq experiments have been performed to identify the binding sites of sequence-specific TFs, general TFs, and chromatin-related factors, which we will call transcription-related factors (TRFs) in general. About 500 ChIP-seq datasets have been produced, involving more than 100 different TRFs in more than 70 cell lines [20
]. There are also matched expression data and chromatin features, such as histone modifications from ChIP-seq experiments, and DNA accessibility from DNase I hypersensitivity analysis [21
] and formaldehyde-assisted isolation of regulatory elements (FAIRE) [23
], making the dataset a valuable resource for studying transcriptional regulation.
Having this large amount of data available notwithstanding, it is still non-trivial to identify all regulatory elements and provide useful annotations for them due to two major reasons. First, the fraction of TRFs included in the experiments is still small compared to the total number of TRFs in human. For instance, if a regulatory element is only bound by TRFs not covered by these experiments, it cannot be identified simply by cataloging all the observed TRF binding sites. Instead, it is necessary to model each type of regulatory element by some general features that are available for the whole genome, and use these features to extend the search of the elements to regions not covered by the experiments.
Second, the overwhelming amount of data makes it difficult to extract useful information. Processing hundreds of genome-scale data files requires a lot of computational resources even for simple analysis tasks, not to mention the complexity in cross-referencing other types of related data, such as gene expression and histone modifications. Statistical significance of observations is also difficult to evaluate due to non-uniform distribution of genomic elements and complex dependency structures within a single dataset and between different datasets.
Here we report our work in using statistical methods to learn general properties of different types of genomic regions defined by TRF binding. We also describe the application of the learned models in locating all occurrences of these types of regions in the whole human genome in different cell types, including locations with no direct experimental binding data. Our main goal is to provide a concise and accessible summary of the large amount of data in the form of several types of regions with clear interpretations, to facilitate various kinds of downstream analyses.
Specifically, we report our identification of six different types of genomic regions that can be grouped into three pairs: regions with active/inactive binding; regulatory modules proximal to promoters/distal to genes; and regions with extremely high/low degrees of co-occurrence of binding by factors that do not usually co-associate. We discuss the chromosomal locations of these regions, their cell-type specificity, chromatin features and different sets of TRFs that bind them, and show that a variety of properties of our called regions are in strong agreement with prior knowledge of TRF binding.
To further explore functional aspects of the identified regions, we report our work in predicting enhancers from the distal regulatory modules and validating their activities by reporter assays. We also link distal regulatory modules to potential target genes and identify the TRFs involved. Finally, we suggest a potential relationship between non-sequence-specific TRF binding and DNase hypersensitivity at regions with high co-occurrence of TRF binding. All these whole-genome analyses would have been difficult to carry out without the large cohort of data produced by ENCODE.
Related ideas for identifying different types of regions in the whole genome have been proposed, both by groups within ENCODE and by other groups. One approach is to use one or a few previously known features to define particular region types, such as using DNase I hypersensitivity and some specific histone marks in identifying enhancers. In comparison, our approach identifies feature patterns directly from data using a machine learning framework, which has the potential to discover novel features for specific region types. Another related idea is to segment the genome in an 'unsupervised' fashion, that is, to group regions based on observed data alone without any predefined region types. This approach is most suitable for exploring new region types. A big challenge of this approach is to interpret the resulting segments. In the current work we focus on the six types of regions described, and take on a 'supervised' approach whenever possible, that is, to learn general properties of a region type using known examples. When there are sufficient examples, the supervised approach is usually preferred in identifying members of well-defined classes.