The identification of DNA regulatory regions is one of the most important and challenging problems toward the functional annotation of genomes. In higher eukaryotes, transcription factor (TF) binding sites are often organized in clusters called cis
-regulatory modules (CRM), which consists of DNA regions of up to a few hundred bases located in the (extended) neighborhood of the gene being regulated (1
). While the prediction of individual TF-binding sites is a notoriously difficult problem, CRM predictions have proven to be more reliable and several algorithms have been developed in the last few years.
Most predictive methods rely on prior knowledge that has to be provided by the user. For instance, some methods will analyze the promoters of a set of (presumably) co-regulated genes obtained from some prior experiments in order to identify over-represented motif combinations (2
). Other methods require a small set of TF position-weight matrices (PWMs) that are expected to co-occur in modules, and identify genomic regions densely populated in putative sites for these TFs (11
). Because of the prior knowledge they require, none of these approaches are able to produce an unbiased, genome-wide survey of mammalian CRMs. Indeed, the only database of predicted cis
-regulatory regions currently available for mammals, CisRed (17
), is restricted to promoter regions.
In Blanchette et al
), we described a new sequence-based, genome-wide CRM identification method that exploits the observation that CRMs often contain several phylogenetically conserved binding sites for a few different TFs [see also a related approach by Philippakis and Bulyk (19
)]. Applying this algorithm to the human and mouse genomes, we built the PReMod database, which contains the complete set of predicted CRMs (pCRMs) for those two genomes. Together with the recently published regulatory potential estimation from the Hardison group (20
), our method represents the only computational approach that has been used for de novo
, genome-wide prediction of CRMs.
PReMod will be useful for several types of investigations. First, researchers interested in the regulation of a specific gene can use PReMod to identify putative CRMs in the vicinity of that gene. The PReMod information is complementary to other types of data like inter-species conservation, CpG islands, regulatory potential, etc. However, it provides a richer annotation, as it predicts the TFs likely to be involved. Second, researchers interested in identifying the targets of a particular TF or TF family will find PReMod useful as it provides a ranked list of putative targets for all TFs for which PWMs are available in Transfac. Modules are ranked by their total binding site concentration for that factor. The list of pCRMs associated to a particular TF can then be used to validate experimentally some of the predictions. For example, Blanchette et al
) used the modules predicted to be bound by E2F4 and estrogen receptor (ER) to build a DNA microarray for chromatin immunoprecipation (ChIP) -chip. A total of 55 and 433 modules were thus validated for ER and E2F4, respectively. While this corresponds to a relatively low fraction of the total number of modules tested (17% for E2F4 and 3% for ER), it is expected that testing binding under different experimental conditions will validate a much larger number of pCRMs since TFs (and in particular ER) are known to regulate different genes in different cellular contexts (22
). Predicted CRMs can also be tested for function using lower-throughput approaches, such as reporter assays [e.g. Woofle et al
) and the Vista Enhancer Database (http://enhancer.lbl.gov
)], and their predicted binding sites can be confirmed via gel shifts or mutagenesis. Finally, PReMod can be used as a data source for data mining efforts to understand the relationship between TFs (e.g. through co-occurrence of binding sites) or between TFs and genes of a particular function or expression pattern [e.g. see Ref. (18
)]. By providing TF target predictions that are more accurate than individual binding site predictions, PReMod affords the researchers a better dataset from which subtle patterns can emerge. For example, using PReMod, Blanchette et al
) highlighted a surprising enrichment of pCRMs near the 3′ end of genes; a results that is corroborated by a growing number of experimental evidence (25
Users need to keep in mind that the different types of predictions contained within PReMod are associated with different expected specificity. We first clarify that PReMod is not meant to be an exhaustive list of CRMs, and that CRMs that would not fit the signature described above would go undetected. Among all the predictions contained in PReMod, those of individual TF-binding sites have the lowest expected accuracy. More accurate are the predictions of the interaction between a TF (or a family of TFs) and a particular module (but without specifying the exact position of the binding sites). Finally, the most accurate predictions of the location of the pCRMs themselves, although the precise boundaries of the modules remain difficult to establish.