|Home | About | Journals | Submit | Contact Us | Français|
We describe PReMod, a new database of genome-wide cis-regulatory module (CRM) predictions for both the human and the mouse genomes. The prediction algorithm, described previously in Blanchette et al. (2006) Genome Res., 16, 656–668, exploits the fact that many known CRMs are made of clusters of phylogenetically conserved and repeated transcription factors (TF) binding sites. Contrary to other existing databases, PReMod is not restricted to modules located proximal to genes, but in fact mostly contains distal predicted CRMs (pCRMs). Through its web interface, PReMod allows users to (i) identify pCRMs around a gene of interest; (ii) identify pCRMs that have binding sites for a given TF (or a set of TFs) or (iii) download the entire dataset for local analyses. Queries can also be refined by filtering for specific chromosomal regions, for specific regions relative to genes or for the presence of CpG islands. The output includes information about the binding sites predicted within the selected pCRMs, and a graphical display of their distribution within the pCRMs. It also provides a visual depiction of the chromosomal context of the selected pCRMs in terms of neighboring pCRMs and genes, all of which are linked to the UCSC Genome Browser and the NCBI. PReMod: http://genomequebec.mcgill.ca/PReMod.
The identification of DNA regulatory regions is one of the most important and challenging problems toward the functional annotation of genomes. In higher eukaryotes, transcription factor (TF) binding sites are often organized in clusters called cis-regulatory modules (CRM), which consists of DNA regions of up to a few hundred bases located in the (extended) neighborhood of the gene being regulated (1). While the prediction of individual TF-binding sites is a notoriously difficult problem, CRM predictions have proven to be more reliable and several algorithms have been developed in the last few years.
Most predictive methods rely on prior knowledge that has to be provided by the user. For instance, some methods will analyze the promoters of a set of (presumably) co-regulated genes obtained from some prior experiments in order to identify over-represented motif combinations (2–10). Other methods require a small set of TF position-weight matrices (PWMs) that are expected to co-occur in modules, and identify genomic regions densely populated in putative sites for these TFs (11–16). Because of the prior knowledge they require, none of these approaches are able to produce an unbiased, genome-wide survey of mammalian CRMs. Indeed, the only database of predicted cis-regulatory regions currently available for mammals, CisRed (17), is restricted to promoter regions.
In Blanchette et al. (18), we described a new sequence-based, genome-wide CRM identification method that exploits the observation that CRMs often contain several phylogenetically conserved binding sites for a few different TFs [see also a related approach by Philippakis and Bulyk (19)]. Applying this algorithm to the human and mouse genomes, we built the PReMod database, which contains the complete set of predicted CRMs (pCRMs) for those two genomes. Together with the recently published regulatory potential estimation from the Hardison group (20,21), our method represents the only computational approach that has been used for de novo, genome-wide prediction of CRMs.
PReMod will be useful for several types of investigations. First, researchers interested in the regulation of a specific gene can use PReMod to identify putative CRMs in the vicinity of that gene. The PReMod information is complementary to other types of data like inter-species conservation, CpG islands, regulatory potential, etc. However, it provides a richer annotation, as it predicts the TFs likely to be involved. Second, researchers interested in identifying the targets of a particular TF or TF family will find PReMod useful as it provides a ranked list of putative targets for all TFs for which PWMs are available in Transfac. Modules are ranked by their total binding site concentration for that factor. The list of pCRMs associated to a particular TF can then be used to validate experimentally some of the predictions. For example, Blanchette et al. (18) used the modules predicted to be bound by E2F4 and estrogen receptor (ER) to build a DNA microarray for chromatin immunoprecipation (ChIP) -chip. A total of 55 and 433 modules were thus validated for ER and E2F4, respectively. While this corresponds to a relatively low fraction of the total number of modules tested (17% for E2F4 and 3% for ER), it is expected that testing binding under different experimental conditions will validate a much larger number of pCRMs since TFs (and in particular ER) are known to regulate different genes in different cellular contexts (22,23). Predicted CRMs can also be tested for function using lower-throughput approaches, such as reporter assays [e.g. Woofle et al. (24) and the Vista Enhancer Database (http://enhancer.lbl.gov)], and their predicted binding sites can be confirmed via gel shifts or mutagenesis. Finally, PReMod can be used as a data source for data mining efforts to understand the relationship between TFs (e.g. through co-occurrence of binding sites) or between TFs and genes of a particular function or expression pattern [e.g. see Ref. (18)]. By providing TF target predictions that are more accurate than individual binding site predictions, PReMod affords the researchers a better dataset from which subtle patterns can emerge. For example, using PReMod, Blanchette et al. (18) highlighted a surprising enrichment of pCRMs near the 3′ end of genes; a results that is corroborated by a growing number of experimental evidence (25,26).
Users need to keep in mind that the different types of predictions contained within PReMod are associated with different expected specificity. We first clarify that PReMod is not meant to be an exhaustive list of CRMs, and that CRMs that would not fit the signature described above would go undetected. Among all the predictions contained in PReMod, those of individual TF-binding sites have the lowest expected accuracy. More accurate are the predictions of the interaction between a TF (or a family of TFs) and a particular module (but without specifying the exact position of the binding sites). Finally, the most accurate predictions of the location of the pCRMs themselves, although the precise boundaries of the modules remain difficult to establish.
The pCRMs contained in PReMod were computed using the method described by Blanchette et al. (18). We only provide a short overview of the method, and refer the interested reader to that article for more details. At the base of PReMod is a set of individual binding site predictions for TFs whose binding preferences are described by PWMs from the Transfac 7.2 database (27). Putative human binding sites are scored based on how well the human site and its orthologs in mouse and rat match the matrix [orthology is based on Multiz genome-wide alignments (28)]. Putative mouse sites are computed based on an alignment to the human and dog genomes. More precisely, a binding site's score is a weighted sum of the log-likelihood ratio scores in the three species. The score of the modules reported in PReMod reflect the presence, in a region of 100–1000 bp, of a surprisingly large number of binding sites (or, more precisely, a surprisingly large sum of their individual scores), for a few different PWMs. Specifically, to assign a score to a given genomic region, each PWM is first assigned a ‘matrixScore’, which reflects the surprise associated with the density and quality of predicted sites in that region. This surprise (P-value) depends, among other things, on the length and GC-content of the region and the genome-wide number and scores of predicted sites for the same PWM. The PWM with the highest matrixScore is chosen as first ‘tag’ for the region. Its occurrences are then masked, and the process is repeated, selecting a second tag. Up to five tags can be selected for a given module. In the end, the region is assigned a ‘moduleScore’, which reflects the surprise associated with the combined scores of the tags. Depending on which number of tags gives the most significant result, the lower-scoring tags may be rejected. It is important to mention here that although PWMs chosen as tags for a module are likely to be of interest, other PWMs that were not selected could also correspond to factors binding the module. This is particularly true in the case where two or more different PWMs represent binding sites for factors of the same family (e.g. STAT1 and STAT3). Because factors from the same family tend to have similar PWMs, it is very difficult to distinguish between their binding sites. Since their predicted sites will heavily overlap, only one member of the family will be reported as tag. However, this should not be interpreted as an indication that this member is significantly more likely than its homologs to bind the module. Instead, the user should refer to the ‘matrixScore’ to assess the binding potential of a particular TF.
Genomic regions obtaining significant moduleScores (P-value below e−10) are reported in PReMod. We should however emphasize that the prediction algorithm is not very good at identifying the correct boundaries of the CRMs, and that one pCRM may sometimes actually contain two functionally distinct modules, or one module may be split between two CRMs. We encourage the user to consider all types of evidence, (e.g. regions of inter-species conservation) to decide on the correct CRM boundaries.
Table 1 reports the key statistics for the human and mouse versions of PReMod. The human version contains more than 123000 predicted modules, slightly more than the 91000 modules of mouse version. The difference is largely due to the fact that the mouse binding site predictions use the dog genome for comparison, resulting in more stringent predictions. Approximately 1.9% of the human genome (and 1.7% of the mouse genome) is covered by pCRMs, consistent with the hypothesis that a large fraction of the non-coding functional regions has a regulatory function (29). Note that the set of human modules in PreMod is based on a newer assembly than the dataset originally reported in (18). The large number of predicted sites per module is due in part to the fact that sites are predicted separately for each PWM, even though several matrices often represent the same or related TF. Thus, a single DNA location can be predicted as a binding site for more than one PWM.
A Java web-based application allows users to browse, query or download the database. To address the various needs of the users, PReMod can be queried in a number of ways using an advanced search form (Figure 1A). First, users can request regulatory modules related to a given gene. Although there is currently no way to confidently assigning pCRMs to the gene they regulate, PReMod assumes that the gene whose transcription start site (TSS) is the closest to the module is the most likely target. However, this association is likely to often be incorrect, in particular for very-long-range regulators.
The second type of PReMod queries is TF-centric. Specifically, the user can request to see all the pCRMs containing predicted binding sites for one or more TFs (only TFs with Transfac matrices can be sought). By default, all modules containing predicted sites for the specified TF will be reported, although the search can also be restricted to only the PWMs used as tags for the pCRMs. An example of this type of query is shown in Figure 1A. Here, the user wants to identify the pCRMs containing tags for two nuclear receptor TFs, ER (M00191) and androgen receptor (M00447). All queries can be refined further by restricting the search to some chromosomal regions, to pCRMs that have a particular moduleScore, to located modules around specific genes, or to modules overlapping CpG islands.
Upon submitting a query, the user receives the list of modules satisfying the given constraints. All outputs can be viewed as HTML files or exported to an Excel spreadsheet. For each module reported, the module identifier, genomic position, length and score are given. Also given are the genes with the closest TSS upstream or downstream of the pCRM. Finally, the list of Transfac matrices selected as tags for the module is shown.
For example, given the query described above, the list of four modules produced as output is shown in Figure 1B. The second is a module located next to the progesterone receptor gene, which we showed to be bound by ER (18). However, we focus here on the fourth module reported, which is interesting for a number of reasons. First, not only are the estrogen and androgen receptors predicted to bind this module, but a third nuclear receptor, RORalpha1, which was not included in our query, is selected as first tag for this module. Given that different nuclear receptors are known to cooperatively (or antagonistically) bind regulatory regions (30), this association is promising. Second, this region is located within an intron of the ERBB4 gene (v-erb-a erythroblastic leukemia viral oncogene), a key growth factor receptor tyrosine kinase inducing cell differentiation (31). The ERBB family is a key player in hormone-dependent breast (and other types of) cancer.
For each module, a details page is obtained by clicking the module name, as is exemplified in Figure 1C–E, for the module described above. This page contains all the binding site information about the selected module, starting with a visual representation of the position of the predicted binding sites for the TFs selected as tags (Figure 1D). The page gives the complete list of matrices with predicted sites in the module, and the position of these predicted sites can be visualized in the graphical display (Figure 1C). We emphasize again that, although the selection of TF tags for the modules is a necessary step algorithmically, the fact that a TF was not selected as a tag should not necessarily be interpreted as the TF being of less interest. Therefore, we recommend that users consider the matrix's total score as an indication of the binding potential.
Each module can be visualized in its genomic context, together with the genes nearby and the other surrounding modules (Figure 1E). By clicking the other modules in the image, one can explore their binding site content and properties. Quite often, interesting patterns will emerge by considering together several neighboring modules. For example, the module located upstream of the module described above is also predicted to be bound by several nuclear receptors. Finally, to explore the selected module in the context other types of annotations, a link to the UCSC genome browser is provided, where the pCRMs are displayed using a custom track.
Several features will be added to PReMod in the near future. Predicted CRMs will soon be made available for other mammalian species (rat, dog, etc.), and orthologous pCRMs from different species will be linked to each other, allowing easy jumps from one species to the other. As new genome assemblies come out, new versions of the database will be released. To simplify querying the database, a BLAST server will be made available, allowing the quick identification of pCRMs homologous to a given query sequence. Finally, the module prediction algorithm is under constant refinement and new releases of PReMod will likely improve the specificity of the predictions.
The authors thank Diane Bourque, Johanne Duhaime, Nathalie Edmond and Martin Leboeuf for their technical support, as well as the UCSC genome browser group for their support. The authors also thank Vincent Giguère for his advice about the nuclear receptor analysis. This work was funded by grants from Génome Québec and Génome Canada (M.B., B.C. and F.R.). F.R. holds a new investigator award from the CIHR. Funding to pay the Open Access publication charges for this article was provided by NSERC and CIHR.
Conflict of interest statement. None declared.