|Home | About | Journals | Submit | Contact Us | Français|
Correct interactions between transcription factors (TFs) and their binding sites (TFBSs) are of central importance to gene regulation. Recently developed chromatin-immunoprecipitation DNA chip (ChIP-chip) techniques and the phylogenetic footprinting method provide ways to identify TFBSs with high precision. In this study, we constructed a user-friendly interactive platform for dynamic binding site mapping using ChIP-chip data and phylogenetic footprinting as two filters. MYBS (Mining Yeast Binding Sites) is a comprehensive web server that integrates an array of both experimentally verified and predicted position weight matrixes (PWMs) from eleven databases, including 481 binding motif consensus sequences and 71 PWMs that correspond to 183 TFs. MYBS users can search within this platform for motif occurrences (possible binding sites) in the promoters of genes of interest via simple motif or gene queries in conjunction with the above two filters. In addition, MYBS enables users to visualize in parallel the potential regulators for a given set of genes, a feature useful for finding potential regulatory associations between TFs. MYBS also allows users to identify target gene sets of each TF pair, which could be used as a starting point for further explorations of TF combinatorial regulation. MYBS is available at http://cg1.iis.sinica.edu.tw/~mybs/.
Eukaryotic gene expression is achieved by multiple layers of regulation, including transcription regulation, which requires transcription factors (TFs) to bind to their respective DNA binding sites (TFBSs) in a correct spatial and temporal manner (1). Identifying and characterizing the binding sites of TFs can permit a more comprehensive and quantitative mapping of the regulatory mechanisms within cells. Unfortunately, TFBSs are usually short (~5–15bp) and degenerate (2), making it difficult to define TFBSs experimentally or computationally.
In Saccharomyces cerevisiae, there are only a limited number of functional TFBSs that have been experimentally verified (3). Inference of TFBSs has thus been relying heavily on computational approaches. A number of plausible motif consensus sequences have been deduced by different bioinformatics methods that exploit sequence information. However, there have been reservations on using those consensuses to annotate the genome due to excessive false positives. Fortunately, the chromatin-immunoprecipitation DNA chip (ChIP-chip) technique (4,5) provides a powerful way to verify the DNA-binding affinity of TFs. In addition, phylogenetic footprinting methods that assume conservation of functional elements during evolution have been utilized to reveal TFBSs that are conserved across species (6,7).
A fair amount of confident TFBS information has been accumulated in various databases during the last few years. For example, SCPD (3), SGD (8), TRANSFAC (9), YPD (10) and YEASTRACT (11) contain an array of TF motif consensus sequences derived from the literature and experimental data. Some of them use simple sequence matching schemes to annotate the genome, which is noisy. SGD in particular remaps the TFBSs inferred by Harbison et al. (5), which took advantage of ChIP-chip data and phylogenetic information. However, SGD has a priori assumptions about the degree of conservation across species and binding affinities of TFs. SwissRegulon (12,13) is another database where the site annotations were produced using several algorithms to perform on related genomes in combination with known sites from the literature, in addition to using ChIP-chip binding data. SwissRegulon contains a variety of experimentally verified or computationally predicted TFBSs for the entire genomes of 18 organisms. However, SwissRegulon currently lacks information about the degree of conservation across species and related condition-specific ChIP-chip experiments for TFBSs.
Since the degree of conservation across species and binding affinities of TFs vary among TFs, we construct a comprehensive web server, mining yeast binding sites (MYBS), which integrates several types of data related to transcriptional regulation in S. cerevisiae. Via simple motif or gene queries, MYBS allows users to apply ChIP-chip data and phylogenetic footprinting filters on genomic data to perform dynamic binding site mapping.
MYBS integrates three main types of data, including related yeast genomic sequences, ChIP-chip data and motif information. Currently, the genomic sequences of eight yeast species (S. cerevisiae, S. paradoxus, S. kudriavzevii, S. mikatae, S. bayanus, S. castellii, S. kluyveri and Candida glabrata) are included. For each gene in S. cerevisiae, we downloaded the promoter sequences (~1000bp, intergenic regions only) of its orthologous genes in other six Saccharomyces species from SGD (8). Since the genome of C. glabrata is not annotated in SGD, for each gene in S. cerevisiae we found its C. glabrata orthologue from http://cbi.labri.fr/Genolevures/download/CAGL_annot.php (14) and then downloaded its promoter sequence from the RSAT website (http://rsat.ulb.ac.be/rsat/) (15). For each gene in S. cerevisiae, we performed multiple sequence alignments using ClustalW (16). Currently we have integrated ChIP-chip experiments under various conditions from Harbison et al. (5). In fact, it is rather easy for MYBS to include other ChIP-chip data.
We collected consensus sequences or position weight matrixes (PWMs) of TFs in S. cerevisiae from confident literatures (5,6,17,18) and a variety of motif databases, including SCPD (3), SGD (8), TRANSFAC (9), YPD (10), YEASTRACT (11), SwissRegulon (12,13) and YTFD (http://biochemie.web.med.uni-muenchen.de/YTFD/). As a result, MYBS contains a collection of 481 binding motif consensus sequences and 71 known PWMs that correspond to 183 TFs. For those motifs in the consensus form, we generate the corresponding substitution-derived position frequency matrixes (PFMs) according to the model of Doniger et al. (19), which is constructed from all occurrences of the consensus sequence in S. cerevisiae with 0 or 1 difference in the orthologous positions in the other four species. For each PFM, we generate its PWM and calculate its cutoff using PATSER (20) with the S. cerevisiae background frequencies (A,T: 0.31 and C,G: 0.19). PATSER was developed to search for TFBSs in the promoter sequences. For each weight matrix of length w, PATSER scores each w-mer under the motif model or the background nucleotide frequencies. PATSER also calculates a P-value threshold using the information content. This P-value threshold is then used to filter out low-scoring sites. In MYBS we set the threshold to 0.01 in order to balance the sensitivity and specificity, while speeding up the query processes. We then use these PWMs and their cutoffs to scan multiple alignments of orthologous intergenic promoter sequences for matched occurrences.
MYBS provides a web-based interface with three main features: binding sites mining, regulatory association searching and target gene selection for each TF pair. MYBS allows users to search for occurrences of a motif in the promoters of a gene, or potential binding sites for a TF. For binding motifs and TFs, their target genes are also reported. MYBS also enables users to visualize in parallel the potential regulators for a given set of genes and allows users to obtain target/non-target gene sets of a pair of TFs in different combinations.
For each function, MYBS allows users to search for occurrences of possible binding sites computationally without using any filters or by applying two filters—phylogenic information and ChIP-chip data—to improve the accuracy of binding site search. The user may request that a TFBS be conserved across a user-defined number of species (ranging from zero to seven) within a neighboring region of 25bp upstream and downstream of the binding site occurrence in S. cerevisiae. In addition, the user can alter the degree of experimental support for TF-DNA binding affinity by setting the P-value in a ChIP-chip experiment.
The underlying core of MYBS is the integration of motif information. Each motif is linked to one or more TFs, and points to a set of genes whose promoter sequences contain incidences of the motif. Similarly, there may be multiple consensuses accrued from various sources listed for a given TF (Figure 1). The bi-directional search can start from a TF, a motif or a gene, and allows for easy identification of regulatory associations between TFs and between motifs. For example, the user can query a short sequence pattern (I.U.B. code allowed) to acquire a list of matching binding motif consensuses. One can choose a motif from the list for detailed information, including its corresponding TFs, the sequence logo, the PWMs and the cutoff thresholds of the PWMs. In addition, MYBS allows users to scan any given sequences for binding occurrences of the selected motif. The user can further select which TF he/she is interested in. With the choice of either or both of two user-defined filters, MYBS provides a potential target gene list of the selected motif and allows the user to look into visualized sequence information for one or multiple genes simultaneously. The user can include or exclude certain databases in the process, and also discover other potential regulators of the selected genes. All related information can be downloaded as plain text files or image files.
In order to give the user an idea of the significance of the TF predicted to be enriched in a given group of target genes, we calculate an enrichment P-value for each TF in ‘Search regulatory association’. This is done by calculating the probability of finding x or more promoters in a user-input gene set that can be bound by the specified TF, in addition to fulfilling the ChIP-chip and conservation requirements set by the user:
where M is the overall number of genes examined, K is the subset of M that are bound by the TF, N is the size of the user-input gene set, and x is the number of promoters within the user-input set that are bound by the TF.
Since the calculation is done for every single TF, the P-value calculation, which could be computationally intensive, is made optional by the user. If the button ‘Calculate enrichment P-value’ is clicked, an enrichment P-value will be shown for each TF in either the text or graphical output.
We also provide P-values for the ‘Find target genes for TF pairs’ function. For any given pair of TFs, we construct a 3 × 3 contingency table and perform the chi-square goodness of fit test.
The χ2 statistic follows a chi-square distribution with four degrees of freedom (3 − 1) × (3 − 1). The P-value gives the user an idea of the probability of the two TFs being associated in a non-random manner. Note that we assign a default P-value of 1 if the expected number of genes E11 simultaneously bound by TF1 and TF2 exceeds the observed number of genes O11.
Since MYBS allows users to dynamically select different criteria for desired TFBSs, it is not easy to know the reliability of the MYBS predictions. To address this issue, for 101 experimentally verified TFBSs of 12 TFs (21) we analyzed their corresponding ChIP-chip P-values and the degree of conservation. Overall, 12 sites failed to be recognized by the PWMs of the corresponding TF in the MYBS database. Figure 2 shows the range of ChIP-chip P-values of these target genes and their degree of conservation across species. As shown, ~65% of promoters where the experimentally verified TFBSs reside have ChIP-chip P-values < 0.01 and more than 70% experimentally verified TFBSs are conserved in at least three species.
MYBS enables users to visualize in parallel the potential regulators for a given set of genes, providing scientists with an efficient way to glance at potential underlying transcription mechanisms. Here we present an example of how this feature can be used toward potential regulatory association discovery. Burckin et al. (22) used splicing-sensitive microarrays to investigate the impact of perturbations on the steady-state levels of mRNAs and pre-mRNAs. Among these perturbations was one that used a conditional-lethal ded1 allele to inactivate Dep1p, a translation initiation factor (23) that is also known to be functionally involved in splicing (24). According to their results, a subset of intron-containing genes is sensitive to the loss of Dep1p. It is interesting to ask why Dep1p preferentially affects these intron-containing genes and whether these genes have anything in common in their promoter regions, since transcription and splicing are known to be coupled (25,26). To do this, we used the function ‘Search regulatory association’ to identify which TFs potentially regulate these genes. As shown in Figure 3, a contact map of the genes against all TFs is presented in the image format, and sorted according to the number of regulatory interactions. We found that 69 of the 111 Ded1p-sensitive intron-containing genes contain both FHL1 (Fork Head-Like) and RAP1 (Repressor Activator Protein) binding sites in their promoter regions (indicated by a red block). RAP1 encodes an essential protein involved in many processes in S. cerevisiae, including telomere maintenance, transcriptional silencing and high level transcriptional activation of genes encoding ribosomal proteins (RP) (27). FHL1 is a putative transcriptional regulator with similarity to the DNA-binding domain of Drosophila forkhead and is required for rRNA processing (28). Martin et al. (29) showed that FHL1 is also involved in the regulation of RP gene transcription in yeast. In contrast, only five of the 143 Ded1p-insensitive intron-containing genes harbor FHL1 or RAP1 binding sites in their promoter regions. These observations raise the possibility that Ded1p's influence on splicing can be exerted, either directly or indirectly, via promoter regions that contain both FHL1 and RAP1 binding sites.
MYBS is an interactive web-based service that integrates an array of predicted and known TFBS PWMs, DNA-binding affinity data from ChIP-chip and phylogenetic footprinting data of TFBSs in eight related yeast species. An important feature of MYBS is its versatility and flexibility in binding site annotation. In the process of binding site annotation, two filters can be customized according to the user's prior knowledge and confidence in the DNA-binding affinity data and phylogenic information, and MYBS reports the binding sites accordingly. Since the binding affinities and degree of conservation vary from TF to TF, the service provides an opportunity for scientists to incorporate one's knowledge and preference in the process of data retrieval. The motif information is also compiled and organized in a way that is easy to query from any directions—by partial motifs, by TF or by gene. As exemplified by the case study mentioned above, the regulatory associations feature could initiate and facilitate investigations by providing an intuitive look at the relationships between genes and TFs. Similarly, the identification of target genes for TF pairs could serve as a starting point for analysis of combinatorial regulation of TFs. Through the user-friendly interface, MYBS allows for dynamic binding site mapping, in addition to visualization and elucidation of potential regulatory relationships.
This research was supported by the grants from the Institute of Information Science and the Genomics Research Center, Academia Sinica and the National Science Council, Taiwan. Funding to pay the Open Access Publication charges for this article was provided by NIH GM 30998.
Conflict of interest statement. None declared.