|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact firstname.lastname@example.org
The AthaMap database generates a map of cis-regulatory elements for the Arabidopsis thaliana genome. AthaMap contains more than 7.4 × 106 putative binding sites for 36 transcription factors (TFs) from 16 different TF families. A newly implemented functionality allows the display of subsets of higher conserved transcription factor binding sites (TFBSs). Furthermore, a web tool was developed that permits a user-defined search for co-localizing cis-regulatory elements. The user can specify individually the level of conservation for each TFBS and a spacer range between them. This web tool was employed for the identification of co-localizing sites of known interacting TFs and TFs containing two DNA-binding domains. More than 1.8 × 105 combinatorial elements were annotated in the AthaMap database. These elements can also be used to identify more complex co-localizing elements consisting of up to four TFBSs. The AthaMap database and the connected web tools are a valuable resource for the analysis and the prediction of gene expression regulation at http://www.athamap.de.
The regulation of gene expression is mainly conferred by transcription factors (TFs) that bind to cis-regulatory sequences. These sequences can be used to generate hypothesis about TF that may be involved in the regulation of nearby genes (1,2). In Arabidopsis thaliana, more than 1500 TFs corresponding to ~5% of the total genes have been identified (3). The largest families are MYB and MYB-related (190 members), AP2/EREBP (144), bHLH (139), NAC (109), C2H2(Zn) (105), HD (89), MADS (82), bZIP (81) and WRKY (72).
Since the complete sequence of the A.thaliana genome has been published (4), it was desirable to have a map of transcription factor binding sites (TFBSs) for the whole genome. The non-restrictive nature of such a map permits the identification of regulatory sequences within transcribed and coding regions as well. To accomplish such a map, the pattern search program Patser (5) and publicly available alignment matrices were used to generate the AthaMap database, the first TFBS map for the whole A.thaliana genome (6). The second release of the AthaMap database presented here has increased the data content from ~2.4 × 106 to >7.4 × 106 putative sites. Specific care has been taken in the annotation of CAT- and TATA-boxes, which were predicted using alignment matrices from the PlantProm database (7) together with the positional information relative to transcription start sites (TSSs) or translation start sites. Because each TFBS is associated with a particular score that represents the similarity of the site to the underlying alignment matrix, a new functionality was implemented that allows the identification of highly conserved binding sites.
It is well known that the composition of binding sites in the regulatory region of a gene confers its specific expression profile (8). For example, two G-box like sequences constitute the as-1 element that is bound by bZIP TFs (9). Another example is the ocs element that occurs in certain glutathione S-transferase genes of Arabidopsis, which harbour a bZIP and DOF factor binding site in close vicinity (10–12). A wide variety of expression specificity is associated with the co-localization of MYB- and MYC-binding sites (13–16). Other examples are MADS/MADS TFBSs and those TFs that harbour two DNA-binding domains, such as AP2 (17,18).
For the identification of such co-localizing elements, a new web tool was implemented that permits a user-defined identification of pairs of TFBSs in the genome of Arabidopsis by providing distance and quality parameters. This web tool was used to identify the co-localizing sites for known interacting factors. Such combinatorial elements were annotated to the AthaMap database and can also be used for the identification of more complex elements consisting of, for example, two combinatorial elements harbouring four TFBSs.
As summarized in Table 1, the genomic positions of more than 7.4 × 106 putative TFBSs were determined in the A.thaliana genome. These positions were identified with 42 alignment matrices for 36 TFs. For the factors bZIP910, bZIP911, PIF3, ABI4, RAV1 and MYB.PH3, two different alignment matrices were employed and they are identified by numbers in brackets behind the factor name (Table 1). The binding sites were taken directly from the published literature, which is regularly screened in the process of updating the TRANSFAC® database with plant transcription factor data (2).
The screens were performed on the most recent version of the A.thaliana genome sequence (TIGR release 5.0, January 21, 2004). The pattern search program Patser (5) was used for the identification of binding sites as described previously (6). The following command line was used to run Patser: ‘patser-v3d -A a:t 0.320 c:g 0.180 -m matrixfile -f sequencefile -c -li -d2’. For all screens, the default threshold calculated by Patser from the adjusted information content of the matrix was employed. This criterion was chosen as an objective cut-off threshold value applicable for all the matrices as it represents a measure of how far the nucleotide frequency distribution in the alignment matrix diverges from the a priori probability for the occurrence of the nucleotides in the genome (5). In the case of CAT- and TATA-boxes (CBF and TBP), only those elements that occur upstream of known TSSs or predicted translation start sites were imported into the AthaMap database. TSSs and translation start sites were annotated to the AthaMap database as provided by the TIGR.
The AthaMap database is based on the in silico determination of binding sites and does not distinguish between experimentally verified and predicted sites. Therefore, it is desirable to discriminate between higher and lower conserved binding sites. A criterion for the conservation of a site is the individual score of a TFBS determined by using Patser (5). In general, only TFBSs with a specific score above a threshold score determined for each matrix were imported into the AthaMap database and are displayed as putative binding sites. A high score close to the possible maximum score represents a highly conserved binding site whereas a low score close to the threshold stands for a less conserved site. Maximum score, threshold score and specific score of a site are identified in a tool tip box in the AthaMap database to evaluate individual TFBSs (6).
To permit the exclusive display of higher conserved TFBSs, a new function was implemented in the AthaMap database that allows the user to restrict the number of sites shown by the quality of their scores. With the new ‘Restriction’ function on the ‘Search’ page of AthaMap, the user is able to restrict the sites displayed to those that are closer to the maximum score. This requires an input value as a percentage, which is then applied to the difference between maximum score and threshold score. For example, if the restrictive value is set to 20% then only sites with a score of at least 6 will be displayed for a matrix with a maximum score of 10 and a threshold score of 5, while normally all sites with a score of at least 5 would be shown. A user-defined increase in the threshold score of TFBSs displayed in the AthaMap database may eliminate putative false positive TFBSs.
Gene expression specificity is often mediated by the interaction between TFs that recognize closely spaced binding sites (8). The importance of combinatorial control for gene expression makes it desirable to identify co-localizing TFBSs in the genome based on user provided parameters.
For this, a new ‘co-localization’ web tool was implemented on the AthaMap website that permits the selection of two TFs and the designation of a specific minimum and maximum spacer of up to 50 bp between two TFBSs. The user may select two different TFs or two identical TFs. Furthermore, one can increase the threshold score of the TFBSs individually to obtain combinatorial elements that show a higher conservation of underlying binding sites. The result of the co-localization analysis is shown on the same page and gives the total number of co-localizing TFBSs detected, the chosen parameters for the co-localization analysis and the number of sites used in the analysis. The spacer between two binding sites is defined by the distance between the most 5′ positions of both TFBSs. This permits the identification of overlapping sites that may be relevant for longer matrices with non-overlapping core sequences. To avoid identical hits at the same chromosomal position when using TFs of the same family, it is suggested to select a minimum spacer length that is as long as the matrix of one of the two factors. In addition, even known TSSs can be selected to identify TFBSs in close vicinity to the TSSs.
Owing to the large number of putative binding sites for some factors, the co-localization analysis had to be limited to ~200000 TFBSs for each factor to permit a co-localization analysis in a reasonable time. The number of TFBSs of 10 matrices was limited to higher conserved sites by increasing their threshold scores in the co-localization analysis. This applies to the matrices of factors AGL15, ALFIN1, DOF2, GAMYB, HVH21, P, RAV1, TEIL and GT1. The applied parameters can be found on the AthaMap website. With these restrictions, co-localization analyses are generally executed in <1 min.
Figure 1 shows a modified screenshot of a result page for a co-localization analysis with AtMYB15 and TGA1, which are both factors from A.thaliana (Table 1). As user-defined parameters, a minimum spacer of 0 nt and a maximum spacer of 20 nt between the binding sites and the default threshold of the alignment matrices (11.85 and 5.81, respectively) were selected. The total number of co-localizing sites detected is nine (Figure 1, combinatorial elements). A result table shows the positions of the co-localizing binding sites, the chromosome and the orientation of the respective site with an arrow. Furthermore, the spacer length of the individual co-localizing element is shown. Each position is linked to an AthaMap sequence window that opens and shows the co-localizing sites highlighted within their genomic context (data not shown).
On the result page, when selected, a feature ‘Show overview’ displays a table with a summary of the co-localization analysis (Figure 1, arrow). The inserted table displays the total number of sites that were obtained with all spacer lengths between the selected minimum and the maximum spacer. Here, the user can readily see if a preferred spacer length is detected for binding sites of two TFs. This new tool will be very helpful to identify co-localizing binding sites for TFs that were shown experimentally to interact with each other. Furthermore, genes harbouring a similar architecture of cis-regulatory elements may be identified.
The well-known examples for combinatorial elements in plants are the as-1 element that is bound by two dimers of bZIP transcription factors, the endosperm or ocs element that is recognized by a member of the bZIP and DOF TF family, and promoters that harbour MYC/MYB or MADS/MADS TF binding sites (9,12,16,17). Based on the approximate spacing between these elements, co-localizing sites were determined with the above described web tool and annotated as bZIP/bZIP, bZIP/DOF, MYC/MYB and MADS/MADS combinatorial elements. A second class of co-localizing TFBSs consists of sites for factors that harbour two DNA-binding domains, such as RAV1 (18). RAV1 belongs to the AP2/EREBP superfamily of TFs that comprises the subfamilies AP2, EREBP and RAV-like (3). RAV1 has two different DNA-binding domains and for each of them the binding specificity was identified (18) and annotated as RAV1 and RAV1 in the AthaMap database. All the putative RAV combinatorial elements were derived from a co-localization of RAV1 and RAV1. Table 2 lists the total number of combinatorial elements identified in the A.thaliana genome and annotated in the AthaMap database. The factors used for the determination of combinatorial sites and the distances between putative binding sites are shown. A total of 183159 combinatorial elements were annotated in the AthaMap database. These elements are identified in the AthaMap database by the factor family names and are displayed with a double line in the sequence window. For the AP2/EREBP member RAV1 the two different alignment matrices were employed for co-localization analysis. Each combinatorial RAV element consists of two TFBSs that correspond to both matrices.
MYC (bHLH) TFs apparently recognize binding sites that are identical or are very closely related to bZIP-binding sites (19–21). Hence, annotated bZIP sites were employed for the identification of MYC-binding sites in combinatorial elements. The identification of functional MYC/MYB-binding sites by employing bZIP sites can be shown for the gene encoding BANYULS that is induced by the interacting TFs TT8 (MYC) and TT2 (MYB) (16,22,23). When the Arabidopsis genome identification number of the Banyuls gene (AT1G61720.1) is used for a search in the AthaMap database, a putative MYC/MYB combinatorial element is detected upstream of the TATA-box (data not shown). This combinatorial element corresponds to the previously determined MYC and MYB regulatory sites in the Banyuls promoter (24). Table 3 summarizes several known or experimentally predicted combinatorial elements detected in the AthaMap database.
As a further asset of the AthaMap database, these annotated combinatorial elements can be included in the user-defined identification of co-localizing TFBSs as well. Therefore, more complex arrangements of regulatory elements consisting of up to four individual binding sites can be detected.
The AthaMap resources are freely available for non-commercial users at http://www.athamap.de.
This work was carried out in the Intergenomics Braunschweig Bioinformatics Competence Center and was supported by the German Ministry of Education and Research (BMBF grant no. 031U110C/031U210C). Funding to pay the Open Access publication charges for this article was provided by BMBF.
Conflict of interest statement. None declared.