|Home | About | Journals | Submit | Contact Us | Français|
The AthaMap database generates a map of cis-regulatory elements for the whole Arabidopsis thaliana genome. This database has been extended by new tools to identify common cis-regulatory elements in specific regions of user-provided gene sets. A resulting table displays all cis-regulatory elements annotated in AthaMap including positional information relative to the respective gene. Further tables show overviews with the number of individual transcription factor binding sites (TFBS) present and TFBS common to the whole set of genes. Over represented cis-elements are easily identified. These features were used to detect specific enrichment of drought-responsive elements in cold-induced genes. For identification of co-regulated genes, the output table of the colocalization function was extended to show the closest genes and their relative distances to the colocalizing TFBS. Gene sets determined by this function can be used for a co-regulation analysis in microarray gene expression databases such as Genevestigator or PathoPlant. Additional improvements of AthaMap include display of the gene structure in the sequence window and a significant data increase. AthaMap is freely available at http://www.athamap.de/.
Bioinformatic tools in molecular biology can easily establish hypotheses for a directed design of experimental set-ups. Bioinformatic gene expression analysis is supported by increasing data on spatial and temporal gene expression and transcription factors (TFs). Gene transcription is mainly regulated by the binding of TFs to cis-regulatory sequences. The occurrence of a cis-sequence is the prerequisite for direct DNA binding that promotes or represses transcription of the gene. Eukaryotic regulation of gene expression is complex and involves synchronized binding of TFs to adjacent cis-regulatory sequences (1). A colocalization analysis of TF binding sites (TFBS) is useful to predict such combinatorial effects on gene expression. Furthermore, binding of TFs can coordinately regulate whole sets of genes.
Bioinformatic methods have been established to predict putative binding sites of TFs in DNA sequences. Web-based resources for detecting TF binding sites or cis-regulatory sequences in plant genes not restricted to Arabidopsis thaliana are Place, PlantCare, and TRANSFAC® (2–4). Genome-wide detection of binding sites can be performed online with the regulatory sequence analysis (RSA) tools (5). A similar genomic sequence search in Arabidopsis can be performed using Patmatch at TAIR (6,7). Pattern recognition programs such as MatInspector, Match or Patser utilize alignment matrices which are derived for example from random binding site selection experiments that determine a set of DNA sequences that can be bound by the same factor (8–10).
Using Patser, the AthaMap database was established for A.thaliana. This database generates a genome wide map of putative TF binding sites determined from alignment matrices (11). Web tools have been implemented for the detection of colocalizing cis-regulatory elements in the genome (12). Combinatorial elements based on known TF interactions have been identified. In addition to positional weight matrix-based detection of binding sites, experimentally verified binding sites were annotated as well (13). The last version of AthaMap contained the genomic positions of more than 8 × 106 putative TFBS for 88 TFs from 21 different families. Another resource for cis-regulatory sequences in A.thaliana is AGRIS (14,15). In contrast to AGRIS, AthaMap covers the whole A.thaliana genome and is mainly based on binding site detection by positional weight matrices.
Co-regulation of genes may be directed by similar combinations of cis-regulatory elements. For A.thaliana, several web-based services harbour gene expression data from microarray experiments and allow recovery of co-regulated genes. Such web-based services are for example TAIR, NASCArrays tools, Stanford Microarray Database, Botany Array Resource, GEO, and Genevestigator (6,16–21). For the detection of gene clusters with similar expression patterns, ACT, Botany Array Resource, CSB.DB, and Genevestigator can be used (19,21–24).
To enable discovery and analysis of common cis-regulatory elements annotated in AthaMap, a new Gene Analysis feature has been developed to allow comparative analysis of cis-elements in sets of co-transcribed genes. Similar expression patterns can also be determined by colocalizing TFBS. For this, the colocalization function has been improved for identification of gene sets harbouring similar combinations of TFBS. Furthermore, the data content in AthaMap has increased significantly and the gene structure is shown in the sequence display window.
To identify and analyze co-regulated genes for common TFBS, the Gene Analysis web tool has been implemented. On the Gene Analysis page at AthaMap, a gene list can be entered by providing the locus identifier (AGI) of each gene separated by carriage returns. In addition to the gene list, the region of the genes to be analyzed needs to be specified as well. Therefore, the upstream and downstream borders of the analyzed regions relative to the annotated translation start point have to be entered. Because all matrix-based TF binding sites have a specific score between the threshold and maximum score defined by Patser, a restriction to higher conserved TFBS can be applied as well (12).
As an example for co-regulation, the Demo button displays three genes in the input area. By default, the genomic region for analysis ranges from −500 bp upstream to 50 bp downstream relative to the translation start point. No restriction to higher conserved TFBS is set. A search result table lists the TF binding sites in the analyzed genomic region in detail. It displays the gene, the name and the family of the transcription factor and the chromosomal position of its TFBS. In addition, the distance of the binding site relative to the translation start point and the orientation of the binding site relative to the gene are specified. A plus means that the TF binding site and the gene are in the same orientation. Furthermore, for matrix-based TFBS also the maximum score and threshold score of the screening matrix as well as the individual score of the TFBS as a measure for sequence conservation are given. All listed genes and positions of the TFBS are linked to the sequence window for single gene display in the genomic context of surrounding binding sites.
Because a gene may harbour more than one binding site for a specific TF, an overview table can be selected by using the ‘Show overview’ link. This results in a list of all gene-factor combinations having at least one binding site. The list shows the gene, the TF, the TF family, and the number of TFBS detected. The number of sites located upstream and downstream as well as their relative orientation to the start point of translation are given. This and all other tables can easily be copied and exported to a spreadsheet program for additional data processing.
Orchestrated regulation of genes involves binding of specific TFs to sets of genes. By selecting ‘Show factors that are common in genes’, the occurrence of binding sites among the whole set of genes from a Gene Analysis search is displayed. In this table, all TFs with identified TF binding sites in the gene set are shown. The table is sorted hierarchically by the total number of genes per TF. Further information given is the total number of sites detected among the set of genes. To estimate TFBS frequencies, the theoretical number of TF binding sites in the genomic regions analyzed is also shown. This number is based on a theoretical random distribution of the total annotated TF binding sites of the respective TF. The ratio between real occurrence of TFBS and theoretical occurrence shows whether particular TF binding sites are over- or underrepresented. Further valuable information can be extracted by selecting ‘Show all factors’. This extends the table by showing also all TF with binding sites that are absent in the analyzed gene regions.
A similar resource to inspect Arabidopsis promoter sets for cis-regulatory sequences is Athena (25). Important differences between AthaMap and Athena are the fixed promoter region of 3 kb in Athena and the flexible gene region selection in AthaMap. Furthermore, the data content is different. Athena binding sites are based on 105 TF consensus sequences from PLACE and AGRIS (2,14). In contrast, AthaMap is mainly based on alignment matrices of TFBS (11). This leads to a much higher TF binding site density in AthaMap. Athena has ~30 TFBS in each promoter region of 3 kb (25). In comparison, AthaMap has a TF binding site density of ~260 TFBS in a 3 kb region including the data update presented here.
To demonstrate the functionality of the Gene Analysis web tool, three cold-inducible genes (cor15a: At2g42540, cor15b: At2g42530, and rd29A: At5g52310) were used as an example (26). The genomic region analyzed was first restricted to the upstream regions (−500 bp upstream, 0 bp downstream). The output of this Gene Analysis search is displayed in a detailed result list. The distribution of binding sites among the whole set of genes can be analyzed by selecting ‘Show factors that are common in genes’. Figure 1 shows that all three genes harbour DREB1A (CBF3), DREB1B (CBF1) and DREB1C (CBF2) binding sites in the upstream region. A P-value of 4.36 × 10−32 was determined for the occurence of 11 and more DREB1A binding sites within 500 bp of the upstream region of the three selected genes. This value was determined from the total number of 552 DREB1A TFBS identified in the genome (AthaMap documentation page), the total Arabidopsis genome sequence length of 119 186 497 bp, and the total 1500 bp analysed for DREB1A binding sites. For the 6 DREB1B and C TFBS the P-value is 6.21 × 10−20. It has been demonstrated, that these TFs, which are members of the AP2/EREBP family, can activate cold-induced genes by binding to the DRE/CRT cis-acting elements present in their promoter regions (27–29). The three sample genes are regulated by members of the AP2/EREBP transcription factor family, namely CBF/DREBs (26).
In a further analysis of these genes, the genomic region was restricted to 0 bp upstream and 500 bp downstream to determine whether AP2/EREBP binding site overrepresentation is specific to the upstream regions. In this analysis, no DREB1A (CBF3), DREB1B (CBF1) and DREB1C (CBF2) binding sites were identified (data not shown). This indicates specific accumulation of these binding sites in the upstream region of the three genes. This example demonstrates the application of the Gene Analysis function for a set of co-regulated genes.
The AthaMap colocalization web tool permits the positional identification of putative combinatorial elements (12). In the earlier version of AthaMap, only positional information of a predicted combinatorial element on a chromosome was shown. Now, the locus IDs of the closest genes of all colocalizing TF binding sites and the relative distances to the translation start sites are identified. For colocalization analysis, either a TF from the complete list of TFs can be selected by factor name or a factor family can be choosen and one member of this family can be selected for colocalization analysis. A denominator in front of the factor name indicates how the TF binding sites were identified. A bar (–) precedes all TFs that were annotated by matrix-based searches (11). A double bar (=) is assigned to combinatorial elements (12). TFs with binding sites derived from experimentally verified single sites based on consensus sequences are preceded by an arrow (>) (13).
After a colocalization analysis, the resulting table specifies the chromosomal positions of the colocalizing binding sites and the locus ID of the nearest annotated gene. Furthermore, the distance relative to the start point of translation is given. Upstream positions are preceded by a minus. The locus IDs in this table are linked to the sequence display window showing the genomic context of the gene. Furthermore, an entire result list can easily be submitted to Gene Analysis for determination of further common cis-regulatory elements by using the respective link. Another link directly exports the gene list to the microarray gene expression analysis form in PathoPlant for the identification of co-expression during plant-pathogen interaction (30). Such a list of genes can also be used with other programs for co-expression analysis in microarray databases such as Genevestigator (21).
Since the last update of AthaMap, a significant change in data display was implemented. In the earlier version, whole genes were displayed as underlined sequence stretches (11). Now, also gene structure elements, i.e. untranslated regions (UTRs), exons and introns, are identified. The annotation of gene structure is based on XML flatfiles downloaded from the TIGR web site (release 5.0) (31). These flatfiles were parsed using a Perl script and positional information for 5′- and 3′-UTRs, exons and introns were annotated to AthaMap. These regions are displayed in AthaMap with a colour code similar to the one used by TAIR (6). The colour code is explained on the sequence display window in AthaMap. The orientation of each gene is indicated below the sequence display window. Forward means that the gene is encoded on the annotated and displayed DNA strand, reverse means that the gene is encoded on the reverse complement strand. For further information on the respective gene, a short description is provided and direct links to TAIR, TIGR, and MIPS records are given (6,31,32).
The data content of AthaMap was increased with nine new alignment matrices derived from eight TFs, another eight TFs with single site-based binding sites and one combinatorial element. These putative binding sites were determined as reported earlier (11–13). Table 1 lists the number of new TF binding sites detected with each matrix and the reference for the alignment matrix. In the case of SPL3 and SPL8, the sequences for alignment matrix generation were obtained directly from the authors of the respective publication (Table 1). Sequences for new single site-based screenings are shown in Table 2. One new combinatorial element was annotated as well. Binding sites derived from both HOX2a matrices were used for determination of combinatorial HOX2a elements (33). AthaMap now contains 9872372 TF binding sites detected with alignment matrices, 94963 TF binding sites detected with experimentally verified TFBS, and 359867 combinatorial elements based on known TF interactions. The TFs annotated in AthaMap cover most plant TF families (34). Table 3 summarizes the TF families and the number of different TFs represented in AthaMap.
The authors would like to thank Peter Huijser for providing binding sequences of SPL3 and SPL8. This work was carried out in the Intergenomics Center Braunschweig (http://www.intergenomics.de) and was supported by the German Federal Ministry for Education and Research (BMBF grant No. 031U110C/031U210C). Funding to pay the Open Access publication charges for this article was provided by BMBF.
Conflict of interest statement. None declared.