|Home | About | Journals | Submit | Contact Us | Français|
Gene expression is controlled mainly by the binding of transcription factors to regulatory sequences. To generate a genomic map for regulatory sequences, the Arabidopsis thaliana genome was screened for putative transcription factor binding sites. Using publicly available data from the TRANSFAC database and from publications, alignment matrices for 23 transcription factors of 13 different factor families were used with the pattern search program Patser to determine the genomic positions of more than 2.4 × 106 putative binding sites. Due to the dense clustering of genes and the observation that regulatory sequences are not restricted to upstream regions, the prediction of binding sites was performed for the whole genome. The genomic positions and the underlying data were imported into the newly developed AthaMap database. This data can be accessed by positional information or the Arabidopsis Genome Initiative identification number. Putative binding sites are displayed in the defined region. Data on the matrices used and on the thresholds applied in these screens are given in the database. Considering the high density of sites it will be a valuable resource for generating models on gene expression regulation. The data are available at http://www.athamap.de.
The identification of gene regulatory elements is still a major challenge for molecular biologists. Experimental methods can be complemented with bioinformatic approaches to identify transcription factors (TFs) or factor families responsible for gene expression regulation (1,2). Once a regulatory region is delineated experimentally, a bioinformatic approach may involve the use of pattern recognition programs such as MatInspector, Match or Patser to identify functional or putative TF binding sites in this region (3–5). The bases for these pattern recognition programs are alignment matrices that can be derived from random binding site selection experiments. Such experiments often determine a large array of different DNA sequences that can be bound by the factor. These data are used by pattern search programs to generate positional weight matrices that also predict novel binding sites solely on the basis of nucleotide frequencies at single matrix positions.
The recent completion of the Arabidopsis thaliana genome sequence offered a good opportunity to determine putative TF binding sites in the whole genome (6). Although most of the regulatory sequences in genes occur upstream of the transcription and translation start site, many exceptions are known (7–9). Furthermore, Arabidopsis has a high density of genes and the average length of the intergenic region is only ~2–2.5 kb (6). Therefore, the determination of factor binding sites should not be restricted to upstream regions alone.
Resources for binding site identification used here are mostly alignment matrices derived for individual TFs and annotated in the TRANSFAC database. During the past few years the TRANSFAC database has been significantly enhanced with plant-specific data. For example, the number of plant transcription factors in the database has risen from 266–489 between the years 2000 and 2001 to currently 644 in the TRANSFAC 6.0 public database (10–12). A similar increase was achieved with the annotation of alignment matrices. This shows that a critical amount of data is available for the prediction of genomic positions of TF binding sites.
Here, we have employed alignment matrices from the TRANSFAC database and from further publications for the prediction of TF binding sites in the most recent version of the Arabidopsis genome sequence by using the matrix screening program Patser (4,12,13). The results of these screens were integrated into a genome-wide binding site map and are now available at http://www.athamap.de. This report describes the content and use of the AthaMap database resource. Tools have been developed for easy display of binding sites and for the display of the underlying data. The database will be complemented in the future with newly published binding sites and with combinatorial elements of interacting factors.
The AthaMap database structure was developed for storing positional information on putative TF binding sites, underlying data for binding site prediction, alignment matrices and additional data on TFs. The database was designed with a high degree of flexibility to facilitate future upgrades and was implemented on an MS-SQL-Server. Genomic screenings were performed using Patser (4). Software tools were programmed to import putative TF binding sites predicted by Patser into the database. This toolbox was also employed to analyse the redundancy of matches in the database (see below). Selection of TF matrices was performed in order to ensure minimal redundancy. An interactive web server interface was designed for public accessibility of the database content.
Alignment matrices from 23 TFs corresponding to 13 different TF families were used for the genomic screen for TF binding sites. Many of the factors employed for the genomic screen originate from A.thaliana. However, genomic screens were also performed with alignment matrices for factors from other plant species. The rationale behind this is the observation that binding site recognition is generally not species specific. For example, bZIP or MYB factors from different plant species recognize similar target sequences with a high conservation of a core sequence (14–16). Because Arabidopsis is frequently used to dissect the function of heterologous TFs, information on binding site locations in Arabidopsis may also be valuable for heterologous TFs (17). Furthermore, all binding sites that are displayed in the desired genomic region can be easily associated with a homologous or heterologous TF (see below).
The pattern search program Patser was employed for the identification of binding sites (4). Patser is available as a UNIX/Linux stand-alone program on the author’s web site (4) and online as part of the Regulatory Sequence Analysis Tools (18). Here, a locally installed version of Patser was used. The following command line was used to run Patser: patser-v3d -A a:t 0.325 c:g 0.175 -m matrixfile -f sequencefile -c -li -d2. Mostly the default threshold derived from the adjusted information content of the matrix was employed. In seven of 23 cases, the matrix information content was insufficient to yield specific matrix matches using the default threshold. In these cases, indicated in Table Table1,1, a higher threshold was applied. Table Table11 summarizes the number of genomic matches identified for 23 alignment matrices with Patser. Column 1 shows the designation of the factor for which an alignment matrix was available; column 2 gives the TF factor family; column 3 displays the total number of genomic positions identified and column 4 shows the TRANSFAC accession number and the reference.
The identification of TATA box binding protein (TBP) binding sites with the available matrix was a particular challenge (19). More than 200 000 putative TBP binding sites were detected by using the default threshold with Patser. Upon closer inspection it was clear that AT-rich regions are ‘hot spots’ of putative TBP binding sites. Therefore, we restricted the putative TBP binding sites in AthaMap to those sites that were detected in a region up to 400 bp upstream of the translation start point. Because the 30 878 putative TBP binding sites in AthaMap exceeds the total number of genes, some sites may still cluster in AT-rich regions.
In summary, more than 2.4 × 106 putative TF binding sites were predicted in the Arabidopsis genome.
In several cases alignment matrices from different TFs of the same TF family were used in the screens. Therefore, it is interesting to estimate the level of redundancy within the genomic positions. This is of particular importance for members of the MYB factor family represented by six different matrices (Table (Table1).1). To determine the level of redundancy, the genomic positions detected by all MYB matrices were investigated for colocalization in the genome using a software tool developed in the lab (L. Bülow and R. Hehl, unpublished). This tool detects identical matches for two different matrices. MYB.PH3 recognizes two different sets of binding sites (MYB.PH3 and MYB.PH3, Table Table1)1) that were both annotated to the TRANSFAC database (12,20). Based on the matrix consensus sequence it was expected that CDC5 would not detect the same binding sites as GAMYB, MYB.PH3, MYB.PH3 and P (data not shown). GAMYB, MYB.PH3, MYB.PH3 and P recognize a consensus sequence with a characteristic AAC trinucleotide sequence, which is not conserved in the matrix for CDC5. In accordance with this expectation, mostly GAMYB, MYB.PH3 and P show a certain level of redundancy in binding site identification. Table Table22 shows that 34 863 of the genomic matches detected with the GAMYB (319996) and NTMYBAS (183 549) matrices are identical. Similarly, of the 8554 matches detected with MYB.PH3, 2166, 2929 and 215 are also identified with GAMYB, NTMYBAS and P, respectively. CDC5 does not detect the same positions as were identified with GAMYB, MYB.PH3, MYB.PH3 and P (Table (Table2).2). From these values we deduce a high number of unique positions and estimate that the level of redundancy in AthaMap is relatively low. Only those matrices that harbour a certain degree of sequence similarity match identical positions in the genome. Furthermore, it is advantageous to represent TF binding sites for different matrices within the same TF family because the specificity of DNA binding is then determined mainly by the nucleotides flanking the core region.
AthaMap is accessible at http://www.athamap.de. AthaMap provides two different search modes for the user. The chromosomal regions of interest can be retrieved either by submitting the Arabidopsis Genome Initiative identification or TAIR accession number (13) or by entering the genomic position. AthaMap displays the DNA sequence 500 bp upstream and downstream of the putative gene start or submitted position, respectively, with all identified TF binding sites highlighted in red. The arrow above the sequence adjacent to the factor’s name indicates the length of the putative binding site and its orientation. To facilitate navigation within the sequence, buttons were placed in the lower corners of the sequence display window which can be used to scroll 500 bp in either direction. Transcribed and translated gene regions are underlined and the respective genes are identified below the sequence window with gene name and gene features like start, stop and orientation (Fig. (Fig.1).1). The designations were taken from TAIR annotation tables (13).
The AthaMap documentation contains a complete list of the matrices used for the genomic screenings, the number of matches detected with Patser, the score thresholds for each matrix used by Patser, the maximum score and the sequences from which the matrix was derived, including the corresponding publication. Furthermore, the documentation of the database provides the TRANSFAC accession number of the matrix. This will enable the user of AthaMap to find more recent information about the respective factor in the regularly updated TRANSFAC database (12). All of the matrices used will eventually be annotated in TRANSFAC.
Figure Figure11 shows a screen shot of AthaMap displaying the upstream and part of the transcribed region (underlined) of locus At1g06180.1 on chromosome 1. The information provided for the putative MADS box binding site identified with the Arabidopsis AG matrix (Table (Table1)1) is shown. The name of the factor, factor family, species, matrix, maximum score of the matrix and the threshold used by Patser is displayed in a pop-up window by clicking on the factor’s name. The tool tip box appears by moving the mouse over the arrow and identifies the position of the site, the score of this particular site and the maximum score and threshold used by Patser. The score of the match can be compared with the maximum score displayed in the same window. A high score close to the maximum score represents a high-quality binding site. A low score close to the threshold represents a low-quality binding site. These values enable the user to judge the quality of a match or putative binding site by comparing the particular score of a match with the threshold and maximum score determined by Patser.
An online resource designated Regulatory Sequence Analysis (RSA) tools is available for the detection of TF binding sites in the Arabidopsis and in other genomes (18). For detection of TF binding sites or cis regulatory sequences in plant genes, the databases Place, PlantCare and TRANSFAC can be employed (12,21,22). These three databases require an input sequence in which putative binding sites should be detected while the genomic detection of binding sites with the RSA tools require that a matrix or a binding site consensus sequence is provided.
AthaMap is one of only two databases that display putative binding sites directly in the genome of A.thaliana and the only database that displays TF binding sites in the whole genomic sequence including coding and complete intergenic regions. The other Arabidopsis genome database on cis regulatory sequences, AtcisDB (A.thaliana cis-regulatory database), contains the 5′ regulatory sequences of 29 388 annotated Arabidopsis genes (23). The main differences between AtcisDB and AthaMap are the restriction to 5′ sequences (AtcisDB) versus the complete genome (AthaMap) and the accessibility of the underlying data. AtcisDB contains only known cis-acting elements while AthaMap also identifies novel putative cis-acting TF binding sites. This demonstrates that AthaMap and AtcisDB are complementary. Furthermore, AthaMap provides a high level of transparency, which means that the process of binding site detection in AthaMap can be reproduced by the user. All parameters for binding site detection are identified and each site is associated with a TF.
The AthaMap resources are freely available for non- commercial users at http://www.athamap.de. The database will be updated on a regular basis. All updates and changes will be announced on the AthaMap home page.
We would like to thank the members of the Intergenomics network at the Technical University, Braunschweig, for many helpful discussions. This work was supported by the German Ministry of Education and Research (BMBF grant no. 031U110C/031U210C) and was carried out as part of the Intergenomics network, Braunschweig. Further support was provided by the Forschungsschwerpunkt Agrarbiotechnologie des Landes Niedersachsen (VW-Vorab).