The identification of gene regulatory elements is still a major challenge for molecular biologists. Experimental methods can be complemented with bioinformatic approaches to identify transcription factors (TFs) or factor families responsible for gene expression regulation (1
). Once a regulatory region is delineated experimentally, a bioinformatic approach may involve the use of pattern recognition programs such as MatInspector, Match or Patser to identify functional or putative TF binding sites in this region (3
). The bases for these pattern recognition programs are alignment matrices that can be derived from random binding site selection experiments. Such experiments often determine a large array of different DNA sequences that can be bound by the factor. These data are used by pattern search programs to generate positional weight matrices that also predict novel binding sites solely on the basis of nucleotide frequencies at single matrix positions.
The recent completion of the Arabidopsis thaliana
genome sequence offered a good opportunity to determine putative TF binding sites in the whole genome (6
). Although most of the regulatory sequences in genes occur upstream of the transcription and translation start site, many exceptions are known (7
). Furthermore, Arabidopsis
has a high density of genes and the average length of the intergenic region is only ~2–2.5 kb (6
). Therefore, the determination of factor binding sites should not be restricted to upstream regions alone.
Resources for binding site identification used here are mostly alignment matrices derived for individual TFs and annotated in the TRANSFAC database. During the past few years the TRANSFAC database has been significantly enhanced with plant-specific data. For example, the number of plant transcription factors in the database has risen from 266–489 between the years 2000 and 2001 to currently 644 in the TRANSFAC 6.0 public database (10
). A similar increase was achieved with the annotation of alignment matrices. This shows that a critical amount of data is available for the prediction of genomic positions of TF binding sites.
Here, we have employed alignment matrices from the TRANSFAC database and from further publications for the prediction of TF binding sites in the most recent version of the Arabidopsis
genome sequence by using the matrix screening program Patser (4
). The results of these screens were integrated into a genome-wide binding site map and are now available at http://www.athamap.de
. This report describes the content and use of the AthaMap database resource. Tools have been developed for easy display of binding sites and for the display of the underlying data. The database will be complemented in the future with newly published binding sites and with combinatorial elements of interacting factors.