The goal of the AthaMap ‘Gene Identification’ function is the identification of all binding sites of pre-selected TFs in all
A. thaliana genes. The tool can be accessed by selecting ‘Gene Identification’ at
http://www.athamap.de. shows a schematic overview of the new tool with parameters that the user can select (red), results obtained (yellow) and some further options for analysis of the obtained data (green). It is possible to select a specific TF from a list of all annotated TFs. To facilitate selection, one can first select the TF family. This restricts the number of selectable factors to these family members. The user can also define specific search parameters. The default upstream and downstream region of all genes to be searched is −500 and 50

bp, respectively. Positions are relative to either the transcription start site or the translation start site, depending on the annotation. The default region of −500

bp already covers the area in which most of the regulatory sequences are found within the upstream region of
A. thaliana genes. A recent study on the distribution of sequences corresponding to known regulatory elements revealed a localized distribution pattern upstream of the transcription start site (
16). For example, the G-box, CACGTG shows a peak position at −80 and a peak width of 273

bp. Hexamer sequences corresponding to regulatory sequences show peak positions between −62 and −138 and a peak width between 182 and 366

bp. Based on this study, a default region of −500 to +50

bp seems to cover the promoter region most likely harbouring the relevant TFBS for gene expression regulation. Nevertheless, these values can be changed, and a maximum window of 6000

bp, 2000

bp upstream and 4000

bp downstream can be selected around either start site. For TFs with binding sites determined with PWMs, the minimal threshold can be increased to detect only genes with highly conserved TFBS (
12). Furthermore, it is possible to exclude genes regulated by small RNAs. This may be useful to exclude genes that are potentially post-transcriptionally regulated. The results can be displayed in two different sort modes. ‘Gene’ will list the results according to the genome identifier (AGI); ‘Distance’ will sort the results according to the distance of the TFBS to the start site of the gene. Results comprise a set of non-redundant genes (gene IDs) harbouring a potential TFBS of the selected TF including positional information and orientation of the TFBS relative to the putative target gene (, yellow). Also genes putatively regulated by small RNAs are identified. Additional information that can be obtained with the data is indicated in green (). For example, each result can be viewed in a sequence display window to analyse the genomic context of the identified TFBS. The gene set can also be submitted to the Gene Analysis function of AthaMap for detecting other TFs regulating these genes. Furthermore, the gene IDs can be used for analysis in microarray expression databases to determine whether these are coregulated. As an example for a result display, shows a partial screen shot with ABF1 and the default parameters. A total of 821 different genes (gene IDs) harbouring TFBS for ABF1 in the selected region were identified. If a gene harbours two TFBS within the selected region or if the TFBS is palindromic, the gene ID is shown twice. Palindromic sites can occur on both, the upper and lower strand (relative orientation, ). A non-redundant gene list can be displayed by selecting the underlined number of genes detected (). The result table also shows the relative distance to the start site and the score of the particular binding site detected. Gene names and positions are linked to the respective AthaMap sequence display window to explore the genomic context of the binding site. For some TFs, the number of sites to be searched had to be restricted. This applies to 13 TFs with putative binding site numbers of more than 200

000. In these cases, the threshold score used is displayed in a ‘table of restriction scores’, which can be accessed on the web interface (). For further data processing of results, binding sites detected around annotated genes can be downloaded as a file containing all sites detected for the selected TF between 2000

bp upstream and 2000

bp downstream of each gene (, download). On special request, the complete unrestricted positional information of TFBS in the
A. thaliana genome will be provided.