Since Release 3 of DBTBS in 2004 (4
), a significant increase in the number of referenced publication, from 378 to 947, has occurred (). This increase resulted in the inclusion of six new transcription factors, bringing their number to 120. At the same time, the number of promoters rose from 633 to 1475 and the number of regulated operons went from 525 to 736. Indeed, the regulated genes were reorganized in regulated operons, and all the regulated genes are now reported, in contrast to only the first gene of the operon. In addition to the extension of the existing data, 463 experimentally validated B. subtilis
) and their terminators have been included as well (6
). As previously, researchers worldwide are encouraged to report outdated, incorrect or missing information in order to make DBTBS as complete and accurate as possible.
To facilitate the identification of regulatory elements, two new tools have been added. First, a matrix search function allows users to identify which transcription factors correspond to the position-specific weighted matrix they submitted by querying DBTBS for the top 10 weight matrices similar to it. Second, following a user request, a B. subtilis
motif location search tool was added as a remedy for the disappearance of the GRASP-DNA tool developed by Schilling et al.
). This tool allows a user to input a list of binding site sequences and returns the B. subtilis
genome locations matched by the position-specific weighted matrix calculated from them. For each location, the two nearest upstream and downstream open reading frames are also reported. Furthermore, the generated position-specific weighted matrix can be directly used to search for the 10 most similar DBTBS weighted matrices matrices by using the provided link.
Upstream intergenic region conservation
In order to provide upstream intergenic region conservation information, groups of homologous proteins from 40 Gram-positive bacteria (Supplementary Table 1) were built. Homologies between the proteins were determined by all-against-all protein BLAST (8
) searches, where a protein A was considered homologous to a protein B if an identity higher than 40% on more than 50% of the length of A was found. Each group was then divided into subgroups based on genus, and each subgroup further divided based on the lengths of the upstream intergenic region of its members. Although orthologous and paralogous genes are first grouped together, subdividing the genus-specific groups based on the length of the upstream intergenic regions is expected to separate paralogous genes that are differently regulated.
The upstream intergenic regions of each of the subgroups containing more than two members were aligned with ClustalW (9
), and the last 300 positions of the alignment, representing the nucleotides directly upstream of the gene starts, were kept for further analysis.
For each subgroup, a conservation profile was calculated based on information content, thus giving the degree of conservation of each position and allowing the determination of conserved regions. In our analysis, conserved regions were determined by setting the threshold for the degree of conservation to 75%, while allowing at most three consecutive positions to have lower values. All the possible 6-bases-long position-specific weighted matrices were then created from the determined conserved regions and clustered using the quality cluster algorithm (10
) and a Kullback–Leibler divergence (11
) of 0.3 as the maximum cluster diameter. Matrices clustering together were merged to yield the hexameric motif matrices available from DBTBS.
Through this process, 29 520 hexameric motif matrices were created; 5652 of them were specific to Bacillus, 1516 to Staphylococcus and 184 to Streptococcus ( and Supplementary Table 2). These numbers are largely influenced by the grouping method, which results in only few groups for genus with few members, and hence lowers the potential number of motifs specific to that genus.
Repartition of the clusters and motifs
The Gram-positive bacteria upstream intergenic conservation information can be accessed from the ‘Motif conservation’ link on the main page of DBTBS. Users can then search the data by submitting a gene name, a genus-subgroup name or a motif number. Also, because of the large number of hexameric motif matrices available, the desired ranges for the information content and the number of occurrences of a motif can be selected in order to filter the displayed motifs.
Submitting a gene name will return a table indicating which organisms contain a gene labeled with the given name, as well as in which genus-subgroup this gene is included and which motifs are found in that subgroup. Genus-subgroups and motifs are linked to the same pages that those obtained by directly searching with a genus-subgroup name or a motif number.
The result of a genus-subgroup name search is a page presenting the conservation profile of the subgroup, with the conserved regions and motifs positions. The upstream intergenic sequence alignment used to calculate the conservation profile is shown under it (). Following the graphical display of the subgroup is a list of the genes included in the subgroup, and a list of the motifs present. This last list shows the motif logo (12
) and indicates in which other groups the same motif is found. Again each genus-subgroup name and motif number in this list is linked to the same page as the one obtained by a direct search.
Figure 1. Hexameric motif conservation in an upstream intergenic region. The upper part shows the search entry box, with the criteria selected for filtering of the displayed hexameric motifs. The lower part is the resulting figure, showing the conservation profile (more ...)
A motif number search shows first the motif logo, and then a list of genus-subgroups where the motif is found. In this list, the position of the motif in each subgroup is shown graphically, and for each subgroup, the list of the included genes and of the other motifs found in it is given, once again linked to the relevant pages.