Sequence-specific DNA-binding transcription factors (TFs) each recognize a family of
cis-regulatory DNA sequences described by a consensus motif (
1) or position-specific weight matrix (
2). They regulate spatial and temporal gene expression by binding to DNA and either activating or repressing action of an RNA polymerase. Like other proteins, TFs are composed of evolutionary units called domains, which belong to families that can occur in many different proteins and various domain combinations. In the DBD database, we define TFs as proteins containing a sequence-specific DNA-binding domain (DBD). Other databases, such as TrSDB (
3), or data sets, such as Messina
et al. (
4), include both specific and general TFs. The precise description of TFs as sequence-specific DNA-binding we use is useful in a wide variety of studies. Examples include: improving genome annotation; high-throughput experiments such as ChIP–chip, protein chip or yeast one-hybrid (
5); and studies of the evolution of gene regulation comparing multiple genomes (
6), or gene regulation networks (
7). The DBD database has been used as an annotation tool in the context of the InterPro (
8) and FlyTF (
http://FlyTF.org) (
9) databases.
Access to the DBD database is via
http://transcriptionfactor.org, where all data is available for viewing and immediate download. The community can browse predictions for over 700 species (from
Arabidopsis thaliana to
Zymomonas mobilis) or DBD family (including helix–turn–helix, zinc-fingers, homeobox and many others); search predictions by sequence identifier or domain family; receive classifications for submitted protein sequences, and download our domain assignments, as well as our manually curated list of DBDs.
The prediction method in the DBD database (
10) uses hidden Markov models (HMMs) to identify domains in proteins from two databases: SUPERFAMILY (
11) and Pfam (
12). From DBD release 2.0 onwards, updated annotation resulted in 303 HMMs from SUPERFAMILY and 145 from Pfam compared to a total of 251 HMMs in the first version of DBD. The HMMs from SUPERFAMILY represent 37 superfamilies and 87 families according to the definitions in the SCOP database (
13). This includes 98 new models representing 37 sequence-specific DBD families. This resulted in an increase in additional TF predictions of 4.7%, for the 150 organisms in the original version of DBD.
The pipeline used to predict TFs begins with a domain annotation of all proteins from completely sequenced genomes with all HMMs from the SUPERFAMILY and Pfam databases (Supplementary Figure 1). A protein is classified as a TF if it has a significant match to a model we annotated as being a DBD, with the significance thresholds for HMM matches taken from the Pfam and SUPERFAMILY databases. This results in an estimated 1–5% of false-positive annotations. The TF predictions are limited to the families in our annotated collection, which means that the coverage is about two-thirds of known TFs. At the same time, up to an additional 50% of proteins are predicted as TFs that have annotations such as ‘hypothetical protein’, particularly in metazoan genomes. For details of benchmarking, please refer to (
10). The prediction method is general and applicable to any proteome or sequence set. In fact, the database has grown to encompass TF repertoires of over 700 publicly available genomes. Predictions for newly sequenced genomes are continuously added to the database.
The current DBD database contains information on over 200 000 predicted TFs. These TFs are distributed across the tree of life. It is not surprising that, we find a greater number of TFs in larger genomes. To investigate the relationship between TF abundance and proteome size in different lineages we graph these variables on a log–log plot as in Kummerfeld and Teichmann (
10) (Supplementary Figure 2 in this paper). To illustrate the difference between the eukaryotic and prokaryotic superkingdoms we separately perform a model fitting for these lineages. From the linear relationship on the log–log scale a power law can be inferred. This power law could be due to the underlying distribution of DBDs. A small number of DBDs (such as helix–turn–helix and zinc-finger families) occur in the majority of TFs. Whereas most DBDs occur in only a small number of TFs. In agreement with van Nimwegen (
14) and Ranea
et al. (
15), we find a higher proportion of TFs are required to regulate larger proteomes. We also find the TF abundance in archaea and bacteria expands more rapidly than in eukaryotes. Thus, in general, the same number of TFs regulate fewer prokaryotic genes than eukaryotic genes. The higher degree of combinatorial control, where gene expression is regulated by not just one but by a group of TFs, may also contribute to the lower eukaryotic TF requirements. Different combinations of TFs mean the number of gene regulation modes can increase with a reduced increase in TFs. Bacteria and archaea obey the same power law in terms of number of TFs and number of proteins. This is in accordance with their shared repertoire of DBD families, which we will return to below.
Apicomplexa appear not to follow either the prokaryote or typical eukaryote trends, perhaps because they are obligate parasites, and only survive in the nutrient-rich environment of their hosts. Thus, a different mode of gene regulation may be used by this lineage, or it is possible that their TFs are not well characterized by the current model libraries. Below, we will illustrate in more detail how the DBD database provides a consistent framework for comparison of the distribution of DBDs across the tree of life.