Search tips
Search criteria 


Trends Genet. 2010 September; 26(9-3): 388–393.
PMCID: PMC2937223

Lineage-specific expansion of DNA-binding transcription factor families


DNA-binding domains (DBDs) are essential components of sequence-specific transcription factors (TFs). We have investigated the distribution of all known DBDs in more than 500 completely sequenced genomes from the three major superkingdoms (Bacteria, Archaea and Eukaryota) and documented conserved and specific DBD occurrence in diverse taxonomic lineages. By combining DBD occurrence in different species with taxonomic information, we have developed an automatic method for inferring the origins of DBD families and their specific combinations with other protein families in TFs. We found only three out of 131 (2%) DBD families shared by the three superkingdoms.

Phylogenetic analysis of DNA-binding transcription factor families

All sequence-specific transcription factors (TFs) contain DNA-binding domains (DBDs), evolutionary units that mediate the specificity of the TF–DNA interaction. Domain-based analysis of TFs is thus effective functionally as well as phylogenetically. TFs and their binding targets have been under intensive study, and previous key publications on TFs and DBDs tended to focus on specific phylogenetic groups [1–4]. Here, we analyze the distribution of all known DBDs in 538 organisms from superkingdoms Bacteria, Archaea and Eukaryota. TF and DBD classifications were obtained from the DBD database [5], a transcription factor resource that annotates TFs based on the presence of DBDs from a manually curated list. The DBD database predicts TFs in all publicly available genomes from diverse phylogenetic lineages using a single platform, and is thus an ideal resource for exploring the phylogenetic distribution of TF families across the tree of life. We provide an overview of conserved and lineage-specific DBD families, using 131 Pfam domains [6] classified as DBDs to illustrate our findings. Note that what we discuss here for Pfam DBDs applies also to 87 SCOP families [7] classified manually as DBDs by the DBD database (see the supplementary material online for a complete list of genomes and DBD families).

TF DBD families are highly lineage-specific

Earlier, we have introduced a heatmap representation to aid visualisation of the expansion and contraction of DBD families in order to investigate the distribution of DBDs in different lineages [5] (Figure 1a). Each column of the heatmap corresponds to a DBD family and each row represents a species. Species are ordered according to the NCBI taxonomic tree, an expertly curated taxonomic hierarchy [8]. The Z-score of a number of TFs containing a particular DBD family of interest was calculated for each family to highlight the organisms in which the family is expanded relative to other species. Orange indicates positive Z-scores and thus a relative expansion of the DBD in that particular lineage and blue indicates negative Z-scores or a contraction. The distinct expansion pattern of different groups of DBDs in prokaryotes and eukaryotes implies that the DBD families are highly specific to two types of cells: nucleate and anucleate.

Figure 1
Lineage-specific expansion patterns of DBD families. (a) The heatmap demonstrates the specific expansion patterns of DBD families between eukaryotic and prokaryotic genomes. Columns correspond to DBD families hierarchically clustered by their occurrence ...

In addition to the heatmap, we have developed a new simple method for inferring the origin of protein domains. By combining DBD family occurrence with taxonomic information from the NCBI taxonomy tree, we demonstrate that the method is able to estimate when each DBD family emerged. We term this the taxonomic limit. The same method is used to estimate when the combinations of DBDs and other protein families in TFs emerged. We provide the taxonomic conservation density, which is the fraction of species containing the DBD out of the total number of species within taxonomic clades (see Box 1 for an example of the calculation steps and see the supplementary material online for a complete list of taxonomic limits and conservation densities).

Box 1

Taxonomic limits of DBD families

We have developed an automatic method for inferring the origins of DBD families by combining DBD occurrence in different species with taxonomic information. Although there are similar methods (e.g. Refs [31–33]) that use protein content profiles and species trees to reconstruct evolutionary scenarios, they are not identical with our method and are not used for the same purpose (see the supplementary material online for a detailed discussion).

To obtain a taxonomic limit for a particular DBD family D, we first collected all species that have the DBD family detected in their genomes and computed TD,i,, the number of TFs containing the family D (tD,i), normalised by the number of genes (Gi) in species i (Equation I). On the basis of the NCBI taxonomic tree, the last common ancestor (LCA) between each species and all other species that share the DBD of interest was derived. This step was repeated for all possible pairs of species, in an all-against-all fashion (all possible pairs of i,j species that contain family D). For each pair of species i,j, the average number of TFs containing family D (TD,i,j) was computed (Equation II).


We defined the frequency fraction of a taxonomic node X (FD,X), as the ratio of the sum of normalised TFs containing family D, sharing the LCA at node X (i,jTD,X,i,j), over the sum of all normalized TFs containing family D in all taxonomic nodes (i,jTD,i,j) (Equation III). We identified the most frequent LCA (highest frequency fraction node) to be the taxonomic limit of this DBD family. However, the bias due to different numbers of genomes in different branches (e.g. Proteobacteria dominate Bacteria) might decrease the accuracy of taxonomic limit estimation. We corrected the estimation by shifting the taxonomic limit to the parental node if its frequency fraction over the highest frequency fraction was greater than a cut-off of 0.2 (see the supplementary material online for the calibration of the method and cut-off threshold).

In addition to the taxonomic limit, we calculated the fraction of species containing the DBD over the total number of species under the taxonomic limit. We termed this the taxonomic conservation density. DBD families that emerged from the same speciation event should be detectable in most of the children species (taxonomic conservation density close to 1). In contrast, the DBDs that are observed sporadically in taxonomically distant lineages (small conservation density), are likely to have been disseminated through horizontal gene transfer or have gone through multiple gene loss events. Figure I demonstrates how the method operates using a simplified taxonomy tree.

Using our method, we found that 19 out of 131 (15%) DBDs have cellular organisms as their taxonomic limits (shared by more than one superkingdom). Eleven of these DBDs are shared by Archaea and Bacteria but not Eukaryota, and only three (2%) are shared by all three superkingdoms (Figure 1b). When we apply the same method to all Pfam domains, we observed that 33% have cellular organisms as taxonomic limits, suggesting that the repertoires of DBD families are more lineage-specific than proteins with other functions. This conclusion is in line with the results of an earlier study that used a different method [9].

Uniform expansion pattern of DBD families in prokaryotes

Focusing on the prokaryotic genomes, helix-turn-helix is by far the commonest DBD structure [1]. The majority of prokaryotic DBDs belong to the winged helix structural class, which might explain the uniform expansion of DBD occurrence observed here. Archaea are thought to be phylogenetically closer to Eukaryota and have more closely related core components of transcription machinery, such as RNA polymerases and basal TFs [1,4]. Interestingly, our heatmap and taxonomic limit assignments suggest a greater number of archaeal DBDs shared with Bacteria than with Eukaryota. Examples of DBDs shared by the two prokaryotic superkingdoms Archaea and Bacteria are Fe_dep_repress (iron-dependent repressor), MarR (antibiotic resistance) and NikR (nickel-responsive regulator). These DBD families regulate specific genes required for adaptation to environmental stress, and might have been established and maintained through multiple horizontal gene transfers [1,10].

The heatmap shown in Figure 1a suggests that the prokaryotic DBD distribution is widespread among the prokaryotic species and there is no clearly distinguishable expansion scheme within the three major bacterial phyla in our dataset: Actinobacteria, Firmicutes and Proteobacteria. Indeed, we find that 30 out of 61 (49%) bacterial DBDs have Bacteria as their taxonomic limit (shared by more than one phylum). These shared DBDs participate in basic carbon source metabolism, e.g. HTH_AraC, LacI and Gnt, as well as in more specific processes, such as FUR (ferric uptake regulator), MerR (mercury resistance) and HTH_8 (virulence gene expression).

Examples of prokaryotic phylum-specific DBDs include WhiB, a DBD specific to Actinobacteria that regulates mycelium formation. The FlhC and FlhD TFs, with Proteobacteria (Gram-negative) as their taxonomic limit, have been shown to be global regulators involved in many cellular processes, including flagella transcriptional activators [11]. On the basis of their restricted phylogenetic distribution and flagella regulation, they might be linked to the Gram-negative's four-support-ring flagella, as opposed to the Gram-positive's two-support-ring flagella. Additional discussions on lineage-specific DBD families and the biological processes they are implicated in are available in the supplementary material online.

The eukaryotic DBD repertoire is highly specific at the kingdom level

In contrast to the uniform DBD occurrence in Bacteria, Figure 1a shows more distinct expansion patterns among the three main eukaryotic kingdoms: Metazoa (animals), Fungi and Viridiplantae (plants). Indeed, a relatively small proportion (29%) of eukaryotic DBD families have Eukaryota as their taxonomic limit. These eukaryotic families include the zinc finger families, HLH (helix-loop-helix) and bZIPs (basic leucine zippers). In addition, the homeobox family, well known for its role in morphogenesis and animal body development [12], is found throughout eukaryotic organisms, including fungi and plants.

The most notable difference in the Metazoa is between vertebrates and invertebrates. Although the majority of DBDs found in animals are present in both groups, the expansion tends to be less pronounced in the invertebrates. The DBDs with particularly extensive expansion in vertebrates include STAT (signal transduction), T-box (body plan and organogenesis) and p53 (cell cycle arrest and apoptosis). DBDs such as IRF (interferon regulator factor) and Churchill (neural development) are absent from invertebrates, which might reflect the more elaborate immune and nervous systems in vertebrates. In contrast, the Runt and GCM families regulate fundamental developmental processes in both vertebrates and invertebrates, and are equally expanded in both groups.

Metazoa and Fungi are phylogenetically closer and share more DBD families with Viridiplantae (see the supplementary material online). In accordance with earlier work [13], we observed a number of fungal-specific DBDs, including Zn2/Cys6 (Zn cluster), and Copper-fist (copper utilisation). Interestingly, HTH_AraC (arabinose operon regulatory) and FMN (flavin mononucleotide) binding domains are exceptional cases of bacterial DBDs broadly found across Fungi. These families have been shown experimentally to be involved in sugar uptake [14] and sporulation regulation [15] in Bacteria. Their functionality in Fungi has yet to be investigated. Plants possess a number of mainly plant-specific DBDs, such as AP2 (activation of defence genes) and SBP (flowering development).

Apart from the three major kingdoms, we observe an interesting DBD occurrence in the unicellular eukaryote Monosiga brevicollis, a marine choanoflagellate that is thought to be the closest sequenced unicellular relative of animals [16]. Earlier studies showed that the species contains a considerable amount of signalling components in common with animals [17]. Besides the more elaborate signalling machineries, uni- to multicellular transitions also require a greater amount of components that contribute to the more complex genetic regulatory networks in functionally diverse cell types [18]. One possible way to enhance the regulation capacity is by recruiting novel sets of TFs. We observed DBDs common to the fungi/animal group in M. brevicollis, and many DBDs specific to animals (MB, Figure 1a). Among these DBDs there are families that regulate animal-specific processes such as STAT (signal transduction), p53 (apoptosis) and Tub (nervous system development), as well as those involved in more general pathways like E2F/DP (cell cycle).

In addition, we observed several interesting DBD occurrences in rare protist genomes (Figure 1a). For example, STAT and WRKY were detected in Dictyostelid [19,20] and are detected in our dataset. We note the occurrence of two DBDs thought to be plant-specific DBDs in protists. Apart from AP2, which was detected in apicomplexa [21], we discovered a rare presence of the zinc finger LSD1 in many euglenozoa for the first time. Our understanding of transcriptional regulation and the number of sequenced genomes in these protists are, however, still very limited.

Variety in domain architecture adds complexity to TF structures

TFs have DBDs as core components and often contain other protein domains of different functions, which we term partner domains. In Figure 2, we use a network-style representation to provide an overview of the most commonly occurring TF architectures (those occuring in >5% of TFs in each family). Using our taxonomic limit method to infer the origins of the DBD–partner domain combinations, we observed many lineage-specific TF architectures on top of lineage-specific DBDs (different coloured arrows connecting domain nodes in Figure 2).

Figure 2
Network representation of DBD families and partner domains. (a) Examples of network representation of bacterial TF architectures. DBDs are shown as oblongs in protein chains and as circular nodes in our network representation. Partner domains are shown ...

The combinations between DBDs (circular nodes) and their partner domains (square nodes) in bacterial TFs are mostly (31 out of 44) shared by more than one bacterial phylum (see supplementary material online for a complete bacterial TF architectural network). For instance, HTH_1 (lysR family), the most abundant DBD in prokaryotes, is always located upstream of the LysR substrate-binding domain (Figure 2a). The blue arrow linking the two domains indicates that this architecture is broadly conserved in all Bacteria. A few TF architectures, such as in Fe_dep_repress, are conserved in all Bacteria as well as in Archaea. In agreement with earlier observations [22], we note that bacterial partner domains function predominantly in small molecule binding or two-component signal transduction. Interestingly, we observed that 16 out of 19 phylum-specific DBDs occur in single-domain TFs without a partner, e.g. FlhC. This is possibly because they emerged relatively recently and have not had sufficient time to combine with other domains to form more elaborate architectures.

Specific DBD–partner domain combinations are observed in animals, fungi and plants. The eukaryotic-specific Tub family, for instance, occurs in a single-domain TF in more than 25% of eukaryotic TFs (green border node). It occurs also downstream of the SOCS_box domain only in animals, and co-occurs exclusively with F-box in plants (Figure 2b). This family is absent from Fungi. These findings suggest that some eukaryotic DBDs have gained new regulatory modes by combining with different partner domains in different kingdoms.

Another distinctive feature of eukaryotic TF architectures not found in prokaryotes is the repetition of the same DBD family within a single TF chain (self-looping arrows in Figure 2c). DBD repeats are found in 22 out of 77 (29%) eukaryotic DBDs, mostly in the zinc fingers. Other DBDs in this category include CUT, E2F/DP and Tea. Additionally, AP2, B3 and WRKY are families that exhibit repeats exclusively in plants (yellow self-looping arrows). The function of repeated DBDs in eukaryotic TFs is most likely to boost the specificity and diversity of motif recognition at TF–DNA interfaces by increasing the number of possible DNA-binding sequences from a limited number of DBD families [23].

The partner domains in eukaryotic TFs have more diverse functions than those in Bacteria, and the commonest function is to mediate protein–protein interaction and dimerisation. This is thought to be important to the formation of composite protein modules, a crucial step towards combinatorial regulation. Examples of these families include BTB, Bromodomain, SAM, ANK and hATC.

Concluding remarks

DBDs are essential to all sequence-specific TFs because they regulate the specificity of TF–DNA binding, which in turn governs differential expression and determines physiological diversity in different species across the tree of life. With this analysis of conserved and lineage-specific DBDs, and TF architectures using our new method for inferring taxonomic limits, we contribute new insights into the global picture of the TF repertoire and its evolution. Our findings can facilitate the experimental design of high-throughput studies on transcriptional regulators, e.g. Refs [24–28]. In addition to providing an improved understanding on how different DBD families are related, our taxonomic inference methods can be applied to other protein domains apart from DBD families.

We demonstrate a limited conservation of DBD families between prokaryotes and eukaryotes. Only 15% of known DBDs have cellular organisms as their taxonomic limit, as opposed to 33% of all Pfam domains. Lineage-specific DBD repertoires can be seen at the eukaryotic kingdom level: only 29% of eukaryotic families are shared by more than two superkingdoms. Prokaryotic DBDs are less specific to the major bacterial phyla, with 49% of families being shared. In addition to DBD, the variety in DBD and partner domain combination adds another level of complexity to TF structures. The specific DBD families and TF architectures in different lineages can be used as signatures for the genetic regulatory circuits in diverse phylogenetic groups. Knowledge of the phylogenetic distribution of DBD families and their domain combinations can improve methods for remote homology detection [29,30] and advance the discovery of new TFs in genomes.

Figure I
Examples of the taxonomic limit and conservation density calculations for a DBD family using a simplified tree of life. Suppose a DBD family is detected in one TF per genome in one out of 20 eukaryotic genomes and 19 out of 20 bacterial genomes. As there ...


We thank Cyrus Chothia, Daniel Hebenstreit, Joseph Marsh, Madan Mohan Babu and Benjamin Lang for critical commentary on the manuscript. This work was funded by the Medical Research Council and a Royal Thai Government Scholarship to V.C.


Appendix ASupplementary data associated with this article can be found in the attached Supplementary Material document and from the authors’ project website. ( doi:10.1016/j.tig.2010.06.004.

Appendix A. Supplementary data


1. Aravind L., Koonin E.V. DNA-binding proteins and evolution of transcription regulation in the archaea. Nucleic Acids Res. 1999;27:4658–4670. [PMC free article] [PubMed]
2. Riechmann J.L. Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science. 2000;290:2105–2110. [PubMed]
3. Coulson R.M., Ouzounis C.A. The phylogenetic diversity of eukaryotic transcription. Nucleic Acids Res. 2003;31:653–660. [PMC free article] [PubMed]
4. Perez-Rueda E. Phylogenetic distribution of DNA-binding transcription factors in bacteria and archaea. Comput. Biol. Chem. 2004;28:341–350. [PubMed]
5. Wilson D. DBD–taxonomically broad transcription factor predictions: new content and functionality. Nucleic Acids Res. 2008;36:D88–D92. [PMC free article] [PubMed]
6. Finn R.D. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. [PMC free article] [PubMed]
7. Wilson D. SUPERFAMILY–sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 2009;37:D380–D386. [PMC free article] [PubMed]
8. Benson D.A. GenBank. Nucleic. Acids Res. 2009;37:D26–D31. [PMC free article] [PubMed]
9. Wunderlich Z., Mirny L.A. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 2009;25:434–440. [PMC free article] [PubMed]
10. Kunin V. The net of life: reconstructing the microbial phylogenetic network. Genome Res. 2005;15:954–959. [PubMed]
11. Pruss B.M. FlhD/FlhC-regulated promoters analyzed by gene array and lacZ gene fusions. FEMS Microbiol. Lett. 2001;197:91–97. [PubMed]
12. Pavlopoulos A., Akam M. Hox go omics: insights from Drosophila into Hox gene targets. Genome Biol. 2007;8:208. [PMC free article] [PubMed]
13. Shelest E. Transcription factors in fungi. FEMS Microbiol. Lett. 2008;286:145–151. [PubMed]
14. Saviola B. Arm-domain interactions in AraC. J. Mol. Biol. 1998;278:539–548. [PubMed]
15. Honjo M. A novel Bacillus subtilis gene involved in negative control of sporulation and degradative-enzyme production. J. Bacteriol. 1990;172:1783–1790. [PMC free article] [PubMed]
16. King N. The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature. 2008;451:783–788. [PMC free article] [PubMed]
17. Pincus D. Evolution of the phospho-tyrosine signaling machinery in premetazoan lineages. Proc. Natl. Acad. Sci. U. S. A. 2008;105:9680–9684. [PubMed]
18. Rokas A. The molecular origins of multicellular transitions. Curr. Opin. Genet. Dev. 2008;18:472–478. [PubMed]
19. Brown J.M., Firtel R.A. Regulation of cell-fate determination in Dictyostelium. Dev. Biol. 1999;216:426–441. [PubMed]
20. Babu M.M. The natural history of the WRKY-GCM1 zinc fingers and the relationship between transcription factors and transposons. Nucleic Acids Res. 2006;34:6505–6520. [PubMed]
21. Balaji S. Discovery of the principal specific transcription factors of Apicomplexa and their implication for the evolution of the AP2-integrase DNA binding domains. Nucleic Acids Res. 2005;33:3994–4006. [PMC free article] [PubMed]
22. Martinez-Antonio A. Internal-sensing machinery directs the activity of the regulatory network in Escherichia coli. Trends Microbiol. 2006;14:22–27. [PubMed]
23. Itzkovitz S. Coding limits on the number of transcription factors. BMC Genomics. 2006;7:239. [PMC free article] [PubMed]
24. Mukherjee S. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat. Genet. 2004;36:1331–1339. [PMC free article] [PubMed]
25. Hallikas O., Taipale J. High-throughput assay for determining specificity and affinity of protein-DNA binding interactions. Nat. Protoc. 2006;1:215–222. [PubMed]
26. Gilad Y. Expression profiling in primates reveals a rapid evolution of human transcription factors. Nature. 2006;440:242–245. [PubMed]
27. Meng X., Wolfe S.A. Identifying DNA sequences recognized by a transcription factor using a bacterial one-hybrid system. Nat. Protoc. 2006;1:30–45. [PubMed]
28. Deplancke B. A gateway-compatible yeast one-hybrid system. Genome Res. 2004;14:2093–2101. [PubMed]
29. Coin L. Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc. Natl. Acad. Sci. U. S. A. 2003;100:4516–4520. [PubMed]
30. Coin L. Enhanced protein domain discovery using taxonomy. BMC Bioinformatics. 2004;5:56. [PMC free article] [PubMed]
31. Snel B. Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res. 2002;12:17–25. [PubMed]
32. Kunin V., Ouzounis C.A. GeneTRACE-reconstruction of gene content of ancestral species. Bioinformatics. 2003;19:1412–1416. [PubMed]
33. Mirkin B.G. Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol. Biol. 2003;3:2. [PMC free article] [PubMed]