|Home | About | Journals | Submit | Contact Us | Français|
DNA-binding domains (DBDs) are essential components of sequence-specific transcription factors (TFs). We have investigated the distribution of all known DBDs in more than 500 completely sequenced genomes from the three major superkingdoms (Bacteria, Archaea and Eukaryota) and documented conserved and specific DBD occurrence in diverse taxonomic lineages. By combining DBD occurrence in different species with taxonomic information, we have developed an automatic method for inferring the origins of DBD families and their specific combinations with other protein families in TFs. We found only three out of 131 (2%) DBD families shared by the three superkingdoms.
All sequence-specific transcription factors (TFs) contain DNA-binding domains (DBDs), evolutionary units that mediate the specificity of the TF–DNA interaction. Domain-based analysis of TFs is thus effective functionally as well as phylogenetically. TFs and their binding targets have been under intensive study, and previous key publications on TFs and DBDs tended to focus on specific phylogenetic groups [1–4]. Here, we analyze the distribution of all known DBDs in 538 organisms from superkingdoms Bacteria, Archaea and Eukaryota. TF and DBD classifications were obtained from the DBD database , a transcription factor resource that annotates TFs based on the presence of DBDs from a manually curated list. The DBD database predicts TFs in all publicly available genomes from diverse phylogenetic lineages using a single platform, and is thus an ideal resource for exploring the phylogenetic distribution of TF families across the tree of life. We provide an overview of conserved and lineage-specific DBD families, using 131 Pfam domains  classified as DBDs to illustrate our findings. Note that what we discuss here for Pfam DBDs applies also to 87 SCOP families  classified manually as DBDs by the DBD database (see the supplementary material online for a complete list of genomes and DBD families).
Earlier, we have introduced a heatmap representation to aid visualisation of the expansion and contraction of DBD families in order to investigate the distribution of DBDs in different lineages  (Figure 1a). Each column of the heatmap corresponds to a DBD family and each row represents a species. Species are ordered according to the NCBI taxonomic tree, an expertly curated taxonomic hierarchy . The Z-score of a number of TFs containing a particular DBD family of interest was calculated for each family to highlight the organisms in which the family is expanded relative to other species. Orange indicates positive Z-scores and thus a relative expansion of the DBD in that particular lineage and blue indicates negative Z-scores or a contraction. The distinct expansion pattern of different groups of DBDs in prokaryotes and eukaryotes implies that the DBD families are highly specific to two types of cells: nucleate and anucleate.
In addition to the heatmap, we have developed a new simple method for inferring the origin of protein domains. By combining DBD family occurrence with taxonomic information from the NCBI taxonomy tree, we demonstrate that the method is able to estimate when each DBD family emerged. We term this the taxonomic limit. The same method is used to estimate when the combinations of DBDs and other protein families in TFs emerged. We provide the taxonomic conservation density, which is the fraction of species containing the DBD out of the total number of species within taxonomic clades (see Box 1 for an example of the calculation steps and see the supplementary material online for a complete list of taxonomic limits and conservation densities).
We have developed an automatic method for inferring the origins of DBD families by combining DBD occurrence in different species with taxonomic information. Although there are similar methods (e.g. Refs [31–33]) that use protein content profiles and species trees to reconstruct evolutionary scenarios, they are not identical with our method and are not used for the same purpose (see the supplementary material online for a detailed discussion).
To obtain a taxonomic limit for a particular DBD family D, we first collected all species that have the DBD family detected in their genomes and computed TD,i,, the number of TFs containing the family D (tD,i), normalised by the number of genes (Gi) in species i (Equation I). On the basis of the NCBI taxonomic tree, the last common ancestor (LCA) between each species and all other species that share the DBD of interest was derived. This step was repeated for all possible pairs of species, in an all-against-all fashion (all possible pairs of i,j species that contain family D). For each pair of species i,j, the average number of TFs containing family D (TD,i,j) was computed (Equation II).
We defined the frequency fraction of a taxonomic node X (FD,X), as the ratio of the sum of normalised TFs containing family D, sharing the LCA at node X (), over the sum of all normalized TFs containing family D in all taxonomic nodes () (Equation III). We identified the most frequent LCA (highest frequency fraction node) to be the taxonomic limit of this DBD family. However, the bias due to different numbers of genomes in different branches (e.g. Proteobacteria dominate Bacteria) might decrease the accuracy of taxonomic limit estimation. We corrected the estimation by shifting the taxonomic limit to the parental node if its frequency fraction over the highest frequency fraction was greater than a cut-off of 0.2 (see the supplementary material online for the calibration of the method and cut-off threshold).
In addition to the taxonomic limit, we calculated the fraction of species containing the DBD over the total number of species under the taxonomic limit. We termed this the taxonomic conservation density. DBD families that emerged from the same speciation event should be detectable in most of the children species (taxonomic conservation density close to 1). In contrast, the DBDs that are observed sporadically in taxonomically distant lineages (small conservation density), are likely to have been disseminated through horizontal gene transfer or have gone through multiple gene loss events. Figure I demonstrates how the method operates using a simplified taxonomy tree.
Using our method, we found that 19 out of 131 (15%) DBDs have cellular organisms as their taxonomic limits (shared by more than one superkingdom). Eleven of these DBDs are shared by Archaea and Bacteria but not Eukaryota, and only three (2%) are shared by all three superkingdoms (Figure 1b). When we apply the same method to all Pfam domains, we observed that 33% have cellular organisms as taxonomic limits, suggesting that the repertoires of DBD families are more lineage-specific than proteins with other functions. This conclusion is in line with the results of an earlier study that used a different method .
Focusing on the prokaryotic genomes, helix-turn-helix is by far the commonest DBD structure . The majority of prokaryotic DBDs belong to the winged helix structural class, which might explain the uniform expansion of DBD occurrence observed here. Archaea are thought to be phylogenetically closer to Eukaryota and have more closely related core components of transcription machinery, such as RNA polymerases and basal TFs [1,4]. Interestingly, our heatmap and taxonomic limit assignments suggest a greater number of archaeal DBDs shared with Bacteria than with Eukaryota. Examples of DBDs shared by the two prokaryotic superkingdoms Archaea and Bacteria are Fe_dep_repress (iron-dependent repressor), MarR (antibiotic resistance) and NikR (nickel-responsive regulator). These DBD families regulate specific genes required for adaptation to environmental stress, and might have been established and maintained through multiple horizontal gene transfers [1,10].
The heatmap shown in Figure 1a suggests that the prokaryotic DBD distribution is widespread among the prokaryotic species and there is no clearly distinguishable expansion scheme within the three major bacterial phyla in our dataset: Actinobacteria, Firmicutes and Proteobacteria. Indeed, we find that 30 out of 61 (49%) bacterial DBDs have Bacteria as their taxonomic limit (shared by more than one phylum). These shared DBDs participate in basic carbon source metabolism, e.g. HTH_AraC, LacI and Gnt, as well as in more specific processes, such as FUR (ferric uptake regulator), MerR (mercury resistance) and HTH_8 (virulence gene expression).
Examples of prokaryotic phylum-specific DBDs include WhiB, a DBD specific to Actinobacteria that regulates mycelium formation. The FlhC and FlhD TFs, with Proteobacteria (Gram-negative) as their taxonomic limit, have been shown to be global regulators involved in many cellular processes, including flagella transcriptional activators . On the basis of their restricted phylogenetic distribution and flagella regulation, they might be linked to the Gram-negative's four-support-ring flagella, as opposed to the Gram-positive's two-support-ring flagella. Additional discussions on lineage-specific DBD families and the biological processes they are implicated in are available in the supplementary material online.
In contrast to the uniform DBD occurrence in Bacteria, Figure 1a shows more distinct expansion patterns among the three main eukaryotic kingdoms: Metazoa (animals), Fungi and Viridiplantae (plants). Indeed, a relatively small proportion (29%) of eukaryotic DBD families have Eukaryota as their taxonomic limit. These eukaryotic families include the zinc finger families, HLH (helix-loop-helix) and bZIPs (basic leucine zippers). In addition, the homeobox family, well known for its role in morphogenesis and animal body development , is found throughout eukaryotic organisms, including fungi and plants.
The most notable difference in the Metazoa is between vertebrates and invertebrates. Although the majority of DBDs found in animals are present in both groups, the expansion tends to be less pronounced in the invertebrates. The DBDs with particularly extensive expansion in vertebrates include STAT (signal transduction), T-box (body plan and organogenesis) and p53 (cell cycle arrest and apoptosis). DBDs such as IRF (interferon regulator factor) and Churchill (neural development) are absent from invertebrates, which might reflect the more elaborate immune and nervous systems in vertebrates. In contrast, the Runt and GCM families regulate fundamental developmental processes in both vertebrates and invertebrates, and are equally expanded in both groups.
Metazoa and Fungi are phylogenetically closer and share more DBD families with Viridiplantae (see the supplementary material online). In accordance with earlier work , we observed a number of fungal-specific DBDs, including Zn2/Cys6 (Zn cluster), and Copper-fist (copper utilisation). Interestingly, HTH_AraC (arabinose operon regulatory) and FMN (flavin mononucleotide) binding domains are exceptional cases of bacterial DBDs broadly found across Fungi. These families have been shown experimentally to be involved in sugar uptake  and sporulation regulation  in Bacteria. Their functionality in Fungi has yet to be investigated. Plants possess a number of mainly plant-specific DBDs, such as AP2 (activation of defence genes) and SBP (flowering development).
Apart from the three major kingdoms, we observe an interesting DBD occurrence in the unicellular eukaryote Monosiga brevicollis, a marine choanoflagellate that is thought to be the closest sequenced unicellular relative of animals . Earlier studies showed that the species contains a considerable amount of signalling components in common with animals . Besides the more elaborate signalling machineries, uni- to multicellular transitions also require a greater amount of components that contribute to the more complex genetic regulatory networks in functionally diverse cell types . One possible way to enhance the regulation capacity is by recruiting novel sets of TFs. We observed DBDs common to the fungi/animal group in M. brevicollis, and many DBDs specific to animals (MB, Figure 1a). Among these DBDs there are families that regulate animal-specific processes such as STAT (signal transduction), p53 (apoptosis) and Tub (nervous system development), as well as those involved in more general pathways like E2F/DP (cell cycle).
In addition, we observed several interesting DBD occurrences in rare protist genomes (Figure 1a). For example, STAT and WRKY were detected in Dictyostelid [19,20] and are detected in our dataset. We note the occurrence of two DBDs thought to be plant-specific DBDs in protists. Apart from AP2, which was detected in apicomplexa , we discovered a rare presence of the zinc finger LSD1 in many euglenozoa for the first time. Our understanding of transcriptional regulation and the number of sequenced genomes in these protists are, however, still very limited.
TFs have DBDs as core components and often contain other protein domains of different functions, which we term partner domains. In Figure 2, we use a network-style representation to provide an overview of the most commonly occurring TF architectures (those occuring in >5% of TFs in each family). Using our taxonomic limit method to infer the origins of the DBD–partner domain combinations, we observed many lineage-specific TF architectures on top of lineage-specific DBDs (different coloured arrows connecting domain nodes in Figure 2).
The combinations between DBDs (circular nodes) and their partner domains (square nodes) in bacterial TFs are mostly (31 out of 44) shared by more than one bacterial phylum (see supplementary material online for a complete bacterial TF architectural network). For instance, HTH_1 (lysR family), the most abundant DBD in prokaryotes, is always located upstream of the LysR substrate-binding domain (Figure 2a). The blue arrow linking the two domains indicates that this architecture is broadly conserved in all Bacteria. A few TF architectures, such as in Fe_dep_repress, are conserved in all Bacteria as well as in Archaea. In agreement with earlier observations , we note that bacterial partner domains function predominantly in small molecule binding or two-component signal transduction. Interestingly, we observed that 16 out of 19 phylum-specific DBDs occur in single-domain TFs without a partner, e.g. FlhC. This is possibly because they emerged relatively recently and have not had sufficient time to combine with other domains to form more elaborate architectures.
Specific DBD–partner domain combinations are observed in animals, fungi and plants. The eukaryotic-specific Tub family, for instance, occurs in a single-domain TF in more than 25% of eukaryotic TFs (green border node). It occurs also downstream of the SOCS_box domain only in animals, and co-occurs exclusively with F-box in plants (Figure 2b). This family is absent from Fungi. These findings suggest that some eukaryotic DBDs have gained new regulatory modes by combining with different partner domains in different kingdoms.
Another distinctive feature of eukaryotic TF architectures not found in prokaryotes is the repetition of the same DBD family within a single TF chain (self-looping arrows in Figure 2c). DBD repeats are found in 22 out of 77 (29%) eukaryotic DBDs, mostly in the zinc fingers. Other DBDs in this category include CUT, E2F/DP and Tea. Additionally, AP2, B3 and WRKY are families that exhibit repeats exclusively in plants (yellow self-looping arrows). The function of repeated DBDs in eukaryotic TFs is most likely to boost the specificity and diversity of motif recognition at TF–DNA interfaces by increasing the number of possible DNA-binding sequences from a limited number of DBD families .
The partner domains in eukaryotic TFs have more diverse functions than those in Bacteria, and the commonest function is to mediate protein–protein interaction and dimerisation. This is thought to be important to the formation of composite protein modules, a crucial step towards combinatorial regulation. Examples of these families include BTB, Bromodomain, SAM, ANK and hATC.
DBDs are essential to all sequence-specific TFs because they regulate the specificity of TF–DNA binding, which in turn governs differential expression and determines physiological diversity in different species across the tree of life. With this analysis of conserved and lineage-specific DBDs, and TF architectures using our new method for inferring taxonomic limits, we contribute new insights into the global picture of the TF repertoire and its evolution. Our findings can facilitate the experimental design of high-throughput studies on transcriptional regulators, e.g. Refs [24–28]. In addition to providing an improved understanding on how different DBD families are related, our taxonomic inference methods can be applied to other protein domains apart from DBD families.
We demonstrate a limited conservation of DBD families between prokaryotes and eukaryotes. Only 15% of known DBDs have cellular organisms as their taxonomic limit, as opposed to 33% of all Pfam domains. Lineage-specific DBD repertoires can be seen at the eukaryotic kingdom level: only 29% of eukaryotic families are shared by more than two superkingdoms. Prokaryotic DBDs are less specific to the major bacterial phyla, with 49% of families being shared. In addition to DBD, the variety in DBD and partner domain combination adds another level of complexity to TF structures. The specific DBD families and TF architectures in different lineages can be used as signatures for the genetic regulatory circuits in diverse phylogenetic groups. Knowledge of the phylogenetic distribution of DBD families and their domain combinations can improve methods for remote homology detection [29,30] and advance the discovery of new TFs in genomes.
We thank Cyrus Chothia, Daniel Hebenstreit, Joseph Marsh, Madan Mohan Babu and Benjamin Lang for critical commentary on the manuscript. This work was funded by the Medical Research Council and a Royal Thai Government Scholarship to V.C.