Since the first whole-genome sequencing was completed in 1995 on a pathogenic bacterium Haemophilus influenza
), the number of completely sequenced prokaryotic genomes has been increasing rapidly, with a doubling time of ~20 months for bacteria and ~34 months for archaea (15
). Due to the abundance of publicly available prokaryotic genomes, a large number of whole genome TF studies have focused on these organisms. Aravind and Koonin (16
) published one of the earlier analyses on the repertoire of TF families in four complete archaeal genomes. Using sequence profile methods in conjunction with protein structure information, they presented the intriguing finding that the majority of archaeal DBDs had helix-turn-helix (HTH) structures similar to bacterial HTH domains. This contrasts with the core components of the archaeal transcriptional machinery, such as basal TFs and RNA polymerases, which are more closely related to eukaryotic systems. A more recent study by Coulson and coworkers confirmed this finding (17
). Since then, similar types of analysis were conducted by different groups with larger sets of prokaryotic species.
Perez-Rueda et al.
) addressed the distribution of 75 TF families across 90 prokaryotes based on the well-characterized set of TFs in E. coli
K12. Because the reference TFs were taken from one bacterial species, the predicted TFs were restricted to close homologues of TFs found in E. coli
. Similarly, Minezaki et al.
) classified TFs from 154 complete prokaryotic genomes into 52 TF families. Their TF families were collected from TFs found in eight different archaea and bacteria, with additional DBDs documented in Pfam (20
). Thus, this reference TF set was likely to detect additional varieties of TF homologues across prokaryotic proteins. Different criteria for constructing the reference TFs notwithstanding, both studies consolidated the predominance of HTH DBDs in prokaryotes, especially the winged-HTHs. They also demonstrated a significant depletion of TF families in intra-cellular pathogenic and endosymbiotic bacteria including Mycoplasma
These pathogenic life forms normally inhabit hosts whose environment lacks selective pressure to maintain the specific genes to respond to environmental stress. Other groups considered more restricted lineages of bacteria including Moreno-Campuzano et al.
) and Brune et al.
). Their studies provided comprehensive lists of TF repertoires in firmicutes and corynebacteria, respectively.
Baker’s yeast Saccharomyces cerevisiae
was the first eukaryotic species to have its genome completely sequenced. The paper describing the whole-genome sequencing of baker’s yeast (23
) was published in 1996, only slightly after the first prokaryotic genome H. influenza.
The number of completely sequenced eukaryotic genomes, however, increases significantly more slowly than that of prokaryotic genomes. This is likely due to the combination of larger average size of eukaryotic genomes, and the difficulty in assembling and annotating the genomes that contain a great amount of repetitive and non-coding elements (24
). Nonetheless, an increasing number of studies on the genomic TF repertoires are being conducted using complete eukaryotic genomes.
Riechmann et al.
) surveyed specific TF families occurring in four eukaryotic genomes: Arabidopsis thaliana
, Caenorhabditis elegans
, Drosophila melanogaster
and S. cerevisiae
. They demonstrated that a number of DBD families are shared across all three major eukaryotic kingdoms, i.e.
Metazoa (animal), Fungi and Viridiplantae (plant), but the domain combinations of DBDs and other domains in TFs are highly kingdom-specific. According to Coulson and Ouzounis (26
), each eukaryotic kingdom possesses not only the families common to all eukaryotes, but also a number of kingdom-specific transcriptional regulators, which possibly participate in kingdom-specific processes. Other studies focused on particular eukaryotic kingdoms. In plant, Shiu et al.
) pointed out that not only were the TF families more diverse compared with fungi and animals, but the expansion and duplication rates in plants were also considerably greater. This suggests a more frequent adaptive response to selection pressure among plants since they do not have mobility to avoid stress stimuli in the same way as other eukaryotes. More recent work by Shelest (28
) concentrated on TFs in fungi, reporting 37 TF families in 62 fungal species, of which only six families were fungal specific. Being phylogenetically distant from animals, fungi and plants, the genomes of parasitic protists such as apicomplexans and ciliates are known to be substantially divergent from the current model eukaryotic genomes and thus less well-understood. Iyer and coworkers (29
) were the first to investigate the repertoires of TFs and chromatin proteins in these parasitic unicellular eukaryotes.
In the Metazoa (animal) kingdom, TFs are particularly essential to the morphological development of animals’ organ systems. Messina et al.
) compiled one of the first lists of metazoan TFs by focusing on human. They aimed to produce a starting point for array experiments across species. By taking known TFs from TRANSFAC (31
), InterPro (32
) and FlyBase (33
) as seed sequences, additional human TFs were discovered using hidden Markov model (HMM) searches, followed by manual curation. As part of the initiative to characterize the transcription regulatory network in mammalian cells, the International Regulome Consortium (IRC) have put together a comprehensive list of mouse TFs by mapping cDNA sequences from several libraries to the NCBI mouse genome. More recently, Vaquerizas et al.
) have manually compiled a human TF repertoire and analyzed their expression patterns and evolutionary conservation. These studies on mammalian TFs will contribute to a better understanding of gene expression control in higher organisms.
In summary, several key publications mentioned here highlight the importance of TFs in the development and maintenance of cellular phenotypes in different kinds of organisms. These genome-wide studies provide a starting point for a systematic comparative analysis of genomic TF repertoires in both closely and distantly related genomes.