|Home | About | Journals | Submit | Contact Us | Français|
Viral DNA-binding proteins have served as good models to study the biochemistry of transcription regulation and chromatin dynamics. Computational analysis of viral DNA-binding regulatory proteins and identification of their previously undetected homologs encoded by cellular genomes might lead to a better understanding of their function and evolution in both viral and cellular systems.
The phyletic range and the conserved DNA-binding domains of the viral regulatory proteins of the poxvirus D6R/N1R and baculoviral Bro protein families have not been previously defined. Using computational analysis, we show that the amino-terminal module of the D6R/N1R proteins defines a novel, conserved DNA-binding domain (the KilA-N domain) that is found in a wide range of proteins of large bacterial and eukaryotic DNA viruses. The KilA-N domain is suggested to be homologous to the fungal DNA-binding APSES domain. We provide evidence for the KilA-N and APSES domains sharing a common fold with the nucleic acid-binding modules of the LAGLIDADG nucleases and the amino-terminal domains of the tRNA endonuclease. The amino-terminal module of the Bro proteins is another, distinct DNA-binding domain (the Bro-N domain) that is present in proteins whose domain architectures parallel those of the KilA-N domain-containing proteins. A detailed analysis of the KilA-N and Bro-N domains and the associated domains points to extensive domain shuffling and lineage-specific gene family expansion within DNA virus genomes.
We define a large class of novel viral DNA-binding proteins and their cellular homologs and identify their domain architectures. On the basis of phyletic pattern analysis we present evidence for a probable viral origin of the fungus-specific cell-cycle regulatory transcription factors containing the APSES DNA-binding domain. We also demonstrate the extensive role of lineage-specific gene expansion and domain shuffling, within a limited set of approximately 24 domains, in the generation of the diversity of virus-specific regulatory proteins.
Large DNA viruses of bacteria and eukaryotes have complex life cycles with several distinct phases that involve diverse virus-host interactions. An array of regulatory systems mediate activation or repression of expression of specific batteries of viral genes that are required at different phases of the life cycle. Other sets of regulatory genes directly interact with components of the host cell and modulate its response to the virus [1,2,3]. Studies on these viral regulatory systems have revealed a pivotal role of transcription and chromatin organization in the control of gene expression and have contributed to the basic understanding of these processes in various model systems [1,3]. Several viral DNA-binding regulators have conserved domains that are shared with cellular transcription factors. The classic helix-turn-helix (HTH)-domain proteins, which govern the switch between the lysogenic and lytic pathways of temperate bacteriophages, and repressors containing the MetJ/Arc domain are well-known examples of such regulatory DNA-binding proteins in prokaryotic virus-host systems [4,5,6,7,8]. Large eukaryotic DNA viruses, such as poxviruses, phycodnaviruses, phaeoviruses, asfarviruses, iridoviruses (all of which form the recently identified monophyletic clade of nucleo-cytoplasmic large DNA viruses (NCLDVs) ), baculoviruses and herpesviruses, also encode a number of transcription factors. Some of these share domains with regulatory proteins of their eukaryotic hosts, for example the FCS zinc finger and the TFIIS-like zinc ribbon .
Much less is known of the domain architecture and evolutionary history of those viral DNA-binding regulatory proteins whose cellular homologs have not (yet) been identified. For example, the baculovirus repeat ORFs (Bro) proteins are a family of DNA-binding proteins that may regulate both viral and host transcription or chromatin structure, but no homologs, cellular or otherwise, have been identified for these proteins [10,11]. The regulatory proteins of the poxviral variola D6R/Shope fibroma virus N1R family have been shown to bind DNA and probably regulate apoptosis of the host cell . All these proteins contain a conserved amino-terminal domain, for which no homologs outside this family have as yet been reported. These proteins additionally contain a carboxy-terminal RING finger domain , and some of them also contain a single-stranded nucleic-acid-binding CCCH domain between the amino-terminal domain and the RING finger.
During the comparative analysis of DNA viruses of the NCLDV class , we observed that the amino-terminal domain of the D6R/N1R family and baculoviral Bro-family proteins have a wide range of homologs in both DNA viruses and cellular genomes. Typically, these domains were identified in multidomain proteins that additionally contained several previously undetected, evolutionarily mobile domains occurring in different contexts. These observations suggested the existence of a common, previously uncharacterized set of regulatory proteins (domains) encoded by numerous eukaryotic and bacterial DNA viruses, as well as some of their host genomes.
To gain a better understanding of the functions and evolution of these regulatory proteins, we initiated a detailed analysis of their sequences using position-specific-score matrix searches, sequence-structure threading, secondary-structure prediction and structure comparisons. Here we describe the functional predictions and evolutionary history of these viral regulatory proteins and their cellular homologs that were detected as a result of these analyses. The comparison of the domain architectures of these proteins points to a major general role for combinatorial shuffling of a small set of domains in the evolution in transcriptional regulators of eukaryotic and bacterial DNA viruses.
To determine the evolutionary affinities of the D6R/N1R amino-terminal regions, we initiated a PSI-BLAST search of the non-redundant (NR) protein database (National Center for Biotechnology Information), which was seeded with the sequence of the corresponding region from the variola virus D6R protein. This search not only recovered homologous proteins from almost all poxviral proteomes, but also previously undetected homologs from the Chilo iridescent virus, a variety of γ-proteobacterial temperate phages (such as KilA of the phage BPP1), and chromosome-encoded proteins from Neisseria meningitidis, Xylella fastidiosa, Salmonella paratyphi and Clostridium difficile (Table (Table1,1, Figure Figure1a).1a). The presence of this conserved region in the amino terminus of the BPP1 KilA protein, which is involved in killing the host cells , with another distinct conserved carboxy-terminal region (Figure (Figure2),2), suggests that this region is a mobile domain that is present in different proteins in independent contexts. Accordingly, this domain was named the KilA-N (terminal) domain (Table (Table2).2). In all proteins shown to contain the KilA-N domain, it occurs at the extreme amino terminus accompanied by a wide range of distinct carboxy-terminal domains other than the RING finger or CCCH domains that are seen in the poxviruses (Figure (Figure22 and see below).
PSI-BLAST searches seeded with the KilA-N domains from a diverse set of viral and bacterial proteins also consistently recovered fungal transcription factors that are involved in cell-cycle-specific gene expression and filamentation, with E-values of borderline statistical significance (0.05-0.15). These hits precisely mapped to the APSES DNA-binding domain, which is shared by a range of fungus-specific transcription factors, such as MBP1p, SWI4p, PHD1p and SOK2p from Saccharomyces cerevisiae and Stunted A from Aspergillus nidulans . For example, a search initiated with the KilA-N domain of Pseudomonas phage D3 Orf11 (gi:9635595) recovered the APSES domain of Saccharomyces MBP1p with an E-value of 0.05 in iteration 3. Reciprocal searches with the profiles of APSES domains, similarly recovered the KilA-N domain with borderline E-values. As an example, a profile seeded with the MBP1p APSES domain and including all APSES domains in the NR database detects the N. meningitidis protein NMA1544 (gi:11290039) with an E-value of 0.1 in iteration 3. Given the availability of the three-dimensional structure of the APSES domain of the MBP1p protein [16,17], we investigated this potential relationship using the KilA-N domain sequences for sequence-structure threading of the PDB database with the 3DPSSM, PSIPRED and combined-fold prediction algorithms. Both 3DPSSM and PSIPRED gave hits (E-value approximately 0.05 for 3DPSSM and probability of matching approximately 0.8 for PSIPRED) that implied a 90% certainty of the KilA-N domain adopting the same fold as the APSES domain (PDB 1bm8/1bm1). Additionally, the 3DPSSM threading also suggested that the LAGLIDADG endonuclease domain  (E-value 0.2; approximately 80% certainty) shared a common fold with the KilA-N proteins. Threading with the combined-fold prediction algorithm also gave the MBP1p APSES domain as the best hit with a very high Z-score (Z = 60), suggesting that the KilA-N domain was highly likely to adopt the same fold as the APSES domain.
Secondary-structure prediction using the Jpred method , with a multiple alignment of the KilA-N domain used as the input, pointed to an α + β fold with four conserved strands and at least two conserved helices. This predicted secondary structure of the KilA-N domain, with a head-to-tail dyad of a 2-β-strand-α-helix unit (Figure (Figure1a),1a), is identical to the secondary structure seen in the conserved core of the APSES domains. Structural comparisons using the DALI program , indicated that the core fold shared by the KilA-N and APSES domains is more distantly related to the LAGLIDADG site-specific DNA endonucleases and the amino-terminal domain of the tRNA splicing endonucleases (TEN domain) (Figure (Figure1b)1b) . Three-dimensional superpositions of the three of these domains for which structures are available aligned the C-α atoms with a root-mean-square deviation of 3.2Å or less for approximately 60 residues. A search with the APSES domain (PDB 1bm8) using the VAST program  detected significant structural alignments with both the LAGLIDADG and the TEN domains (p approximately 10-4). The structural similarity between the TEN domain and LAGLIDADG nuclease domain has been previously noted , but the connection of both of these with the APSES domains has not been reported before to our knowledge. We noticed, however, that, although the TEN domain is related to the LAGLIDADG endonucleases, the active-site residues are not preserved in the former, and the TEN domain probably functions as a RNA-binding domain rather than a nuclease. These observations suggest that the APSES, LAGLIDADG and TEN domains, together with the KilA-N domain, define a novel nucleic-acid-binding fold with a conserved (β2α)2 core (Figure (Figure1b).1b). Previously, it had been proposed that the APSES domains were related either to the winged HTH or to the basic helix-loop-helix (bHLH) domains [16,17,24]. These connections are not, however, recovered in any sensitive sequence-profile or structural similarity searches. A direct comparison of the structures also showed that the only domains that share a similar topology and conformation with the APSES domain are the LAGLIDADG and TEN domains. This implies that the previously proposed relationships for the APSES domain are unlikely to represent the true evolutionary connections.
A comparison of the multiple alignment of the KilA-N and APSES domains showed that several characteristic hydrophobic/aromatic residues were conserved between the two families (Figure (Figure1a),1a), suggesting that they have similar functional properties. The APSES domains principally bind specific DNA sequences associated with regulatory regions of numerous genes expressed in the G1 to S transition of the cell cycle in yeast [25,26]. The sequence relationship with the APSES domain, taken together with the evidence of DNA binding by the D6R/N1R protein , suggests that the KilA-N domain is a previously undetected DNA-binding domain that is prevalent in the viral world. This prediction was also supported by the evidence from mutations to the conserved KilA-N domain of N1R that affected the localization of this protein to viral DNA-containing cytoplasmic virus maturation complexes (virus factories) .
The phyletic patterns of the KilA-N and APSES domains have interesting implications for the origin of the fungal transcription factors. Unlike several DNA-binding domains conserved throughout eukaryotes, such as the homeodomain, the bHLH, bZip, C2H2 zinc finger and the RFX domain, APSES domains are restricted to fungi [27,28]. Their closest relatives are the KilA-N domains, which are widespread in diverse DNA viruses and prophage derivatives from bacterial genomes. This suggests that APSES domains probably emerged early in fungal evolution from a viral KilA-N-like precursor that was acquired by the host cell. An alternative scenario, that the viral KilA-N domains were acquired from the fungal APSES domain, is also imaginable. This appears less likely, however, because the APSES domains show limited sequence diversity in the fungi, compared with the much greater sequence diversity of the KilA-N domain in a wide range of DNA viruses that had probably already diverged by the time fungi emerged (Figure (Figure1a).1a). Furthermore, the viral provenance of the APSES-like domains is analogous to the recruitment of the transposon-derived BED-finger domain in cell-cycle-specific transcription factors such as BEAF-1 and DREF in the arthropods . Both viruses and transposons are likely to derive selective advantages by evolving transcription factors that regulate their genes in response to the host cell cycle. Hence, from the above observations, it is not unlikely that the host cells co-opt the transcription factors of their genomic parasites for their own cell-cycle-specific gene expression.
The potential higher-order structural relationship between the APSES and KilA-N domains with the LAGLIDADG site-specific DNA endonucleases and tRNA endonuclease amino-terminal domains has interesting implications for their evolutionary affinities and origin. A structural comparison of these proteins indicates that the nuclease active site of the LAGLIDADG domains is contained in a specific amino-terminal α-helical extension packed against the core (β2α)2 domain common to all these proteins (Figure (Figure1b)1b) . The KilA-N, APSES and TEN-terminal domains lack the equivalent residues of the specific active-site extension of the LAGLIDADG nucleases, suggesting that the former domains do not possess nuclease activity. Thus, it appears plausible that these domains evolved from an ancestral nucleic-acid-binding module that, on one hand, gave rise to the nucleases through the acquisition of an amino-terminal helical extension that provided the active site, and on the other hand, diversified into distinct nucleic-acid-binding domains. Like the KilA-N domains that are mainly associated with DNA viruses, the LAGLIDADG endonucleases are predominantly encoded by mobile genetic elements . Just as the APSES domain appears to be a derivative of a KilA-N domain captured by the cellular genome, the TEN domains appear to have been derived through the ancient cellular capture of an inactive LAGLIDADG-like domain . The common ancestor of this fold might have emerged in a mobile genomic symbiont or parasite and subsequently spread widely across viral and transposon genomes.
The baculovirus Bro proteins are encoded by a multigene family and represent another class of virus-specific DNA-binding regulators whose evolutionary affinities are not yet understood. The typical Bro proteins that have been experimentally investigated are BroA, BroC and BroD from Bombyx mori nuclear polyhedrosis virus (BmNV) . In addition to baculoviruses, we observed that the NCLDV class members, such as poxviruses and iridoviruses, also encoded homologs of the Bro proteins. Proteins such as FPV124 from fowlpox virus and three distinct proteins encoded by the entomopoxvirus AMV showed similarity only to the carboxy-terminal part of the baculovirus Bro proteins, whereas, on their amino terminus, they contain a KilA-N domain (Figure (Figure2).2). In contrast, another group of entomopoxvirus proteins, such as MSV226 from MSV and AMV262 from AMV, showed similarity only to the amino-terminal part of the baculovirus Bro proteins. Yet another set of viral proteins, such as MSV194 from MSV, ORF117 from the phaeovirus ESV, and baculovirus proteins, such as BroE of BmNV, combine the region homologous to the amino-terminal segment of the typical Bro proteins with another distinct domain that occurs in a stand-alone form in the phage T5 ORF172 protein. This suggests that the typical Bro proteins contain distinct amino- and carboxy-terminal domains (Bro-N and Bro-C, respectively) that are present independently of each other and in distinct contexts in a variety of other viral proteins. To uncover their entire range of diversity, we initiated PSI-BLAST sequence-profile searches, seeded separately with the sequences of the Bro-N and Bro-C domains. The Bro-C domain was essentially restricted to the eukaryotic viruses of the baculovirus and NCLDV classes (Table (Table1).1). In contrast, the Bro-N domain was more widely distributed, occurring in a stand-alone form or combined with other domains in proteins from temperate phages that infect Gram-positive bacteria and Myxococcus xanthus, and proteins encoded in the genomes of proteobacteria and Gram-positive bacteria. The P22 antirepressor protein, which regulates phage transcription [33,34], is one of the previously characterized bacteriophage proteins in which the Bro-N domain was observed.
Studies on the BmNV BroA protein have shown that it binds DNA with high affinity and associates with the chromatin in the BmNV-infected cells . Additionally, it has been shown that the DNA-binding determinants of the BroA protein map to the amino-terminal 80 amino acids that correspond to the Bro-N domain defined above . Thus, the Bro-N domain appears to define a distinct superfamily of widespread viral DNA-binding domains. Multiple alignment-based secondary-structure prediction of the Bro-N domain reveals a core with two head-to-tail units of a β-hairpin followed by an α-helix (Figure (Figure3).3). Thus, the Bro-N domain adopts an α + β fold; furthermore, the pattern of predicted secondary-structure elements was similar to that seen in the KilA-N domain and its relatives (Figure (Figure3).3). The multiple alignment shows that the Bro-N domains contain two highly conserved aromatic or hydrophobic residues at the end of the second and fourth conserved strands, a pattern that is reminiscent of similarly conserved residues in the KilA-N and APSES domains (Figure (Figure3).3). However, sequence profile searches do not detect any significant similarity between the Bro-N and KilA-N domains. The sequence-structure threading with the 3D-PSSM method recovered the APSES domain structure (PDB 1bm8/1bm1) as the best hit for the Bro-N domain, albeit with statistically insignificant E-values. Thus, given that the Bro-N domain is a DNA-binding domain, it appears plausible that it adopts a fold similar to that of the KilA-N domain, although sequence analysis and threading methods failed to provide strong support for this.
There are several analogies between the KilA-N and Bro-N domains in terms of phyletic patterns, intragenomic distribution and domain architectures. Both these domains are widely prevalent in bacteriophages, bacteria and large eukaryotic viruses of the NCLDV and baculovirus classes. Both of them show expansions in particular viral genomes, for example KilA-N in FPV and Bro-N in certain baculoviruses, and the entomopoxvirus MSV. Phylogenetic tree construction for the KilA-N and Bro-N domains using the least-squares and maximum-likelihood methods showed that multiple versions of these domains from a given genome typically grouped together, to the exclusion of members from other genomes (data not shown). Thus, KilA-N and Bro-N domains probably have undergone lineage-specific expansions through amplification of a single ancestral gene. The genes encoding multiple copies of these domains do not necessarily occur in close proximity in the corresponding genomes, indicating that duplications were accompanied by extensive genome rearrangement resulting in dispersion of the paralogous genes.
The KilA-N and the Bro-N domains show additional parallels in the domain architectures of the corresponding proteins. Both domains almost always are located at the amino termini of these proteins and most often are fused to another distinct domain at the carboxyl terminus (Figure (Figure2).2). On many occasions, the Bro-N and KilA-N domains are fused to the same carboxy-terminal domains (Table (Table1,1, Figure Figure2),2), such as the Bro-C, T5ORF172, KilA-C, and Mx8P63-C domains. Additionally, Bro-N domains show specific combinations with certain domains, such as HTH and the Vsr-superfamily endonuclease. These architectures suggest that both the KilA-N and Bro-N modules are, to a large extent, functionally equivalent and probably act as the principal DNA-binding moiety that recruits a specific activity purveyed by their carboxy-terminal domains to the target DNA sequences. These carboxy-terminal modules may be enzymes, such as the nuclease domains, or might mediate additional, specific interactions with nucleic acids or proteins. Thus, the principal function of these proteins appears to be transcriptional regulation of viral or host genes. Consistent with this hypothesis, these proteins include antirepressors from phages such as BPP22, BPVT2 and BP933W.
The majority of the domains with which the KilA-N and Bro-N domains combine in multidomain proteins are restricted to proteins encoded by temperate bacteriophages and large eukaryotic DNA viruses (the exceptions are a few domains that are common in cellular proteomes, such as HTH, CCCH and RING finger). In order to identify the entire repertoire of domains involved in these domain-shuffling events, we systematically explored all domains that combined with the KilA-N and Bro-N domains and compiled a list of domains with which the latter combined in other multidomain proteins (Table (Table2,2, Figure Figure2a).2a). This information is represented as a directed graph in Figure Figure2b,2b, in which each vertex corresponds to a particular domain and the edges connect domains that combine to form multidomain proteins. The entire network consists of 24 domains, of which Bro-N and KilA-N domains show by far the greatest number of connections (12 and 7, respectively) compared to other domains. The RHA, 31ORF238-N and MSV199 domains, while less versatile in their connections, tend to combine with the same domains, and in the same orientation, as Bro-N and KilA-N (Figure (Figure2b),2b), which suggests an analogous function, such as DNA binding.
We show here that KilA-N and Bro-N domains are two DNA-binding domains that are widespread in large DNA viruses infecting bacteria and eukaryotes. At least the former, and perhaps even the latter, appear to belong to a large class of nucleic-acid-binding domains that includes the APSES, LAGLIDADG endonuclease and TEN domains. The fungus-specific transcription factors containing the APSES domain appear to have evolved through capture of a viral KilA-N-like precursor early in fungal evolution. KilA-N and Bro-N domains combine with overlapping sets of carboxy-terminal domains, which, in turn, combine with several additional domains, resulting in an extensive network of 18 domains that are predominantly specific to large DNA viruses and six domains acquired by these viruses from the host genomes (18 + 6 = 24). These observations establish a major role for shuffling within a limited set of domains during evolution of viral DNA-binding regulatory proteins.
The non-redundant (NR) database of protein sequences was searched using the BLASTP program . Profile searches were conducted using the PSI-BLAST program with either a single sequence or an alignment used as the query, with a profile-inclusion expectation (E) value threshold of 0.01, and were iterated until convergence [35,36]. If whole-length proteins were used, the searches were carried out using the composition-based statistics  in order to prevent the detection of spurious matches that could arise from low-complexity segments in the query or target proteins. In searches carried out with just the globular domain of a particular protein, the composition-based statistics was not used because this helps in improving the sensitivity of searches without a major risk of corruption of the profile with false positives . Additional searches with hidden Markov models were performed using the HMMER package [38,39]. Previously known conserved protein domains were detected using the corresponding position-specific scoring matrices (PSSMs) constructed using PSI-BLAST .
Multiple alignments of protein sequences were constructed using the T-Coffee program , followed by manual correction based on the PSI-BLAST results. Protein secondary structure was predicted using a multiple alignment as the input for the JPRED program . Sequence-structure threading was performed using the hybrid fold-prediction method, which combines multiple-alignment information with secondary-structure prediction , and the 3D-PSSM method . Phylogenetic trees were constructed using neighbor-joining, least-squares and maximum-likelihood methods [44,45]. Species abbreviations used in this paper are as follows: AcNV, Autographa californica nucleopolyhedrosis virus; AgNV, Anticarsia gemmatalis nucleopolyhedrosis virus; AMV, Amsacta moorei entomopoxvirus; BPBK5T, Lactococcus phage BK5-T; BmNV, Bombyx mori nuclear polyhedrosis virus; BP933W, bacteriophage 933W; BPA118, bacteriophage A118; BPAPSE1, bacteriophage APSE1; BPbIL285, bacteriophage bIL285; BPbIL286, bacteriophage bIL286; BPbIL309, bacteriophage bIL309; BPD3, bacteriophage D3; BP80, bacteriophage 80; BPHK620, bacteriophage HK620; BPHK97, bacteriophage HK97; BPLLH, bacteriophage LLH; BPMx8, bacteriophage Mx8; BPN15, bacteriophage N15; BPP1, bacteriophage P1; BPP22, bacteriophage P22; BPP4, bacteriophage P4; BPP27, bacteriophage P27; BPPV, bacteriophage PV; BPR1T, bacteriophage R1T; BPSpβC2, bacteriophage SpβC2; BPT5, bacteriophage T5; BPVT2-Sa, bacteriophage VT2; Cab, Clostridium acetobutylicum; Cal, Candida albicans; CIV, Chilo iridescent virus; CnBV, Culex nigripalpus baculovirus; CpGV, Cydia pomonella granulovirus; DpAV4, Diadromus pulchellus ascovirus; Ec, Escherichia coli; EpNV, Epiphyas postvittana nucleopolyhedrovirus; EV, ectromelia virus; Eni, Emericella nidulans; ESV, Ectocarpus siliculosus virus; FPV, fowlpox virus; HaEPV, Heliothis armigera entomopoxvirus; HaNV, Heliocoverpa armigera nucleopolyhedrovirus G4; HK97, bacteriophage HK97; HzNV, Helicoverpa zea single nucleocapsid nucleopolyhedrovirus; Kla, Kluyveromyces lactis; LcBPA2, Lactobacillus casei bacteriophage A2; LdNV, Lymantria dispar nucleopolyhedrovirus; LsNV, Leucania separata nuclear polyhedrosis virus; MbNV, Mamestra brassicae nucleopolyhedrovirus; MSV, Melanoplus sanguinipes entomopoxvirus; Nc, Neurospora crassa; Nm, Neisseria meningitidis; OpNV, Orgyia pseudotsugata single capsid nuclear polyhedrosis virus; Pa, Pseudomonas aeruginosa; PxGV, Plutella xylostella granulovirus; Sa, Staphylococcus aureus; Sc, Saccharomyces cerevisiae; Scoe, Streptomyces coelicolor; SeNV, Spodoptera exigua nucleopolyhedrovirus; Sf, Shigella flexneri; Sp, Schizosaccharomyces pombe; SpLNV, Spodoptera litura nucleopolyhedrovirus; Spy, Streptococcus pyogenes; VAR, variola virus; Xf, Xylella fastidiosa; XnGV, Xestia c-nigrum granulovirus; Yaly, Yarrowia lipolytica.
Alignments of all viral specific domains