Identification of a potential DNA repair system that is largely specific to thermophiles
An exhaustive search for conserved gene neighborhoods in the available complete bacterial and archaeal genomes (I.B.Rogozin, K.S.Makarova and E.V.Koonin, unpublished data) revealed only three large (more than 10 genes) neighborhoods that were represented predominantly in Archaea, along with a few bacterial species. Of these, the partially conserved superoperons that encode ribosomal proteins and subunits of the proton-transporting ATPase complex have been identified and analyzed in detail previously (16
and references therein). The third neighborhood, which is chiefly represented in Archaea and hyperthermophilic bacteria, is even larger, with genes that belong to over 20 COGs, but shows greater diversity than the ribosomal and ATPase neighborhoods, in terms of gene order. This neighborhood includes mostly genes without known or predicted functions. At the time of the genome comparison that resulted in the identification of this neighborhood, clear functional assignments were available for only two of its constituent genes. These genes encode a predicted DNA helicase and a predicted RecB family nuclease, which, by extension, could suggest a role in DNA repair for the entire neighborhood. Prompted by these observations, we undertook a detailed comparison of the potential operons comprising this neighborhood in different genomes and an in-depth analysis of the conserved domains that could be identified in the proteins encoded by uncharacterized genes.
Diverse versions of this neighborhood were detected in all completely sequenced archaeal species, with the exception of Thermoplasma acidophilum
) and Halobacterium sp. NRC-1
), both available genomes of bacterial hyperthermophiles, T.maritima
and some bacterial mesophiles, namely B.halodurans
. The corresponding genome region from the bacterial hyperthermophile A.aeolicus
was chosen as a template to produce a template-anchored multiple alignment (16
) of the analyzed neighborhood because it had the longest potential superoperon comprised of 18 genes (Fig. A). Although not a single gene is present in all genomes that have the analyzed neighborhood, a distinct group of five core genes that are conserved in the great majority of these genomes, often in the same order, was identified (Fig. A and Table ). This conserved core of the putative new repair system shows the following predominant gene order: COG1857-COG1688-COG1203-COG1468-COG1518 (Fig. A). The sixth gene, which is not a part of this array, but is present within the neighborhood in most genomes, is COG1353, which typically is found in close proximity with one or more genes of COGs 1336, 1367, 1604, 1337 and 1332 (Fig. A).
Figure 1 (Opposite) Organization of genes and potential operons in the genomic regions coding for protein components of the predicted novel DNA repair system. (A) The core (helicase-nuclease) and polymerase modules. Genes are shown not to scale; the direction (more ...)
The genes comprising the predicted thermophile-specific DNA repair system
The core gene array includes those components of the putative repair system for which straightforward functional prediction was possible. All proteins of COG1203 contain a typical superfamily II helicase domain. In most of these proteins (with the exception of MJ0383, PAB1689, PH0917, APE1232 and AF1874), the helicase domain is fused to a predicted HD-nuclease domain (52
). Fusion of helicase and nuclease domains is characteristic of many repair systems. For example, the bacterial RecB protein contains a fusion of a Superfamily I helicase and the eponymous nuclease, whereas the eukaryotic RAD1 protein is a Superfamily II helicase fused to a predicted ERCC4 family nuclease and the Werner syndrome protein displays a fusion of the 3′→5′ exonuclease and a SF-II helicase module (14
). However, the specific combination of helicase and nuclease domains seen in the COG1203 proteins has not been described so far. Those species that have a stand-alone helicase in the core of the putative repair system (e.g. P.abyssi
) possess either an extra copy of the fusion gene or a stand-alone predicted HD-nuclease (COG2254) that is typically adjacent to the core gene array (Fig. A).
The proteins of COG1468 belong to the RecB nuclease family and contain all the conserved catalytic residues characteristic of these nucleases. A distinctive feature of these proteins is the presence of a C-terminal module that contains three conserved cysteines and might mediate metal-dependent DNA binding. RecB protein, which is sporadically distributed in bacteria, contains helicase and nuclease domains and is a subunit of the RecBCD recombinase complex, one of the major systems of recombinational repair in E.coli
). In addition to the RecB protein, RecB family nucleases are often fused to SF-I helicases of other subfamilies (e.g. in Synechocystis
protein sll1582 to DNAI/HCS1-like helicase and in MTH472 from M.thermoautotrophicum
to PCRA/UvrD-like helicase) (53
COG1518 consists of proteins that do not have detectable homologs with known functions. However, examination of the multiple alignment of these proteins revealed a pattern of conserved acidic residues (Fig. ) that are often present in catalytic sites of various families of nucleases (53
). Considering the strongly conserved association of the gene coding for this protein with the genes coding for two other nucleases (COG1203/2254 and COG1468) and a helicase (COG1203), we consider it likely that this protein is a previously undetected nuclease that functions within the putative novel DNA repair system. In contrast to most known nuclease families, which include α/β proteins, but similarly to the HD-superfamily (52
), secondary structure prediction indicates that COG1518 proteins have an all-α structure that probably represents a novel nuclease fold. In addition to their position within the core of the putative repair system, genes for the COG1518 proteins were found in an alternative gene array that is conserved in several bacterial species (Fig. B). In particular, COG3513 consists of large, probably multidomain proteins that contain a diverged McrA/T4-Endo-VII nuclease domain (53
Figure 2 Multiple alignment of the predicted novel nuclease family (COG1518). The proteins are denoted by their systematic gene numbers, Gene Identification (GI) numbers from the GenBank database and abbreviated species names (see Materials and Methods for abbreviations). (more ...)
Two uncharacterized proteins that belong to the core of the putative repair system, COG1857 and COG1688, are predicted to possess α/β folds, as suggested by secondary structure prediction, but do not contain any conserved motifs with potential catalytic amino acids (data not shown). These proteins are homologous to the DevR and DevS gene products from Myxococcus xanthus
, respectively. In M.xanthus
is an autoregulated gene locus that is essential for fruiting body development, but this connection provides no clues as to the possible biochemical functions of the proteins in question (57
). No further information was obtained on COG1857 despite extensive sequence searches. However, COG1688 turned out to be distantly related to several other COGs associated with this system as described below.
Another part of the analyzed gene neighborhood centers on the gene coding for multidomain proteins of COG1353. Many of the proteins in this COG contain an N-terminal domain that is a distinct version of an HD-superfamily hydrolase (52
) with a circular permutation, in which the N-terminal metal-binding histidine is displaced to the extreme C-terminus of the HD domain (Fig. E). However, in some of these proteins, the HD domain is disrupted, whereas others, such as aq_357 from A.aeolicus,
lack the HD domain altogether. The conserved C-terminus shared by almost all these proteins has three distinct domains (Fig. ); the first of these is a distinct globular α + β domain that was not detected in any other proteins (Fig. D), whereas the second one is a Zn ribbon (Fig. C) that is seen in numerous contexts including nucleic acid interaction (58
). The C-terminal domain of these proteins, which is present in a stand-alone form in SSO1429, is homologous to the catalytic domain of diverse DNA and RNA polymerases. Early sequence and structure comparisons showed that reverse transcriptases, viral RNA-dependent RNA polymerases and DNA polymerases of superfamilies A and B share a common catalytic core domain (59
). This core domain was also detected in signal-transducing adenylyl cyclases and bacterial nucleotide cyclases typified by the GGDEF domain (62
), some of which may possess diguanylate cyclase activity (65
). The core palm-domain of these proteins contains a RNA recognition motif (RRM)-like fold with a β-α-β2-α-β topology; in nucleic acid polymerases, predominantly α-helical structures (the polymerase ‘fingers’) are inserted into the RRM-like domain upstream of helix-1 (Fig. A).
Figure 3 (Opposite and above) The predicted novel DNA polymerase. (A) Topology of the conserved core of the polymerase-cyclase palm domain. The catalytic metal-coordinating residues and the variable inserted finger module in the polymerases are indicated. (B (more ...)
Figure 4 The domain architecture of the predicted novel DNA polymerases compared with domain architectures of other nucleic acid polymerases that are associated with different phosphoesterase domains. The polymerase catalytic domains are abbreviated as ‘Poly’ (more ...)
PSI-BLAST searches with the C-terminal domains of the COG1353 proteins detected the GGDEF domains with statistically significant E
in 3rd–4th iterations). Furthermore, GGDEF domains and the putative polymerase domains of the COG1353 proteins shared an extended region of similarity beyond the core palm domain (62
). Alignment-based secondary structure predictions (66
) for these domains was compatible with the palm-domain structures of nucleic acid polymerases and nucleotide cyclases. Sequence–structure threading through the PDB database with both the combined fold recognition method (Z
-score = 12.1) and 3D-PSSM method (E
-value 0.02; E
-values up to 0.8 are normally considered significant for threading through PDB database with this method) gave the adenylyl cyclases and DNA polymerases as the best hits. This strongly suggests that COG1353 proteins contain the same core fold as the palm domain of polymerases and cyclases.
A multiple alignment of the conserved portion of the C-terminal domain of COG1353 proteins with several families of polymerases, including family B DNA-directed DNA polymerases, reverse transcriptases, RNA-directed RNA polymerases of positive-strand RNA viruses and the two families of nucleotide cyclases, revealed the conservation of two distinct motifs (Fig. B) in this entire diverse array of proteins (Fig. A and B). These motifs contain the conserved negatively-charged residues (most often aspartates), which function as divalent metal ligands in the catalytic centers of these polymerases and are readily identifiable in the COG1353 proteins suggesting similar catalytic activities (Fig. A and B). Unlike the related GGDEF and adenylyl cyclase domains, the polymerase/cyclase-related domains of COG1353 proteins are never fused to the classic signaling domains such as PAS, GAF, HAMP or CACHE (64
). Instead, they are fused to Zn ribbons, which are present in a variety of DNA polymerases, such as DNA polymerase
of the eukaryotes and the euryarchaeal DNA polymerase II (58
). In one of the COG1353 members, MTH1082, a Zn ribbon is inserted directly within the polymerase/cyclase domain after helix 1 (Fig. ). The presence of the predicted HD-hydrolase domain fused to the N-terminus of this novel polymerase-related domain develops the theme of functional and, in most cases, physical association of various phosphohydrolase domain and nucleic acid polymerases (68
) (Fig. ). It has been hypothesized that these domains or subunits function as pyrophosphatases that cleave the pyrophosphate formed during nucleotide polymerization and thus drive forward the polymerase reaction (68
). The same function is most likely for the HD-domain of the COG1353 proteins, although it is also possible that these predicted phosphoesterase domains function as uncharacterized nucleases in conjunction with the DNA polymerases. Taken together, these observations strongly suggest that the C-terminal domain of the COG1353 proteins is a previously undetected DNA polymerase distantly related to all of the above polymerase families. Given their degree of conservation and widespread presence in the Archaea, it appears most likely that this predicted DNA polymerase evolved from a common ancestor with other polymerases at an early stage of archaeal evolution. The GGDEF cyclases and adenylyl cyclase, which are more closely related to these predicted DNA polymerases, might have been derived from them and subsequently disseminated by horizontal gene transfer (HGT).
Genes of COGs 1336, 1367, 1604, 1337, 1332 and 1583 are always seen in close proximity to the predicted COG1353 polymerase, often in tandem (Fig. A). PSI-BLAST searches revealed relatively weak but, in many cases, statistically significant similarity between proteins from these COGs. For example, in a search starting with the sequence of the Rv2821c protein (COG1337), with the profile-inclusion cut-off set at E = 0.01, some members of COG1332 are detected in the second iteration, members of COG1604 and COG1336 appear in the fourth iteration and members of COG1367 in the fifth iteration. In the same search, a member of COG1567 (PH0166) appears above the cut-off in the fourth iteration. A reverse search started with the PH0166 sequence as the query detects all members of COG1567 and includes the first protein of COG1337 (MTH1080) on the fifth iteration. In the latter search, some proteins from other COGs represented in the neighborhood also appear, albeit with E-values that are below the cut-off. In particular, in the sixth iteration, proteins SSO1437 (COG1583) and AF0067 (COG1688) were detected with E-values of 0.13 and 0.29, respectively.
Proteins from COGs 1336, 1367, 1604, 1337 and 1332 produced a multiple alignment with many conserved positions and five common motifs (Fig. ), which supports the notion that these COGs belong to the same protein family. The remaining COGs detected in BLAST searches failed to unequivocally align with the above five COGs, but shared at least some of the same conserved motifs and appeared to be compatible with the same fold as judged by similar patterns of predicted secondary structure elements (Fig. , alignments are available upon request). Thus, we believe that all these COGs comprise a previously undetected superfamily of repair-associated mysterious proteins (RAMPs). To identify all potential members of the RAMP superfamily and improve multiple alignments, we used all identified proteins as query sequences for exhaustive PSI-BLAST searches. This resulted in the identification of over 90 RAMPs, mostly in Archaea. All families of the RAMP superfamily share at least one C-terminal motif, which contains a glycine-rich loop (motif V, Fig. ). Two other conserved motifs in the N-terminal portion of RAMPs show distinct structural features. Motif II also consists of a loop followed by an α-helix (Fig. ). Motif I is a β-strand followed by a conserved glycine. However, none of these motifs was detectable in COG1583 members; this COG remains a provisional member of the RAMP superfamily (Fig. ).
Figure 5 The RAMP superfamily. The top part of the figure shows a multiple alignment of the major family of the RAMP superfamily. The designations are as in Figure 2. The bottom part shows a comparison of motifs derived from multiple alignments and secondary (more ...)
Genes coding for several other uncharacterized proteins tend to be associated with the core genes of the putative repair system. One of these (COG1343) shows a patchy distribution among bacteria and Archaea, and others are seen specifically in a subset of bacteria (e.g. COG3649 in T.maritima, B.halodurans and S.pyogenes) or Archaea (COG3574 in Pyrococci, A.fulgidus and S.solfataricus). COG3578, which is represented in A.fulgidus, S.solfataricus and A.pernix, includes additional members of the RecB nuclease family.
Several other proteins are typically encoded in the vicinity of the predicted new polymerase gene. Two of these (COG1517 and COG3574) are specific for Archaea and two others were detected in both Archaea and bacteria (COG3337 and COG1421). Members of COG1517 contain a distinct motif with a ‘hhDhoH’ signature and several other conserved polar amino acids, which could contribute to a potential catalytic center of an enzyme, perhaps yet another nuclease (alignment is available upon request). Members of the remaining polymerase-associated COGs are small proteins, for which no functional prediction is currently possible. At least one more functional prediction can be made on the basis of operon organization conservation. A gene for a predicted helix–turn–helix transcriptional regulator (COG2462) is located within the neighborhood in most archaeal genomes. This Archaea-specific protein is likely to regulate the expression of one or more of the operons in the analyzed neighborhood.
Although none of the genes in the analyzed neighborhood has been functionally characterized, the repertoire of predicted functions, which include a DNA helicase, several DNases and a polymerase (Table ), strongly suggests that this neighborhood consists of genes together comprising a previously undetected DNA repair system. The uncharacterized proteins encoded within this neighborhood might function as accessory, DNA-binding or regulatory subunits of these repair complexes and perhaps as a sliding clamp for the predicted DNA polymerase. Of particular interest in this latter context are the RAMP proteins, which, given their remarkable proliferation, could form large hetero-oligomeric complexes.
A more specific functional role for the predicted repair system could be that of a functional equivalent of the mutagenic repair systems of bacteria and eukaryotes, which center around the translesion DNA polymerases of the UmuC-DinB-Rad30-Rev1 superfamily (69
). A priori, such a system should be considered important in thermophiles, to counteract DNA damage caused by exposure to high temperatures. Among thermophilic organisms whose genomes have been completely or partially sequenced, only Sulfolobus
encodes a predicted UmuC-DinB-Rad30-Rev1 superfamily polymerase (71
). In particular, although this type of polymerase is present in most free-living bacteria, it is conspicuously missing in the hyperthermophiles T.maritima a
. In contrast, the only sequenced genome of a mesophilic archaeon, that of Halobacterium
NRC-1, does have a DinB ortholog, but completely lacks the predicted new repair system. Thus, it seems plausible that the predicted novel repair system is the thermophilic counterpart of the mesophilic translesion repair system that also mediates adaptive mutagenesis (72
). Technically, it is possible that the functional system encoded by the gene neighborhood described here is involved in RNA metabolism rather than in DNA repair. This, however, appears unlikely given that the most characteristic genes of this neighborhood encode a RecB family nuclease, which is specifically associated with repair systems (73
) and a predicted polymerase fused to a phosphoesterase domain, an architecture typical of DNA polymerases (74
Genomic diversity, modular organization, horizontal gene transfer and evolution of the predicted repair system
The predicted novel repair system shows conspicuous evolutionary plasticity. This becomes particularly obvious when genomes of closely related species, such as two Thermoplasmas, two Bacilli and three Pyrococci, are compared (Fig. A). Thermoplasma acidophilum does not encode any members of this system, whereas T.volcanium has what seems to be a rudimentary form, with only the predicted polymerase, the putative nuclease of COG1518, and several RAMPs and uncharacterized proteins. The disruption of the repair system probably started already in the common ancestor of the two Thermoplasma species, with T.acidophilum subsequently losing it completely. Bacillus halodurans has two large putative operons with genes for components of the predicted repair system, but B.subtilis has no counterpart to any of these genes. Among the three Pyrococcus species, P.abyssi resembles T.volcanium in having only remnants of the system, represented by the helicase, a stand-alone HD-nuclease and three RAMPs. Pyrococcus furiosus has a more complete system, which also includes the RecB family nuclease, the predicted polymerase and extra RAMPs. Finally, P.horikoshii has a complete, complex system, with triplication of the helicase and surrounding genes. Thus, substantial changes in this system tend to occur relatively rapidly, on a time-scale commensurate with the evolution of individual species. Gene loss is prominent among these evolutionary modifications, but amplification of parts of the system, and possibly acquisition of additional members through HGT (see below) also take place.
Cross-genome comparison of the repertoires and organization of the genes encoding components of the predicted novel repair system supports the notion that the core (helicase-nuclease) and the polymerase-RAMP modules of the superoperon have a degree of independence. The examples discussed above indicate that one of the modules can be independently lost. They also undergo rearrangement that includes reversal of the relative orientation of the modules in a superoperon (compare the gene organization in A.aeolicus and T.maritima in Figure A) and probable operon disruption (compare the gene organization in A.aeolicus and A.fulgidus). Furthermore, on some occasions, the modules have undergone independent amplification, such as the duplication of the polymerase module in T.maritima and triplication of the helicase-nuclease module in P.horikoshii.
The modular gene/operon organization of the predicted repair system probably also entails functional modularity. It seems likely that the stand-alone helicase-nuclease module that is present, for example, in P.abyssi
(Fig. A) retains some limited functionality in repair, but probably functions on its own or in conjunction with a different polymerase. Conversely, in T.volcanium
, the predicted DNA polymerase might interact with a distinct helicase. Furthermore, the deletion of some of the predicted nucleases in P.abyssi
suggests a degree of redundancy among these enzymes. Such partial functional redundancy is typical of other repair pathways (75
Overall, the number of genes coding for components of the predicted repair system varies to a great degree between genomes (Figs and ), with about 90 genes in S.solfataricus (>3% of all genes in this genome) and the minimal set of three genes in E.coli. The prevalence of this system in thermophiles is obvious. Bacillus halodurans is the only mesophile that has the principal genes of both the helicase-nuclease and the polymerase-RAMP modules and, even in this case, the system is less elaborate than it is in most thermophiles (Figs A and ). Most mesophiles have no trace of this system, and several species, in which it is represented, have only remnants of one or both modules (Fig. A). Search of unfinished prokaryotic genomes detected homologs of different proteins from the new repair system, particularly of the helicase-nuclease module, in many diverse bacteria (Table ). Again, the three thermophiles, for which large amounts of genome sequence were available, Chlorobium tepidum, Carboxydothermus hydrogenoformans and Bacillus stearothermophilus, showed a greater representation of this system than mesophiles (Table ).
Representation of the predicted novel repair system in different genomes. Pink rectangles, RAMP proteins; blue rectangles, other components of the system.
Traces of the predicted new repair system in unfinished bacterial genomes
The obvious plasticity of the new repair system raises the issue of a possible role of HGT in its evolution (as already alluded to above). The notion that HGT occurred more than once during evolution of this system is supported by notable conservation of certain gene arrays in phylogenetically distant genomes (Fig. A). The strongest case in point is the conservation of the gene order in the polymerase module between the archaeon A.fulgidus
and the bacteria A.aeolicus
(Fig. A). These observations suggest that these gene cassettes (probable operons) disseminated via HGT as a single entity. In fact, in an early comparison, the apparent superoperon that comprises the predicted repair system in A.aeolicus
has been noticed as the largest constellation of ‘archaeal’ genes in the genome of this hyperthermophilic bacterium, and its presence was one of the arguments supporting massive HGT between bacterial and archaeal hyperthermophiles (76
To examine further the contribution of HGT to the evolution of the predicted new repair system, phylogenetic trees were constructed for the four genes that are most common in the conserved neighborhood (COGs 1203, 1518, 1468 and 1353). All four trees showed clear indications of multiple HGT events (Fig. ). In particular, each tree supports HGT from Archaea to A.aeolicus, in agreement with the conservation of gene order between this bacterium and some Archaea (see above). The tree topologies suggest independent HGT events between Archaea and different bacterial groups as well as between different bacterial lineages. For example, in the tree for the putative novel nuclease (COG1518) and RecB-type nuclease (COG1468), the proteins from B.halodurans and B.stearothermophilus occupy very different positions instead of being adjacent as expected from the phylogeny of the corresponding species (Fig. A and C). The former belongs to a cluster of several bacterial species, which is located between two archaeal clusters, whereas the latter is part of another, smaller group of diverse bacterial species, which lies within one of the archaeal clusters (Fig. A and C). Furthermore, in the tree for COG1203 helicases, a third bacterial cluster, which combines proteins from proteobacteria, the Bacillus-Clostridium group of Gram-positive bacteria and a spirochete, joins the second archaeal cluster (Fig. B). Thus, the topology of this tree can be explained through three independent HGT events between Archaea and bacteria, followed by some additional HGT within the bacterial and possibly archaeal domains. Alternatively, it could be postulated that the existence of the third bacterial cluster, which is separated from the Archaea by a long branch, reflects vertical inheritance from the last common ancestor of Archaea and bacteria, with subsequent multiple gene losses resulting in the extant patchy phyletic distribution. Unlike the trees for the other three analyzed proteins, the tree for predicted polymerases has B.halodurans and B.stearothermophilus proteins in the same cluster (Fig. D); this emphasizes distinct evolutionary fates of different genes within the predicted new repair system.
Figure 7 Phylogenetic trees for the most common components of the predicted novel repair system. (A) Putative novel nuclease (COG1518). (B) The helicase domain (COG1203). (C) The RecB family nuclease (COG1468). (D) The predicted novel polymerase (COG1353). Maximum (more ...)
The apparent multiple HGT and gene loss events preclude a definitive conclusion as to the origin of the predicted repair system described here. One scenario would posit that this system originally evolved in hyperthermophilic Archaea and subsequently was disseminated through the prokaryotic world via multiple HGTs. Under this scenario, many mesophilic bacteria acquired (parts of) this system from thermophiles and subsequently lost some of the acquired genes. An alternative possibility, which is best compatible with the hypothesis that the last universal common ancestor of modern life forms was a hyperthermophile (77
), is that the core of this system already existed in this hypothetical ancestral organism, with numerous coordinated gene losses occurring in various lineages that became mesophilic. One such lineage is the eukaryotes whose common ancestor might have originally inherited this repair system. A variant of this hypothesis is that the helicase-nuclease and polymerase-RAMP modules evolved independently at a very early stage of evolution. Subsequently, they might have been brought together to form a single repair system in Archaea, and this system was acquired by some, primarily thermophilic bacteria via HGT. At a mechanistic level, the association of the predicted repair system with thermophily and the apparent near incompatibility of this system with the translesion repair pathway based on UmuC-DinB-Rad30-Rev1 superfamily polymerases remain mysterious and, hopefully, will be targets for future experimental studies.