|Home | About | Journals | Submit | Contact Us | Français|
During a systematic analysis of conserved gene context in prokaryotic genomes, a previously undetected, complex, partially conserved neighborhood consisting of more than 20 genes was discovered in most Archaea (with the exception of Thermoplasma acidophilum and Halobacterium NRC-1) and some bacteria, including the hyperthermophiles Thermotoga maritima and Aquifex aeolicus. The gene composition and gene order in this neighborhood vary greatly between species, but all versions have a stable, conserved core that consists of five genes. One of the core genes encodes a predicted DNA helicase, often fused to a predicted HD-superfamily hydrolase, and another encodes a RecB family exonuclease; three core genes remain uncharacterized, but one of these might encode a nuclease of a new family. Two more genes that belong to this neighborhood and are present in most of the genomes in which the neighborhood was detected encode, respectively, a predicted HD-superfamily hydrolase (possibly a nuclease) of a distinct family and a predicted, novel DNA polymerase. Another characteristic feature of this neighborhood is the expansion of a superfamily of paralogous, uncharacterized proteins, which are encoded by at least 20–30% of the genes in the neighborhood. The functional features of the proteins encoded in this neighborhood suggest that they comprise a previously undetected DNA repair system, which, to our knowledge, is the first repair system largely specific for thermophiles to be identified. This hypothetical repair system might be functionally analogous to the bacterial–eukaryotic system of translesion, mutagenic repair whose central components are DNA polymerases of the UmuC-DinB-Rad30-Rev1 superfamily, which typically are missing in thermophiles.
Most of the presently known archaeal species and many bacteria are thermophiles or hyperthermophiles whose optimal growth temperatures reach 115°C (1–3). The molecular basis of maintenance of genome stability in (hyper)thermophiles is, arguably, one of the most intriguing problems of modern biology (1,4). Thermophiles typically are resistant not only to high temperatures, but also to other damaging factors, such as ionizing and ultraviolet radiation and chemical agents; furthermore, spontaneous mutagenesis is accelerated at elevated temperatures (5–7). However, the genomic mutation rate in the thermophilic archaeon Sulfolobus acidocaldarius was shown to be about the same as in mesophiles (8). Thus, thermophiles must have highly efficient and probably specialized DNA repair systems. Experimental studies of repair in thermophiles so far have been scant, although several repair enzymes have been identified, including thermostable O-6-methylguanine-DNA methyltransferase (9), uracil-DNA-glycosylase (10–12), RecA/Rad51 family protein with unique DNase activity (13) and others reviewed in Grogan (4). Attempts to delineate repair systems of bacterial and particularly archaeal thermophiles from genome sequences by identification of homologs of well-characterized components of repair pathways from Escherichia coli and Saccharomyces cerevisiae have been only partially successful (4,14,15). This led to the hypothesis that a distinct DNA repair system might exist in thermophiles (4).
Currently, 12 complete genome sequences of thermophilic Archaea and two genome sequences of bacterial hyperthermophiles are available. Like prokaryotic genomes in general, the genomes in thermophiles show little conservation of gene order over long evolutionary distances (16). Nevertheless, comparative analysis of genomic context, i.e. organization of genes into partially conserved clusters that are likely to represent operons, has proved a powerful method for prediction of the functions of uncharacterized bacterial and archaeal genes (16–20). The central premise of genomic context analysis is that genes that belong to the same operon are almost certainly functionally connected. By inference, if a predicted operon contains one or more genes with a known function, functions can be predicted for other, uncharacterized members of the same operon, especially when context analysis is complemented by prediction of biochemical activity of the proteins in question by means of comparative sequence and structure analysis. Straightforward identification of conserved gene strings that are likely to represent operons is the principal approach that so far has been employed in genome context analysis (16,17,19). However, because of the extensive rearrangements of local gene order, even within operons, that is characteristic of prokaryotic evolution, this method is insufficient to extract all context information that potentially exists in bacterial and archaeal genomes. Several attempts have been made to identify partially conserved gene neighborhoods that may show little direct conservation of gene order, but consist of identical or substantially overlapping gene sets in different genomes. Gene neighborhoods typically are not present, in their entirety, in any single genome, but are held together by overlaps between partially conserved gene sets.
It has been noticed previously that orthologs of a relatively small fraction of bacterial and eukaryotic repair proteins are detectable in Archaea, although many proteins containing helicase, nuclease and DNA-binding domains were identified and, in principle, could be candidates for roles in repair (14,15). Thus, sequence analysis alone seems to be insufficient for confidently predicting archaeal repair systems (21). Recently, we utilized a combination of the analysis of conserved gene neighborhoods/gene fusions with sensitive sequence profile searches and structural comparisons to predict a novel prokaryotic DNA repair system that seems to be the counterpart of the eukaryotic Ku-dependent double strand break system (22). Here, by using a combination of gene neighborhood analysis and detailed sequence and structure analysis of protein domains, we predict another previously undetected DNA repair system in archaeal and bacterial genomes. To our knowledge, this is the first DNA repair system that appears to be largely confined to thermophiles in its phyletic distribution and could potentially fill a significant void in terms of archaeal DNA repair systems.
The genome sequences and the encoded protein sequences of the Archaea Archaeoglobus fulgidus (Aful) (23), Methanobacterium thermoautotrophicum (Mthe) (24), Methanococcus jannaschii (Mjan) (25), Pyrococcus horikoshii (Phor) (26), Pyrococcus abyssi (Paby) (R. Heilig, Genoscope; GenBank NC_000868), Thermoplasma volcanium (Euryarchaeota) (Tvol) and Aeropyrum pernix (Aper) (27) and Sulfolobus solfataricus (Ssol) (28) (Crenarchaeota), as well as the bacteria Thermotoga maritima (Tmar) (29), Aquifex aeolicus (Aaeo) (30), Bacillus halodurans (Bhal) (31), Mycobacterium tuberculosis (Mtub) (32), Streptococcus pyogenes (Spyo) (33) (bacteria) were retrieved from the Genomes division of the Entrez system (34). The preliminary genome sequence of the Euryarchaeon Pyrococcus furiosus was downloaded from http://comb5-156.umbi.umd.edu/genemate/pfu-info.html.
The non-redundant database of protein sequences at the National Center for Biotechnology Information (NIH, Bethesda) was iteratively searched using the PSI-BLAST program (35,36). The cut-off of E < 0.01 was normally employed for inclusion of sequences in the position-specific weight matrices. For detecting subtle sequence conservation, the PSI-BLAST search results were visually examined and sequences with greater E-values, but containing signature motifs of a given protein family, were included into profiles on a case by case basis (35–37). Nucleotide sequences translated in six reading frames were searched for protein sequence similarity using the TBLASTN program (35). Unfinished microbial genome sequences (http://www.ncbi.nlm.nih.gov/Microb_blast/unfinishedgenome.html) were searched using TBLASTN. Conserved domains in protein sequences were identified by searching the NCBI’s CD collection of domain-specific, position-dependent weight matrices using the Reverse PSI-BLAST program (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) and by searching the SMART collection of domain-specific hidden Markov models (38). Multiple alignments of protein sequences were constructed using the T-coffee program (39) and corrected on the basis of PSI-BLAST results. Protein secondary structure was predicted using the PHD program, with a multiple alignment submitted as the query (40). Protein sequence–structure threading was performed by using the hybrid fold-recognition method (41) and the 3D-PSSM method (42).
Distance trees were constructed from multiple protein sequence alignments after excluding positions containing >70% gaps, by using the least-square method as implemented in the FITCH program of the PHYLIP package (43,44). Maximum likelihood trees were constructed using the ProtML program of the MOLPHY package, with the JTT-F model of amino acid substitutions, by optimizing the least-square trees with local rearrangements (45,46). Bootstrap analysis was performed for each maximum likelihood tree as implemented in MOLPHY using the resampling of estimated log-likelihoods (RELL) method (45–47).
The procedure for reconstructing conserved gene neighborhoods will be described in detail elsewhere (I.B.Rogozin, K.S.Makarova and E.V.Koonin, unpublished data). Briefly, the following steps were implemented. The collection of clusters of orthologous groups (COGs) of proteins from complete genomes (48,49) was used as the source of information on orthologous relationships for detecting conserved gene pairs. A pair of genes from two COGs was considered ‘conserved’ if the corresponding genes were separated by none, one or two genes in at least three of the compared genomes. At the next step, overlapping gene pairs were joined in triplets, each of which was required to exist in at least one genome. Overlapping triplets were used to construct gene arrays by walk search in an oriented graph; a gene array may or may not be found in its entirety in any available genome (I.B.Rogozin, K.S.Makarova and E.V.Koonin, unpublished data). Finally, gene arrays that shared at least three COGs were clustered into neighborhoods by using a single-linkage clustering algorithm. All these steps were implemented in the program GENE_NEIGHBOR, which ran automatically without human control at intermediate stages (I.B.Rogozin, K.S.Makarova and E.V.Koonin, unpublished data; available upon request). The resulting gene neighborhoods were amended manually by adding genes that were located next to the genes of an automatically delineated neighborhood in at least one genome, but did not fit the criteria outlined above.
An exhaustive search for conserved gene neighborhoods in the available complete bacterial and archaeal genomes (I.B.Rogozin, K.S.Makarova and E.V.Koonin, unpublished data) revealed only three large (more than 10 genes) neighborhoods that were represented predominantly in Archaea, along with a few bacterial species. Of these, the partially conserved superoperons that encode ribosomal proteins and subunits of the proton-transporting ATPase complex have been identified and analyzed in detail previously (16 and references therein). The third neighborhood, which is chiefly represented in Archaea and hyperthermophilic bacteria, is even larger, with genes that belong to over 20 COGs, but shows greater diversity than the ribosomal and ATPase neighborhoods, in terms of gene order. This neighborhood includes mostly genes without known or predicted functions. At the time of the genome comparison that resulted in the identification of this neighborhood, clear functional assignments were available for only two of its constituent genes. These genes encode a predicted DNA helicase and a predicted RecB family nuclease, which, by extension, could suggest a role in DNA repair for the entire neighborhood. Prompted by these observations, we undertook a detailed comparison of the potential operons comprising this neighborhood in different genomes and an in-depth analysis of the conserved domains that could be identified in the proteins encoded by uncharacterized genes.
Diverse versions of this neighborhood were detected in all completely sequenced archaeal species, with the exception of Thermoplasma acidophilum (50) and Halobacterium sp. NRC-1 (51), both available genomes of bacterial hyperthermophiles, T.maritima and A.aeolicus, and some bacterial mesophiles, namely B.halodurans, M.tuberculosis and S.pyogenes. The corresponding genome region from the bacterial hyperthermophile A.aeolicus was chosen as a template to produce a template-anchored multiple alignment (16) of the analyzed neighborhood because it had the longest potential superoperon comprised of 18 genes (Fig. (Fig.1A).1A). Although not a single gene is present in all genomes that have the analyzed neighborhood, a distinct group of five core genes that are conserved in the great majority of these genomes, often in the same order, was identified (Fig. (Fig.1A1A and Table Table1).1). This conserved core of the putative new repair system shows the following predominant gene order: COG1857-COG1688-COG1203-COG1468-COG1518 (Fig. (Fig.1A).1A). The sixth gene, which is not a part of this array, but is present within the neighborhood in most genomes, is COG1353, which typically is found in close proximity with one or more genes of COGs 1336, 1367, 1604, 1337 and 1332 (Fig. (Fig.11A).
The core gene array includes those components of the putative repair system for which straightforward functional prediction was possible. All proteins of COG1203 contain a typical superfamily II helicase domain. In most of these proteins (with the exception of MJ0383, PAB1689, PH0917, APE1232 and AF1874), the helicase domain is fused to a predicted HD-nuclease domain (52). Fusion of helicase and nuclease domains is characteristic of many repair systems. For example, the bacterial RecB protein contains a fusion of a Superfamily I helicase and the eponymous nuclease, whereas the eukaryotic RAD1 protein is a Superfamily II helicase fused to a predicted ERCC4 family nuclease and the Werner syndrome protein displays a fusion of the 3′→5′ exonuclease and a SF-II helicase module (14,15,53). However, the specific combination of helicase and nuclease domains seen in the COG1203 proteins has not been described so far. Those species that have a stand-alone helicase in the core of the putative repair system (e.g. P.abyssi and A.pernix) possess either an extra copy of the fusion gene or a stand-alone predicted HD-nuclease (COG2254) that is typically adjacent to the core gene array (Fig. (Fig.11A).
The proteins of COG1468 belong to the RecB nuclease family and contain all the conserved catalytic residues characteristic of these nucleases. A distinctive feature of these proteins is the presence of a C-terminal module that contains three conserved cysteines and might mediate metal-dependent DNA binding. RecB protein, which is sporadically distributed in bacteria, contains helicase and nuclease domains and is a subunit of the RecBCD recombinase complex, one of the major systems of recombinational repair in E.coli (54–56). In addition to the RecB protein, RecB family nucleases are often fused to SF-I helicases of other subfamilies (e.g. in Synechocystis protein sll1582 to DNAI/HCS1-like helicase and in MTH472 from M.thermoautotrophicum to PCRA/UvrD-like helicase) (53).
COG1518 consists of proteins that do not have detectable homologs with known functions. However, examination of the multiple alignment of these proteins revealed a pattern of conserved acidic residues (Fig. (Fig.2)2) that are often present in catalytic sites of various families of nucleases (53). Considering the strongly conserved association of the gene coding for this protein with the genes coding for two other nucleases (COG1203/2254 and COG1468) and a helicase (COG1203), we consider it likely that this protein is a previously undetected nuclease that functions within the putative novel DNA repair system. In contrast to most known nuclease families, which include α/β proteins, but similarly to the HD-superfamily (52), secondary structure prediction indicates that COG1518 proteins have an all-α structure that probably represents a novel nuclease fold. In addition to their position within the core of the putative repair system, genes for the COG1518 proteins were found in an alternative gene array that is conserved in several bacterial species (Fig. (Fig.1B).1B). In particular, COG3513 consists of large, probably multidomain proteins that contain a diverged McrA/T4-Endo-VII nuclease domain (53).
Two uncharacterized proteins that belong to the core of the putative repair system, COG1857 and COG1688, are predicted to possess α/β folds, as suggested by secondary structure prediction, but do not contain any conserved motifs with potential catalytic amino acids (data not shown). These proteins are homologous to the DevR and DevS gene products from Myxococcus xanthus, respectively. In M.xanthus, devRS is an autoregulated gene locus that is essential for fruiting body development, but this connection provides no clues as to the possible biochemical functions of the proteins in question (57). No further information was obtained on COG1857 despite extensive sequence searches. However, COG1688 turned out to be distantly related to several other COGs associated with this system as described below.
Another part of the analyzed gene neighborhood centers on the gene coding for multidomain proteins of COG1353. Many of the proteins in this COG contain an N-terminal domain that is a distinct version of an HD-superfamily hydrolase (52) with a circular permutation, in which the N-terminal metal-binding histidine is displaced to the extreme C-terminus of the HD domain (Fig. (Fig.3E).3E). However, in some of these proteins, the HD domain is disrupted, whereas others, such as aq_357 from A.aeolicus, lack the HD domain altogether. The conserved C-terminus shared by almost all these proteins has three distinct domains (Fig. (Fig.4);4); the first of these is a distinct globular α + β domain that was not detected in any other proteins (Fig. (Fig.3D),3D), whereas the second one is a Zn ribbon (Fig. (Fig.3C)3C) that is seen in numerous contexts including nucleic acid interaction (58). The C-terminal domain of these proteins, which is present in a stand-alone form in SSO1429, is homologous to the catalytic domain of diverse DNA and RNA polymerases. Early sequence and structure comparisons showed that reverse transcriptases, viral RNA-dependent RNA polymerases and DNA polymerases of superfamilies A and B share a common catalytic core domain (59–61). This core domain was also detected in signal-transducing adenylyl cyclases and bacterial nucleotide cyclases typified by the GGDEF domain (62–64), some of which may possess diguanylate cyclase activity (65). The core palm-domain of these proteins contains a RNA recognition motif (RRM)-like fold with a β-α-β2-α-β topology; in nucleic acid polymerases, predominantly α-helical structures (the polymerase ‘fingers’) are inserted into the RRM-like domain upstream of helix-1 (Fig. (Fig.33A).
PSI-BLAST searches with the C-terminal domains of the COG1353 proteins detected the GGDEF domains with statistically significant E-values (10–4–10–5 in 3rd–4th iterations). Furthermore, GGDEF domains and the putative polymerase domains of the COG1353 proteins shared an extended region of similarity beyond the core palm domain (62). Alignment-based secondary structure predictions (66) for these domains was compatible with the palm-domain structures of nucleic acid polymerases and nucleotide cyclases. Sequence–structure threading through the PDB database with both the combined fold recognition method (Z-score = 12.1) and 3D-PSSM method (E-value 0.02; E-values up to 0.8 are normally considered significant for threading through PDB database with this method) gave the adenylyl cyclases and DNA polymerases as the best hits. This strongly suggests that COG1353 proteins contain the same core fold as the palm domain of polymerases and cyclases.
A multiple alignment of the conserved portion of the C-terminal domain of COG1353 proteins with several families of polymerases, including family B DNA-directed DNA polymerases, reverse transcriptases, RNA-directed RNA polymerases of positive-strand RNA viruses and the two families of nucleotide cyclases, revealed the conservation of two distinct motifs (Fig. (Fig.3B)3B) in this entire diverse array of proteins (Fig. (Fig.3A3A and B). These motifs contain the conserved negatively-charged residues (most often aspartates), which function as divalent metal ligands in the catalytic centers of these polymerases and are readily identifiable in the COG1353 proteins suggesting similar catalytic activities (Fig. (Fig.3A3A and B). Unlike the related GGDEF and adenylyl cyclase domains, the polymerase/cyclase-related domains of COG1353 proteins are never fused to the classic signaling domains such as PAS, GAF, HAMP or CACHE (64,67). Instead, they are fused to Zn ribbons, which are present in a variety of DNA polymerases, such as DNA polymerase of the eukaryotes and the euryarchaeal DNA polymerase II (58). In one of the COG1353 members, MTH1082, a Zn ribbon is inserted directly within the polymerase/cyclase domain after helix 1 (Fig. (Fig.4).4). The presence of the predicted HD-hydrolase domain fused to the N-terminus of this novel polymerase-related domain develops the theme of functional and, in most cases, physical association of various phosphohydrolase domain and nucleic acid polymerases (68) (Fig. (Fig.4).4). It has been hypothesized that these domains or subunits function as pyrophosphatases that cleave the pyrophosphate formed during nucleotide polymerization and thus drive forward the polymerase reaction (68). The same function is most likely for the HD-domain of the COG1353 proteins, although it is also possible that these predicted phosphoesterase domains function as uncharacterized nucleases in conjunction with the DNA polymerases. Taken together, these observations strongly suggest that the C-terminal domain of the COG1353 proteins is a previously undetected DNA polymerase distantly related to all of the above polymerase families. Given their degree of conservation and widespread presence in the Archaea, it appears most likely that this predicted DNA polymerase evolved from a common ancestor with other polymerases at an early stage of archaeal evolution. The GGDEF cyclases and adenylyl cyclase, which are more closely related to these predicted DNA polymerases, might have been derived from them and subsequently disseminated by horizontal gene transfer (HGT).
Genes of COGs 1336, 1367, 1604, 1337, 1332 and 1583 are always seen in close proximity to the predicted COG1353 polymerase, often in tandem (Fig. (Fig.1A).1A). PSI-BLAST searches revealed relatively weak but, in many cases, statistically significant similarity between proteins from these COGs. For example, in a search starting with the sequence of the Rv2821c protein (COG1337), with the profile-inclusion cut-off set at E = 0.01, some members of COG1332 are detected in the second iteration, members of COG1604 and COG1336 appear in the fourth iteration and members of COG1367 in the fifth iteration. In the same search, a member of COG1567 (PH0166) appears above the cut-off in the fourth iteration. A reverse search started with the PH0166 sequence as the query detects all members of COG1567 and includes the first protein of COG1337 (MTH1080) on the fifth iteration. In the latter search, some proteins from other COGs represented in the neighborhood also appear, albeit with E-values that are below the cut-off. In particular, in the sixth iteration, proteins SSO1437 (COG1583) and AF0067 (COG1688) were detected with E-values of 0.13 and 0.29, respectively.
Proteins from COGs 1336, 1367, 1604, 1337 and 1332 produced a multiple alignment with many conserved positions and five common motifs (Fig. (Fig.5),5), which supports the notion that these COGs belong to the same protein family. The remaining COGs detected in BLAST searches failed to unequivocally align with the above five COGs, but shared at least some of the same conserved motifs and appeared to be compatible with the same fold as judged by similar patterns of predicted secondary structure elements (Fig. (Fig.5,5, alignments are available upon request). Thus, we believe that all these COGs comprise a previously undetected superfamily of repair-associated mysterious proteins (RAMPs). To identify all potential members of the RAMP superfamily and improve multiple alignments, we used all identified proteins as query sequences for exhaustive PSI-BLAST searches. This resulted in the identification of over 90 RAMPs, mostly in Archaea. All families of the RAMP superfamily share at least one C-terminal motif, which contains a glycine-rich loop (motif V, Fig. Fig.5).5). Two other conserved motifs in the N-terminal portion of RAMPs show distinct structural features. Motif II also consists of a loop followed by an α-helix (Fig. (Fig.5).5). Motif I is a β-strand followed by a conserved glycine. However, none of these motifs was detectable in COG1583 members; this COG remains a provisional member of the RAMP superfamily (Fig. (Fig.55).
Genes coding for several other uncharacterized proteins tend to be associated with the core genes of the putative repair system. One of these (COG1343) shows a patchy distribution among bacteria and Archaea, and others are seen specifically in a subset of bacteria (e.g. COG3649 in T.maritima, B.halodurans and S.pyogenes) or Archaea (COG3574 in Pyrococci, A.fulgidus and S.solfataricus). COG3578, which is represented in A.fulgidus, S.solfataricus and A.pernix, includes additional members of the RecB nuclease family.
Several other proteins are typically encoded in the vicinity of the predicted new polymerase gene. Two of these (COG1517 and COG3574) are specific for Archaea and two others were detected in both Archaea and bacteria (COG3337 and COG1421). Members of COG1517 contain a distinct motif with a ‘hhDhoH’ signature and several other conserved polar amino acids, which could contribute to a potential catalytic center of an enzyme, perhaps yet another nuclease (alignment is available upon request). Members of the remaining polymerase-associated COGs are small proteins, for which no functional prediction is currently possible. At least one more functional prediction can be made on the basis of operon organization conservation. A gene for a predicted helix–turn–helix transcriptional regulator (COG2462) is located within the neighborhood in most archaeal genomes. This Archaea-specific protein is likely to regulate the expression of one or more of the operons in the analyzed neighborhood.
Although none of the genes in the analyzed neighborhood has been functionally characterized, the repertoire of predicted functions, which include a DNA helicase, several DNases and a polymerase (Table (Table1),1), strongly suggests that this neighborhood consists of genes together comprising a previously undetected DNA repair system. The uncharacterized proteins encoded within this neighborhood might function as accessory, DNA-binding or regulatory subunits of these repair complexes and perhaps as a sliding clamp for the predicted DNA polymerase. Of particular interest in this latter context are the RAMP proteins, which, given their remarkable proliferation, could form large hetero-oligomeric complexes.
A more specific functional role for the predicted repair system could be that of a functional equivalent of the mutagenic repair systems of bacteria and eukaryotes, which center around the translesion DNA polymerases of the UmuC-DinB-Rad30-Rev1 superfamily (69,70). A priori, such a system should be considered important in thermophiles, to counteract DNA damage caused by exposure to high temperatures. Among thermophilic organisms whose genomes have been completely or partially sequenced, only Sulfolobus encodes a predicted UmuC-DinB-Rad30-Rev1 superfamily polymerase (71). In particular, although this type of polymerase is present in most free-living bacteria, it is conspicuously missing in the hyperthermophiles T.maritima and A.aeolicus. In contrast, the only sequenced genome of a mesophilic archaeon, that of Halobacterium NRC-1, does have a DinB ortholog, but completely lacks the predicted new repair system. Thus, it seems plausible that the predicted novel repair system is the thermophilic counterpart of the mesophilic translesion repair system that also mediates adaptive mutagenesis (72). Technically, it is possible that the functional system encoded by the gene neighborhood described here is involved in RNA metabolism rather than in DNA repair. This, however, appears unlikely given that the most characteristic genes of this neighborhood encode a RecB family nuclease, which is specifically associated with repair systems (73) and a predicted polymerase fused to a phosphoesterase domain, an architecture typical of DNA polymerases (74).
The predicted novel repair system shows conspicuous evolutionary plasticity. This becomes particularly obvious when genomes of closely related species, such as two Thermoplasmas, two Bacilli and three Pyrococci, are compared (Fig. (Fig.1A).1A). Thermoplasma acidophilum does not encode any members of this system, whereas T.volcanium has what seems to be a rudimentary form, with only the predicted polymerase, the putative nuclease of COG1518, and several RAMPs and uncharacterized proteins. The disruption of the repair system probably started already in the common ancestor of the two Thermoplasma species, with T.acidophilum subsequently losing it completely. Bacillus halodurans has two large putative operons with genes for components of the predicted repair system, but B.subtilis has no counterpart to any of these genes. Among the three Pyrococcus species, P.abyssi resembles T.volcanium in having only remnants of the system, represented by the helicase, a stand-alone HD-nuclease and three RAMPs. Pyrococcus furiosus has a more complete system, which also includes the RecB family nuclease, the predicted polymerase and extra RAMPs. Finally, P.horikoshii has a complete, complex system, with triplication of the helicase and surrounding genes. Thus, substantial changes in this system tend to occur relatively rapidly, on a time-scale commensurate with the evolution of individual species. Gene loss is prominent among these evolutionary modifications, but amplification of parts of the system, and possibly acquisition of additional members through HGT (see below) also take place.
Cross-genome comparison of the repertoires and organization of the genes encoding components of the predicted novel repair system supports the notion that the core (helicase-nuclease) and the polymerase-RAMP modules of the superoperon have a degree of independence. The examples discussed above indicate that one of the modules can be independently lost. They also undergo rearrangement that includes reversal of the relative orientation of the modules in a superoperon (compare the gene organization in A.aeolicus and T.maritima in Figure Figure1A)1A) and probable operon disruption (compare the gene organization in A.aeolicus and A.fulgidus). Furthermore, on some occasions, the modules have undergone independent amplification, such as the duplication of the polymerase module in T.maritima and triplication of the helicase-nuclease module in P.horikoshii.
The modular gene/operon organization of the predicted repair system probably also entails functional modularity. It seems likely that the stand-alone helicase-nuclease module that is present, for example, in P.abyssi and A.pernix (Fig. (Fig.1A)1A) retains some limited functionality in repair, but probably functions on its own or in conjunction with a different polymerase. Conversely, in T.volcanium and M.tuberculosis, the predicted DNA polymerase might interact with a distinct helicase. Furthermore, the deletion of some of the predicted nucleases in P.abyssi, T.volcanium, B.halodurans, S.pyogenes and E.coli suggests a degree of redundancy among these enzymes. Such partial functional redundancy is typical of other repair pathways (75).
Overall, the number of genes coding for components of the predicted repair system varies to a great degree between genomes (Figs (Figs11 and and6),6), with about 90 genes in S.solfataricus (>3% of all genes in this genome) and the minimal set of three genes in E.coli. The prevalence of this system in thermophiles is obvious. Bacillus halodurans is the only mesophile that has the principal genes of both the helicase-nuclease and the polymerase-RAMP modules and, even in this case, the system is less elaborate than it is in most thermophiles (Figs (Figs1A1A and and6).6). Most mesophiles have no trace of this system, and several species, in which it is represented, have only remnants of one or both modules (Fig. (Fig.1A).1A). Search of unfinished prokaryotic genomes detected homologs of different proteins from the new repair system, particularly of the helicase-nuclease module, in many diverse bacteria (Table (Table2).2). Again, the three thermophiles, for which large amounts of genome sequence were available, Chlorobium tepidum, Carboxydothermus hydrogenoformans and Bacillus stearothermophilus, showed a greater representation of this system than mesophiles (Table (Table22).
The obvious plasticity of the new repair system raises the issue of a possible role of HGT in its evolution (as already alluded to above). The notion that HGT occurred more than once during evolution of this system is supported by notable conservation of certain gene arrays in phylogenetically distant genomes (Fig. (Fig.1A).1A). The strongest case in point is the conservation of the gene order in the polymerase module between the archaeon A.fulgidus and the bacteria A.aeolicus and B.halodurans (Fig. (Fig.1A).1A). These observations suggest that these gene cassettes (probable operons) disseminated via HGT as a single entity. In fact, in an early comparison, the apparent superoperon that comprises the predicted repair system in A.aeolicus has been noticed as the largest constellation of ‘archaeal’ genes in the genome of this hyperthermophilic bacterium, and its presence was one of the arguments supporting massive HGT between bacterial and archaeal hyperthermophiles (76).
To examine further the contribution of HGT to the evolution of the predicted new repair system, phylogenetic trees were constructed for the four genes that are most common in the conserved neighborhood (COGs 1203, 1518, 1468 and 1353). All four trees showed clear indications of multiple HGT events (Fig. (Fig.7).7). In particular, each tree supports HGT from Archaea to A.aeolicus, in agreement with the conservation of gene order between this bacterium and some Archaea (see above). The tree topologies suggest independent HGT events between Archaea and different bacterial groups as well as between different bacterial lineages. For example, in the tree for the putative novel nuclease (COG1518) and RecB-type nuclease (COG1468), the proteins from B.halodurans and B.stearothermophilus occupy very different positions instead of being adjacent as expected from the phylogeny of the corresponding species (Fig. (Fig.7A7A and C). The former belongs to a cluster of several bacterial species, which is located between two archaeal clusters, whereas the latter is part of another, smaller group of diverse bacterial species, which lies within one of the archaeal clusters (Fig. (Fig.7A7A and C). Furthermore, in the tree for COG1203 helicases, a third bacterial cluster, which combines proteins from proteobacteria, the Bacillus-Clostridium group of Gram-positive bacteria and a spirochete, joins the second archaeal cluster (Fig. (Fig.7B).7B). Thus, the topology of this tree can be explained through three independent HGT events between Archaea and bacteria, followed by some additional HGT within the bacterial and possibly archaeal domains. Alternatively, it could be postulated that the existence of the third bacterial cluster, which is separated from the Archaea by a long branch, reflects vertical inheritance from the last common ancestor of Archaea and bacteria, with subsequent multiple gene losses resulting in the extant patchy phyletic distribution. Unlike the trees for the other three analyzed proteins, the tree for predicted polymerases has B.halodurans and B.stearothermophilus proteins in the same cluster (Fig. (Fig.7D);7D); this emphasizes distinct evolutionary fates of different genes within the predicted new repair system.
The apparent multiple HGT and gene loss events preclude a definitive conclusion as to the origin of the predicted repair system described here. One scenario would posit that this system originally evolved in hyperthermophilic Archaea and subsequently was disseminated through the prokaryotic world via multiple HGTs. Under this scenario, many mesophilic bacteria acquired (parts of) this system from thermophiles and subsequently lost some of the acquired genes. An alternative possibility, which is best compatible with the hypothesis that the last universal common ancestor of modern life forms was a hyperthermophile (77,78), is that the core of this system already existed in this hypothetical ancestral organism, with numerous coordinated gene losses occurring in various lineages that became mesophilic. One such lineage is the eukaryotes whose common ancestor might have originally inherited this repair system. A variant of this hypothesis is that the helicase-nuclease and polymerase-RAMP modules evolved independently at a very early stage of evolution. Subsequently, they might have been brought together to form a single repair system in Archaea, and this system was acquired by some, primarily thermophilic bacteria via HGT. At a mechanistic level, the association of the predicted repair system with thermophily and the apparent near incompatibility of this system with the translesion repair pathway based on UmuC-DinB-Rad30-Rev1 superfamily polymerases remain mysterious and, hopefully, will be targets for future experimental studies.
A previously undetected DNA repair system that is largely specific for thermophiles was predicted through the use of a relatively permissive approach to gene context analysis, examination of partially conserved gene neighborhoods, which does not emphasize exact conservation of local gene order. The use of such an approach was important because of extreme evolutionary plasticity of the novel repair system. The evolution of this system appears to have involved frequent genomic rearrangements, modular and sporadic gene loss and multiple HGT events. Experimental validation of the predictions made here should include both demonstration of individual biochemical activities, particularly those of the predicted novel polymerase and nuclease, and of the RAMPs, and elucidation of the physiological role of the system as a whole. The latter type of experiments might shed light on the intriguing and unsolved question: how do thermophiles cope with the increased level of DNA damage that is inevitable in their natural habitats?
The release of unfinished genome sequences by The Institute of Genome Research, the Sanger Center, Oklahoma’s Advanced Center for Genome Technology and Washington University Genome Sequencing Center is gratefully acknowledged. K.M. is supported by the Microbial Genome Program, Office of Biological and Environmental Research, DOE (DE-FG02-98ER62583).