Sequence profile analysis of the RelE- and ParE-type post-segregational cell killing (PSK) toxins from diverse bacteria and archaea has unified these proteins into a single superfamily. Further comparative analysis suggests that the core of the eukaryotic nonsense-mediated RNA decay system has probably evolved from a PSK-related system.
Several prokaryotic plasmids maintain themselves in their hosts by means of diverse post-segregational cell killing systems. Recent findings suggest that chromosomally encoded copies of toxins and antitoxins of post-segregational cell killing systems - such as the RelE system - might function as regulatory switches under stress conditions. The RelE toxin cleaves ribosome-associated transcripts, whereas another post-segregational cell killing toxin, ParE, functions as a gyrase inhibitor.
Using sequence profile analysis we were able unify the RelE- and ParE-type toxins with several families of small, uncharacterized proteins from diverse bacteria and archaea into a single superfamily. Gene neighborhood analysis showed that the majority of these proteins were encoded by genes in characteristic neighborhoods, in which genes encoding toxins always co-occurred with genes encoding transcription factors that are also antitoxins. The transcription factors accompanying the RelE/ParE superfamily may belong to unrelated or distantly related superfamilies, however. We used this conserved neighborhood template to transitively search genomes and identify novel post-segregational cell killing-related systems. One of these novel systems, observed in several prokaryotes, contained a predicted toxin with a PilT-N terminal (PIN) domain, which is also found in proteins of the eukaryotic nonsense-mediated RNA decay system. These searches also identified novel transcription factors (antitoxins) in post-segregational cell killing systems. Furthermore, the toxin Doc defines a potential metalloenzyme superfamily, with novel representatives in bacteria, archaea and eukaryotes, that probably acts on nucleic acids.
The tightly maintained gene neighborhoods of post-segregational cell killing-related systems appear to have evolved by in situ displacement of genes for toxins or antitoxins by functionally equivalent but evolutionarily unrelated genes. We predict that the novel post-segregational cell killing-related systems containing a PilT-N terminal domain toxin and the eukaryotic nonsense-mediated RNA decay system are likely to function via a common mechanism, in which the PilT-N terminal domain cleaves ribosome-associated transcripts. The core of the eukaryotic nonsense-mediated RNA decay system has probably evolved from a post-segregational cell killing-related system.
In eukaryotes, RNA interference (RNAi) is a major mechanism of defense against viruses and transposable elements as well of regulating translation of endogenous mRNAs. The RNAi systems recognize the target RNA molecules via small guide RNAs that are completely or partially complementary to a region of the target. Key components of the RNAi systems are proteins of the Argonaute-PIWI family some of which function as slicers, the nucleases that cleave the target RNA that is base-paired to a guide RNA. Numerous prokaryotes possess the CRISPR-associated system (CASS) of defense against phages and plasmids that is, in part, mechanistically analogous but not homologous to eukaryotic RNAi systems. Many prokaryotes also encode homologs of Argonaute-PIWI proteins but their functions remain unknown.
We present a detailed analysis of Argonaute-PIWI protein sequences and the genomic neighborhoods of the respective genes in prokaryotes. Whereas eukaryotic Ago/PIWI proteins always contain PAZ (oligonucleotide binding) and PIWI (active or inactivated nuclease) domains, the prokaryotic Argonaute homologs (pAgos) fall into two major groups in which the PAZ domain is either present or absent. The monophyly of each group is supported by a phylogenetic analysis of the conserved PIWI-domains. Almost all pAgos that lack a PAZ domain appear to be inactivated, and the respective genes are associated with a variety of predicted nucleases in putative operons. An additional, uncharacterized domain that is fused to various nucleases appears to be a unique signature of operons encoding the short (lacking PAZ) pAgo form. By contrast, almost all PAZ-domain containing pAgos are predicted to be active nucleases. Some proteins of this group (e.g., that from Aquifex aeolicus) have been experimentally shown to possess nuclease activity, and are not typically associated with genes for other (putative) nucleases. Given these observations, the apparent extensive horizontal transfer of pAgo genes, and their common, statistically significant over-representation in genomic neighborhoods enriched in genes encoding proteins involved in the defense against phages and/or plasmids, we hypothesize that pAgos are key components of a novel class of defense systems. The PAZ-domain containing pAgos are predicted to directly destroy virus or plasmid nucleic acids via their nuclease activity, whereas the apparently inactivated, PAZ-lacking pAgos could be structural subunits of protein complexes that contain, as active moieties, the putative nucleases that we predict to be co-expressed with these pAgos. All these nucleases are predicted to be DNA endonucleases, so it seems most probable that the putative novel phage/plasmid-defense system targets phage DNA rather than mRNAs. Given that in eukaryotic RNAi systems, the PAZ domain binds a guide RNA and positions it on the complementary region of the target, we further speculate that pAgos function on a similar principle (the guide being either DNA or RNA), and that the uncharacterized domain found in putative operons with the short forms of pAgos is a functional substitute for the PAZ domain.
The hypothesis that pAgos are key components of a novel prokaryotic immune system that employs guide RNA or DNA molecules to degrade nucleic acids of invading mobile elements implies a functional analogy with the prokaryotic CASS and a direct evolutionary connection with eukaryotic RNAi. The predictions of the hypothesis including both the activities of pAgos and those of the associated endonucleases are readily amenable to experimental tests.
This article was reviewed by Daniel Haft, Martijn Huynen, and Chris Ponting.
Metabolic pathways in eubacteria and archaea often are encoded by operons and/or gene clusters (genome neighborhoods) that provide important clues for assignment of both enzyme functions and metabolic pathways. We describe a bioinformatic approach (genome neighborhood network; GNN) that enables large scale prediction of the in vitro enzymatic activities and in vivo physiological functions (metabolic pathways) of uncharacterized enzymes in protein families. We demonstrate the utility of the GNN approach by predicting in vitro activities and in vivo functions in the proline racemase superfamily (PRS; InterPro IPR008794). The predictions were verified by measuring in vitro activities for 51 proteins in 12 families in the PRS that represent ∼85% of the sequences; in vitro activities of pathway enzymes, carbon/nitrogen source phenotypes, and/or transcriptomic studies confirmed the predicted pathways. The synergistic use of sequence similarity networks3 and GNNs will facilitate the discovery of the components of novel, uncharacterized metabolic pathways in sequenced genomes.
DNA molecules are polymers in which four nucleotides—guanine, adenine, thymine, and cytosine—are arranged along a sugar backbone. The sequence of these four nucleotides along the DNA strand determines the genetic code of the organism, and can be deciphered using various genome sequencing techniques. Microbial genomes are particularly easy to sequence as they contain fewer than several million nucleotides, compared with the 3 billion or so nucleotides that are present in the human genome.
Reading a genome sequence is straight forward, but predicting the physiological functions of the proteins encoded by the genes in the sequence can be challenging. In a process called genome annotation, the function of protein is predicted by comparing the relevant gene to the genes of proteins with known functions. However, microbial genomes and proteins are hugely diverse and over 50% of the microbial genomes that have been sequenced have not yet been related to any physiological function. With thousands of microbial genomes waiting to be deciphered, large scale approaches are needed.
Zhao et al. take advantage of a particular characteristic of microbial genomes. DNA sequences that code for two proteins required for the same task tend to be closer to each other in the genome than two sequences that code for unrelated functions. Operons are an extreme example; an operon is a unit of DNA that contains several genes that are expressed as proteins at the same time.
Zhao et al. have developed a bioinformatic method called the genome neighbourhood network approach to work out the function of proteins based on their position relative to other proteins in the genome. When applied to the proline racemase superfamily (PRS), which contains enzymes with similar sequences that can catalyze three distinct chemical reactions, the new approach was able to assign a function to the majority of proteins in a public database of PRS enzymes, and also revealed new members of the PRS family. Experiments confirmed that the proteins behaved as predicted. The next challenge is to develop the genome neighbourhood network approach so that it can be applied to more complex systems.
sequence similarity network; genome neighborhood network; functional assignment; other
Recently, it has been shown that a predicted P-loop ATPase (the HerA or MlaA protein), which is highly conserved in archaea and also present in many bacteria but absent in eukaryotes, has a bidirectional helicase activity and forms hexameric rings similar to those described for the TrwB ATPase. In this study, the FtsK–HerA superfamily of P-loop ATPases, in which the HerA clade comprises one of the major branches, is analyzed in detail. We show that, in addition to the FtsK and HerA clades, this superfamily includes several families of characterized or predicted ATPases which are predominantly involved in extrusion of DNA and peptides through membrane pores. The DNA-packaging ATPases of various bacteriophages and eukaryotic double-stranded DNA viruses also belong to the FtsK–HerA superfamily. The FtsK protein is the essential bacterial ATPase that is responsible for the correct segregation of daughter chromosomes during cell division. The structural and evolutionary relationship between HerA and FtsK and the nearly perfect complementarity of their phyletic distributions suggest that HerA similarly mediates DNA pumping into the progeny cells during archaeal cell division. It appears likely that the HerA and FtsK families diverged concomitantly with the archaeal–bacterial division and that the last universal common ancestor of modern life forms had an ancestral DNA-pumping ATPase that gave rise to these families. Furthermore, the relationship of these cellular proteins with the packaging ATPases of diverse DNA viruses suggests that a common DNA pumping mechanism might be operational in both cellular and viral genome segregation. The herA gene forms a highly conserved operon with the gene for the NurA nuclease and, in many archaea, also with the orthologs of eukaryotic double-strand break repair proteins MRE11 and Rad50. HerA is predicted to function in a complex with these proteins in DNA pumping and repair of double-stranded breaks introduced during this process and, possibly, also during DNA replication. Extensive comparative analysis of the ‘genomic context’ combined with in-depth sequence analysis led to the prediction of numerous previously unnoticed nucleases of the NurA superfamily, including a specific version that is likely to be the endonuclease component of a novel restriction-modification system. This analysis also led to the identification of previously uncharacterized nucleases, such as a novel predicted nuclease of the Sir2-type Rossmann fold, and phosphatases of the HAD superfamily that are likely to function as partners of the FtsK–HerA superfamily ATPases.
The bacterial SOS response is an elaborate program for DNA repair, cell cycle regulation and adaptive mutagenesis under stress conditions. Using sensitive sequence and structure analysis, combined with contextual information derived from comparative genomics and domain architectures, we identify two novel domain superfamilies in the SOS response system. We present evidence that one of these, the SOS response associated peptidase (SRAP; Pfam: DUF159) is a novel thiol autopeptidase. Given the involvement of other autopeptidases, such as LexA and UmuD, in the SOS response, this finding suggests that multiple structurally unrelated peptidases have been recruited to this process. The second of these, the ImuB-C superfamily, is linked to the Y-family DNA polymerase-related domain in ImuB, and also occurs as a standalone protein. We present evidence using gene neighborhood analysis that both these domains function with different mutagenic polymerases in bacteria, such as Pol IV (DinB), Pol V (UmuCD) and ImuA-ImuB-DnaE2 and also other repair systems, which either deploy Ku and an ATP-dependent ligase or a SplB-like radical SAM photolyase. We suggest that the SRAP superfamily domain functions as a DNA-associated autoproteolytic switch that recruits diverse repair enzymes upon DNA damage, whereas the ImuB-C domain performs a similar function albeit in a non-catalytic fashion. We propose that C3Orf37, the eukaryotic member of the SRAP superfamily, which has been recently shown to specifically bind DNA with 5-hydroxymethylcytosine, 5-formylcytosine and 5-carboxycytosine, is a sensor for these oxidized bases generated by the TET enzymes from methylcytosine. Hence, its autoproteolytic activity might help it act as a switch that recruits DNA repair enzymes to remove these oxidized methylcytosine species as part of the DNA demethylation pathway downstream of the TET enzymes.
This article was reviewed by RDS, RF and GJ.
The DNA single-strand annealing proteins (SSAPs), such as RecT, Redβ, ERF and Rad52, function in RecA-dependent and RecA-independent DNA recombination pathways. Recently, they have been shown to form similar helical quaternary superstructures. However, despite the functional similarities between these diverse SSAPs, their actual evolutionary affinities are poorly understood.
Using sensitive computational sequence analysis, we show that the RecT and Redβ proteins, along with several other bacterial proteins, form a distinct superfamily. The ERF and Rad52 families show no direct evolutionary relationship to these proteins and define novel superfamilies of their own. We identify several previously unknown members of each of these superfamilies and also report, for the first time, bacterial and viral homologs of Rad52. Additionally, we predict the presence of aberrant HhH modules in RAD52 that are likely to be involved in DNA-binding. Using the contextual information obtained from the analysis of gene neighborhoods, we provide evidence of the interaction of the bacterial members of each of these SSAP superfamilies with a similar set of DNA repair/recombination protein. These include different nucleases or Holliday junction resolvases, the ABC ATPase SbcC and the single-strand-binding protein. We also present evidence of independent assembly of some of the predicted operons encoding SSAPs and in situ displacement of functionally similar genes.
There are three evolutionarily distinct superfamilies of SSAPs, namely the RecT/Redβ, ERF, and RAD52, that have different sequence conservation patterns and predicted folds. All these SSAPs appear to be primarily of bacteriophage origin and have been acquired by numerous phylogenetically distant cellular genomes. They generally occur in predicted operons encoding one or more of a set of conserved DNA recombination proteins that appear to be the principal functional partners of the SSAPs.
Holliday junction resolvases (HJRs) are key enzymes of DNA recombination. A detailed computer analysis of the structural and evolutionary relationships of HJRs and related nucleases suggests that the HJR function has evolved independently from at least four distinct structural folds, namely RNase H, endonuclease, endonuclease VII–colicin E and RusA. The endonuclease fold, whose structural prototypes are the phage λ exonuclease, the very short patch repair nuclease (Vsr) and type II restriction enzymes, is shown to encompass by far a greater diversity of nucleases than previously suspected. This fold unifies archaeal HJRs, repair nucleases such as RecB and Vsr, restriction enzymes and a variety of predicted nucleases whose specific activities remain to be determined. Within the RNase H fold a new family of predicted HJRs, which is nearly ubiquitous in bacteria, was discovered, in addition to the previously characterized RuvC family. The proteins of this family, typified by Escherichia coli YqgF, are likely to function as an alternative to RuvC in most bacteria, but could be the principal HJRs in low-GC Gram-positive bacteria and Aquifex. Endonuclease VII of phage T4 is shown to serve as a structural template for many nucleases, including McrA and other type II restriction enzymes. Together with colicin E7, endonuclease VII defines a distinct metal-dependent nuclease fold. As a result of this analysis, the principal HJRs are now known or confidently predicted for all bacteria and archaea whose genomes have been completely sequenced, with many species encoding multiple potential HJRs. Horizontal gene transfer, lineage-specific gene loss and gene family expansion, and non-orthologous gene displacement seem to have been major forces in the evolution of HJRs and related nucleases. A remarkable case of displacement is seen in the Lyme disease spirochete Borrelia burgdorferi, which does not possess any of the typical HJRs, but instead encodes, in its chromosome and each of the linear plasmids, members of the λ exonuclease family predicted to function as HJRs. The diversity of HJRs and related nucleases in bacteria and archaea contrasts with their near absence in eukaryotes. The few detected eukaryotic representatives of the endonuclease fold and the RNase H fold have probably been acquired from bacteria via horizontal gene transfer. The identity of the principal HJR(s) involved in recombination in eukaryotes remains uncertain; this function could be performed by topoisomerase IB or by a novel, so far undetected, class of enzymes. Likely HJRs and related nucleases were identified in the genomes of numerous bacterial and eukaryotic DNA viruses. Gene flow between viral and cellular genomes has probably played a major role in the evolution of this class of enzymes. This analysis resulted in the prediction of numerous previously unnoticed nucleases, some of which are likely to be new restriction enzymes.
A detailed analysis of protein domains involved in DNA repair was performed by comparing the sequences of the repair proteins from two well-studied model organisms, the bacterium Escherichia coli and yeast Saccharomyces cerevisiae, to the entire sets of protein sequences encoded in completely sequenced genomes of bacteria, archaea and eukaryotes. Previously uncharacterized conserved domains involved in repair were identified, namely four families of nucleases and a family of eukaryotic repair proteins related to the proliferating cell nuclear antigen. In addition, a number of previously undetected occurrences of known conserved domains were detected; for example, a modified helix-hairpin-helix nucleic acid-binding domain in archaeal and eukaryotic RecA homologs. There is a limited repertoire of conserved domains, primarily ATPases and nucleases, nucleic acid-binding domains and adaptor (protein-protein interaction) domains that comprise the repair machinery in all cells, but very few of the repair proteins are represented by orthologs with conserved domain architecture across the three superkingdoms of life. Both the external environment of an organism and the internal environment of the cell, such as the chromatin superstructure in eukaryotes, seem to have a profound effect on the layout of the repair systems. Another factor that apparently has made a major contribution to the composition of the repair machinery is horizontal gene transfer, particularly the invasion of eukaryotic genomes by organellar genes, but also a number of likely transfer events between bacteria and archaea. Several additional general trends in the evolution of repair proteins were noticed; in particular, multiple, independent fusions of helicase and nuclease domains, and independent inactivation of enzymatic domains that apparently retain adaptor or regulatory functions.
We report an in-depth computational study of the protein sequences and structures of the superfamily of archaeo-eukaryotic primases (AEPs). This analysis greatly expands the range of diversity of the AEPs and reveals the unique active site shared by all members of this superfamily. In particular, it is shown that eukaryotic nucleo-cytoplasmic large DNA viruses, including poxviruses, asfarviruses, iridoviruses, phycodnaviruses and the mimivirus, encode AEPs of a distinct family, which also includes the herpesvirus primases whose relationship to AEPs has not been recognized previously. Many eukaryotic genomes, including chordates and plants, encode previously uncharacterized homologs of these predicted viral primases, which might be involved in novel DNA repair pathways. At a deeper level of evolutionary connections, structural comparisons indicate that AEPs, the nucleases involved in the initiation of rolling circle replication in plasmids and viruses, and origin-binding domains of papilloma and polyoma viruses evolved from a common ancestral protein that might have been involved in a protein-priming mechanism of initiation of DNA replication. Contextual analysis of multidomain protein architectures and gene neighborhoods in prokaryotes and viruses reveals remarkable parallels between AEPs and the unrelated DnaG-type primases, in particular, tight associations with the same repertoire of helicases. These observations point to a functional equivalence of the two classes of primases, which seem to have repeatedly displaced each other in various extrachromosomal replicons.
The complete 1,751,377-bp sequence of the genome of the thermophilic archaeon Methanobacterium thermoautotrophicum deltaH has been determined by a whole-genome shotgun sequencing approach. A total of 1,855 open reading frames (ORFs) have been identified that appear to encode polypeptides, 844 (46%) of which have been assigned putative functions based on their similarities to database sequences with assigned functions. A total of 514 (28%) of the ORF-encoded polypeptides are related to sequences with unknown functions, and 496 (27%) have little or no homology to sequences in public databases. Comparisons with Eucarya-, Bacteria-, and Archaea-specific databases reveal that 1,013 of the putative gene products (54%) are most similar to polypeptide sequences described previously for other organisms in the domain Archaea. Comparisons with the Methanococcus jannaschii genome data underline the extensive divergence that has occurred between these two methanogens; only 352 (19%) of M. thermoautotrophicum ORFs encode sequences that are >50% identical to M. jannaschii polypeptides, and there is little conservation in the relative locations of orthologous genes. When the M. thermoautotrophicum ORFs are compared to sequences from only the eucaryal and bacterial domains, 786 (42%) are more similar to bacterial sequences and 241 (13%) are more similar to eucaryal sequences. The bacterial domain-like gene products include the majority of those predicted to be involved in cofactor and small molecule biosyntheses, intermediary metabolism, transport, nitrogen fixation, regulatory functions, and interactions with the environment. Most proteins predicted to be involved in DNA metabolism, transcription, and translation are more similar to eucaryal sequences. Gene structure and organization have features that are typical of the Bacteria, including genes that encode polypeptides closely related to eucaryal proteins. There are 24 polypeptides that could form two-component sensor kinase-response regulator systems and homologs of the bacterial Hsp70-response proteins DnaK and DnaJ, which are notably absent in M. jannaschii. DNA replication initiation and chromosome packaging in M. thermoautotrophicum are predicted to have eucaryal features, based on the presence of two Cdc6 homologs and three histones; however, the presence of an ftsZ gene indicates a bacterial type of cell division initiation. The DNA polymerases include an X-family repair type and an unusual archaeal B type formed by two separate polypeptides. The DNA-dependent RNA polymerase (RNAP) subunits A', A", B', B" and H are encoded in a typical archaeal RNAP operon, although a second A' subunit-encoding gene is present at a remote location. There are two rRNA operons, and 39 tRNA genes are dispersed around the genome, although most of these occur in clusters. Three of the tRNA genes have introns, including the tRNAPro (GGG) gene, which contains a second intron at an unprecedented location. There is no selenocysteinyl-tRNA gene nor evidence for classically organized IS elements, prophages, or plasmids. The genome contains one intein and two extended repeats (3.6 and 8.6 kb) that are members of a family with 18 representatives in the M. jannaschii genome.
The use of nucleases as toxins for defense, offense or addiction of selfish elements is widely encountered across all life forms. Using sensitive sequence profile analysis methods, we characterize a novel superfamily (the SUKH superfamily) that unites a diverse group of proteins including Smi1/Knr4, PGs2, FBXO3, SKIP16, Syd, herpesviral US22, IRS1 and TRS1, and their bacterial homologs. Using contextual analysis we present evidence that the bacterial members of this superfamily are potential immunity proteins for a variety of toxin systems that also include the recently characterized contact-dependent inhibition (CDI) systems of proteobacteria. By analyzing the toxin proteins encoded in the neighborhood of the SUKH superfamily we predict that they possess domains belonging to diverse nuclease and nucleic acid deaminase families. These include at least eight distinct types of DNases belonging to HNH/EndoVII- and restriction endonuclease-fold, and RNases of the EndoU-like and colicin E3-like cytotoxic RNases-folds. The N-terminal domains of these toxins indicate that they are extruded by several distinct secretory mechanisms such as the two-partner system (shared with the CDI systems) in proteobacteria, ESAT-6/WXG-like ATP-dependent secretory systems in Gram-positive bacteria and the conventional Sec-dependent system in several bacterial lineages. The hedgehog-intein domain might also release a subset of toxic nuclease domains through auto-proteolytic action. Unlike classical colicin-like nuclease toxins, the overwhelming majority of toxin systems with the SUKH superfamily is chromosomally encoded and appears to have diversified through a recombination process combining different C-terminal nuclease domains to N-terminal secretion-related domains. Across the bacterial superkingdom these systems might participate in discriminating `self’ or kin from `non-self’ or non-kin strains. Using structural analysis we demonstrate that the SUKH domain possesses a versatile scaffold that can be used to bind a wide range of protein partners. In eukaryotes it appears to have been recruited as an adaptor to regulate modification of proteins by ubiquitination or polyglutamylation. Similarly, another widespread immunity protein from these toxin systems, namely the suppressor of fused (SuFu) superfamily has been recruited for comparable roles in eukaryotes. In animal DNA viruses, such as herpesviruses, poxviruses, iridoviruses and adenoviruses, the ability of the SUKH domain to bind diverse targets has been deployed to counter diverse anti-viral responses by interacting with specific host proteins.
The genome sequence of the extremely thermophilic bacterium Aquifex aeolicus encodes alternative sigma factor ςN (ς54, RpoN) and five potential ςN-dependent transcriptional activators. Although A. aeolicus possesses no recognizable nitrogenase genes, two of the activators have a high degree of sequence similarity to NifA proteins from nitrogen-fixing proteobacteria. We identified five putative ςN-dependent promoters upstream of operons implicated in functions including sulfur respiration, nitrogen assimilation, nitrate reductase, and nitrite reductase activity. We cloned, overexpressed (in Escherichia coli), and purified A. aeolicus ςN and the NifA homologue, AQ_218. Purified A. aeolicus ςN bound to E. coli core RNA polymerase and bound specifically to a DNA fragment containing E. coli promoter glnHp2 and to several A. aeolicus DNA fragments containing putative ςN-dependent promoters. When combined with E. coli core RNA polymerase, A. aeolicus ςN supported A. aeolicus NifA-dependent transcription from the glnHp2 promoter. The E. coli activator PspFΔHTH did not stimulate transcription. The NifA homologue, AQ_218, bound specifically to a DNA sequence centered about 100 bp upstream of the A. aeolicus glnBA operon and so is likely to be involved in the regulation of nitrogen assimilation in this organism. These results argue that the ςN enhancer-dependent transcription system operates in at least one extreme environment, and that the activator and ςN have coevolved.
Occurrence of the hsp70 (dnaK) gene was investigated in various members of the domain Archaea comprising both euryarchaeotes and crenarchaeotes and in the hyperthermophilic bacteria Aquifex pyrophilus and Thermotoga maritima representing the deepest offshoots in phylogenetic trees of bacterial 16S rRNA sequences. The gene was not detected in 8 of 10 archaea examined but was found in A. pyrophilus and T. maritima, from which it was cloned and sequenced. Comparative analyses of the HSP70 amino acid sequences encoded in these genes, and others in the databases, showed that (i) in accordance with the vicinities seen in rRNA-based trees, the proteins from A. pyrophilus and T. maritima form a thermophilic cluster with that from the green nonsulfur bacterium Thermomicrobium roseum and are unrelated to their counterparts from gram-positive bacteria, proteobacteria/mitochondria, chlamydiae/spirochetes, deinococci, and cyanobacteria/chloroplasts; (ii) the T. maritima HSP70 clusters with the homologues from the archaea Methanobacterium thermoautotrophicum and Thermoplasma acidophilum, in contrast to the postulated unique kinship between archaea and gram-positive bacteria; and (iii) there are exceptions to the reported association between an insert in HSP70 and gram negativity, or vice versa, absence of insert and gram positivity. Notably, the HSP70 from T. maritima lacks the insert, although T. maritima is phylogenetically unrelated to the gram-positive bacteria. These results, along with the absence of hsp70 (dnaK) in various archaea and its presence in others, suggest that (i) different taxa retained either one or the other of two hsp70 (dnaK) versions (with or without insert), regardless of phylogenetic position; and (ii) archaea are aboriginally devoid of hsp70 (dnaK), and those that have it must have received it from phylogenetically diverse bacteria via lateral gene transfer events that did not involve replacement of an endogenous hsp70 (dnaK) gene.
A computational method was developed for delineating connected gene neighborhoods in bacterial and archaeal genomes. These gene neighborhoods are not typically present, in their entirety, in any single genome, but are held together by overlapping, partially conserved gene arrays. The procedure was applied to comparing the orders of orthologous genes, which were extracted from the database of Clusters of Orthologous Groups of proteins (COGs), in 31 prokaryotic genomes and resulted in the identification of 188 clusters of gene arrays, which included 1001 of 2890 COGs. These clusters were projected onto actual genomes to produce extended neighborhoods including additional genes, which are adjacent to the genes from the clusters and are transcribed in the same direction, which resulted in a total of 2387 COGs being included in the neighborhoods. Most of the neighborhoods consist predominantly of genes united by a coherent functional theme, but also include a minority of genes without an obvious functional connection to the main theme. We hypothesize that although some of the latter genes might have unsuspected roles, others are maintained within gene arrays because of the advantage of expression at a level that is typical of the given neighborhood. We designate this phenomenon ‘genomic hitchhiking’. The largest neighborhood includes 79 genes (COGs) and consists of overlapping, rearranged ribosomal protein superoperons; apparent genome hitchhiking is particularly typical of this neighborhood and other neighborhoods that consist of genes coding for translation machinery components. Several neighborhoods involve previously undetected connections between genes, allowing new functional predictions. Gene neighborhoods appear to evolve via complex rearrangement, with different combinations of genes from a neighborhood fixed in different lineages.
The PD-(D/E)XK nuclease superfamily, initially identified in type II restriction endonucleases and later in many enzymes involved in DNA recombination and repair, is one of the most challenging targets for protein sequence analysis and structure prediction. Typically, the sequence similarity between these proteins is so low, that most of the relationships between known members of the PD-(D/E)XK superfamily were identified only after the corresponding structures were determined experimentally. Thus, it is tempting to speculate that among the uncharacterized protein families, there are potential nucleases that remain to be discovered, but their identification requires more sensitive tools than traditional PSI-BLAST searches.
The low degree of amino acid conservation hampers the possibility of identification of new members of the PD-(D/E)XK superfamily based solely on sequence comparisons to known members. Therefore, we used a recently developed method HHsearch for sensitive detection of remote similarities between protein families represented as profile Hidden Markov Models enhanced by secondary structure. We carried out a comparison of known families of PD-(D/E)XK nucleases to the database comprising the COG and PFAM profiles corresponding to both functionally characterized as well as uncharacterized protein families to detect significant similarities. The initial candidates for new nucleases were subsequently verified by sequence-structure threading, comparative modeling, and identification of potential active site residues.
In this article, we report identification of the PD-(D/E)XK nuclease domain in numerous proteins implicated in interactions with DNA but with unknown structure and mechanism of action (such as putative recombinase RmuC, DNA competence factor CoiA, a DNA-binding protein SfsA, a large human protein predicted to be a DNA repair enzyme, predicted archaeal transcription regulators, and the head completion protein of phage T4) and in proteins for which no function was assigned to date (such as YhcG, various phage proteins, novel candidates for restriction enzymes). Our results contributes to the reduction of "white spaces" on the sequence-structure-function map of the protein universe and will help to jump-start the experimental characterization of new nucleases, of which many may be of importance for the complete understanding of mechanisms that govern the evolution and stability of the genome.
In previous studies, gene neighborhoods—spatial clusters of co-expressed genes in the genome—have been defined using arbitrary rules such as requiring adjacency, a minimum number of genes, a fixed window size, or a minimum expression level. In the current study, we developed a Gene Neighborhood Scoring Tool (G-NEST) which combines genomic location, gene expression, and evolutionary sequence conservation data to score putative gene neighborhoods across all possible window sizes simultaneously.
Using G-NEST on atlases of mouse and human tissue expression data, we found that large neighborhoods of ten or more genes are extremely rare in mammalian genomes. When they do occur, neighborhoods are typically composed of families of related genes. Both the highest scoring and the largest neighborhoods in mammalian genomes are formed by tandem gene duplication. Mammalian gene neighborhoods contain highly and variably expressed genes. Co-localized noisy gene pairs exhibit lower evolutionary conservation of their adjacent genome locations, suggesting that their shared transcriptional background may be disadvantageous. Genes that are essential to mammalian survival and reproduction are less likely to occur in neighborhoods, although neighborhoods are enriched with genes that function in mitosis. We also found that gene orientation and protein-protein interactions are partially responsible for maintenance of gene neighborhoods.
Our experiments using G-NEST confirm that tandem gene duplication is the primary driver of non-random gene order in mammalian genomes. Non-essentiality, co-functionality, gene orientation, and protein-protein interactions are additional forces that maintain gene neighborhoods, especially those formed by tandem duplicates. We expect G-NEST to be useful for other applications such as the identification of core regulatory modules, common transcriptional backgrounds, and chromatin domains. The software is available at http://docpollard.org/software.html
Computational biology; Genomics; Gene expression; Gene duplication; Transcription; Cluster analysis; Gene neighborhood; Gene cluster; Bioinformatics; Evolution
We have identified a novel structure-specific nuclease in highly fractionated extracts of the thermophilic archaeon Methanothermobacter thermautotrophicus (Mth). The 71 kDa protein product of open reading frame mth1090 is a nuclease with ATPase activity, which we call Nar71 (Nuclease-ATPase in Repair, 71 kDa). The nar71 gene is located in a gene neighbourhood proposed by genomics to encode a novel DNA repair system conserved in thermophiles. The biochemical characterization of Nar71 presented here is the first analysis from within this neighbourhood, and it supports the insight from genomics. Nuclease activity of Nar71 is specific for 3′ flaps and flayed duplexes, targeting single-stranded DNA (ssDNA) regions. This activity requires Mg2+ or Mn2+ and is greatly reduced in ATP. In ATP, Nar71 displaces ssDNA, also with high specificity for 3′ flap and flayed duplex DNA. Strand displacement is weak compared with nuclease activity, but in ATPγS it is abolished, suggesting that Nar71 couples ATP hydrolysis to DNA strand separation. ATPase assays confirmed that Nar71 is stimulated by ssDNA, though not double-stranded DNA. Mutation of Lys-117 in Nar71 abolished ATPase and nuclease activity, and we describe a separation-of-function mutant (K68A) that has lost ATPase activity but retains nuclease activity. A model of possible Nar71 function in DNA repair is presented.
The elaborate eukaryotic DNA replication machinery evolved from the archaeal ancestors that themselves show considerable complexity. Here we discuss the comparative genomic and phylogenetic analysis of the core replication enzymes, the DNA polymerases, in archaea and their relationships with the eukaryotic polymerases. In archaea, there are three groups of family B DNA polymerases, historically known as PolB1, PolB2 and PolB3. All three groups appear to descend from the last common ancestors of the extant archaea but their subsequent evolutionary trajectories seem to have been widely different. Although PolB3 is present in all archaea, with the exception of Thaumarchaeota, and appears to be directly involved in lagging strand replication, the evolution of this gene does not follow the archaeal phylogeny, conceivably due to multiple horizontal transfers and/or dramatic differences in evolutionary rates. In contrast, PolB1 is missing in Euryarchaeota but otherwise seems to have evolved vertically. The third archaeal group of family B polymerases, PolB2, includes primarily proteins in which the catalytic centers of the polymerase and exonuclease domains are disrupted and accordingly the enzymes appear to be inactivated. The members of the PolB2 group are scattered across archaea and might be involved in repair or regulation of replication along with inactivated members of the RadA family ATPases and an additional, uncharacterized protein that are encoded within the same predicted operon. In addition to the family B polymerases, all archaea, with the exception of the Crenarchaeota, encode enzymes of a distinct family D the origin of which is unclear. We examine multiple considerations that appear compatible with the possibility that family D polymerases are highly derived homologs of family B. The eukaryotic DNA polymerases show a highly complex relationship with their archaeal ancestors including contributions of proteins and domains from both the family B and the family D archaeal polymerases.
DNA replication; archaea; mobile genetic elements; DNA polymerases; enzyme inactivation
Lateral gene transfer (LGT) is an important factor contributing to the evolution of prokaryotic genomes. The Aquificae are a hyperthermophilic bacterial group whose genes show affiliations to many other lineages, including the hyperthermophilic Thermotogae, the Proteobacteria, and the Archaea. Previous phylogenomic analyses focused on Aquifex aeolicus identified Thermotogae and Aquificae either as successive early branches or sisters in a rooted bacterial phylogeny, but many phylogenies and cellular traits have suggested a stronger affiliation with the Epsilonproteobacteria. Different scenarios for the evolution of the Aquificae yield different phylogenetic predictions. Here, we outline these scenarios and consider the fit of the available data, including three sequenced Aquificae genomes, to different sets of predictions. Evidence from phylogenetic profiles and trees suggests that the Epsilonproteobacteria have the strongest affinities with the three Aquificae analyzed. However, this pattern is shown by only a minority of encoded proteins, and the Archaea, many lineages of thermophilic bacteria, and members of genus Clostridium and class Deltaproteobacteria also show strong connections to the Aquificae. The phylogenetic affiliations of different functional subsystems showed strong biases: Most but not all genes implicated in the core translational apparatus tended to group Aquificae with Thermotogae, whereas a wide range of metabolic and cellular processes strongly supported the link between Aquificae and Epsilonproteobacteria. Depending on which sets of genes are privileged, either Thermotogae or Epsilonproteobacteria is the most plausible adjacent lineage to the Aquificae. Both scenarios require massive sharing of genes to explain the history of this enigmatic group, whose history is further complicated by specific affinities of different members of Aquificae to different partner lineages.
Aquifex aeolicus; Thermotogae; phylogenomics; hyperthermophiles; lateral gene transfer
Escherichia coli pol V (UmuD′2C), the main translesion DNA polymerase, ensures continued nascent strand extension when the cellular replicase is blocked by unrepaired DNA lesions. Pol V is characterized by low sugar selectivity, which can be further reduced by a Y11A “steric-gate” substitution in UmuC that enables pol V to preferentially incorporate rNTPs over dNTPs in vitro. Despite efficient error-prone translesion synthesis catalyzed by UmuC_Y11A in vitro, strains expressing umuC_Y11A exhibit low UV mutability and UV resistance. Here, we show that these phenotypes result from the concomitant dual actions of Ribonuclease HII (RNase HII) initiating removal of rNMPs from the nascent DNA strand and nucleotide excision repair (NER) removing UV lesions from the parental strand. In the absence of either repair pathway, UV resistance and mutagenesis conferred by umuC_Y11A is significantly enhanced, suggesting that the combined actions of RNase HII and NER lead to double-strand breaks that result in reduced cell viability. We present evidence that the Y11A-specific UV phenotype is tempered by pol IV in vivo. At physiological ratios of the two polymerases, pol IV inhibits pol V–catalyzed translesion synthesis (TLS) past UV lesions and significantly reduces the number of Y11A-incorporated rNTPs by limiting the length of the pol V–dependent TLS tract generated during lesion bypass in vitro. In a recA730 lexA(Def) ΔumuDC ΔdinB strain, plasmid-encoded wild-type pol V promotes high levels of spontaneous mutagenesis. However, umuC_Y11A-dependent spontaneous mutagenesis is only ∼7% of that observed with wild-type pol V, but increases to ∼39% of wild-type levels in an isogenic ΔrnhB strain and ∼72% of wild-type levels in a ΔrnhA ΔrnhB double mutant. Our observations suggest that errant ribonucleotides incorporated by pol V can be tolerated in the E. coli genome, but at the cost of higher levels of cellular mutagenesis.
E. coli pol V, a complex formed by umuC and umuD gene products, is a “founding” member of the Y-family of DNA polymerases that have been identified in all domains of life. The primary cellular function of Y-family polymerases is the replication of damaged DNA. We discovered that pol V is characterized by unusually poor sugar selectivity and readily incorporates ribonucleotides into DNA. The extent of ribonucleotide incorporation can be modulated by substituting amino acids at, or adjacent to, the “steric gate” in the active site of the DNA polymerase. Principally, by taking a genetic approach, supported by in vitro biochemical data, we show that SOS mutations triggered by pol V–catalyzed errant ribonucleotide incorporation are kept in check by the action of nucleotide excision repair operating in conjunction with RNase HII and, unexpectedly, by another error-prone Y-family polymerase, pol IV. Our studies provide new insight into a growing field investigating the processing of ribonucleotides that are misincorporated by DNA polymerases and how these basic mechanisms contribute to cell survival and mutagenesis.
Y family DNA polymerases are specialized for replication of damaged DNA and represent a major contribution to cellular resistance to DNA lesions. Although the Y family polymerase active sites have fewer contacts with their DNA substrates than replicative DNA polymerases, Y family polymerases appear to exhibit specificity for certain lesions. Thus, mutation of the steric gate residue of Escherichia coli DinB resulted in the specific loss of lesion bypass activity. We constructed variants of E. coli UmuC with mutations of the steric gate residue Y11 and of residue F10 and determined that strains harboring these variants are hypersensitive to UV light. Moreover, these UmuC variants are dominant negative with respect to sensitivity to UV light. The UV hypersensitivity and the dominant negative phenotype are partially suppressed by additional mutations in the known motifs in UmuC responsible for binding to the β processivity clamp, suggesting that the UmuC steric gate variant exerts its effects via access to the replication fork. Strains expressing the UmuC Y11A variant also exhibit decreased UV mutagenesis. Strikingly, disruption of the dnaQ gene encoding the replicative DNA polymerase proofreading subunit suppressed the dominant negative phenotype of a UmuC steric gate variant. This could be due to a recruitment function of the proofreading subunit or involvement of the proofreading subunit in a futile cycle of base insertion/excision with the UmuC steric gate variant.
Bacteria of the genus Deinococcus are extremely resistant to ionizing radiation (IR), ultraviolet light (UV) and desiccation. The mesophile Deinococcus radiodurans was the first member of this group whose genome was completely sequenced. Analysis of the genome sequence of D. radiodurans, however, failed to identify unique DNA repair systems. To further delineate the genes underlying the resistance phenotypes, we report the whole-genome sequence of a second Deinococcus species, the thermophile Deinococcus geothermalis, which at its optimal growth temperature is as resistant to IR, UV and desiccation as D. radiodurans, and a comparative analysis of the two Deinococcus genomes. Many D. radiodurans genes previously implicated in resistance, but for which no sensitive phenotype was observed upon disruption, are absent in D. geothermalis. In contrast, most D. radiodurans genes whose mutants displayed a radiation-sensitive phenotype in D. radiodurans are conserved in D. geothermalis. Supporting the existence of a Deinococcus radiation response regulon, a common palindromic DNA motif was identified in a conserved set of genes associated with resistance, and a dedicated transcriptional regulator was predicted. We present the case that these two species evolved essentially the same diverse set of gene families, and that the extreme stress-resistance phenotypes of the Deinococcus lineage emerged progressively by amassing cell-cleaning systems from different sources, but not by acquisition of novel DNA repair systems. Our reconstruction of the genomic evolution of the Deinococcus-Thermus phylum indicates that the corresponding set of enzymes proliferated mainly in the common ancestor of Deinococcus. Results of the comparative analysis weaken the arguments for a role of higher-order chromosome alignment structures in resistance; more clearly define and substantially revise downward the number of uncharacterized genes that might participate in DNA repair and contribute to resistance; and strengthen the case for a role in survival of systems involved in manganese and iron homeostasis.
CRISPR-Cas adaptive immunity systems of bacteria and archaea insert fragments of virus or plasmid DNA as spacer sequences into CRISPR repeat loci. Processed transcripts encompassing these spacers guide the cleavage of the cognate foreign DNA or RNA. Most CRISPR-Cas loci, in addition to recognized cas genes, also include genes that are not directly implicated in spacer acquisition, CRISPR transcript processing or interference. Here we comprehensively analyze sequences, structures and genomic neighborhoods of one of the most widespread groups of such genes that encode proteins containing a predicted nucleotide-binding domain with a Rossmann-like fold, which we denote CARF (CRISPR-associated Rossmann fold). Several CARF protein structures have been determined but functional characterization of these proteins is lacking. The CARF domain is most frequently combined with a C-terminal winged helix-turn-helix DNA-binding domain and “effector” domains most of which are predicted to possess DNase or RNase activity. Divergent CARF domains are also found in RtcR proteins, sigma-54 dependent regulators of the rtc RNA repair operon. CARF genes frequently co-occur with those coding for proteins containing the WYL domain with the Sm-like SH3 β-barrel fold, which is also predicted to bind ligands. CRISPR-Cas and possibly other defense systems are predicted to be transcriptionally regulated by multiple ligand-binding proteins containing WYL and CARF domains which sense modified nucleotides and nucleotide derivatives generated during virus infection. We hypothesize that CARF domains also transmit the signal from the bound ligand to the fused effector domains which attack either alien or self nucleic acids, resulting, respectively, in immunity complementing the CRISPR-Cas action or in dormancy/programmed cell death.
CRISPR; Rossmann fold; beta barrel; DNA-binding proteins; phage defense
Translesion synthesis is a DNA damage tolerance mechanism by which damaged DNA in a cell can be replicated by specialized DNA polymerases without being repaired. The Escherichia coli umuDC gene products, UmuC and the cleaved form of UmuD, UmuD′, comprise a specialized, potentially mutagenic translesion DNA polymerase, polymerase V (UmuD′2C). The full-length UmuD protein, together with UmuC, plays a role in a primitive DNA damage checkpoint by decreasing the rate of DNA synthesis. It has been proposed that the checkpoint is manifested as a cold-sensitive phenotype that is observed when the umuDC gene products are overexpressed. Elevated levels of the beta processivity clamp along with elevated levels of the umuDC gene products, UmuD′C, exacerbate the cold-sensitive phenotype. We used this observation as the basis for genetic selection to identify two alleles of umuD′ and seven alleles of umuC that do not exacerbate the cold-sensitive phenotype when they are present in cells with elevated levels of the beta clamp. The variants were characterized to determine their abilities to confer the umuD′C-specific phenotype UV-induced mutagenesis. The umuD variants were assayed to determine their proficiencies in UmuD cleavage, and one variant (G129S) rendered UmuD noncleaveable. We found at least two UmuC residues, T243 and L389, that may further define the beta binding region on UmuC. We also identified UmuC S31, which is predicted to bind to the template nucleotide, as a residue that is important for UV-induced mutagenesis.
Comparative analysis of the sequences of enzymes encoded in a variety of prokaryotic and eukaryotic genomes reveals convergence and divergence at several levels. Functional convergence can be inferred when structurally distinct and hence non-homologous enzymes show the ability to catalyze the same biochemical reaction. In contrast, as a result of functional diversification, many structurally similar enzyme molecules act on substantially distinct substrates and catalyze diverse biochemical reactions. Here, we present updates on the ATP-grasp, alkaline phosphatase, cupin, HD hydrolase, and N-terminal nucleophile (Ntn) hydrolase enzyme superfamilies and discuss the patterns of sequence and structural conservation and diversity within these superfamilies. Typically, enzymes within a superfamily possess common sequence motifs and key active site residues, as well as (predicted) reaction mechanisms. These observations suggest that the strained conformation (the entatic state) of the active site, which is responsible for the substrate binding and formation of the transition complex, tends to be conserved within enzyme superfamilies. The subsequent fate of the transition complex is not necessarily conserved and depends on the details of the structures of the enzyme and the substrate. This variability of reaction outcomes limits the ability of sequence analysis to predict the exact enzymatic activities of newly sequenced gene products. Nevertheless, sequence-based (super)family assignments and generic functional predictions, even if imprecise, provide valuable leads for experimental studies and remain the best approach to the functional annotation of uncharacterized proteins from new genomes.
Enzyme Catalysis; Enzyme Mechanisms; Enzyme Structure; Evolution; Phosphodiesterases; Convergence; Divergence