The origin and evolution of “ORFans” (suspected genes without known relatives) remain unclear. Here, we take advantage of a unique opportunity to examine the population diversity of thousands of ORFans, based on a collection of 35 complete genomes of isolates of Escherichia coli and Shigella (which is included phylogenetically within E. coli). As expected from previous studies, ORFans are shorter and AT-richer in sequence than non-ORFans. We find that ORFans often are very narrowly distributed: the most common pattern is for an ORFan to be found in only one genome. We compared within-species population diversity of ORFan genes with those of two control groups of non-ORFan genes. Patterns of population variation suggest that most ORFans are not artifacts, but encode real genes whose protein-coding capacity is conserved, reflecting selection against nonsynonymous mutations. Nevertheless, nonsynonymous nucleotide diversity is higher than for non-ORFans, whereas synonymous diversity is roughly the same. In particular, there is a several-fold excess of ORFans in the highest decile of diversity relative to controls, which might be due to weaker purifying selection, positive selection, or a subclass of ORFans that are decaying.
ORFan; lineage-specific genes; evolution; population genetics; positive selection; negative selection
The origin of microbial ORFans, ORFs having no detectable homology to other ORFs in the databases, is one of the unexplained puzzles of the post-genomic era. Several hypothesis on the origin of ORFans have been suggested in the last few years, most of which based on selected, relatively small, subsets of ORFans. One of the hypotheses for the origin of ORFans is that they have been acquired thru lateral transfer from viruses. Here we carry out a comprehensive, genome-wide study on the origins of ORFans to quantify the strength of current evidence supporting this hypothesis.
We performed similarity searches by querying all current ORFans against the public virus protein database. Surprisingly, we found that only 2.8% of all microbial ORFans have detectable homologs in viruses, while the percentage of non-ORFans with detectable homologs in viruses is 7.9%, a significantly higher figure. This suggests that the current evidence for the origin of ORFans from lateral transfer from viruses is at best weak. However, an analysis of individual genomes revealed a number of organisms with much higher percentages, many of them belonging to the Firmicutes and Gamma-proteobacteria. We provide evidence suggesting that the current virus database may be biased towards those viruses attacking Firmicutes and Gamma-proteobacteria.
We conclude that as more viral genomes are sequenced, more microbial ORFans will find homologs in viruses, but this trend may vary much for individual genomes. Thus, lateral transfer from viruses alone is unlikely to explain the origin of the majority of ORFans in the majority of prokaryotes and consequently, other, not necessarily exclusive, mechanisms are likely to better explain the origin of the increasing number of ORFans.
ORFans are open reading frames (ORFs) with no detectable sequence similarity
to any other sequence in the databases. Each newly sequenced genome contains a
significant number of ORFans. Therefore, ORFans entail interesting evolutionary
puzzles. However, little can be learned about them using bioinformatics tools, and
their study seems to have been underemphasized. Here we present some of the
questions that the existence of so many ORFans have raised and review some of
the studies aimed at understanding ORFans, their functions and their origins. These
works have demonstrated that ORFans are an untapped source of research, requiring
further computational and experimental studies.
As each newly sequenced genome contains a significant number of protein-coding ORFs that are species-, family- or lineage-specific, many interesting questions arise about the evolution and role of these ORFs and of the genomes they are part of. We refer to these poorly conserved ORFs as singleton or paralogous ORFans if they are unique to one genome, or as orthologous ORFans if they appear only in a family of closely related organisms and have no homolog in other genomes. In order to study and classify ORFans we have constructed the ORFanage, an ORFan database. This database consists of the predicted ORFs in fully sequenced microbial genomes, and enables searching for the three types of ORFans in any subset of the genomes chosen by the user. The ORFanage could help in choosing interesting targets for further genomic and evolutionary studies. The ORFanage is accessible via http://www.bioinformatics.buffalo.edu/ORFanage.
Motivation: Intriguingly, sequence analysis of genomes reveals that a large number of genes are unique to each organism. The origin of these genes, termed ORFans, is not known. Here, we explore the origin of ORFan genes by defining a simple measure called ‘composition bias’, based on the deviation of the amino acid composition of a given sequence from the average composition of all proteins of a given genome.
Results: For a set of 47 prokaryotic genomes, we show that the amino acid composition bias of real proteins, random ‘proteins’ (created by using the nucleotide frequencies of each genome) and ‘proteins’ translated from intergenic regions are distinct. For ORFans, we observed a correlation between their composition bias and their relative evolutionary age. Recent ORFan proteins have compositions more similar to those of random ‘proteins’, while the compositions of more ancient ORFan proteins are more similar to those of the set of all proteins of the organism. This observation is consistent with an evolutionary scenario wherein ORFan genes emerged and underwent a large number of random mutations and selection, eventually adapting to the composition preference of their organism over time.
Supplementary information: Supplementary data are available at Bioinformatics online.
Mimivirus isolated from A. polyphaga is the largest virus discovered so far. It is unique among all the viruses in having genes related to translation, DNA repair and replication which bear close homology to eukaryotic genes. Nevertheless, only a small fraction of the proteins (33%) encoded in this genome has been assigned a function. Furthermore, a large fraction of the unassigned protein sequences bear no sequence similarity to proteins from other genomes. These sequences are referred to as ORFans. Because of their lack of sequence similarity to other proteins, they can not be assigned putative functions using standard sequence comparison methods. As part of our genome-wide computational efforts aimed at characterizing Mimivirus ORFans, we have applied fold-recognition methods to predict the structure of these ORFans and further functions were derived based on conservation of functionally important residues in sequence-template alignments.
Using fold recognition, we have identified highly confident computational 3D structural assignments for 21 Mimivirus ORFans. In addition, highly confident functional predictions for 6 of these ORFans were derived by analyzing the conservation of functional motifs between the predicted structures and proteins of known function. This analysis allowed us to classify these 6 previously unannotated ORFans into their specific protein families: carboxylesterase/thioesterase, metal-dependent deacetylase, P-loop kinases, 3-methyladenine DNA glycosylase, BTB domain and eukaryotic translation initiation factor eIF4E.
Using stringent fold recognition criteria we have assigned three-dimensional structures for 21 of the ORFans encoded in the Mimivirus genome. Further, based on the 3D models and an analysis of the conservation of functionally important residues and motifs, we were able to derive functional attributes for 6 of the ORFans. Our computational identification of important functional sites in these ORFans can be the basis for a subsequent experimental verification of our predictions. Further computational and experimental studies are required to elucidate the 3D structures and functions of the remaining Mimivirus ORFans.
Acanthamoeba polyphaga mimivirus is the largest known virus in both particle size and genome complexity. Its 1.2-Mb genome encodes 911 proteins, among which only 298 have predicted functions. The composition of purified isolated virions was analyzed by using a combined electrophoresis/mass spectrometry approach allowing the identification of 114 proteins. Besides the expected major structural components, the viral particle packages 12 proteins unambiguously associated with transcriptional machinery, 3 proteins associated with DNA repair, and 2 topoisomerases. Other main functional categories represented in the virion include oxidative pathways and protein modification. More than half of the identified virion-associated proteins correspond to anonymous genes of unknown function, including 45 “ORFans.” As demonstrated by both Western blotting and immunogold staining, some of these “ORFans,” which lack any convincing similarity in the sequence databases, are endowed with antigenic properties. Thus, anonymous and unique genes constituting the majority of the mimivirus gene complement encode bona fide proteins that are likely to participate in well-integrated processes.
ORFan genes can constitute a large fraction of a bacterial genome, but due to their lack of homologs, their functions have remained largely unexplored. To determine if particular features of ORFan-encoded proteins promote their presence in a genome, we analyzed properties of ORFans that originated over a broad evolutionary timescale. We also compared ORFan genes to another class of acquired genes (HOPs), which have homologs in other bacteria. A total of 54 ORFan and HOP genes selected from different phylogenetic depths in the Escherichia coli lineage were cloned, expressed, purified and subjected to CD spectroscopy. A majority of genes could be expressed, but only 18 yielded sufficient soluble protein for spectral analysis. Of these, half were significantly α-helical, three were predominantly β-sheet, and six were of intermediate or indeterminate structure. Although a higher proportion of HOPs yielded soluble proteins with resolvable secondary structures, ORFans resembled HOPs with regard to most of the other features tested. Overall, we found that those ORFan and HOP genes that have persisted in the E. coli lineage were more likely to encode soluble and folded proteins, more likely to display environmental modulation of their gene expression, and by extrapolation, are more likely to be functional.
E. coli; genome evolution; lateral gene transfer; ORFans; protein folding
Despite numerous comparative mitochondrial genomics studies revealing that animal mitochondrial genomes are highly conserved in terms of gene content, supplementary genes are sometimes found, often arising from gene duplication. Mitochondrial ORFans (ORFs having no detectable homology and unknown function) were found in bivalve molluscs with Doubly Uniparental Inheritance (DUI) of mitochondria. In DUI animals, two mitochondrial lineages are present: one transmitted through females (F-type) and the other through males (M-type), each showing a specific and conserved ORF. The analysis of 34 mitochondrial major Unassigned Regions of Musculista senhousia F- and M-mtDNA allowed us to verify the presence of novel mitochondrial ORFs in this species and to compare them with ORFs from other species with ascertained DUI, with other bivalves and with animals showing new mitochondrial elements. Overall, 17 ORFans from nine species were analyzed for structure and function. Many clues suggest that the analyzed ORFans arose from endogenization of viral genes. The co-option of such novel genes by viral hosts may have determined some evolutionary aspects of host life cycle, possibly involving mitochondria. The structure similarity of DUI ORFans within evolutionary lineages may also indicate that they originated from independent events. If these novel ORFs are in some way linked to DUI establishment, a multiple origin of DUI has to be considered. These putative proteins may have a role in the maintenance of sperm mitochondria during embryo development, possibly masking them from the degradation processes that normally affect sperm mitochondria in species with strictly maternal inheritance.
mitochondrial ORFans; mitochondrial inheritance; Doubly Uniparental Inheritance of mitochondria; endogenous virus
The mimivirus genome contains many genes that lack homologs in the sequence database and are thus known as ORFans. In addition, mimivirus genes that encode proteins belonging to known fold families are in some cases fused to domain-sized segments that cannot be classified. One such ORFan region is present in the mimivirus enzyme R596, a member of the Erv family of sulfhydryl oxidases. We determined the structure of a variant of full-length R596 and observed that the carboxy-terminal region of R596 assumes a folded, compact domain, demonstrating that these ORFan segments can be stable structural units. Moreover, the R596 ORFan domain fold is novel, hinting at the potential wealth of protein structural innovation yet to be discovered in large double-stranded DNA viruses. In the context of the R596 dimer, the ORFan domain contributes to formation of a broad cleft enriched with exposed aromatic groups and basic side chains, which may function in binding target proteins or localization of the enzyme within the virus factory or virions. Finally, we find evidence for an intermolecular dithiol/disulfide relay within the mimivirus R596 dimer, the first such extended, intersubunit redox-active site identified in a viral sulfhydryl oxidase.
We have identified conserved orthologs in completely sequenced genomes of double-strand DNA phages and arranged them into evolutionary families (phage orthologous groups [POGs]). Using this resource to analyze the collection of known phage genomes, we find that most orthologs are unique in their genomes (having no diverged duplicates [paralogs]), and while many proteins contain multiple domains, the evolutionary recombination of these domains does not appear to be a major factor in evolution of these orthologous families. The number of POGs has been rapidly increasing over the past decade, the percentage of genes in phage genomes that have orthologs in other phages has also been increasing, and the percentage of unknown “ORFans” is decreasing as more proteins find homologs and establish a family. Other properties of phage genomes have remained relatively stable over time, most notably the high fraction of genes that are never or only rarely observed in their cellular hosts. This suggests that despite the renowned ability of phages to transduce cellular genes, these cellular “hitchhiker” genes do not dominate the phage genomic landscape, and a large fraction of the genes in phage genomes maintain an evolutionary trajectory that is distinct from that of the host genes.
Bacterial species, and even strains within species, can vary greatly in their gene contents and metabolic capabilities. We examine the evolution of this diversity by assessing the distribution and ancestry of each gene in 13 sequenced isolates of Escherichia coli and Shigella. We focus on the emergence and demise of two specific classes of genes, ORFans (genes with no homologs in present databases) and HOPs (genes with distant homologs), since these genes, in contrast to most conserved ancestral sequences, are known to be a major source of the novel features in each strain. We find that the rates of gain and loss of these genes vary greatly among strains as well as through time, and that ORFans and HOPs show very different behavior with respect to their emergence and demise. Although HOPs, which mostly represent gene acquisitions from other bacteria, originate more frequently, ORFans are much more likely to persist. This difference suggests that many adaptive traits are conferred by completely novel genes that do not originate in other bacterial genomes. With respect to the demise of these acquired genes, we find that strains of Shigella lose genes, both by disruption events and by complete removal, at accelerated rates.
Changes in genetic repertoires can alter the adaptive strategy of an organism, especially in bacteria, in which genes are continually gained and lost. Mapping the gains and losses of genes in the densely sequenced clade of Escherichia coli and Shigella shows that these genomes harbour two types of acquired genes: HOPs, which are those acquired genes with homologs in distantly related bacteria; and ORFans, which are genes without any known homologs. Surprisingly, the two classes of acquired genes display very different patterns of gain and loss. HOPs are acquired more frequently, though they rarely persist in the recipient genomes. In contrast, ORFans are much more likely to be maintained over evolutionary timescales, suggesting that despite their unknown origins, they will more often confer novel and beneficial traits to the recipient genome.
Simpler biological systems should be easier to understand and to engineer towards pre-defined goals. One way to achieve biological simplicity is through genome minimization. Here we looked for genomic islands in the fresh water cyanobacteria Synechococcus elongatus PCC 7942 (genome size 2.7 Mb) that could be used as targets for deletion. We also looked for conserved genes that might be essential for cell survival.
By using a combination of methods we identified 170 xenologs, 136 ORFans and 1401 core genes in the genome of S. elongatus PCC 7942. These represent 6.5%, 5.2% and 53.6% of the annotated genes respectively. We considered that genes in genomic islands could be found if they showed a combination of: a) unusual G+C content; b) unusual phylogenetic similarity; and/or c) a small number of the highly iterated palindrome 1 (HIP1) motif plus an unusual codon usage. The origin of the largest genomic island by horizontal gene transfer (HGT) could be corroborated by lack of coverage among metagenomic sequences from a fresh water microbialite. Evidence is also presented that xenologous genes tend to cluster in operons. Interestingly, most genes coding for proteins with a diguanylate cyclase domain are predicted to be xenologs, suggesting a role for horizontal gene transfer in the evolution of Synechococcus sensory systems.
Our estimates of genomic islands in PCC 7942 are larger than those predicted by other published methods like SIGI-HMM. Our results set a guide to non-essential genes in S. elongatus PCC 7942 indicating a path towards the engineering of a model photoautotrophic bacterial cell.
We discovered a novel interaction between phage P22 and its host Salmonella Typhimurium LT2 that is characterized by a phage mediated and targeted derepression of the host dgo operon. Upon further investigation, this interaction was found to be instigated by an ORFan gene (designated pid for phage P22 encoded instigator of dgo expression) located on a previously unannotated moron locus in the late region of the P22 genome, and encoding an 86 amino acid protein of 9.3 kDa. Surprisingly, the Pid/dgo interaction was not observed during strict lytic or lysogenic proliferation of P22, and expression of pid was instead found to arise in cells that upon infection stably maintained an unintegrated phage chromosome that segregated asymmetrically upon subsequent cell divisions. Interestingly, among the emerging siblings, the feature of pid expression remained tightly linked to the cell inheriting this phage carrier state and became quenched in the other. As such, this study is the first to reveal molecular and genetic markers authenticating pseudolysogenic development, thereby exposing a novel mechanism, timing, and populational distribution in the realm of phage–host interactions.
Viruses of bacteria, also referred to as (bacterio)phages, are the most abundant biological entity on earth and have a tremendous impact on the ecology of their hosts. It has traditionally been recognized that upon infection by a temperate phage the host cell is forced either to produce and release new virions during lytic development or to replicate and segregate the phage chromosome together with its own genetic material during lysogenic development. These developmental paths are orchestrated by a dedicated set of phage–host interactions that are able to sense and redirect host cell physiology. In addition to this classical bifurcation of temperate phage development, many studies on phage biology in natural ecosystems hypothesize the existence and significance of stable phage carrier cells that are not engaged in either lytic or lysogenic proliferation. Using Salmonella Typhimurium and phage P22 as a model system, we provide substantial evidence authenticating the existence of the phage carrier state and demonstrate that this state (i) is asymmetrically inherited among carrier cell siblings and (ii) enables the execution of a novel phage–host interaction that is not encountered during lytic or lysogenic proliferation.
Although the study of phage infection has a long history and catalyzed much of our current understanding in bacterial genetics, molecular biology, evolution and ecology, it seems that microbiologists have only just begun to explore the intricacy of phage–host interactions. In a recent manuscript by Cenens et al. we found molecular and genetic support for pseudolysogenic development in the Salmonella Typhimurium–phage P22 model system. More specifically, we observed the existence of phage carrier cells harboring an episomal P22 element that segregated asymmetrically upon subsequent divisions. Moreover, a newly discovered P22 ORFan protein (Pid) able to derepress a metabolic operon of the host (dgo) proved to be specifically expressed in these phage carrier cells. In this addendum we expand on our view regarding pseudolysogeny and its effects on bacterial and phage biology.
Salmonella Typhimurium; phage P22; phage carrier state; phage–host interactions; pseudolysogeny
A large-scale survey of potential recently acquired integrative elements in 119 archaeal and bacterial genomes reveals that many recently acquired genes have originated from integrative elements
Archaeal and bacterial genomes contain a number of genes of foreign origin that arose from recent horizontal gene transfer, but the role of integrative elements (IEs), such as viruses, plasmids, and transposable elements, in this process has not been extensively quantified. Moreover, it is not known whether IEs play an important role in the origin of ORFans (open reading frames without matches in current sequence databases), whose proportion remains stable despite the growing number of complete sequenced genomes.
We have performed a large-scale survey of potential recently acquired IEs in 119 archaeal and bacterial genomes. We developed an accurate in silico Markov model-based strategy to identify clusters of genes that show atypical sequence composition (clusters of atypical genes or CAGs) and are thus likely to be recently integrated foreign elements, including IEs. Our method identified a high number of new CAGs. Probabilistic analysis of gene content indicates that 56% of these new CAGs are likely IEs, whereas only 7% likely originated via horizontal gene transfer from distant cellular sources. Thirty-four percent of CAGs remain unassigned, what may reflect a still poor sampling of IEs associated with bacterial and archaeal diversity. Moreover, our study contributes to the issue of the origin of ORFans, because 39% of these are found inside CAGs, many of which likely represent recently acquired IEs.
Our results strongly indicate that archaeal and bacterial genomes contain an impressive proportion of recently acquired foreign genes (including ORFans) coming from a still largely unexplored reservoir of IEs.
The universally conserved J-domain proteins (JDPs) are obligate cochaperone partners of the Hsp70 (DnaK) chaperone. They stimulate Hsp70's ATPase activity, facilitate substrate delivery, and confer specific cellular localization to Hsp70. In this work, we have identified and characterized the first functional JDP protein encoded by a bacteriophage. Specifically, we show that the ORFan gene 057w of the T4-related enterobacteriophage RB43 encodes a bona fide JDP protein, named Rki, which specifically interacts with the Escherichia coli host multifunctional DnaK chaperone. However, in sharp contrast with the three known host JDP cochaperones of DnaK encoded by E. coli, Rki does not act as a generic cochaperone in vivo or in vitro. Expression of Rki alone is highly toxic for wild-type E. coli, but toxicity is abolished in the absence of endogenous DnaK or when the conserved J-domain of Rki is mutated. Further in vivo analyses revealed that Rki is expressed early after infection by RB43 and that deletion of the rki gene significantly impairs RB43 proliferation. Furthermore, we show that mutations in the host dnaK gene efficiently suppress the growth phenotype of the RB43 rki deletion mutant, thus indicating that Rki specifically interferes with DnaK cellular function. Finally, we show that the interaction of Rki with the host DnaK chaperone rapidly results in the stabilization of the heat-shock factor σ32, which is normally targeted for degradation by DnaK. The mechanism by which the Rki-dependent stabilization of σ32 facilitates RB43 bacteriophage proliferation is discussed.
Bacteriophages are the most abundant biological entities on earth. As a consequence, they represent the largest reservoir of unexplored genetic information. They control bacterial growth, mediate horizontal gene transfer, and thus exert profound influence on microbial ecology and growth. One of the striking features of bacteriophages is that they code for many open reading frames of thus far unknown biological function (called ORFans), which have been referred to as the dark matter of our biosphere. Here we have extensively characterized such a novel ORFan-encoded protein, Rki, encoded by the large, virulent enterobacteriaceae bacteriophage RB43. We show that Rki functions to control the host stress-response during the early stages of bacteriophage infection, specifically by interacting with the host DnaK/Hsp70 chaperone to stabilize the major host heat-shock factor, σ32.
In the enterobacterial species Escherichia coli and Salmonella enterica, expression of horizontally acquired genes with a higher than average AT content is repressed by the nucleoid-associated protein H-NS. A classical example of an H-NS–repressed locus is the bgl (aryl-β,D-glucoside) operon of E. coli. This locus is “cryptic,” as no laboratory growth conditions are known to relieve repression of bgl by H-NS in E. coli K12. However, repression can be relieved by spontaneous mutations. Here, we investigated the phylogeny of the bgl operon. Typing of bgl in a representative collection of E. coli demonstrated that it evolved clonally and that it is present in strains of the phylogenetic groups A, B1, and B2, while it is presumably replaced by a cluster of ORFans in the phylogenetic group D. Interestingly, the bgl operon is mutated in 20% of the strains of phylogenetic groups A and B1, suggesting erosion of bgl in these groups. However, bgl is functional in almost all B2 isolates and, in approximately 50% of them, it is weakly expressed at laboratory growth conditions. Homologs of bgl genes exist in Klebsiella, Enterobacter, and Erwinia species and also in low GC-content Gram-positive bacteria, while absent in E. albertii and Salmonella sp. This suggests horizontal transfer of bgl genes to an ancestral Enterobacterium. Conservation and weak expression of bgl in isolates of phylogenetic group B2 may indicate a functional role of bgl in extraintestinal pathogenic E. coli.
Horizontal gene transfer, an important mechanism in bacterial adaptation and evolution, requires mechanisms to avoid uncontrolled and possibly disadvantageous expression of the transferred genes. Recently, it was shown that the protein H-NS selectively silences genes gained by horizontal transfer in enteric bacteria. Regulated expression of these genes can then evolve and be integrated into the regulatory network of the new host. Our analysis of the catabolic bgl (aryl-β,D-glucoside) operon, which is silenced by H-NS in E. coli, provides a snapshot on the evolution of such a locus. Genes of the bgl operon were presumably gained by horizontal transfer from Gram-positive bacteria to ancestral enteric bacteria. In E. coli, the bgl operon co-evolved with the diversification of the species into four phylogenetic groups. In one phylogenetic group the bgl operon is functional. However, in two other phylogenetic groups, bgl accumulates disrupting mutations, and it is absent in the fourth group. This indicates that the H-NS–silenced bgl operon evolved differently in E. coli and is presumably positively selected in one phylogenetic group, while it is neutrally or negatively selected in the other groups.
Bacteriophages are the most abundant biological entities in our biosphere, characterized by their hyperplasticity, mosaic composition, and the many unknown functions (ORFans) encoded by their immense genetic repertoire. These genes are potentially maintained by the bacteriophage to allow efficient propagation on hosts encountered in nature. To test this hypothesis, we devised a selection to identify bacteriophage-encoded gene(s) that modulate the host Escherichia coli GroEL/GroES chaperone machine, which is essential for the folding of certain host and bacteriophage proteins. As a result, we identified the bacteriophage RB69 gene 39.2, of previously unknown function and showed that homologs of 39.2 in bacteriophages T4, RB43, and RB49 similarly modulate GroEL/GroES.
Production of wild-type bacteriophage T4 Gp39.2, a 58-amino-acid protein, (a) enables diverse bacteriophages to plaque on the otherwise nonpermissive groES or groEL mutant hosts in an allele-specific manner, (b) suppresses the temperature-sensitive phenotype of both groES and groEL mutants, (c) suppresses the defective UV-induced PolV function (UmuCD) of the groEL44 mutant, and (d) is lethal to the host when overproduced. Finally, as proof of principle that Gp39.2 is essential for bacteriophage growth on certain bacterial hosts, we constructed a T4 39.2 deletion strain and showed that, unlike the isogenic wild-type parent, it is incapable of propagating on certain groEL mutant hosts. We propose a model of how Gp39.2 modulates GroES/GroEL function.
For a very long time, Type II restriction enzymes (REases) have been a paradigm of ORFans: proteins with no detectable similarity to each other and to any other protein in the database, despite common cellular and biochemical function. Crystallographic analyses published until January 2008 provided high-resolution structures for only 28 of 1637 Type II REase sequences available in the Restriction Enzyme database (REBASE). Among these structures, all but two possess catalytic domains with the common PD-(D/E)XK nuclease fold. Two structures are unrelated to the others: R.BfiI exhibits the phospholipase D (PLD) fold, while R.PabI has a new fold termed ‘half-pipe’. Thus far, bioinformatic studies supported by site-directed mutagenesis have extended the number of tentatively assigned REase folds to five (now including also GIY-YIG and HNH folds identified earlier in homing endonucleases) and provided structural predictions for dozens of REase sequences without experimentally solved structures. Here, we present a comprehensive study of all Type II REase sequences available in REBASE together with their homologs detectable in the nonredundant and environmental samples databases at the NCBI. We present the summary and critical evaluation of structural assignments and predictions reported earlier, new classification of all REase sequences into families, domain architecture analysis and new predictions of three-dimensional folds. Among 289 experimentally characterized (not putative) Type II REases, whose apparently full-length sequences are available in REBASE, we assign 199 (69%) to contain the PD-(D/E)XK domain. The HNH domain is the second most common, with 24 (8%) members. When putative REases are taken into account, the fraction of PD-(D/E)XK and HNH folds changes to 48% and 30%, respectively. Fifty-six characterized (and 521 predicted) REases remain unassigned to any of the five REase folds identified so far, and may exhibit new architectures. These enzymes are proposed as the most interesting targets for structure determination by high-resolution experimental methods. Our analysis provides the first comprehensive map of sequence-structure relationships among Type II REases and will help to focus the efforts of structural and functional genomics of this large and biotechnologically important class of enzymes.
During the long history of biological evolution, genome structures have undergone enormous changes. Nevertheless, some traits or vestiges of the primordial genome (defined as the most primitive nucleic acid genome for life on earth in this paper) may remain in modern genetic systems. It is of great importance to find these traits or vestiges for the study of the origin and evolution of genomes. As the shorter is a sequence, the less probable it would be modified during genome evolution. And if mutated, it would be easier to reappear at the same site or another site. Consequently, the genomic frequencies of very short nucleotide sequences, such as dinucleotides, would have considerable chances to be conserved during billions of years of evolution. Prokaryotic genomes are very diverse and with a wide range of GC content. Therefore, in order to find traits or vestiges of the primordial genome remained in modern genetic systems, we have studied the characteristics of dinucleotide frequencies across bacterial and archaeal genomes. We analyzed the dinucleotide frequency patterns of the whole-genome sequences from more than 1300 prokaryotic species (bacterial and archaeal genomes available as of December 2012). The results show that the frequencies of the dinucleotides AC, AG, CA, CT, GA, GT, TC, and TG are well-conserved across various genomes, while the frequencies of other dinucleotides vary considerably among species. The dinucleotide frequency conservation/variation pattern seems to correlate with the distributions of dinucleotides throughout a genome and across genomes. Further analysis indicates that the phenomenon would be determined by strand symmetry of genomic sequences (the second parity rule) and GC content variations among genomes. We discussed some possible origins of strand symmetry. And we propose that the phenomenon of frequency conservation of some dinucleotides may provide insights into the genomic composition of the primordial genetic system.
dinucleotide frequency; compositional analysis; whole-genome sequences; strand symmetry; GC content; primordial genome; origin and evolution of genomes
Varicella-zoster virus (VZV) expresses at least six viral transcripts during latency. One of these transcripts, derived from open reading frame 63 (ORF63), is one of the most abundant viral RNAs expressed during latency. The VZV ORF63 protein has been detected in human and experimentally infected rodent ganglia by several laboratories. We have deleted >90% of both copies of the ORF63 gene from the VZV genome. Animals inoculated with the ORF63 mutant virus had lower mean copy numbers of latent VZV genomes in the dorsal root ganglia 5 to 6 weeks after infection than animals inoculated with parental or rescued virus, and the frequency of latently infected animals was significantly lower in animals infected with the ORF63 mutant virus than in animals inoculated with parental or rescued virus. In contrast, the frequency of animals latently infected with viral mutants in other genes that are equally or more impaired for replication in vitro, compared with the ORF63 mutant, is similar to that of animals latently infected with parental VZV. Examination of dorsal root ganglia 3 days after infection showed high levels of VZV DNA in animals infected with either ORF63 mutant or parental virus; however, by days 6 and 10 after infection, the level of viral DNA in animals infected with the ORF63 mutant was significantly lower than that in animals infected with parental virus. Thus, ORF63 is not required for VZV to enter ganglia but is the first VZV gene shown to be critical for establishment of latency. Since the present vaccine can reactivate and cause shingles, a VZV vaccine based on the ORF63 mutant virus might be safer.
A better understanding of the size and abundance of open reading frames (ORFS) in whole genomes may shed light on the factors that control genome complexity. Here we examine the statistical distributions of open reading frames (i.e. distribution of start and stop codons) in the fully sequenced genomes of 297 prokaryotes, and 14 eukaryotes.
By fitting mixture models to data from whole genome sequences we show that the size-frequency distributions for ORFS are strikingly similar across prokaryotic and eukaryotic genomes. Moreover, we show that i) a large fraction (60–80%) of ORF size-frequency distributions can be predicted a priori with a stochastic assembly model based on GC content, and that (ii) size-frequency distributions of the remaining “non-random” ORFs are well-fitted by log-normal or gamma distributions, and similar to the size distributions of annotated proteins.
Our findings suggest stochastic processes have played a primary role in the evolution of genome complexity, and that common processes govern the conservation and loss of functional genomics units in both prokaryotes and eukaryotes.
(Bacterio)phage PVP-SE1, isolated from a German wastewater plant, presents a high potential value as a biocontrol agent and as a diagnostic tool, even compared to the well-studied typing phage Felix 01, due to its broad lytic spectrum against different Salmonella strains. Sequence analysis of its genome (145,964 bp) shows it to be terminally redundant and circularly permuted. Its G+C content, 45.6 mol%, is lower than that of its hosts (50 to 54 mol%). We found a total of 244 open reading frames (ORFs), representing 91.6% of the coding capacity of the genome. Approximately 46% of encoded proteins are unique to this phage, and 22.1% of the proteins could be functionally assigned. This myovirus encodes a large number of tRNAs (n=24), reflecting its lytic capacity and evolution through different hosts. Tandem mass spectrometric analysis using electron spray ionization revealed 25 structural proteins as part of the mature phage particle. The genome sequence was found to share homology with 140 proteins of the Escherichia coli bacteriophage rV5. Both phages are unrelated to any other known virus, which suggests that an “rV5-like virus” genus should be created within the Myoviridae to contain these two phages.
Horizontal transfer of transposable elements (TEs) plays a key role in prokaryote genome evolution. Most TEs do not encode the enzymatic machinery allowing them to transfer between host cells and it is widely assumed in the literature that horizontal transfer of prokaryote TEs is mediated by other mobile genetic elements such as phages and plasmids. In a recent study, we have shown that phages are less tolerant to insertion sequences (IS, the most frequent class of prokaryote TEs) and therefore have a lower cargo capacity than plasmids. Consequently, while our analysis confirmed the crucial role of plasmids as efficient vehicles of IS horizontal transfer, we concluded that phages are unlikely to efficiently shuttle IS elements between prokaryotes. Here, we discuss whether or not the distribution pattern observed for IS elements in phages and plasmids also holds for other TEs, such as transposons and mobile introns. We also further explore various factors that may impact the relative capacity of phages and plasmids to mediate TE horizontal transfer among prokaryotes.
transposable element; insertion sequence; horizontal transfer; phage; plasmid; purifying selection; prokaryote