Genome-wide studies have already shed light into the evolution and enormous diversity of the viral world. Nevertheless, one of the unresolved mysteries in comparative genomics today is the abundance of ORFans – ORFs with no detectable sequence similarity to any other ORF in the databases. Recently, studies attempting to understand the origin and functions of bacterial ORFans have been reported. Here we present a first genome-wide identification and analysis of ORFans in the viral world, with focus on bacteriophages.
Almost one-third of all ORFs in 1,456 complete virus genomes correspond to ORFans, a figure significantly larger than that observed in prokaryotes. Like prokaryotic ORFans, viral ORFans are shorter and have a lower GC content than non-ORFans. Nevertheless, a statistically significant lower GC content is found only on a minority of viruses. By focusing on phages, we find that 38.4% of phage ORFs have no homologs in other phages, and 30.1% have no homologs neither in the viral nor in the prokaryotic world. Phages with different host ranges have different percentages of ORFans, reflecting different sampling status and suggesting various diversities. Similarity searches of the phage ORFeome (ORFans and non-ORFans) against prokaryotic genomes shows that almost half of the phage ORFs have prokaryotic homologs, suggesting the major role that horizontal transfer plays in bacterial evolution. Surprisingly, the percentage of phage ORFans with prokaryotic homologs is only 18.7%. This suggests that phage ORFans play a lesser role in horizontal transfer to prokaryotes, but may be among the major players contributing to the vast phage diversity.
Although the current sampling of viral genomes is extremely low, ORFans and near-ORFans are likely to continue to grow in number as more genomes are sequenced. The abundance of phage ORFans may be partially due to the expected vast viral diversity, and may be instrumental in understanding viral evolution. The functions, origins and fates of the majority of viral ORFans remain a mystery. Further computational and experimental studies are likely to shed light on the mechanisms that have given rise to so many bacterial and viral ORFans.
Mitochondrial ORFans (open reading frames having no detectable homology and with unknown function) were discovered in bivalve molluscs with doubly uniparental inheritance (DUI) of mitochondria. In these animals, two mitochondrial lineages are present, one transmitted through eggs (F-type), the other through sperm (M-type), each showing a specific ORFan. In this study, we used in situ hybridization and immunocytochemistry to provide evidence for the expression of Ruditapes philippinarum male-specific ORFan (orf21): both the transcript and the protein (RPHM21) were localized in spermatogenic cells and mature spermatozoa; the protein was localized in sperm mitochondria and nuclei, and in early embryos. Also, in silico analyses of orf21 flanking region and RPHM21 structure supported its derivation from viral sequence endogenization. We propose that RPHM21 prevents the recognition of M-type mitochondria by the degradation machinery, allowing their survival in the zygote. The process might involve a mechanism similar to that of Modulators of Immune Recognition, viral proteins involved in the immune recognition pathway, to which RPHM21 showed structural similarities. A viral origin of RPHM21 may also support a developmental role, because some integrated viral elements are involved in development and sperm differentiation of their host. Mitochondrial ORFans could be responsible for or participate in the DUI mechanism and their viral origin could explain the acquired capability of M-type mitochondria to avoid degradation and invade the germ line, that is what viruses do best: to elude host immune system and proliferate.
mitochondrial ORFan; viral endogenization; novel mitochondrial protein; testis expression; doubly uniparental inheritance of mitochondria; embryo development
ORFans are open reading frames (ORFs) with no detectable sequence similarity
to any other sequence in the databases. Each newly sequenced genome contains a
significant number of ORFans. Therefore, ORFans entail interesting evolutionary
puzzles. However, little can be learned about them using bioinformatics tools, and
their study seems to have been underemphasized. Here we present some of the
questions that the existence of so many ORFans have raised and review some of
the studies aimed at understanding ORFans, their functions and their origins. These
works have demonstrated that ORFans are an untapped source of research, requiring
further computational and experimental studies.
A large-scale survey of potential recently acquired integrative elements in 119 archaeal and bacterial genomes reveals that many recently acquired genes have originated from integrative elements
Archaeal and bacterial genomes contain a number of genes of foreign origin that arose from recent horizontal gene transfer, but the role of integrative elements (IEs), such as viruses, plasmids, and transposable elements, in this process has not been extensively quantified. Moreover, it is not known whether IEs play an important role in the origin of ORFans (open reading frames without matches in current sequence databases), whose proportion remains stable despite the growing number of complete sequenced genomes.
We have performed a large-scale survey of potential recently acquired IEs in 119 archaeal and bacterial genomes. We developed an accurate in silico Markov model-based strategy to identify clusters of genes that show atypical sequence composition (clusters of atypical genes or CAGs) and are thus likely to be recently integrated foreign elements, including IEs. Our method identified a high number of new CAGs. Probabilistic analysis of gene content indicates that 56% of these new CAGs are likely IEs, whereas only 7% likely originated via horizontal gene transfer from distant cellular sources. Thirty-four percent of CAGs remain unassigned, what may reflect a still poor sampling of IEs associated with bacterial and archaeal diversity. Moreover, our study contributes to the issue of the origin of ORFans, because 39% of these are found inside CAGs, many of which likely represent recently acquired IEs.
Our results strongly indicate that archaeal and bacterial genomes contain an impressive proportion of recently acquired foreign genes (including ORFans) coming from a still largely unexplored reservoir of IEs.
Despite numerous comparative mitochondrial genomics studies revealing that animal mitochondrial genomes are highly conserved in terms of gene content, supplementary genes are sometimes found, often arising from gene duplication. Mitochondrial ORFans (ORFs having no detectable homology and unknown function) were found in bivalve molluscs with Doubly Uniparental Inheritance (DUI) of mitochondria. In DUI animals, two mitochondrial lineages are present: one transmitted through females (F-type) and the other through males (M-type), each showing a specific and conserved ORF. The analysis of 34 mitochondrial major Unassigned Regions of Musculista senhousia F- and M-mtDNA allowed us to verify the presence of novel mitochondrial ORFs in this species and to compare them with ORFs from other species with ascertained DUI, with other bivalves and with animals showing new mitochondrial elements. Overall, 17 ORFans from nine species were analyzed for structure and function. Many clues suggest that the analyzed ORFans arose from endogenization of viral genes. The co-option of such novel genes by viral hosts may have determined some evolutionary aspects of host life cycle, possibly involving mitochondria. The structure similarity of DUI ORFans within evolutionary lineages may also indicate that they originated from independent events. If these novel ORFs are in some way linked to DUI establishment, a multiple origin of DUI has to be considered. These putative proteins may have a role in the maintenance of sperm mitochondria during embryo development, possibly masking them from the degradation processes that normally affect sperm mitochondria in species with strictly maternal inheritance.
mitochondrial ORFans; mitochondrial inheritance; Doubly Uniparental Inheritance of mitochondria; endogenous virus
As each newly sequenced genome contains a significant number of protein-coding ORFs that are species-, family- or lineage-specific, many interesting questions arise about the evolution and role of these ORFs and of the genomes they are part of. We refer to these poorly conserved ORFs as singleton or paralogous ORFans if they are unique to one genome, or as orthologous ORFans if they appear only in a family of closely related organisms and have no homolog in other genomes. In order to study and classify ORFans we have constructed the ORFanage, an ORFan database. This database consists of the predicted ORFs in fully sequenced microbial genomes, and enables searching for the three types of ORFans in any subset of the genomes chosen by the user. The ORFanage could help in choosing interesting targets for further genomic and evolutionary studies. The ORFanage is accessible via http://www.bioinformatics.buffalo.edu/ORFanage.
The mimivirus genome contains many genes that lack homologs in the sequence database and are thus known as ORFans. In addition, mimivirus genes that encode proteins belonging to known fold families are in some cases fused to domain-sized segments that cannot be classified. One such ORFan region is present in the mimivirus enzyme R596, a member of the Erv family of sulfhydryl oxidases. We determined the structure of a variant of full-length R596 and observed that the carboxy-terminal region of R596 assumes a folded, compact domain, demonstrating that these ORFan segments can be stable structural units. Moreover, the R596 ORFan domain fold is novel, hinting at the potential wealth of protein structural innovation yet to be discovered in large double-stranded DNA viruses. In the context of the R596 dimer, the ORFan domain contributes to formation of a broad cleft enriched with exposed aromatic groups and basic side chains, which may function in binding target proteins or localization of the enzyme within the virus factory or virions. Finally, we find evidence for an intermolecular dithiol/disulfide relay within the mimivirus R596 dimer, the first such extended, intersubunit redox-active site identified in a viral sulfhydryl oxidase.
Mimivirus isolated from A. polyphaga is the largest virus discovered so far. It is unique among all the viruses in having genes related to translation, DNA repair and replication which bear close homology to eukaryotic genes. Nevertheless, only a small fraction of the proteins (33%) encoded in this genome has been assigned a function. Furthermore, a large fraction of the unassigned protein sequences bear no sequence similarity to proteins from other genomes. These sequences are referred to as ORFans. Because of their lack of sequence similarity to other proteins, they can not be assigned putative functions using standard sequence comparison methods. As part of our genome-wide computational efforts aimed at characterizing Mimivirus ORFans, we have applied fold-recognition methods to predict the structure of these ORFans and further functions were derived based on conservation of functionally important residues in sequence-template alignments.
Using fold recognition, we have identified highly confident computational 3D structural assignments for 21 Mimivirus ORFans. In addition, highly confident functional predictions for 6 of these ORFans were derived by analyzing the conservation of functional motifs between the predicted structures and proteins of known function. This analysis allowed us to classify these 6 previously unannotated ORFans into their specific protein families: carboxylesterase/thioesterase, metal-dependent deacetylase, P-loop kinases, 3-methyladenine DNA glycosylase, BTB domain and eukaryotic translation initiation factor eIF4E.
Using stringent fold recognition criteria we have assigned three-dimensional structures for 21 of the ORFans encoded in the Mimivirus genome. Further, based on the 3D models and an analysis of the conservation of functionally important residues and motifs, we were able to derive functional attributes for 6 of the ORFans. Our computational identification of important functional sites in these ORFans can be the basis for a subsequent experimental verification of our predictions. Further computational and experimental studies are required to elucidate the 3D structures and functions of the remaining Mimivirus ORFans.
Bacterial species, and even strains within species, can vary greatly in their gene contents and metabolic capabilities. We examine the evolution of this diversity by assessing the distribution and ancestry of each gene in 13 sequenced isolates of Escherichia coli and Shigella. We focus on the emergence and demise of two specific classes of genes, ORFans (genes with no homologs in present databases) and HOPs (genes with distant homologs), since these genes, in contrast to most conserved ancestral sequences, are known to be a major source of the novel features in each strain. We find that the rates of gain and loss of these genes vary greatly among strains as well as through time, and that ORFans and HOPs show very different behavior with respect to their emergence and demise. Although HOPs, which mostly represent gene acquisitions from other bacteria, originate more frequently, ORFans are much more likely to persist. This difference suggests that many adaptive traits are conferred by completely novel genes that do not originate in other bacterial genomes. With respect to the demise of these acquired genes, we find that strains of Shigella lose genes, both by disruption events and by complete removal, at accelerated rates.
Changes in genetic repertoires can alter the adaptive strategy of an organism, especially in bacteria, in which genes are continually gained and lost. Mapping the gains and losses of genes in the densely sequenced clade of Escherichia coli and Shigella shows that these genomes harbour two types of acquired genes: HOPs, which are those acquired genes with homologs in distantly related bacteria; and ORFans, which are genes without any known homologs. Surprisingly, the two classes of acquired genes display very different patterns of gain and loss. HOPs are acquired more frequently, though they rarely persist in the recipient genomes. In contrast, ORFans are much more likely to be maintained over evolutionary timescales, suggesting that despite their unknown origins, they will more often confer novel and beneficial traits to the recipient genome.
ORFan genes can constitute a large fraction of a bacterial genome, but due to their lack of homologs, their functions have remained largely unexplored. To determine if particular features of ORFan-encoded proteins promote their presence in a genome, we analyzed properties of ORFans that originated over a broad evolutionary timescale. We also compared ORFan genes to another class of acquired genes (HOPs), which have homologs in other bacteria. A total of 54 ORFan and HOP genes selected from different phylogenetic depths in the Escherichia coli lineage were cloned, expressed, purified and subjected to CD spectroscopy. A majority of genes could be expressed, but only 18 yielded sufficient soluble protein for spectral analysis. Of these, half were significantly α-helical, three were predominantly β-sheet, and six were of intermediate or indeterminate structure. Although a higher proportion of HOPs yielded soluble proteins with resolvable secondary structures, ORFans resembled HOPs with regard to most of the other features tested. Overall, we found that those ORFan and HOP genes that have persisted in the E. coli lineage were more likely to encode soluble and folded proteins, more likely to display environmental modulation of their gene expression, and by extrapolation, are more likely to be functional.
E. coli; genome evolution; lateral gene transfer; ORFans; protein folding
Genomic analysis of giant viruses, such as Mimivirus, has revealed that more than half of the putative genes have no known functions (ORFans). We knocked down Mimivirus genes using short interfering RNA as a proof of concept to determine the functions of giant virus ORFans. As fibers are easy to observe, we targeted a gene encoding a protein absent in a Mimivirus mutant devoid of fibers as well as three genes encoding products identified in a protein concentrate of fibers, including one ORFan and one gene of unknown function. We found that knocking down these four genes was associated with depletion or modification of the fibers. Our strategy of silencing ORFan genes in giant viruses opens a way to identify its complete gene repertoire and may clarify the role of these genes, differentiating between junk DNA and truly used genes. Using this strategy, we were able to annotate four proteins in Mimivirus and 30 homologous proteins in other giant viruses. In addition, we were able to annotate >500 proteins from cellular organisms and 100 from metagenomic databases.
Mimivirus; giant virus; Megavirales; fiber; short interfering RNA; RNA interference; nucleocytoplasmic large DNA virus
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. Given the wide-ranging roles microbes play in many ecosystems, metagenomics studies of microbial communities will reveal insights into protein families and their evolution. Because most microbes will not grow in the laboratory using current cultivation techniques, scientists have turned to cultivation-independent techniques to study microbial diversity. One such technique—shotgun sequencing—allows random sampling of DNA sequences to examine the genomic material present in a microbial community. We used shotgun sequencing to examine microbial communities in water samples collected by the Sorcerer II Global Ocean Sampling (GOS) expedition. Our analysis predicted more than six million proteins in the GOS data—nearly twice the number of proteins present in current databases. These predictions add tremendous diversity to known protein families and cover nearly all known prokaryotic protein families. Some of the predicted proteins had no similarity to any currently known proteins and therefore represent new families. A higher than expected fraction of these novel families is predicted to be of viral origin. We also found that several protein domains that were previously thought to be kingdom specific have GOS examples in other kingdoms. Our analysis opens the door for a multitude of follow-up protein family analyses and indicates that we are a long way from sampling all the protein families that exist in nature.
The GOS data identified 6.12 million predicted proteins covering nearly all known prokaryotic protein families, and several new families. This almost doubles the number of known proteins and shows that we are far from identifying all the proteins in nature.
The origin and evolution of “ORFans” (suspected genes without known relatives) remain unclear. Here, we take advantage of a unique opportunity to examine the population diversity of thousands of ORFans, based on a collection of 35 complete genomes of isolates of Escherichia coli and Shigella (which is included phylogenetically within E. coli). As expected from previous studies, ORFans are shorter and AT-richer in sequence than non-ORFans. We find that ORFans often are very narrowly distributed: the most common pattern is for an ORFan to be found in only one genome. We compared within-species population diversity of ORFan genes with those of two control groups of non-ORFan genes. Patterns of population variation suggest that most ORFans are not artifacts, but encode real genes whose protein-coding capacity is conserved, reflecting selection against nonsynonymous mutations. Nevertheless, nonsynonymous nucleotide diversity is higher than for non-ORFans, whereas synonymous diversity is roughly the same. In particular, there is a several-fold excess of ORFans in the highest decile of diversity relative to controls, which might be due to weaker purifying selection, positive selection, or a subclass of ORFans that are decaying.
ORFan; lineage-specific genes; evolution; population genetics; positive selection; negative selection
Motivation: Intriguingly, sequence analysis of genomes reveals that a large number of genes are unique to each organism. The origin of these genes, termed ORFans, is not known. Here, we explore the origin of ORFan genes by defining a simple measure called ‘composition bias’, based on the deviation of the amino acid composition of a given sequence from the average composition of all proteins of a given genome.
Results: For a set of 47 prokaryotic genomes, we show that the amino acid composition bias of real proteins, random ‘proteins’ (created by using the nucleotide frequencies of each genome) and ‘proteins’ translated from intergenic regions are distinct. For ORFans, we observed a correlation between their composition bias and their relative evolutionary age. Recent ORFan proteins have compositions more similar to those of random ‘proteins’, while the compositions of more ancient ORFan proteins are more similar to those of the set of all proteins of the organism. This observation is consistent with an evolutionary scenario wherein ORFan genes emerged and underwent a large number of random mutations and selection, eventually adapting to the composition preference of their organism over time.
Supplementary information: Supplementary data are available at Bioinformatics online.
Viruses have been suggested to be the largest source of genetic diversity on Earth. Genome sequencing and metagenomic surveys reveal that novel genes with unknown functions are abundant in viral genomes. Yet few observations exist for the processes and frequency by which these genes are gained and lost. The surface waters of marine environments are dominated by marine picocyanobacteria and their co-existing viruses (cyanophages). Recent genome sequencing of cyanophages has revealed a vast array of genes that have been acquired from their cyanobacterial hosts. Here, we re-sequenced the cyanophage S-PM2 genome after 10 years of near continuous passage through its marine Synechococcus host. During this time a spontaneous mutant (S-PM2d) lacking 13% of the S-PM2 ORFs became dominant in the cyanophage population. These ORFs are found at one loci and are not homologous to any proteins in any other sequenced organism (ORFans). We demonstrate a fitness cost to S-PM2WT associated with possession of these ORFs under standard laboratory growth. Metagenomic surveys reveal these ORFs are present in various aquatic environments, are likely of cyanophage origin and appear to be enriched in environments from the extremes of salinity (freshwater and hypersaline). We posit that these ORFs contribute to the flexible gene content of cyanophages and offer a distinct fitness advantage in freshwater and hypersaline environments.
The holE gene is an enterobacterial ORFan gene (open reading frame [ORF] with no detectable homology to other ORFs in a database). It encodes the θ subunit of the DNA polymerase III core complex. The precise function of the θ subunit within this complex is not well established, and loss of holE does not result in a noticeable phenotype. Paralogs of holE are also present on many conjugative plasmids and on phage P1 (hot gene). In this study, we provide evidence indicating that θ (HolE) exhibits structural and functional similarities to a family of nucleoid-associated regulatory proteins, the Hha/YdgT-like proteins that are also encoded by enterobacterial ORFan genes. Microarray studies comparing the transcriptional profiles of Escherichia coli
holE, hha, and ydgT mutants revealed highly similar expression patterns for strains harboring holE and ydgT alleles. Among the genes differentially regulated in both mutants were genes of the tryptophanase (tna) operon. The tna operon consists of a transcribed leader region, tnaL, and two structural genes, tnaA and tnaB. Further experiments with transcriptional lacZ fusions (tnaL::lacZ and tnaA::lacZ) indicate that HolE and YdgT downregulate expression of the tna operon by possibly increasing the level of Rho-dependent transcription termination at the tna operon's leader region. Thus, for the first time, a regulatory function can be attributed to HolE, in addition to its role as structural component of the DNA polymerase III complex.
Acanthamoeba polyphaga mimivirus is the largest known virus in both particle size and genome complexity. Its 1.2-Mb genome encodes 911 proteins, among which only 298 have predicted functions. The composition of purified isolated virions was analyzed by using a combined electrophoresis/mass spectrometry approach allowing the identification of 114 proteins. Besides the expected major structural components, the viral particle packages 12 proteins unambiguously associated with transcriptional machinery, 3 proteins associated with DNA repair, and 2 topoisomerases. Other main functional categories represented in the virion include oxidative pathways and protein modification. More than half of the identified virion-associated proteins correspond to anonymous genes of unknown function, including 45 “ORFans.” As demonstrated by both Western blotting and immunogold staining, some of these “ORFans,” which lack any convincing similarity in the sequence databases, are endowed with antigenic properties. Thus, anonymous and unique genes constituting the majority of the mimivirus gene complement encode bona fide proteins that are likely to participate in well-integrated processes.
The universally conserved J-domain proteins (JDPs) are obligate cochaperone partners of the Hsp70 (DnaK) chaperone. They stimulate Hsp70's ATPase activity, facilitate substrate delivery, and confer specific cellular localization to Hsp70. In this work, we have identified and characterized the first functional JDP protein encoded by a bacteriophage. Specifically, we show that the ORFan gene 057w of the T4-related enterobacteriophage RB43 encodes a bona fide JDP protein, named Rki, which specifically interacts with the Escherichia coli host multifunctional DnaK chaperone. However, in sharp contrast with the three known host JDP cochaperones of DnaK encoded by E. coli, Rki does not act as a generic cochaperone in vivo or in vitro. Expression of Rki alone is highly toxic for wild-type E. coli, but toxicity is abolished in the absence of endogenous DnaK or when the conserved J-domain of Rki is mutated. Further in vivo analyses revealed that Rki is expressed early after infection by RB43 and that deletion of the rki gene significantly impairs RB43 proliferation. Furthermore, we show that mutations in the host dnaK gene efficiently suppress the growth phenotype of the RB43 rki deletion mutant, thus indicating that Rki specifically interferes with DnaK cellular function. Finally, we show that the interaction of Rki with the host DnaK chaperone rapidly results in the stabilization of the heat-shock factor σ32, which is normally targeted for degradation by DnaK. The mechanism by which the Rki-dependent stabilization of σ32 facilitates RB43 bacteriophage proliferation is discussed.
Bacteriophages are the most abundant biological entities on earth. As a consequence, they represent the largest reservoir of unexplored genetic information. They control bacterial growth, mediate horizontal gene transfer, and thus exert profound influence on microbial ecology and growth. One of the striking features of bacteriophages is that they code for many open reading frames of thus far unknown biological function (called ORFans), which have been referred to as the dark matter of our biosphere. Here we have extensively characterized such a novel ORFan-encoded protein, Rki, encoded by the large, virulent enterobacteriaceae bacteriophage RB43. We show that Rki functions to control the host stress-response during the early stages of bacteriophage infection, specifically by interacting with the host DnaK/Hsp70 chaperone to stabilize the major host heat-shock factor, σ32.
For a very long time, Type II restriction enzymes (REases) have been a paradigm of ORFans: proteins with no detectable similarity to each other and to any other protein in the database, despite common cellular and biochemical function. Crystallographic analyses published until January 2008 provided high-resolution structures for only 28 of 1637 Type II REase sequences available in the Restriction Enzyme database (REBASE). Among these structures, all but two possess catalytic domains with the common PD-(D/E)XK nuclease fold. Two structures are unrelated to the others: R.BfiI exhibits the phospholipase D (PLD) fold, while R.PabI has a new fold termed ‘half-pipe’. Thus far, bioinformatic studies supported by site-directed mutagenesis have extended the number of tentatively assigned REase folds to five (now including also GIY-YIG and HNH folds identified earlier in homing endonucleases) and provided structural predictions for dozens of REase sequences without experimentally solved structures. Here, we present a comprehensive study of all Type II REase sequences available in REBASE together with their homologs detectable in the nonredundant and environmental samples databases at the NCBI. We present the summary and critical evaluation of structural assignments and predictions reported earlier, new classification of all REase sequences into families, domain architecture analysis and new predictions of three-dimensional folds. Among 289 experimentally characterized (not putative) Type II REases, whose apparently full-length sequences are available in REBASE, we assign 199 (69%) to contain the PD-(D/E)XK domain. The HNH domain is the second most common, with 24 (8%) members. When putative REases are taken into account, the fraction of PD-(D/E)XK and HNH folds changes to 48% and 30%, respectively. Fifty-six characterized (and 521 predicted) REases remain unassigned to any of the five REase folds identified so far, and may exhibit new architectures. These enzymes are proposed as the most interesting targets for structure determination by high-resolution experimental methods. Our analysis provides the first comprehensive map of sequence-structure relationships among Type II REases and will help to focus the efforts of structural and functional genomics of this large and biotechnologically important class of enzymes.
An algorithm is presented that returns the optimal pairwise gapped alignment of two sets of signed numerical sequence values. One distinguishing feature of this algorithm is a flexible comparison engine (based on both relative shape and absolute similarity measures) that does not rely on explicit gap penalties. Additionally, an empirical probability model is developed to estimate the significance of the returned alignment with respect to randomized data. The algorithm's utility for biological hypothesis formulation is demonstrated with test cases including database search and pairwise alignment of protein hydropathy. However, the algorithm and probability model could possibly be extended to accommodate other diverse types of protein or nucleic acid data, including positional thermodynamic stability and mRNA translation efficiency. The algorithm requires only numerical values as input and will readily compare data other than protein hydropathy. The tool is therefore expected to complement, rather than replace, existing sequence and structure based tools and may inform medical discovery, as exemplified by proposed similarity between a chlamydial ORFan protein and bacterial colicin pore-forming domain. The source code, documentation, and a basic web-server application are available.
Trend discovery is an important way to generate understanding from large amounts of data. We have developed a novel tool that discovers significantly similar trends shared between two numerical data sets. Since the tool's algorithmic method compares both the relative shapes of the “peaks” and “valleys” in the data, as well as the absolute magnitudes of the numerical values, we believe the tool is tolerant of imperfections and could be applicable to a wide range of scientific, engineering, social, or economic problems. In short, if measurements can be converted to a series of numbers, our tool may potentially be useful for trend discovery. Since we are a protein biophysics group, we are most naturally interested in discovering new similarities between proteins, and we have discovered a particularly interesting, statistically significant similarity between a protein unique to Chlamydia and a bacterial pore-forming protein, colicin. This previously unreported similarity may have medical relevance, and we are currently experimentally testing the properties of the chlamydial protein in the laboratory. In a second example, we demonstrate the tool's ability to easily recover a known, but difficult to detect, relationship between two other GPCR proteins.
The increasing number of genome sequences of archaea and bacteria show their adaptation to different environmental conditions at the genomic level. Aeropyrum spp. are aerobic and hyperthermophilic archaea. Aeropyrum camini was isolated from a deep-sea hydrothermal vent, and Aeropyrum pernix was isolated from a coastal solfataric vent. To investigate the adaptation strategy in each habitat, we compared the genomes of the two species. Shared genome features were a small genome size, a high GC content, and a large portion of orthologous genes (86 to 88%). The genomes also showed high synteny. These shared features may have been derived from the small number of mobile genetic elements and the lack of a RecBCD system, a recombinational enzyme complex. In addition, the specialized physiology (aerobic and hyperthermophilic) of Aeropyrum spp. may also contribute to the entire-genome similarity. Despite having stable genomes, interference of synteny occurred with two proviruses, A. pernix spindle-shaped virus 1 (APSV1) and A. pernix ovoid virus 1 (APOV1), and clustered regularly interspaced short palindromic repeat (CRISPR) elements. Spacer sequences derived from the A. camini CRISPR showed significant matches with protospacers of the two proviruses infecting A. pernix, indicating that A. camini interacted with viruses closely related to APSV1 and APOV1. Furthermore, a significant fraction of the nonorthologous genes (41 to 45%) were proviral genes or ORFans probably originating from viruses. Although the genomes of A. camini and A. pernix were conserved, we observed nonsynteny that was attributed primarily to virus-related elements. Our findings indicated that the genomic diversification of Aeropyrum spp. is substantially caused by viruses.
The identification of novel giant viruses from the nucleocytoplasmic large DNA viruses group and their virophages has increased in the last decade and has helped to shed light on viral evolution. This study describe the discovery, isolation and characterization of Samba virus (SMBV), a novel giant virus belonging to the Mimivirus genus, which was isolated from the Negro River in the Brazilian Amazon. We also report the isolation of an SMBV-associated virophage named Rio Negro (RNV), which is the first Mimivirus virophage to be isolated in the Americas.
Based on a phylogenetic analysis, SMBV belongs to group A of the putative Megavirales order, possibly a new virus related to Acanthamoeba polyphaga mimivirus (APMV). SMBV is the largest virus isolated in Brazil, with an average particle diameter about 574 nm. The SMBV genome contains 938 ORFs, of which nine are ORFans. The 1,213.6 kb SMBV genome is one of the largest genome of any group A Mimivirus described to date. Electron microscopy showed RNV particle accumulation near SMBV and APMV factories resulting in the production of defective SMBV and APMV particles and decreasing the infectivity of these two viruses by several logs.
This discovery expands our knowledge of Mimiviridae evolution and ecology.
Mimiviridae; DNA virus; Giant virus; NCLDV; Virophage; Amazon; Brazil
Nucleomorphs are the remnant nuclei of algal endosymbionts that were engulfed by nonphotosynthetic host eukaryotes. These peculiar organelles are found in cryptomonad and chlorarachniophyte algae, where they evolved from red and green algal endosymbionts, respectively. Despite their independent origins, cryptomonad and chlorarachniophyte nucleomorph genomes are similar in size and structure: they are both <1 million base pairs in size (the smallest nuclear genomes known), comprised three chromosomes, and possess subtelomeric ribosomal DNA operons. Here, we report the complete sequence of one of the smallest cryptomonad nucleomorph genomes known, that of the secondarily nonphotosynthetic cryptomonad Cryptomonas paramecium. The genome is 486 kbp in size and contains 518 predicted genes, 466 of which are protein coding. Although C. paramecium lacks photosynthetic ability, its nucleomorph genome still encodes 18 plastid-associated proteins. More than 90% of the “conserved” protein genes in C. paramecium (i.e., those with clear homologs in other eukaryotes) are also present in the nucleomorph genomes of the cryptomonads Guillardia theta and Hemiselmis andersenii. In contrast, 143 of 466 predicted C. paramecium proteins (30.7%) showed no obvious similarity to proteins encoded in any other genome, including G. theta and H. andersenii. Significantly, however, many of these “nucleomorph ORFans” are conserved in position and size between the three genomes, suggesting that they are in fact homologous to one another. Finally, our analyses reveal an unexpected degree of overlap in the genes present in the independently evolved chlorarachniophyte and cryptomonad nucleomorph genomes: ∼80% of a set of 120 conserved nucleomorph genes in the chlorarachniophyte Bigelowiella natans were also present in all three cryptomonad nucleomorph genomes. This result suggests that similar reductive processes have taken place in unrelated lineages of nucleomorph-containing algae.
nucleomorph; cryptomonads; chlorarachniophytes; genome reduction; endosymbiosis
Simpler biological systems should be easier to understand and to engineer towards pre-defined goals. One way to achieve biological simplicity is through genome minimization. Here we looked for genomic islands in the fresh water cyanobacteria Synechococcus elongatus PCC 7942 (genome size 2.7 Mb) that could be used as targets for deletion. We also looked for conserved genes that might be essential for cell survival.
By using a combination of methods we identified 170 xenologs, 136 ORFans and 1401 core genes in the genome of S. elongatus PCC 7942. These represent 6.5%, 5.2% and 53.6% of the annotated genes respectively. We considered that genes in genomic islands could be found if they showed a combination of: a) unusual G+C content; b) unusual phylogenetic similarity; and/or c) a small number of the highly iterated palindrome 1 (HIP1) motif plus an unusual codon usage. The origin of the largest genomic island by horizontal gene transfer (HGT) could be corroborated by lack of coverage among metagenomic sequences from a fresh water microbialite. Evidence is also presented that xenologous genes tend to cluster in operons. Interestingly, most genes coding for proteins with a diguanylate cyclase domain are predicted to be xenologs, suggesting a role for horizontal gene transfer in the evolution of Synechococcus sensory systems.
Our estimates of genomic islands in PCC 7942 are larger than those predicted by other published methods like SIGI-HMM. Our results set a guide to non-essential genes in S. elongatus PCC 7942 indicating a path towards the engineering of a model photoautotrophic bacterial cell.
We have identified conserved orthologs in completely sequenced genomes of double-strand DNA phages and arranged them into evolutionary families (phage orthologous groups [POGs]). Using this resource to analyze the collection of known phage genomes, we find that most orthologs are unique in their genomes (having no diverged duplicates [paralogs]), and while many proteins contain multiple domains, the evolutionary recombination of these domains does not appear to be a major factor in evolution of these orthologous families. The number of POGs has been rapidly increasing over the past decade, the percentage of genes in phage genomes that have orthologs in other phages has also been increasing, and the percentage of unknown “ORFans” is decreasing as more proteins find homologs and establish a family. Other properties of phage genomes have remained relatively stable over time, most notably the high fraction of genes that are never or only rarely observed in their cellular hosts. This suggests that despite the renowned ability of phages to transduce cellular genes, these cellular “hitchhiker” genes do not dominate the phage genomic landscape, and a large fraction of the genes in phage genomes maintain an evolutionary trajectory that is distinct from that of the host genes.