Genome-wide studies have already shed light into the evolution and enormous diversity of the viral world. Nevertheless, one of the unresolved mysteries in comparative genomics today is the abundance of ORFans – ORFs with no detectable sequence similarity to any other ORF in the databases. Recently, studies attempting to understand the origin and functions of bacterial ORFans have been reported. Here we present a first genome-wide identification and analysis of ORFans in the viral world, with focus on bacteriophages.
Almost one-third of all ORFs in 1,456 complete virus genomes correspond to ORFans, a figure significantly larger than that observed in prokaryotes. Like prokaryotic ORFans, viral ORFans are shorter and have a lower GC content than non-ORFans. Nevertheless, a statistically significant lower GC content is found only on a minority of viruses. By focusing on phages, we find that 38.4% of phage ORFs have no homologs in other phages, and 30.1% have no homologs neither in the viral nor in the prokaryotic world. Phages with different host ranges have different percentages of ORFans, reflecting different sampling status and suggesting various diversities. Similarity searches of the phage ORFeome (ORFans and non-ORFans) against prokaryotic genomes shows that almost half of the phage ORFs have prokaryotic homologs, suggesting the major role that horizontal transfer plays in bacterial evolution. Surprisingly, the percentage of phage ORFans with prokaryotic homologs is only 18.7%. This suggests that phage ORFans play a lesser role in horizontal transfer to prokaryotes, but may be among the major players contributing to the vast phage diversity.
Although the current sampling of viral genomes is extremely low, ORFans and near-ORFans are likely to continue to grow in number as more genomes are sequenced. The abundance of phage ORFans may be partially due to the expected vast viral diversity, and may be instrumental in understanding viral evolution. The functions, origins and fates of the majority of viral ORFans remain a mystery. Further computational and experimental studies are likely to shed light on the mechanisms that have given rise to so many bacterial and viral ORFans.
Many species of bivalves exhibit a unique system of mtDNA transmission named Doubly Uniparental Inheritance (DUI). Under this system, species have two distinct, sex-linked mitochondrial genomes: the M-type mtDNA, which is transmitted by males to male offspring and found in spermatozoa, and the F-type mtDNA, which is transmitted by females to all offspring, and found in all tissues of females and in somatic tissues of males. Bivalves with DUI also have sex-specific mitochondrial ORFan genes, (M-orf in the M mtDNA, F-orf in the F mtDNA), which are open reading frames having no detectable homology and no known function. DUI ORFan proteins have previously been characterized in silico in a taxonomically broad array of bivalves including four mytiloid, one veneroid and one unionoid species. However, the large evolutionary distance among these taxa prevented a meaningful comparison of ORFan properties among these divergent lineages. The present in silico study focuses on a suite of more closely-related Unionoid freshwater mussel species to provide more reliably interpretable information on patterns of conservation and properties of DUI ORFans. Unionoid species typically have separate sexes, but hermaphroditism also occurs, and hermaphroditic species lack the M-type mtDNA and possess a highly mutated version of the F-orf in their maternally transmitted mtDNA (named H-orf in these taxa). In this study, H-orfs and their respective proteins are analysed for the first time.
Despite a rapid rate of evolution, strong structural and functional similarities were found for M-ORF proteins compared among species, and among the F-ORF and H-ORF proteins across the studied species. In silico analyses suggest that M-ORFs have a role in transport and cellular processes such as signalling, cell cycle and division, and cytoskeleton organisation, and that F-ORFs may be involved in cellular traffic and transport, and in immune response. H-ORFs appear to be structural glycoproteins, which may be involved in signalling, transport and transcription. Our results also support either a viral or a mitochondrial origin for the ORFans.
Our findings reveal striking structural and functional similarities among proteins encoded by mitochondrial ORFans in freshwater mussels, and strongly support a role for these genes in the DUI mechanism. Our analyses also support the possibility of DUI systems with elements of different sources/origins and different mechanisms of action in the distantly-related DUI taxa. Parallel situations to the novel mitochondrially-encoded functions of freshwater mussel ORFans present in some other eukaryotes are also discussed.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-016-2986-6) contains supplementary material, which is available to authorized users.
Mitochondrial DNA; Mitochondrial ORFans; Mitochondrial inheritance; Doubly uniparental inheritance of mitochondria; Bivalvia; Unionoida
Mitochondrial ORFans (open reading frames having no detectable homology and with unknown function) were discovered in bivalve molluscs with doubly uniparental inheritance (DUI) of mitochondria. In these animals, two mitochondrial lineages are present, one transmitted through eggs (F-type), the other through sperm (M-type), each showing a specific ORFan. In this study, we used in situ hybridization and immunocytochemistry to provide evidence for the expression of Ruditapes philippinarum male-specific ORFan (orf21): both the transcript and the protein (RPHM21) were localized in spermatogenic cells and mature spermatozoa; the protein was localized in sperm mitochondria and nuclei, and in early embryos. Also, in silico analyses of orf21 flanking region and RPHM21 structure supported its derivation from viral sequence endogenization. We propose that RPHM21 prevents the recognition of M-type mitochondria by the degradation machinery, allowing their survival in the zygote. The process might involve a mechanism similar to that of Modulators of Immune Recognition, viral proteins involved in the immune recognition pathway, to which RPHM21 showed structural similarities. A viral origin of RPHM21 may also support a developmental role, because some integrated viral elements are involved in development and sperm differentiation of their host. Mitochondrial ORFans could be responsible for or participate in the DUI mechanism and their viral origin could explain the acquired capability of M-type mitochondria to avoid degradation and invade the germ line, that is what viruses do best: to elude host immune system and proliferate.
mitochondrial ORFan; viral endogenization; novel mitochondrial protein; testis expression; doubly uniparental inheritance of mitochondria; embryo development
Predicted open reading frames (ORFs) that lack detectable homology to known proteins are termed ORFans. Despite their prevalence in metagenomes, the extent to which ORFans encode real proteins, the degree to which they can be annotated, and their functional contributions, remain unclear. To gain insights into these questions, we applied sensitive remote-homology detection methods to functionally analyze ORFans from soil, marine, and human gut metagenome collections. ORFans were identified, clustered into sequence families, and annotated through profile-profile comparison to proteins of known structure. We found that a considerable number of metagenomic ORFans (73,896 of 484,121, 15.3%) exhibit significant remote homology to structurally characterized proteins, providing a means for ORFan functional profiling. The extent of detected remote homology far exceeds that obtained for artificial protein families (1.4%). As expected for real genes, the predicted functions of ORFans are significantly similar to the functions of their gene neighbors (p < 0.001). Compared to the functional profiles predicted through standard homology searches, ORFans show biologically intriguing differences. Many ORFan-enriched functions are virus-related and tend to reflect biological processes associated with extreme sequence diversity. Each environment also possesses a large number of unique ORFan families and functions, including some known to play important community roles such as gut microbial polysaccharide digestion. Lastly, ORFans are a valuable resource for finding novel enzymes of interest, as we demonstrate through the identification of hundreds of novel ORFan metalloproteases that all possess a signature catalytic motif despite a general lack of similarity to known proteins. Our ORFan functional predictions are a valuable resource for discovering novel protein families and exploring the boundaries of protein sequence space. All remote homology predictions are available at http://doxey.uwaterloo.ca/ORFans.
metagenome; metaproteome; ORFan; orphan; remote homology; profile-profile comparison; functional annotation; comparative metagenomics
ORFans are open reading frames (ORFs) with no detectable sequence similarity
to any other sequence in the databases. Each newly sequenced genome contains a
significant number of ORFans. Therefore, ORFans entail interesting evolutionary
puzzles. However, little can be learned about them using bioinformatics tools, and
their study seems to have been underemphasized. Here we present some of the
questions that the existence of so many ORFans have raised and review some of
the studies aimed at understanding ORFans, their functions and their origins. These
works have demonstrated that ORFans are an untapped source of research, requiring
further computational and experimental studies.
The mimivirus genome contains many genes that lack homologs in the sequence database and are thus known as ORFans. In addition, mimivirus genes that encode proteins belonging to known fold families are in some cases fused to domain-sized segments that cannot be classified. One such ORFan region is present in the mimivirus enzyme R596, a member of the Erv family of sulfhydryl oxidases. We determined the structure of a variant of full-length R596 and observed that the carboxy-terminal region of R596 assumes a folded, compact domain, demonstrating that these ORFan segments can be stable structural units. Moreover, the R596 ORFan domain fold is novel, hinting at the potential wealth of protein structural innovation yet to be discovered in large double-stranded DNA viruses. In the context of the R596 dimer, the ORFan domain contributes to formation of a broad cleft enriched with exposed aromatic groups and basic side chains, which may function in binding target proteins or localization of the enzyme within the virus factory or virions. Finally, we find evidence for an intermolecular dithiol/disulfide relay within the mimivirus R596 dimer, the first such extended, intersubunit redox-active site identified in a viral sulfhydryl oxidase.
A large-scale survey of potential recently acquired integrative elements in 119 archaeal and bacterial genomes reveals that many recently acquired genes have originated from integrative elements
Archaeal and bacterial genomes contain a number of genes of foreign origin that arose from recent horizontal gene transfer, but the role of integrative elements (IEs), such as viruses, plasmids, and transposable elements, in this process has not been extensively quantified. Moreover, it is not known whether IEs play an important role in the origin of ORFans (open reading frames without matches in current sequence databases), whose proportion remains stable despite the growing number of complete sequenced genomes.
We have performed a large-scale survey of potential recently acquired IEs in 119 archaeal and bacterial genomes. We developed an accurate in silico Markov model-based strategy to identify clusters of genes that show atypical sequence composition (clusters of atypical genes or CAGs) and are thus likely to be recently integrated foreign elements, including IEs. Our method identified a high number of new CAGs. Probabilistic analysis of gene content indicates that 56% of these new CAGs are likely IEs, whereas only 7% likely originated via horizontal gene transfer from distant cellular sources. Thirty-four percent of CAGs remain unassigned, what may reflect a still poor sampling of IEs associated with bacterial and archaeal diversity. Moreover, our study contributes to the issue of the origin of ORFans, because 39% of these are found inside CAGs, many of which likely represent recently acquired IEs.
Our results strongly indicate that archaeal and bacterial genomes contain an impressive proportion of recently acquired foreign genes (including ORFans) coming from a still largely unexplored reservoir of IEs.
Despite numerous comparative mitochondrial genomics studies revealing that animal mitochondrial genomes are highly conserved in terms of gene content, supplementary genes are sometimes found, often arising from gene duplication. Mitochondrial ORFans (ORFs having no detectable homology and unknown function) were found in bivalve molluscs with Doubly Uniparental Inheritance (DUI) of mitochondria. In DUI animals, two mitochondrial lineages are present: one transmitted through females (F-type) and the other through males (M-type), each showing a specific and conserved ORF. The analysis of 34 mitochondrial major Unassigned Regions of Musculista senhousia F- and M-mtDNA allowed us to verify the presence of novel mitochondrial ORFs in this species and to compare them with ORFs from other species with ascertained DUI, with other bivalves and with animals showing new mitochondrial elements. Overall, 17 ORFans from nine species were analyzed for structure and function. Many clues suggest that the analyzed ORFans arose from endogenization of viral genes. The co-option of such novel genes by viral hosts may have determined some evolutionary aspects of host life cycle, possibly involving mitochondria. The structure similarity of DUI ORFans within evolutionary lineages may also indicate that they originated from independent events. If these novel ORFs are in some way linked to DUI establishment, a multiple origin of DUI has to be considered. These putative proteins may have a role in the maintenance of sperm mitochondria during embryo development, possibly masking them from the degradation processes that normally affect sperm mitochondria in species with strictly maternal inheritance.
mitochondrial ORFans; mitochondrial inheritance; Doubly Uniparental Inheritance of mitochondria; endogenous virus
As each newly sequenced genome contains a significant number of protein-coding ORFs that are species-, family- or lineage-specific, many interesting questions arise about the evolution and role of these ORFs and of the genomes they are part of. We refer to these poorly conserved ORFs as singleton or paralogous ORFans if they are unique to one genome, or as orthologous ORFans if they appear only in a family of closely related organisms and have no homolog in other genomes. In order to study and classify ORFans we have constructed the ORFanage, an ORFan database. This database consists of the predicted ORFs in fully sequenced microbial genomes, and enables searching for the three types of ORFans in any subset of the genomes chosen by the user. The ORFanage could help in choosing interesting targets for further genomic and evolutionary studies. The ORFanage is accessible via http://www.bioinformatics.buffalo.edu/ORFanage.
Bacterial species, and even strains within species, can vary greatly in their gene contents and metabolic capabilities. We examine the evolution of this diversity by assessing the distribution and ancestry of each gene in 13 sequenced isolates of Escherichia coli and Shigella. We focus on the emergence and demise of two specific classes of genes, ORFans (genes with no homologs in present databases) and HOPs (genes with distant homologs), since these genes, in contrast to most conserved ancestral sequences, are known to be a major source of the novel features in each strain. We find that the rates of gain and loss of these genes vary greatly among strains as well as through time, and that ORFans and HOPs show very different behavior with respect to their emergence and demise. Although HOPs, which mostly represent gene acquisitions from other bacteria, originate more frequently, ORFans are much more likely to persist. This difference suggests that many adaptive traits are conferred by completely novel genes that do not originate in other bacterial genomes. With respect to the demise of these acquired genes, we find that strains of Shigella lose genes, both by disruption events and by complete removal, at accelerated rates.
Changes in genetic repertoires can alter the adaptive strategy of an organism, especially in bacteria, in which genes are continually gained and lost. Mapping the gains and losses of genes in the densely sequenced clade of Escherichia coli and Shigella shows that these genomes harbour two types of acquired genes: HOPs, which are those acquired genes with homologs in distantly related bacteria; and ORFans, which are genes without any known homologs. Surprisingly, the two classes of acquired genes display very different patterns of gain and loss. HOPs are acquired more frequently, though they rarely persist in the recipient genomes. In contrast, ORFans are much more likely to be maintained over evolutionary timescales, suggesting that despite their unknown origins, they will more often confer novel and beneficial traits to the recipient genome.
Mimivirus isolated from A. polyphaga is the largest virus discovered so far. It is unique among all the viruses in having genes related to translation, DNA repair and replication which bear close homology to eukaryotic genes. Nevertheless, only a small fraction of the proteins (33%) encoded in this genome has been assigned a function. Furthermore, a large fraction of the unassigned protein sequences bear no sequence similarity to proteins from other genomes. These sequences are referred to as ORFans. Because of their lack of sequence similarity to other proteins, they can not be assigned putative functions using standard sequence comparison methods. As part of our genome-wide computational efforts aimed at characterizing Mimivirus ORFans, we have applied fold-recognition methods to predict the structure of these ORFans and further functions were derived based on conservation of functionally important residues in sequence-template alignments.
Using fold recognition, we have identified highly confident computational 3D structural assignments for 21 Mimivirus ORFans. In addition, highly confident functional predictions for 6 of these ORFans were derived by analyzing the conservation of functional motifs between the predicted structures and proteins of known function. This analysis allowed us to classify these 6 previously unannotated ORFans into their specific protein families: carboxylesterase/thioesterase, metal-dependent deacetylase, P-loop kinases, 3-methyladenine DNA glycosylase, BTB domain and eukaryotic translation initiation factor eIF4E.
Using stringent fold recognition criteria we have assigned three-dimensional structures for 21 of the ORFans encoded in the Mimivirus genome. Further, based on the 3D models and an analysis of the conservation of functionally important residues and motifs, we were able to derive functional attributes for 6 of the ORFans. Our computational identification of important functional sites in these ORFans can be the basis for a subsequent experimental verification of our predictions. Further computational and experimental studies are required to elucidate the 3D structures and functions of the remaining Mimivirus ORFans.
The length of a protein sequence is largely determined by its function. In certain species, it may be also affected by additional factors, such as growth temperature or acidity. In 2002, it was shown that in the bacterium Escherichia coli and in the archaeon Archaeoglobus fulgidus, protein sequences with no homologs were, on average, shorter than those with homologs (BMC Evol Biol 2:20, 2002). It is now generally accepted that in bacterial and archaeal genomes the distributions of protein length are different between sequences with and without homologs. In this study, we examine this postulate by conducting a comprehensive analysis of all annotated prokaryotic genomes and by focusing on certain exceptions.
We compared the distribution of lengths of “having homologs proteins” (HHPs) and “non-having homologs proteins” (orphans or ORFans) in all currently completely sequenced and COG-annotated prokaryotic genomes. As expected, the HHPs and ORFans have strikingly different length distributions in almost all genomes. As previously established, the HHPs, indeed are, on average, longer than the ORFans, and the length distributions for the ORFans have a relatively narrow peak, in contrast to the HHPs, whose lengths spread over a wider range of values. However, about thirty genomes do not obey these rules. Practically all genomes of Mycoplasma and Ureaplasma have atypical ORFans distributions, with the mean lengths of ORFan larger than the mean lengths of HHPs. These genera constitute over 80 % of atypical genomes.
We confirmed on a ubiquitous set of genomes that the previous observation of HHPs and ORFans have different gene length distributions. We also showed that Mycoplasmataceae genomes have very distinctive distributions of ORFans lengths. We offer several possible biological explanations of this phenomenon, such as an adaptation to Mycoplasmataceae’s ecological niche, specifically its “quiet” co-existence with host organisms, resulting in long ABC transporters.
Electronic supplementary material
The online version of this article (doi:10.1186/s13062-015-0104-3) contains supplementary material, which is available to authorized users.
Mycoplasmataceae; COG; ORFan; HHP; Evolution; Stop codon
Giant viruses are protist-associated viruses belonging to the proposed order Megavirales; almost all have been isolated from Acanthamoeba spp. Their isolation in humans suggests that they are part of the human virome. Using a high-throughput strategy to isolate new giant viruses from their original protozoan hosts, we obtained eight isolates of a new giant viral lineage from Vermamoeba
vermiformis, the most common free-living protist found in human environments. This new lineage was proposed to be the faustovirus lineage. The prototype member, faustovirus E12, forms icosahedral virions of ≈200 nm that are devoid of fibrils and that encapsidate a 466-kbp genome encoding 451 predicted proteins. Of these, 164 are found in the virion. Phylogenetic analysis of the core viral genes showed that faustovirus is distantly related to the mammalian pathogen African swine fever virus, but it encodes ≈3 times more mosaic gene complements. About two-thirds of these genes do not show significant similarity to genes encoding any known proteins. These findings show that expanding the panel of protists to discover new giant viruses is a fruitful strategy.
IMPORTANCE By using Vermamoeba, a protist living in humans and their environment, we isolated eight strains of a new giant virus that we named faustovirus. The genomes of these strains were sequenced, and their sequences showed that faustoviruses are related to but different from the vertebrate pathogen African swine fever virus (ASFV), which belongs to the family Asfarviridae. Moreover, the faustovirus gene repertoire is ≈3 times larger than that of ASFV and comprises approximately two-thirds ORFans (open reading frames [ORFs] with no detectable homology to other ORFs in a database).
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. Given the wide-ranging roles microbes play in many ecosystems, metagenomics studies of microbial communities will reveal insights into protein families and their evolution. Because most microbes will not grow in the laboratory using current cultivation techniques, scientists have turned to cultivation-independent techniques to study microbial diversity. One such technique—shotgun sequencing—allows random sampling of DNA sequences to examine the genomic material present in a microbial community. We used shotgun sequencing to examine microbial communities in water samples collected by the Sorcerer II Global Ocean Sampling (GOS) expedition. Our analysis predicted more than six million proteins in the GOS data—nearly twice the number of proteins present in current databases. These predictions add tremendous diversity to known protein families and cover nearly all known prokaryotic protein families. Some of the predicted proteins had no similarity to any currently known proteins and therefore represent new families. A higher than expected fraction of these novel families is predicted to be of viral origin. We also found that several protein domains that were previously thought to be kingdom specific have GOS examples in other kingdoms. Our analysis opens the door for a multitude of follow-up protein family analyses and indicates that we are a long way from sampling all the protein families that exist in nature.
The GOS data identified 6.12 million predicted proteins covering nearly all known prokaryotic protein families, and several new families. This almost doubles the number of known proteins and shows that we are far from identifying all the proteins in nature.
ORFan genes can constitute a large fraction of a bacterial genome, but due to their lack of homologs, their functions have remained largely unexplored. To determine if particular features of ORFan-encoded proteins promote their presence in a genome, we analyzed properties of ORFans that originated over a broad evolutionary timescale. We also compared ORFan genes to another class of acquired genes (HOPs), which have homologs in other bacteria. A total of 54 ORFan and HOP genes selected from different phylogenetic depths in the Escherichia coli lineage were cloned, expressed, purified and subjected to CD spectroscopy. A majority of genes could be expressed, but only 18 yielded sufficient soluble protein for spectral analysis. Of these, half were significantly α-helical, three were predominantly β-sheet, and six were of intermediate or indeterminate structure. Although a higher proportion of HOPs yielded soluble proteins with resolvable secondary structures, ORFans resembled HOPs with regard to most of the other features tested. Overall, we found that those ORFan and HOP genes that have persisted in the E. coli lineage were more likely to encode soluble and folded proteins, more likely to display environmental modulation of their gene expression, and by extrapolation, are more likely to be functional.
E. coli; genome evolution; lateral gene transfer; ORFans; protein folding
Genomic analysis of giant viruses, such as Mimivirus, has revealed that more than half of the putative genes have no known functions (ORFans). We knocked down Mimivirus genes using short interfering RNA as a proof of concept to determine the functions of giant virus ORFans. As fibers are easy to observe, we targeted a gene encoding a protein absent in a Mimivirus mutant devoid of fibers as well as three genes encoding products identified in a protein concentrate of fibers, including one ORFan and one gene of unknown function. We found that knocking down these four genes was associated with depletion or modification of the fibers. Our strategy of silencing ORFan genes in giant viruses opens a way to identify its complete gene repertoire and may clarify the role of these genes, differentiating between junk DNA and truly used genes. Using this strategy, we were able to annotate four proteins in Mimivirus and 30 homologous proteins in other giant viruses. In addition, we were able to annotate >500 proteins from cellular organisms and 100 from metagenomic databases.
Mimivirus; giant virus; Megavirales; fiber; short interfering RNA; RNA interference; nucleocytoplasmic large DNA virus
The origin and evolution of “ORFans” (suspected genes without known relatives) remain unclear. Here, we take advantage of a unique opportunity to examine the population diversity of thousands of ORFans, based on a collection of 35 complete genomes of isolates of Escherichia coli and Shigella (which is included phylogenetically within E. coli). As expected from previous studies, ORFans are shorter and AT-richer in sequence than non-ORFans. We find that ORFans often are very narrowly distributed: the most common pattern is for an ORFan to be found in only one genome. We compared within-species population diversity of ORFan genes with those of two control groups of non-ORFan genes. Patterns of population variation suggest that most ORFans are not artifacts, but encode real genes whose protein-coding capacity is conserved, reflecting selection against nonsynonymous mutations. Nevertheless, nonsynonymous nucleotide diversity is higher than for non-ORFans, whereas synonymous diversity is roughly the same. In particular, there is a several-fold excess of ORFans in the highest decile of diversity relative to controls, which might be due to weaker purifying selection, positive selection, or a subclass of ORFans that are decaying.
ORFan; lineage-specific genes; evolution; population genetics; positive selection; negative selection
Rickettsia species are strictly intracellular bacteria that have undergone a reductive genomic evolution. Despite their allopatric lifestyle, almost half of the 26 currently validated Rickettsia species have plasmids. In order to study the origin, evolutionary history and putative roles of rickettsial plasmids, we investigated the evolutionary processes that have shaped 20 plasmids belonging to 11 species, using comparative genomics and phylogenetic analysis between rickettsial, microbial and non-microbial genomes.
Plasmids were differentially present among Rickettsia species. The 11 species had 1 to 4 plasmid (s) with a size ranging from 12 kb to 83 kb. We reconstructed pRICO, the last common ancestor of the current rickettsial plasmids. pRICO was vertically inherited mainly from Rickettsia/Orientia chromosomes and diverged vertically into a single or multiple plasmid(s) in each species. These plasmids also underwent a reductive evolution by progressive gene loss, similar to that observed in rickettsial chromosomes, possibly leading to cryptic plasmids or complete plasmid loss. Moreover, rickettsial plasmids exhibited ORFans, recent gene duplications and evidence of horizontal gene transfer events with rickettsial and non-rickettsial genomes mainly from the α/γ-proteobacteria lineages. Genes related to maintenance and plasticity of plasmids, and to adaptation and resistance to stress mostly evolved under vertical and/or horizontal processes. Those involved in nucleotide/carbohydrate transport and metabolism were under the influence of vertical evolution only, whereas genes involved in cell wall/membrane/envelope biogenesis, cycle control, amino acid/lipid/coenzyme and secondary metabolites biosynthesis, transport and metabolism underwent mainly horizontal transfer events.
Rickettsial plasmids had a complex evolution, starting with a vertical inheritance followed by a reductive evolution associated with increased complexity via horizontal gene transfer as well as gene duplication and genesis. The plasmids are plastic and mosaic structures that may play biological roles similar to or distinct from their co-residing chromosomes in an obligate intracellular lifestyle.
The holE gene is an enterobacterial ORFan gene (open reading frame [ORF] with no detectable homology to other ORFs in a database). It encodes the θ subunit of the DNA polymerase III core complex. The precise function of the θ subunit within this complex is not well established, and loss of holE does not result in a noticeable phenotype. Paralogs of holE are also present on many conjugative plasmids and on phage P1 (hot gene). In this study, we provide evidence indicating that θ (HolE) exhibits structural and functional similarities to a family of nucleoid-associated regulatory proteins, the Hha/YdgT-like proteins that are also encoded by enterobacterial ORFan genes. Microarray studies comparing the transcriptional profiles of Escherichia coli
holE, hha, and ydgT mutants revealed highly similar expression patterns for strains harboring holE and ydgT alleles. Among the genes differentially regulated in both mutants were genes of the tryptophanase (tna) operon. The tna operon consists of a transcribed leader region, tnaL, and two structural genes, tnaA and tnaB. Further experiments with transcriptional lacZ fusions (tnaL::lacZ and tnaA::lacZ) indicate that HolE and YdgT downregulate expression of the tna operon by possibly increasing the level of Rho-dependent transcription termination at the tna operon's leader region. Thus, for the first time, a regulatory function can be attributed to HolE, in addition to its role as structural component of the DNA polymerase III complex.
Viruses have been suggested to be the largest source of genetic diversity on Earth. Genome sequencing and metagenomic surveys reveal that novel genes with unknown functions are abundant in viral genomes. Yet few observations exist for the processes and frequency by which these genes are gained and lost. The surface waters of marine environments are dominated by marine picocyanobacteria and their co-existing viruses (cyanophages). Recent genome sequencing of cyanophages has revealed a vast array of genes that have been acquired from their cyanobacterial hosts. Here, we re-sequenced the cyanophage S-PM2 genome after 10 years of near continuous passage through its marine Synechococcus host. During this time a spontaneous mutant (S-PM2d) lacking 13% of the S-PM2 ORFs became dominant in the cyanophage population. These ORFs are found at one loci and are not homologous to any proteins in any other sequenced organism (ORFans). We demonstrate a fitness cost to S-PM2WT associated with possession of these ORFs under standard laboratory growth. Metagenomic surveys reveal these ORFs are present in various aquatic environments, are likely of cyanophage origin and appear to be enriched in environments from the extremes of salinity (freshwater and hypersaline). We posit that these ORFs contribute to the flexible gene content of cyanophages and offer a distinct fitness advantage in freshwater and hypersaline environments.
Motivation: Intriguingly, sequence analysis of genomes reveals that a large number of genes are unique to each organism. The origin of these genes, termed ORFans, is not known. Here, we explore the origin of ORFan genes by defining a simple measure called ‘composition bias’, based on the deviation of the amino acid composition of a given sequence from the average composition of all proteins of a given genome.
Results: For a set of 47 prokaryotic genomes, we show that the amino acid composition bias of real proteins, random ‘proteins’ (created by using the nucleotide frequencies of each genome) and ‘proteins’ translated from intergenic regions are distinct. For ORFans, we observed a correlation between their composition bias and their relative evolutionary age. Recent ORFan proteins have compositions more similar to those of random ‘proteins’, while the compositions of more ancient ORFan proteins are more similar to those of the set of all proteins of the organism. This observation is consistent with an evolutionary scenario wherein ORFan genes emerged and underwent a large number of random mutations and selection, eventually adapting to the composition preference of their organism over time.
Supplementary information: Supplementary data are available at Bioinformatics online.
The study of bacteriophages continues to generate key information about microbial interactions in the environment. Many phenotypic characteristics of bacteriophages cannot be examined by sequencing alone, further highlighting the necessity for isolation and examination of phages from environmental samples. While much of our current knowledge base has been generated by the study of marine phages, freshwater viruses are understudied in comparison. Our group has previously conducted metagenomics-based studies samples collected from Lake Michigan - the data presented in this study relate to four phages that were extracted from the same samples.
Four phages were extracted from Lake Michigan on the same bacterial host, exhibiting similar morphological characteristics as shown under transmission electron microscopy. Growth characteristics of the phages were unique to each isolate. Each phage demonstrated a host-range spanning several phyla of bacteria – to date, such a broad host-range is yet to be reported. Genomic data reveals genomes of a similar size, and close similarities between the Lake Michigan phages and the Pseudomonas phage PB1, however, the majority of annotated genes present were ORFans and little insight was offered into mechanisms for host-range.
The phages isolated from Lake Michigan are capable of infecting several bacterial phyla, and demonstrate varied phenotypic characteristics despite similarities in host preference, and at the genomic level. We propose that such a broad host-range is likely related to the oligotrophic nature of Lake Michigan, and the competitive benefit that this characteristic may lend to phages in nature.
Bacteriophage; Broad host-range; Freshwater; Lake Michigan
Most theories on viral evolution are speculative and lack fossil comparison. Here, we isolated a modern Pithovirus-like virus from sewage samples. This giant virus, named Pithovirus massiliensis, was compared with its prehistoric counterpart, Pithovirus sibericum, found in Siberian permafrost. Our analysis revealed near-complete gene repertoire conservation, including horizontal gene transfer and ORFans. Furthermore, all orthologous genes evolved under strong purifying selection with a non-synonymous and synonymous ratio in the same range as the ratio found in the prokaryotic world. The comparison between fossil and modern Pithovirus species provided an estimation of the cadence of the molecular clock, reaching up to 3 × 10−6 mutations/site/year. In addition, the strict conservation of HGTs and ORFans in P. massiliensis revealed the stable genetic mosaicism in giant viruses and excludes the concept of a bag of genes. The genetic stability for 30,000 years of P. massiliensis demonstrates that giant viruses evolve similarly to prokaryotes by classical mechanisms of evolution, including selection and fixation of genes, followed by selective constraints.
giant viruses; evolution; mosaicism; fossil
Acanthamoeba polyphaga mimivirus is the largest known virus in both particle size and genome complexity. Its 1.2-Mb genome encodes 911 proteins, among which only 298 have predicted functions. The composition of purified isolated virions was analyzed by using a combined electrophoresis/mass spectrometry approach allowing the identification of 114 proteins. Besides the expected major structural components, the viral particle packages 12 proteins unambiguously associated with transcriptional machinery, 3 proteins associated with DNA repair, and 2 topoisomerases. Other main functional categories represented in the virion include oxidative pathways and protein modification. More than half of the identified virion-associated proteins correspond to anonymous genes of unknown function, including 45 “ORFans.” As demonstrated by both Western blotting and immunogold staining, some of these “ORFans,” which lack any convincing similarity in the sequence databases, are endowed with antigenic properties. Thus, anonymous and unique genes constituting the majority of the mimivirus gene complement encode bona fide proteins that are likely to participate in well-integrated processes.
The identification of novel giant viruses from the nucleocytoplasmic large DNA viruses group and their virophages has increased in the last decade and has helped to shed light on viral evolution. This study describe the discovery, isolation and characterization of Samba virus (SMBV), a novel giant virus belonging to the Mimivirus genus, which was isolated from the Negro River in the Brazilian Amazon. We also report the isolation of an SMBV-associated virophage named Rio Negro (RNV), which is the first Mimivirus virophage to be isolated in the Americas.
Based on a phylogenetic analysis, SMBV belongs to group A of the putative Megavirales order, possibly a new virus related to Acanthamoeba polyphaga mimivirus (APMV). SMBV is the largest virus isolated in Brazil, with an average particle diameter about 574 nm. The SMBV genome contains 938 ORFs, of which nine are ORFans. The 1,213.6 kb SMBV genome is one of the largest genome of any group A Mimivirus described to date. Electron microscopy showed RNV particle accumulation near SMBV and APMV factories resulting in the production of defective SMBV and APMV particles and decreasing the infectivity of these two viruses by several logs.
This discovery expands our knowledge of Mimiviridae evolution and ecology.
Mimiviridae; DNA virus; Giant virus; NCLDV; Virophage; Amazon; Brazil