The nucleo-cytoplasmic large DNA viruses (NCLDV) constitute an apparently monophyletic group that consists of 6 families of viruses infecting a broad variety of eukaryotes. A comprehensive genome comparison and maximum-likelihood reconstruction of NCLDV evolution reveal a set of approximately 50 conserved genes that can be tentatively mapped to the genome of the common ancestor of this class of eukaryotic viruses. We address the origins and evolution of NCLDV.
Phylogenetic analysis indicates that some of the major clades of NCLDV infect diverse animals and protists, suggestive of early radiation of the NCLDV, possibly concomitant with eukaryogenesis. The core NCLDV genes seem to have originated from different sources including homologous genes of bacteriophages, bacteria and eukaryotes. These observations are compatible with a scenario of the origin of the NCLDV at an early stage of the evolution of eukaryotes through extensive mixing of genes from widely different genomes.
The common ancestor of the NCLDV probably evolved from a bacteriophage as a result of recruitment of numerous eukaryotic and some bacterial genes, and concomitant loss of the majority of phage genes except for a small core of genes coding for proteins essential for virus genome replication and virion formation.
Bacteriophage; Eukaryogenesis; Nucleo-cytoplasmic large DNA viruses, evolution; Phylogenetic analysis
Eukaryotic Nucleo-Cytoplasmic Large DNA Viruses (NCLDV) encode most if not all of the enzymes involved in their DNA replication. It has been inferred that genes for these enzymes were already present in the last common ancestor of the NCLDV. However, the details of the evolution of these genes that bear on the complexity of the putative ancestral NCLDV and on the evolutionary relationships between viruses and their hosts are not well understood.
Phylogenetic analysis of the ATP-dependent and NAD-dependent DNA ligases encoded by the NCLDV reveals an unexpectedly complex evolutionary history. The NAD-dependent ligases are encoded only by a minority of NCLDV (including mimiviruses, some iridoviruses and entomopoxviruses) but phylogenetic analysis clearly indicated that all viral NAD-dependent ligases are monophyletic. Combined with the topology of the NCLDV tree derived by consensus of trees for universally conserved genes suggests that this enzyme was represented in the ancestral NCLDV. Phylogenetic analysis of ATP-dependent ligases that are encoded by chordopoxviruses, most of the phycodnaviruses and Marseillevirus failed to demonstrate monophyly and instead revealed an unexpectedly complex evolutionary trajectory. The ligases of the majority of phycodnaviruses and Marseillevirus seem to have evolved from bacteriophage or bacterial homologs; the ligase of one phycodnavirus, Emiliana huxlei virus, belongs to the eukaryotic DNA ligase I branch; and ligases of chordopoxviruses unequivocally cluster with eukaryotic DNA ligase III.
Examination of phyletic patterns and phylogenetic analysis of DNA ligases of the NCLDV suggest that the common ancestor of the extant NCLDV encoded an NAD-dependent ligase that most likely was acquired from a bacteriophage at the early stages of evolution of eukaryotes. By contrast, ATP-dependent ligases from different prokaryotic and eukaryotic sources displaced the ancestral NAD-dependent ligase at different stages of subsequent evolution. These findings emphasize complex routes of viral evolution that become apparent through detailed phylogenomic analysis but not necessarily in reconstructions based on phyletic patterns of genes.
This article was reviewed by: Patrick Forterre, George V. Shpakovski, and Igor B. Zhulin.
The Nucleo-Cytoplasmic Large DNA Viruses (NCLDV) comprise an apparently monophyletic class of viruses that infect a broad variety of eukaryotic hosts. Recent progress in isolation of new viruses and genome sequencing resulted in a substantial expansion of the NCLDV diversity, resulting in additional opportunities for comparative genomic analysis, and a demand for a comprehensive classification of viral genes.
A comprehensive comparison of the protein sequences encoded in the genomes of 45 NCLDV belonging to 6 families was performed in order to delineate cluster of orthologous viral genes. Using previously developed computational methods for orthology identification, 1445 Nucleo-Cytoplasmic Virus Orthologous Groups (NCVOGs) were identified of which 177 are represented in more than one NCLDV family. The NCVOGs were manually curated and annotated and can be used as a computational platform for functional annotation and evolutionary analysis of new NCLDV genomes. A maximum-likelihood reconstruction of the NCLDV evolution yielded a set of 47 conserved genes that were probably present in the genome of the common ancestor of this class of eukaryotic viruses. This reconstructed ancestral gene set is robust to the parameters of the reconstruction procedure and so is likely to accurately reflect the gene core of the ancestral NCLDV, indicating that this virus encoded a complex machinery of replication, expression and morphogenesis that made it relatively independent from host cell functions.
The NCVOGs are a flexible and expandable platform for genome analysis and functional annotation of newly characterized NCLDV. Evolutionary reconstructions employing NCVOGs point to complex ancestral viruses.
Nucleo-Cytoplasmic Large DNA viruses (NCLDV), a diverse group that infects a wide range of eukaryotic hosts, exhibit a large heterogeneity in genome size (between 100 kb and 1.2 Mb) but have been suggested to form a monophyletic group on the basis of a small subset of approximately 30 conserved genes. NCLDV were proposed to have evolved by simplification from cellular organism although some of the giant NCLDV have clearly grown by gene accretion from a bacterial origin.
We demonstrate here that many NCLDV lineages appear to have undergone frequent gene exchange in two different ways. Viruses which infect protists directly (Mimivirus) or algae which exist as intracellular protists symbionts (Phycodnaviruses) acquire genes from a bacterial source. Metazoan viruses such as the Poxviruses show a predominant acquisition of host genes. In both cases, the laterally acquired genes show a strong tendency to be positioned at the tip of the genome. Surprisingly, several core genes believed to be ancestral in the family appear to have undergone lateral gene transfers, suggesting that the NCLDV ancestor might have had a smaller genome than previously believed. Moreover, our data show that the larger the genome, the higher is the number of laterally acquired genes. This pattern is incompatible with a genome reduction from a cellular ancestor.
We propose that the NCLDV viruses have evolved by significant growth of a simple DNA virus by gene acquisition from cellular sources.
The family Mimiviridae belongs to the large monophyletic group of Nucleo-Cytoplasmic Large DNA Viruses (NCLDV; proposed order Megavirales) and encompasses giant viruses infecting amoeba and probably other unicellular eukaryotes. The recent discovery of the Cafeteria roenbergensis virus (CroV), a distant relative of the prototype mimiviruses, led to a substantial expansion of the genetic variance within the family Mimiviridae. In the light of these findings, a reassessment of the relationships between the mimiviruses and other NCLDV and reconstruction of the evolution of giant virus genomes emerge as interesting and timely goals.
Database searches for the protein sequences encoded in the genomes of several viruses originally classified as members of the family Phycodnaviridae, in particular Organic Lake phycodnaviruses and Phaeocystis globosa viruses (OLPG), revealed a greater number of highly similar homologs in members of the Mimiviridae than in phycodnaviruses. We constructed a collection of 898 Clusters of Orthologous Genes for the putative expanded family Mimiviridae (MimiCOGs) and used these clusters for a comprehensive phylogenetic analysis of the genes that are conserved in most of the NCLDV. The topologies of the phylogenetic trees for these conserved viral genes strongly support the monophyly of the OLPG and the mimiviruses. The same tree topology was obtained by analysis of the phyletic patterns of conserved viral genes. We further employed the mimiCOGs to obtain a maximum likelihood reconstruction of the history of genes losses and gains among the giant viruses. The results reveal massive gene gain in the mimivirus branch and modest gene gain in the OLPG branch.
These phylogenomic results reported here suggest a substantial expansion of the family Mimiviridae. The proposed expanded family encompasses a greater diversity of viruses including a group of viruses with much smaller genomes than those of the original members of the Mimiviridae. If the OLPG group is included in an expanded family Mimiviridae, it becomes the only family of giant viruses currently shown to host virophages. The mimiCOGs are expected to become a key resource for phylogenomics of giant viruses.
Mimivirus is a nucleocytoplasmic large DNA virus (NCLDV) with a genome size (1.2 Mb) and coding capacity ( 1000 genes) comparable to that of some cellular organisms. Unlike other viruses, Mimivirus and its NCLDV relatives encode homologs of broadly conserved informational genes found in Bacteria, Archaea, and Eukaryotes, raising the possibility that they could be placed on the tree of life. A recent phylogenetic analysis of these genes showed the NCLDVs emerging as a monophyletic group branching between Eukaryotes and Archaea. These trees were interpreted as evidence for an independent “fourth domain” of life that may have contributed DNA processing genes to the ancestral eukaryote. However, the analysis of ancient evolutionary events is challenging, and tree reconstruction is susceptible to bias resulting from non-phylogenetic signals in the data. These include compositional heterogeneity and homoplasy, which can lead to the spurious grouping of compositionally-similar or fast-evolving sequences. Here, we show that these informational gene alignments contain both significant compositional heterogeneity and homoplasy, which were not adequately modelled in the original analysis. When we use more realistic evolutionary models that better fit the data, the resulting trees are unable to reject a simple null hypothesis in which these informational genes, like many other NCLDV genes, were acquired by horizontal transfer from eukaryotic hosts. Our results suggest that a fourth domain is not required to explain the available sequence data.
A recent work has provided strong arguments in favor of a fourth domain of Life composed of nucleo-cytoplasmic large DNA viruses (NCLDVs). This hypothesis was supported by phylogenetic and phyletic analyses based on a common set of proteins conserved in Eukarya, Archaea, Bacteria, and viruses, and implicated in the functions of information storage and processing. Recently, the genome of a new NCLDV, Cafeteria roenbergensis virus (CroV), was released. The present work aimed to determine if CroV supports the fourth domain of Life hypothesis.
A consensus phylogenetic tree of NCLDVs including CroV was generated from a concatenated alignment of four universal proteins of NCLDVs. Some features of the gene complement of CroV and its distribution along the genome were further analyzed. Phylogenetic and phyletic analyses were performed using the previously identified common set of informational genes present in Eukarya, Archaea, Bacteria, and NCLDVs, including CroV.
Phylogenetic reconstructions indicated that CroV is clearly related to the Mimiviridae family. The comparison between the gene repertoires of CroV and Mimivirus showed similarities regarding the gene contents and genome organization. In addition, the phyletic clustering based on the comparison of informational gene repertoire between Eukarya, Archaea, Bacteria, and NCLDVs unambiguously classified CroV with other NCLDVs and clearly included it in a fourth domain of Life. Taken together, these data suggest that Mimiviridae, including CroV, may have inherited a common gene content probably acquired from a common Mimiviridae ancestor.
This further analysis of the gene repertoire of CroV consolidated the fourth domain of Life hypothesis and contributed to outline a functional pan-genome for giant viruses infecting phagocytic protistan grazers.
Motivation: Eukaryote-infecting nucleo-cytoplasmic large DNA viruses (NCLDVs) feature some of the largest genomes in the viral world. These viruses typically do not strongly depend on the host DNA replication systems. In line with this observation, a number of essential DNA replication proteins, such as DNA polymerases, primases, helicases and ligases, have been identified in the NCLDVs. One other ubiquitous component of DNA replisomes is the single-stranded DNA-binding (SSB) protein. Intriguingly, no NCLDV homologs of canonical OB-fold-containing SSB proteins had previously been detected. Only in poxviruses, one of seven NCLDV families, I3 was identified as the SSB protein. However, whether I3 is related to any known protein structure has not yet been established.
Results: Here, we addressed the case of ‘missing’ canonical SSB proteins in the NCLDVs and also probed evolutionary origins of the I3 family. Using advanced computational methods, in four NCLDV families, we detected homologs of the bacteriophage T7 SSB protein (gp2.5). We found the properties of these homologs to be consistent with the SSB function. Moreover, we implicated specific residues in single-stranded DNA binding. At the same time, we found no evolutionary link between the T7 gp2.5-like NCLDV SSB homologs and the poxviral SSB protein (I3). Instead, we identified a distant relationship between I3 and small protein B (SmpB), a bacterial RNA-binding protein. Thus, apparently, the NCLDVs have the two major distinct sets of SSB proteins having bacteriophage and bacterial origins, respectively.
Supplementary data are available at Bioinformatics online.
Nucleo-cytoplasmic large DNA viruses (NCLDVs) constitute a group of eukaryotic viruses that can have crucial ecological roles in the sea by accelerating the turnover of their unicellular hosts or by causing diseases in animals. To better characterize the diversity, abundance and biogeography of marine NCLDVs, we analyzed 17 metagenomes derived from microbial samples (0.2–1.6 μm size range) collected during the Tara Oceans Expedition. The sample set includes ecosystems under-represented in previous studies, such as the Arabian Sea oxygen minimum zone (OMZ) and Indian Ocean lagoons. By combining computationally derived relative abundance and direct prokaryote cell counts, the abundance of NCLDVs was found to be in the order of 104–105 genomes ml−1 for the samples from the photic zone and 102–103 genomes ml−1 for the OMZ. The Megaviridae and Phycodnaviridae dominated the NCLDV populations in the metagenomes, although most of the reads classified in these families showed large divergence from known viral genomes. Our taxon co-occurrence analysis revealed a potential association between viruses of the Megaviridae family and eukaryotes related to oomycetes. In support of this predicted association, we identified six cases of lateral gene transfer between Megaviridae and oomycetes. Our results suggest that marine NCLDVs probably outnumber eukaryotic organisms in the photic layer (per given water mass) and that metagenomic sequence analyses promise to shed new light on the biodiversity of marine viruses and their interactions with potential hosts.
eukaryotic viruses; marine NCLDVs; taxon co-occurrence; oomycetes
The recently discovered Pandoraviruses are by far the largest viruses known, with their 2 megabase genomes exceeding in size the genomes of numerous bacteria and archaea. Pandoraviruses show a distant relationship with other nucleocytoplasmic large DNA viruses (NCLDV) of eukaryotes, lack some of the NCLDV core genes and in particular do not appear to be specifically related to the other, better characterized family of giant viruses, the Mimiviridae. Here we report phylogenetic analysis of 6 core NCLDV genes that confidently places Pandoraviruses within the family Phycodnaviridae, with an apparent specific affinity with Coccolithoviruses. We conclude that, despite their many unusual characteristics, Pandoraviruses are highly derived phycodnaviruses. These findings imply that giant viruses have independently evolved from smaller NCLDV on at least two occasions.
This article was reviewed by Patrick Forterre and Lakshminarayan Iyer. For the full reviews, see the Reviewers’ reports section.
Genomes of nucleocytoplasmic large DNA viruses (NCLDVs) encode enzymes that catalyze the formation of disulfide bonds between cysteine amino acid residues in proteins, a function essential for the proper assembly and propagation of NCLDV virions. Recently, a catalyst of disulfide formation was identified in baculoviruses, a group of large double-stranded DNA viruses considered phylogenetically distinct from NCLDVs. The NCLDV and baculovirus disulfide catalysts are flavin adenine dinucleotide (FAD)-binding sulfhydryl oxidases related to the cellular Erv enzyme family, but the baculovirus enzyme, the product of the Ac92 gene in Autographa californica multiple nucleopolyhedrovirus (AcMNPV), is highly divergent at the amino acid sequence level. The crystal structure of the Ac92 protein presented here shows a configuration of the active-site cysteine residues and bound cofactor similar to that observed in other Erv sulfhydryl oxidases. However, Ac92 has a complex quaternary structural arrangement not previously seen in cellular or viral enzymes of this family. This novel assembly comprises a dimer of pseudodimers with a striking 40-degree kink in the interface helix between subunits. The diversification of the Erv sulfhydryl oxidase enzymes in large double-stranded DNA viruses exemplifies the extreme degree to which these viruses can push the boundaries of protein family folds.
Heterocapsa circularisquama DNA virus (HcDNAV; previously designated as HcV) is a giant virus (girus) with a ~356-kbp double-stranded DNA (dsDNA) genome. HcDNAV lytically infects the bivalve-killing marine dinoflagellate H. circularisquama, and currently represents the sole DNA virus isolated from dinoflagellates, one of the most abundant protists in marine ecosystems. Its morphological features, genome type, and host range previously suggested that HcDNAV might be a member of the family Phycodnaviridae of Nucleo-Cytoplasmic Large DNA Viruses (NCLDVs), though no supporting sequence data was available. NCLDVs currently include two families found in aquatic environments (Phycodnaviridae, Mimiviridae), one mostly infecting terrestrial animals (Poxviridae), another isolated from fish, amphibians and insects (Iridoviridae), and the last one (Asfarviridae) exclusively represented by the animal pathogen African swine fever virus (ASFV), the agent of a fatal hemorrhagic disease in domestic swine. In this study, we determined the complete sequence of the type B DNA polymerase (PolB) gene of HcDNAV. The viral PolB was transcribed at least from 6 h post inoculation (hpi), suggesting its crucial function for viral replication. Most unexpectedly, the HcDNAV PolB sequence was found to be closely related to the PolB sequence of ASFV. In addition, the amino acid sequence of HcDNAV PolB showed a rare amino acid substitution within a motif containing highly conserved motif: YSDTDS was found in HcDNAV PolB instead of YGDTDS in most dsDNA viruses. Together with the previous observation of ASFV-like sequences in the Sorcerer II Global Ocean Sampling metagenomic datasets, our results further reinforce the ideas that the terrestrial ASFV has its evolutionary origin in marine environments.
Acanthamoeba polyphaga Mimivirus is a giant double-stranded DNA virus defining a new genus, the Mimiviridae, among the Nucleo-Cytoplasmic Large DNA Viruses (NCLDV). We used utrastructural studies to shed light on the different steps of the Mimivirus replication cycle: entry via phagocytosis, release of viral DNA into the cell cytoplasm through fusion of viral and vacuolar membranes, and finally viral morphogenesis in an extraordinary giant cytoplasmic virus factory (VF). Fluorescent staining of the AT-rich Mimivirus DNA showed that it enters the host nucleus prior to the generation of a cytoplasmic independent replication centre that forms the core of the VF. Assembly and filling of viral capsids were observed within the replication centre, before release into the cell cytoplasm where progeny virions accumulated. 3D reconstruction from fluorescent and differential contrast interference images revealed the VF emerging from the cell surface as a volcano-like structure. Its size dramatically grew during the 24 h infectious lytic cycle. Our results showed that Mimivirus replication is an extremely efficient process that results from a rapid takeover of cellular machinery, and takes place in a unique and autonomous giant assembly centre, leading to the release of a large number of complex virions through amoebal lysis.
The discovery of Mimivirus, with its very large genome content, made it possible to identify genes common to the three domains of life (Eukarya, Bacteria and Archaea) and to generate controversial phylogenomic trees congruent with that of ribosomal genes, branching Mimivirus at its root. Here we used sequences from metagenomic databases, Marseillevirus and three new viruses extending the Mimiviridae family to generate the phylogenetic trees of eight proteins involved in different steps of DNA processing. Compared to the three ribosomal defined domains, we report a single common origin for Nucleocytoplasmic Large DNA Viruses (NCLDV), DNA processing genes rooted between Archaea and Eukarya, with a topology congruent with that of the ribosomal tree. As for translation, we found in our new viruses, together with Mimivirus, five proteins rooted deeply in the eukaryotic clade. In addition, comparison of informational genes repertoire based on phyletic pattern analysis supports existence of a clade containing NCLDVs clearly distinct from that of Eukarya, Bacteria and Archaea. We hypothesize that the core genome of NCLDV is as ancient as the three currently accepted domains of life.
Ectocarpus siliculosus virus-1 (EsV-1) is a lysogenic dsDNA virus belonging to the super family of nucleocytoplasmic large DNA viruses (NCLDV) that infect Ectocarpus siliculosus, a marine filamentous brown alga. Previous studies indicated that the viral genome is integrated into the host DNA. In order to find the integration sites of the viral genome, a genomic library from EsV-1-infected algae was screened using labelled EsV-1 DNA. Several fragments were isolated and some of them were sequenced and analyzed in detail.
Analysis revealed that the algal genome is split by a copy of viral sequences that have a high identity to EsV-1 DNA sequences. These fragments are interspersed with DNA repeats, pseudogenes and genes coding for products involved in DNA replication, integration and transposition. Some of these gene products are not encoded by EsV-1 but are present in the genome of other members of the NCLDV family. Further analysis suggests that the Ectocarpus algal genome contains traces of the integration of a large dsDNA viral genome; this genome could be the ancestor of the extant NCLDV genomes. Furthermore, several lines of evidence indicate that the EsV-1 genome might have originated in these viral DNA pieces, implying the existence of a complex integration and recombination system. A protein similar to a new class of tyrosine recombinases might be a key enzyme of this system.
Our results support the hypothesis that some dsDNA viruses are monophyletic and evolved principally through genome reduction. Moreover, we hypothesize that phaeoviruses have probably developed an original replication system.
The mimivirus L544 gene product was expressed in E. coli and crystallized; preliminary phasing of a MAD data set was performed using the selenium signal present in a crystal of recombinant selenomethionine-substituted protein.
Mimivirus is the prototype of a new family (the Mimiviridae) of nucleocytoplasmic large DNA viruses (NCLDVs), which already include the Poxviridae, Iridoviridae, Phycodnaviridae and Asfarviridae. Mimivirus specifically replicates in cells from the genus Acanthamoeba. Proteomic analysis of purified mimivirus particles revealed the presence of many subunits of the DNA-directed RNA polymerase II complex. A fully functional pre-transcriptional complex appears to be loaded in the virions, allowing mimivirus to initiate transcription within the host cytoplasm immediately upon infection independently of the host nuclear apparatus. To fully understand this process, a systematic study of mimivirus proteins that are predicted (by bioinformatics) or suspected (by proteomic analysis) to be involved in transcription was initiated by cloning and expressing them in Escherichia coli in order to determine their three-dimensional structures. Here, preliminary crystallographic analysis of the recombinant L544 protein is reported. The crystals belonged to the orthorhombic space group C2221 with one monomer per asymmetric unit. A MAD data set was used for preliminary phasing using the selenium signal present in a selenomethionine-substituted protein crystal.
nucleocytoplasmic large DNA viruses; transcription; DNA-directed RNA polymerases; mimivirus
Nucleocytoplasmic large DNA viruses (NCLDVs) are characterized by large genomes that often encode proteins not commonly found in viruses. Two species in this group are Acanthocystis turfacea chlorella virus 1 (ATCV-1) (family Phycodnaviridae, genus Chlorovirus) and Acanthamoeba polyphaga mimivirus (family Mimiviridae), commonly known as mimivirus. ATCV-1 and other chlorovirus members encode enzymes involved in the synthesis and glycosylation of their structural proteins. In this study, we identified and characterized three enzymes responsible for the synthesis of the sugar l-rhamnose: two UDP-d-glucose 4,6-dehydratases (UGDs) encoded by ATCV-1 and mimivirus and a bifunctional UDP-4-keto-6-deoxy-d-glucose epimerase/reductase (UGER) from mimivirus. Phylogenetic analysis indicated that ATCV-1 probably acquired its UGD gene via a recent horizontal gene transfer (HGT) from a green algal host, while an earlier HGT event involving the complete pathway (UGD and UGER) probably occurred between a protozoan ancestor and mimivirus. While ATCV-1 lacks an epimerase/reductase gene, its Chlorella host may encode this enzyme. Both UGDs and UGER are expressed as late genes, which is consistent with their role in posttranslational modification of capsid proteins. The data in this study provide additional support for the hypothesis that chloroviruses, and maybe mimivirus, encode most, if not all, of the glycosylation machinery involved in the synthesis of specific glycan structures essential for virus replication and infection.
The family Phycodnaviridae encompasses a diverse and rapidly expanding collection of large icosahedral, dsDNA viruses that infect algae. These lytic and lysogenic viruses have genomes ranging from 160 to 560 kb. The family consists of six genera based initially on host range and supported by sequence comparisons. The family is monophyletic with branches for each genus, but the phycodnaviruses have evolutionary roots that connect them with several other families of large DNA viruses, referred to as the nucleocytoplasmic large DNA viruses (NCLDV).
In the last twenty years, numerous giant, dsDNA, icosahedral viruses have been discovered and assigned to the nucleocytoplasmic large dsDNA virus (NCLDV) clade. The major capsid proteins of these viruses consist of two consecutive jelly-roll domains, assembled into trimers, with pseudo 6-fold symmetry. The capsomers are assembled into arrays that have either p6 (as in Paramecium bursaria Chlorella virus-1) or p3 symmetry (as in Mimivirus). Most of the NCLDV viruses have a membrane that separates the nucleocapsid from the external capsid.
Double-jelly-roll capsid protein; Giant icosahedral DNA viruses; Unique vertices; Assembly
Mimivirus, a giant dsDNA virus infecting Acanthamoeba, is the prototype of the mimiviridae family, the latest addition to the family of the nucleocytoplasmic large DNA viruses (NCLDVs). Its 1.2 Mb-genome was initially predicted to encode 917 genes. A subsequent RNA-Seq analysis precisely mapped many transcript boundaries and identified 75 new genes.
We now report a much deeper analysis using the SOLiD™ technology combining RNA-Seq of the Mimivirus transcriptome during the infectious cycle (202.4 Million reads), and a complete genome re-sequencing (45.3 Million reads). This study corrected the genome sequence and identified several single nucleotide polymorphisms. Our results also provided clear evidence of previously overlooked transcription units, including an important RNA polymerase subunit distantly related to Euryarchea homologues. The total Mimivirus gene count is now 1018, 11% greater than the original annotation.
This study highlights the huge progress brought about by ultra-deep sequencing for the comprehensive annotation of virus genomes, opening the door to a complete one-nucleotide resolution level description of their transcriptional activity, and to the realistic modeling of the viral genome expression at the ultimate molecular level. This work also illustrates the need to go beyond bioinformatics-only approaches for the annotation of short protein and non-coding genes in viral genomes.
We report an in-depth computational study of the protein sequences and structures of the superfamily of archaeo-eukaryotic primases (AEPs). This analysis greatly expands the range of diversity of the AEPs and reveals the unique active site shared by all members of this superfamily. In particular, it is shown that eukaryotic nucleo-cytoplasmic large DNA viruses, including poxviruses, asfarviruses, iridoviruses, phycodnaviruses and the mimivirus, encode AEPs of a distinct family, which also includes the herpesvirus primases whose relationship to AEPs has not been recognized previously. Many eukaryotic genomes, including chordates and plants, encode previously uncharacterized homologs of these predicted viral primases, which might be involved in novel DNA repair pathways. At a deeper level of evolutionary connections, structural comparisons indicate that AEPs, the nucleases involved in the initiation of rolling circle replication in plasmids and viruses, and origin-binding domains of papilloma and polyoma viruses evolved from a common ancestral protein that might have been involved in a protein-priming mechanism of initiation of DNA replication. Contextual analysis of multidomain protein architectures and gene neighborhoods in prokaryotes and viruses reveals remarkable parallels between AEPs and the unrelated DnaG-type primases, in particular, tight associations with the same repertoire of helicases. These observations point to a functional equivalence of the two classes of primases, which seem to have repeatedly displaced each other in various extrachromosomal replicons.
We examined functional and evolutionary patterns in the recently constructed set of 5,873 clusters of predicted orthologs from seven eukaryotic genomes. The analysis reveals a conserved core of largely essential eukaryotic genes as well as major diversification and innovation associated with evolution of eukaryotic genomes.
Sequencing the genomes of multiple, taxonomically diverse eukaryotes enables in-depth comparative-genomic analysis which is expected to help in reconstructing ancestral eukaryotic genomes and major events in eukaryotic evolution and in making functional predictions for currently uncharacterized conserved genes.
We examined functional and evolutionary patterns in the recently constructed set of 5,873 clusters of predicted orthologs (eukaryotic orthologous groups or KOGs) from seven eukaryotic genomes: Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Encephalitozoon cuniculi. Conservation of KOGs through the phyletic range of eukaryotes strongly correlates with their functions and with the effect of gene knockout on the organism's viability. The approximately 40% of KOGs that are represented in six or seven species are enriched in proteins responsible for housekeeping functions, particularly translation and RNA processing. These conserved KOGs are often essential for survival and might approximate the minimal set of essential eukaryotic genes. The 131 single-member, pan-eukaryotic KOGs we identified were examined in detail. For around 20 that remained uncharacterized, functions were predicted by in-depth sequence analysis and examination of genomic context. Nearly all these proteins are subunits of known or predicted multiprotein complexes, in agreement with the balance hypothesis of evolution of gene copy number. Other KOGs show a variety of phyletic patterns, which points to major contributions of lineage-specific gene loss and the 'invention' of genes new to eukaryotic evolution. Examination of the sets of KOGs lost in individual lineages reveals co-elimination of functionally connected genes. Parsimonious scenarios of eukaryotic genome evolution and gene sets for ancestral eukaryotic forms were reconstructed. The gene set of the last common ancestor of the crown group consists of 3,413 KOGs and largely includes proteins involved in genome replication and expression, and central metabolism. Only 44% of the KOGs, mostly from the reconstructed gene set of the last common ancestor of the crown group, have detectable homologs in prokaryotes; the remainder apparently evolved via duplication with divergence and invention of new genes.
The KOG analysis reveals a conserved core of largely essential eukaryotic genes as well as major diversification and innovation associated with evolution of eukaryotic genomes. The results provide quantitative support for major trends of eukaryotic evolution noticed previously at the qualitative level and a basis for detailed reconstruction of evolution of eukaryotic genomes and biology of ancestral forms.
The concept of a genomic core, defined as the set of genes ubiquitous in all genomes of a monophyletic group, has become crucial in comparative and evolutionary genomics. However, it is still a matter of debate whether lateral gene transfers (LGT) may affect the components of genomic cores, preventing their use to retrace species evolution. We have recently reconstructed the phylogeny of Archaea by using two large concatenated datasets of core proteins involved in translation and transcription, respectively. The resulting trees were largely congruent, showing that informational gene components of the archaeal genomic core belonging to two distinct molecular systems contain a coherent signal for archaeal phylogeny. However, some incongruence remained between the two phylogenies. This may be due either to undetected LGT and/or to a lack of sufficient phylogenetic signal in the datasets.
We present evidence strongly favoring of the latter hypothesis. In fact, we have updated our transcription and translation datasets with five new archaeal genomes for a total of 6384 and 2928 amino acid positions, respectively, and 25 taxa. This increase in taxonomic sampling led to the nearly complete convergence of the transcription-based and translation-based trees on a single phylogenetic pattern for archaeal evolution. In fact, only a single incongruence persisted between the two phylogenies. This concerned Methanopyrus kandleri, whose placement remained strongly biased in the transcription tree due to its above average evolutionary rates, and could not be counterbalanced due to the lack of availability of closely related and/or slower-evolving relatives.
To our knowledge, this is the first report of evidence that the phylogenetic signal harbored by components of the archaeal translation apparatus is confirmed by additional markers belonging to a second molecular system (i.e. transcription). This rules out the risk of circularity when inferring species evolution by small subunit ribosomal RNA and ribosomal protein sequences, since it has been suggested that concerted LGT may affect these markers. Our results strongly support the existence of a core of proteins that has evolved mainly through vertical inheritance in Archaea, and carries a bona fide phylogenetic signal that can be used to retrace the evolutionary history of this domain. The identification and analysis of additional molecular markers not affected by LGT should continue defining the emerging picture of a genuine phylogenetic core for the third domain of life.
Genome size and complexity, as measured by the number of genes or protein domains, is remarkably similar in most extant eukaryotes and generally exhibits no correlation with their morphological complexity. Underlying trends in the evolution of the functional content and capabilities of different eukaryotic genomes might be hidden by simultaneous gains and losses of genes.
We reconstructed the domain repertoires of putative ancestral species at major divergence points, including the last eukaryotic common ancestor (LECA). We show that, surprisingly, during eukaryotic evolution domain losses in general outnumber domain gains. Only at the base of the animal and the vertebrate sub-trees do domain gains outnumber domain losses. The observed gain/loss balance has a distinct functional bias, most strikingly seen during animal evolution, where most of the gains represent domains involved in regulation and most of the losses represent domains with metabolic functions. This trend is so consistent that clustering of genomes according to their functional profiles results in an organization similar to the tree of life. Furthermore, our results indicate that metabolic functions lost during animal evolution are likely being replaced by the metabolic capabilities of symbiotic organisms such as gut microbes.
While protein domain gains and losses are common throughout eukaryote evolution, losses oftentimes outweigh gains and lead to significant differences in functional profiles. Results presented here provide additional arguments for a complex last eukaryotic common ancestor, but also show a general trend of losses in metabolic capabilities and gain in regulatory complexity during the rise of animals.
Recent advances of genomics and metagenomics reveal remarkable diversity of viruses and other selfish genetic elements. In particular, giant viruses have been shown to possess their own mobilomes that include virophages, small viruses that parasitize on giant viruses of the Mimiviridae family, and transpovirons, distinct linear plasmids. One of the virophages known as the Mavirus, a parasite of the giant Cafeteria roenbergensis virus, shares several genes with large eukaryotic self-replicating transposon of the Polinton (Maverick) family, and it has been proposed that the polintons evolved from a Mavirus-like ancestor.
We performed a comprehensive phylogenomic analysis of the available genomes of virophages and traced the evolutionary connections between the virophages and other selfish genetic elements. The comparison of the gene composition and genome organization of the virophages reveals 6 conserved, core genes that are organized in partially conserved arrays. Phylogenetic analysis of those core virophage genes, for which a sufficient diversity of homologs outside the virophages was detected, including the maturation protease and the packaging ATPase, supports the monophyly of the virophages. The results of this analysis appear incompatible with the origin of polintons from a Mavirus-like agent but rather suggest that Mavirus evolved through recombination between a polinton and an unknownvirus. Altogether, virophages, polintons, a distinct Tetrahymena transposable element Tlr1, transpovirons, adenoviruses, and some bacteriophages form a network of evolutionary relationships that is held together by overlapping sets of shared genes and appears to represent a distinct module in the vast total network of viruses and mobile elements.
The results of the phylogenomic analysis of the virophages and related genetic elements are compatible with the concept of network-like evolution of the virus world and emphasize multiple evolutionary connections between bona fide viruses and other classes of capsid-less mobile elements.