|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: DR MB MAM. Performed the experiments: MB MAM GG. Analyzed the data: DR MB MAM GG BLS. Contributed reagents/materials/analysis tools: MB MAM GG. Wrote the paper: DR MB MAM.
The discovery of Mimivirus, with its very large genome content, made it possible to identify genes common to the three domains of life (Eukarya, Bacteria and Archaea) and to generate controversial phylogenomic trees congruent with that of ribosomal genes, branching Mimivirus at its root. Here we used sequences from metagenomic databases, Marseillevirus and three new viruses extending the Mimiviridae family to generate the phylogenetic trees of eight proteins involved in different steps of DNA processing. Compared to the three ribosomal defined domains, we report a single common origin for Nucleocytoplasmic Large DNA Viruses (NCLDV), DNA processing genes rooted between Archaea and Eukarya, with a topology congruent with that of the ribosomal tree. As for translation, we found in our new viruses, together with Mimivirus, five proteins rooted deeply in the eukaryotic clade. In addition, comparison of informational genes repertoire based on phyletic pattern analysis supports existence of a clade containing NCLDVs clearly distinct from that of Eukarya, Bacteria and Archaea. We hypothesize that the core genome of NCLDV is as ancient as the three currently accepted domains of life.
Molecular sequence analyses have allowed to partially identify the origins of genes. A putative tree of life based on ribosomal analysis was postulated that includes three domains: Eukarya, Bacteria and Archaea. Viruses are excluded from this classification system due to a lack of evidence that they possess a core of genes and because they lack ribosomes . The double jelly-roll motif that can be found in the capsids of viruses infecting Archaea, Eukarya, or Bacteria, however, is proposed as evidence for an ancestry of some viruses , . The description of the giant Mimivirus  that share similar features with cells has recently generated much debate about the nature of viruses in the living world –. In our study, we identify a common set of proteins present in all forms of life including cellular organisms and viruses, particularly giant viruses, and we used this set of viral and cellular proteins to perform phylogenetic reconstructions.
In modern cells, biosynthesis of DNA precursors (dNTPs) is achieved in two steps by two essential enzymes (Figure 1): ribonucleotide reductase (RNR) and thymidylate synthase (TdS) . First, reduction of RNA precursors (rNTP) into dNTP is catalyzed by RNR, then deoxythymidine 5′-monophosphate (dTMP) is produced from deoxyuridine 5′-monophosphate (dUMP) by thymidylate synthase. These two key enzymes are thought to have been involved in the transition from ancient RNA to the DNA world . They have been identified in cells and in nucleocytoplasmic large DNA viruses (NCLDVs) and used for phylogenetic reconstructions. RNR proteins have been classified into three classes ; we focus on classes I and II as class III proteins share very little similarity with the other two classes . The RNR phylogeny (Figure S1) shows prokaryotic and eukaryotic clades and supports the emergence of RNR of three NCLDV families separately from the eukaryotes. Two families of non-homologous TdS, ThyA and ThyX, have been identified in living organisms, but the thymidylate synthases reported in NCLDVs are mostly of the ThyA type . The phylogenetic reconstruction of the ThyA type (Figure S2) showed two clusters, one that includes bacteria, bacteriophages and Archaea, and the other that includes eukaryotes and their viruses. In the second group, ThyA in all NCLDV lineages was scattered within the eukaryotic clade, except for ThyA from Mimivirus and Marseillevirus, which emerge on a separate clade. Phylogenetic trees inferred for these two proteins also show NCLDV branches within eukaryotic clades. We assumed, as others have , that multiple gene exchanges or non-orthologous gene displacements have occurred during the extensive evolution between eukaryotes and viruses, thus blurring the topology.
Four key proteins are involved in DNA replication and repair. Three enzymes, including the DNA polymerase family B (DNAP B), the topoisomerase II A (TopoIIA), the Flap endonuclease (FEN), and the processing factor Proliferating Cell Nuclear Antigen (PCNA) (Figure 1) are largely distributed in NCLDVs  and are therefore used for phylogenetic reconstruction. The DNAP B tree (Figure S3) was built from archaeal, bacterial, viral and eukaryotic α, δ and ζ protein sequences' alignment. Sequences from NCLDVs and Herpesviridae were sister clades rooted between the δ and ζ eukaryotic subfamilies. The TopoIIA tree (Figure S4) was obtained from archaeal, bacterial, eukaryotic and NCLDVs' sequences' alignment. The TopoIIA phylogenetic reconstruction showed eukaryotic monophyly, whereas NCLDVs are not monophyletic but emerged as a bush between the eukaryotic and prokaryotic clades. PCNA and FEN were not found in Bacteria and their respective phylogenetic trees (Figures S5 and S6, respectively), supporting distinct Archaea, eukaryote and NCLDV monophylies. Therefore, these four protein phylogenies support, with a high degree of confidence, the existence of a viral clade with ancestral DNA replication machinery branching separately from Eukarya and Archaea.
In addition, we investigated the phylogenetic reconstruction of two proteins involved in transcription processing: the DNA-dependant RNA polymerase II (RNAP II) is a ubiquitous enzyme, and its phylogenetic tree reveals a clade containing NCLDVs clearly distinct from that of Eukarya, Bacteria and Archaea (Figure 2 and Figure S7). Transcription factor II B (TFIIB), absent in bacteria, is a general transcription factor that makes up the RNA polymerase II pre-initiation complex. Its phylogenetic analysis displayed similar results (Figure 3), with the exception that bacteria are absent. These results lead to the conclusion that highly conserved proteins involved in the RNA biosynthesis in NCLDVs have emerged in a clade as ancient as those of eukaryotes and Archaea.
The Mimivirus genome contains several genes encoding proteins involved in DNA translation . We investigated a phylogenetic reconstruction for five proteins involved in translation, including four amino-acyl tRNA synthetases (Figures S8, S9, S10, S11) and the putative elongation factor EF-1 (Figure S12), which are also found in the draft genomes of three newly described members of the Mimiviridae family (Terravirus, Courdovirus and Moumouvirus) . Arginyl-tRNA synthetase and cysteine-tRNA synthetase phylogenetic trees (Figures S8 and S10) show blurred topologies of Mimiviridae branching inside the eukaryotes clade, but the topology of the methyonyl-tRNA synthetase tree (Figure S11), which contains three distinct eukaryotic clades, supports the existence of a Mimiviridae clade branching before the eukaryotic clade containing Amoebozoa and opisthokonts. The topology of the tyrosyl-tRNA synthetase tree supports the hypothesis of a transfer from an ancient viral clade to an amoeba , . It was previously suggested that the amino-acyl tRNA synthetases are more commonly exchanged genes , and the amino-acyl tRNA synthetase trees, including that of Mimiviridae, support this hypothesis and show that viruses are also included in the confused evolutionary scenarios of these proteins. Proteins related to EF-1 are divided into different subfamilies according to their functional domains . That of Mimiviridae displays the GTP-binding protein domain (GTPBP1) found in a protein family divergent from the bona fide EF-1 family . The phylogenetic tree (Figure S12) exhibited three large clades corresponding to the archaeal and eukaryotic translation protein families related to EF-1, one clade corresponding to the divergent bacterial elongation factor EF-TU, and another corresponding to GTPBP1 proteins. A phylogenetic tree supported the emergence of the Mimiviridae elongation factor inside the GTPBP1 eukaryote clade. In conclusion, the phylogenetic trees based on translational proteins suggest that Mimiviridae acquired some of the genes at the root of the eukaryote clade.
As DNA processing genes appear to be the most conserved genes between cellular organisms and viruses, we further constructed a different type of tree, known as phyletic pattern , that is based on the comparison of the gene repertoire involved in the information storage and processing. A dendogram tree (Figure 4) was constructed from the presence/absence of the respective genes of each virus and cellular organisms in database of Clusters of Orthologous Groups (COGs) related to the DNA processing functions. The topology of this gene content tree supports the existence of a clade containing NCLDVs clearly distinct from that of Eukarya, Bacteria and Archaea. In accordance with the phylogenetic tree based on NCLDV core genes , viruses of the same family roughly grouped together inside the NCLDV clade. This tree based on informational gene repertoire is compatible with the above phylogenetic trees showing clearly four distinct clades and with that of the ribosomal phylogenetic trees (excluding viruses).
Our results attest to the existence of a single and ancestral source of genes allowing for the biosynthesis of DNA in large DNA viruses. Based on the study of these genes, we speculate that they appeared simultaneously with or just after the emergence of the modern eukaryote lineages. This topology may be biased by the fact that the current eukaryotes, arising from the fusion of a proto-eukaryote with an alpha-proteobacteria, emerged a billion years ago. This was followed by a bottleneck and the disappearance of all proto-eukaryotes that did not successfully form a symbiotic fusion with bacteria . Therefore, there are no traces of the viruses of these proto-eukaryotes. This might explain the branching of some NCLDVs inside the eukaryote clade. Finally, an analysis of the current genealogy of organisms showed that the group of genes associated with DNA synthesis has a common source that is conserved in three or four current domains of organisms (including viruses) and had a topology comparable to that of ribosomal phylogenetic trees (excluding viruses). Finally, if we consider these nine proteins, five were present in the three canonical domains and NCLDVs; the other four trees included only three clades: the ribosomal tree and three containing NCLDVs and two domains (PCNA, FEN, and TFIIB). We conclude that trees based on ribosomal proteins are not sufficient to represent all forms of life as they do not include viruses.
In addition, as previously suggested , it is possible that the ancestral cellular DNA machinery came from DNA viruses, which would have thus in turn provided genes allowing DNA replication. This work also confirms and extends the seminal work of Iyers et al., who identified for the first time a core genome of the NCLDV with nine genes common to all species of this group of viruses . Interestingly, the name NCLDV does not reflect the striking complexity of this group of viruses that may contain both DNA and RNA . Finally in this study, we found that the informational gene repertoires, irrespective of their individual source, are also evidencing four major clades.
Because the set of genes associated with the DNA machinery is found in both viruses and cells, the origin of this gene set may be questionable, and some authors have suggested that large DNA viruses might have contributed to the formation of the nucleus of eukaryotes , . In this scenario, current forms of life would result from the fusion of the world of RNA and proteins (including ribosomes) with that of DNA; RNA and protein molecules have been preserved in modern cells (Archaea, Bacteria and Eukarya), whereas some viruses (i.e., Mimiviridae) containing both RNA and DNA have conserved translation-associated proteins such as tRNA synthetases, which are probably remnants of an ancestral translation apparatus. In any case, we speculate that DNA machinery is a clear remnant of a biological “Big Bang” generating at least two distinct forms of life: DNA-based and DNA-RNA-ribosome-based. Since this ancestral time, two forms of life have co-developed ; cells that are ribosome-encoding organisms (REOs), diversified in three domains (Archaea, Bacteria, and Eukarya), and capsid-encoding organisms (CEOs) (bona fide viruses, parasites of the REOs). Accepting this ancestry allows us to reinterpret the origin of two-tailed virus families (i.e., Myoviridae and Siphoviridae) infecting both bacteria and Archaea. One can suggest that they derive from ancestral viruses infecting the REO ancestor rather than being transferred from one domain to another . This makes sense if one considers that Siphoviridae appear to be the most common viral family on Earth . Finally, several previous studies focusing on the common genetic features in viruses from Archaea and from eukaryotes ,  suggest that there is a conserved endonuclease structure in influenza virus and in the endonucleases of Archaea and bacteria . Evidence that the picorna-like superfamily of RNA viruses originated before the radiation of eukaryotes  and a phylogenetic tree of the guanylyl transferase domain branching poxviruses at the origin of eukaryotes  might also be reinterpreted in this light and are congruent with the hypothesis of CEO ancestry. Based on these data, we propose a scenario in which NCLDV emerged from the rhizome of life  with roots arising at the very beginning of life (Figure 5).
Selected proteomes of organisms representing major phyla of the three domains of life (Eukarya, Archaea and Bacteria) were downloaded from NCBI and JGI. BLASTP  searches were performed with this set of proteins to identify conserved proteins in the complete proteomes that are available for the six NCLDV families (Asfarviridae, Asco-Iridoviridae, Phycodnaviridae, Poxviridae, Marseillevirus and Mimiviridae) and the draft proteomes of three new Mimiviridae . Groups of homologs were built using the criteria that sequences were homologous if the Best Blast Hit (BBH) showed an alignment length of over 70 amino acids and with a percent identity over 20%. Four proteins matching these criteria that were also present in at least the three domains of life were identified: thymidylate synthase, RNR, DNAP B, TopoIIA (also called gyrase in bacteria), and RNAP II. The FEN, the PCNA and the TFIIB were found only in eukaryotes and Archaea. Aminoacyl-ARNt synthetases and EF-1-like proteins were only identified in Mimiviridae and the three domains of life. For this set of proteins, homologs among all the viral proteomes together with homologs with the largest coverage relating to our inquiry in environmental sequence data (env_nr) were detected using BLASTP. The Courdovirus protein sequences have been deposited in GenBank under the following accession numbers: HQ223093 for cysteinyl-tRNA_synthetase, HQ223094 for tyrosyl-tRNA_synthetase, HQ223095 for methionyl-tRNA_synthetase, HQ223096 for arginyl-tRNA_synthetase, HQ223097 for GTPB1_like, HQ404368 for TFIIB. The Terravirus protein sequences have been deposited in GenBank under the following accession numbers: HQ223098 for cysteinyl-tRNA_synthetase, HQ223099 for tyrosyl-tRNA_synthetase, HQ223100 for methionyl-tRNA_synthetase, HQ223101 for arginyl-tRNA_synthetase, HQ223102 for GTPB1_like. The Moumouvirus protein sequences have been deposited in GenBank under the following accession numbers: HQ223103 for cysteinyl-tRNA_synthetase, HQ223104 for tyrosyl-tRNA_synthetase, HQ223105 for GTPB1_like, HQ404369 for TFIIB.
T-Coffee  or Muscle  was used to construct multiple alignments, and conserved blocks were selected using Gblocks . The corrected alignments were then used for maximum likelihood (ML) and Bayesian inference (BI). ML phylogeny inference was constructed using FastTree , with the JTT+CAT substitution model. Values near branches are SH-like local supports computed by the FastTree program and are used as confidence values of tree branches. For the Bayesian approach, the phylogeny was performed using MrBayes; the WAG matrix was used, and model parameters (gamma shape and proportion invariant) were allowed to vary through the Markov Chain Monte Carlo Chain (MCMC). Four MCMC chains were run for 1 million generations and sampled every 100th generation. The first 100,000 trees were discarded, and the sumt command of MrBayes was used to compute clade posterior probabilities. Trees were displayed using MEGA 4 .
Comparison of informational genes repertoire was performed using a phyletic pattern indicating the presence/absence of the respective informational genes of all species present in the corresponding functional categories of Clusters of Orthologous Group (COGs) based on 66 genomes , . Only COGs functional categories related to Information storage and processing ([J, A, K, L and B] COG categories) and Nucleotide transport and metabolism ([F] COG category) were used to build a matrix by assigning as a “1” if there is at least one ortholog contained in a genome and a “0” if not. As eukaryotes are weakly represented in initial COG database, we extended the initial matrix by assigning a COG functional category to proteins of 14 added eukaryotic proteomes using BLASTP against COG database (e-value <10-3). We also incremented the matrix of phyletic pattern with NCLDVs species by assigning NCLDVs proteins in corresponding COGs, which is performed by linking functional categories of NCVOGs database  with those of COGs using the same parameters. The euclidian distance matrix was computed from the 0/1 matrix and then a dendogram tree was built from hierarchical clustering using in house R language.
Bayesian phylogenetic tree of RNR (31 sequences, 166 positions). GI or JGI numbers are listed next to the corresponding taxonomic name of each cellular organism and virus. A color code was used to represent taxonomic groups, Bacteria in purple, Archaea in green, Eukarya in blue, NCLDVs in red, other viruses and phages in pink and environmental sequences in black. Numbers at nodes are Bayesian posterior probabilities. Scale bar represents the number of estimated changes per position for a unit of branch length.
Bayesian phylogenetic tree of ThyA (44 sequences, 159 positions). Detailed legend is the same as in Figure S1.
Bayesian phylogenetic tree of DNAP B (62 sequences, 139 positions). Detailed legend is the same as in Figure S1.
Bayesian phylogenetic tree of TopoIIA (48 sequences, 140 positions). Detailed legend is the same as in Figure S1.
Bayesian phylogenetic tree of PCNA (40 sequences, 174 positions). Detailed legend is the same as in Figure S1.
Bayesian phylogenetic tree of FEN (37 sequences, 304 positions). Detailed legend is the same as in Figure S1.
Bayesian phylogenetic tree of RNAP II (80 sequences, 272 positions). Detailed legend is the same as in Figure S1.
Maximum-likelihood phylogenetic tree of arginyl-tRNA synthetase (32 sequences, 123 positions). Detailed legend is the same as in Figure S1 except that numbers at nodes are SH-like local supports.
Maximum-likelihood phylogenetic tree of tyrosyl-tRNA synthetase (33 sequences, 201 positions). Detailed legend is the same as in Figure S1 except that numbers at nodes are SH-like local supports.
Maximum-likelihood phylogenetic tree of cysteine-tRNA synthetase (38 sequences, 156 positions). Detailed legend is the same as in Figure S1 except that numbers at nodes are SH-like local supports.
Maximum-likelihood phylogenetic tree of methyonyl-tRNA synthetase (51 sequences, 204 positions). Detailed legend is the same as in Figure S1 except that numbers at nodes are SH-like local supports.
We are grateful to Ghislain Fournous for help with bioinformatic analyses; Lina Barrassi, Angélique Campocasso and Isabelle Pagnier for their technical assistance in isolation of new viruses and Catherine Robert for technical assistance in genome sequencing. We thank Christelle Forzale for help with the figures 1 and and55.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work was funded by the Centre National de la Recherche Scientifique (CNRS, crédits récurrents). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.