|Home | About | Journals | Submit | Contact Us | Français|
To investigate the evolutionary origins of proteins encoded by the Poxviridae family of viruses, we examined all poxvirus protein coding genes using a method of characterizing and visualizing the similarity between these proteins and taxonomic subsets of proteins in GenBank. Our analysis divides poxvirus proteins into categories based on their relative degree of similarity to two different taxonomic subsets of proteins such as all eukaryote vs. all virus (except poxvirus) proteins. As an example, this allows us to identify, based on high similarity to only eukaryote proteins, poxvirus proteins that may have been obtained by horizontal transfer from their hosts. Although this method alone does not definitively prove horizontal gene transfer, it allows us to provide an assessment of the possibility of horizontal gene transfer for every poxvirus protein. Potential candidates can then be individually studied in more detail during subsequent investigation.
Results of our analysis demonstrate that in general, proteins encoded by members of the subfamily Chordopoxvirinae exhibit greater similarity to eukaryote proteins than to proteins of other virus families. In addition, our results reiterate the important role played by host gene capture in poxvirus evolution; highlight the functions of many genes poxviruses share with their hosts; and illustrate which host-like genes are present uniquely in poxviruses and which are also present in other virus families.
The Poxviridae are a large family of double stranded DNA viruses whose members have a linear genome of 130 to 300 kbp, and replicate in the cytoplasm of eukaryote cells. The poxvirus family is composed of two subfamilies: the Entomopoxvirinae, comprised of viruses that infect insects, and the Chordopoxvirinae, comprised of viruses that infect vertebrates. Both subfamilies are further divided into genera, groups of viral species with genetic and antigenic similarity to one another. Chordopoxviruses are categorized into 9 genera: Avipoxvirus, Capripoxvirus, Cervidpoxvirus, Leporipoxvirus, Molluscipoxvirus, Orthopoxvirus, Parapoxvirus, Suipoxvirus, and Yatapoxvirus. Entomopoxviruses are categorized into the genera Alphaentomopoxvirus, Betaentomopoxvirus, and Gammaentomopoxvirus. The genus Orthopoxvirus is the most well characterized, and contains the species Variola virus, isolates of which are the causative agent of smallpox, as well as the species Vaccinia virus, containing less virulent but better studied viruses.
The poxvirus family is postulated to have a common evolutionary origin with four other families of large eukaryotic DNA viruses, collectively referred to as nucleocytoplasmic large DNA viruses (NCLDV) (Iyer et al., 2001). These families include Asfarviridae, containing viruses which infect both pig and parasitic arthropods, Iridoviridae, whose members infect invertebrates, fish or amphibians, Phycodnaviridae, containing viruses which infect eukaryotic algae, and the recently discovered Mimiviridae, whose members are only known to infect amoeba.
Poxvirus virions are ovoid or brick-shaped, and consist of an envelope surrounding an outer membrane, which itself surrounds a densely packed and membrane bound core containing a double-stranded DNA genome, enzymes, and transcription factors. Unlike most viruses, poxvirus virions do not rely on particular cell surface receptors, but are capable of binding and penetrating the outer membrane of nearly any cell type. Virion cores are released into the cytoplasm where they immediately synthesize early mRNAs that are translated into growth factors, cell signaling and immune defense molecules, enzymes, and other factors necessary for DNA replication and intermediate transcription. Uncoating of the core next allows the DNA genome to be replicated to form concatemeric molecules along with the transcription of intermediate genes that when translated, provide late transcription factors. Subsequent transcription and translation of the late genes produces virion structural proteins, enzymes, and early transcription factors for packaging into virions (Moss, 2001). During virion formation, the concatemeric DNA genomes are resolved into individual genomes, packaged into the core membranes, and mature within the cytoplasm to form infectious mature virions (MV). These are subsequently wrapped in modified Golgi membranes and transported to the periphery of the cell via attached actin filaments. Fusion of the wrapped virions with the plasma membrane results in release of enveloped virus (EV) (Condit et al., 2006).
DNA viruses have much lower mutation rates and genetic variability than RNA viruses, with nucleotide substitution rates closer to those of their hosts, on the order of 10−7 to 10−9 mutations per site per round of replication (Drake and Hwang, 2005; Duffy et al., 2008). Their resulting genetic stability, together with their high levels of host specificity have led in part to the hypothesis that many DNA viruses cospeciate with their hosts (DeFilippis and Villarreal, 2001). Intricate relationships with their hosts are evidenced by the many immunological and cellular factors these viruses have obtained through host gene capture via recombination between viral and host DNA. Such acquisition of new coding information by poxviruses may contribute to the ability of the virus to manipulate the host immune response and other cellular machinery to provide a selective advantage for virus replication.
Evidence suggests many orthopoxviruses occasionally cross into other mammals from rodent reservoir populations, either as zoonotic infections of humans or via mutations that allow colonization of new host species (Esposito and Fenner, 2001; Li et al., 2007; Likos et al., 2005). Understanding species crossing events of both types is essential to understanding the threats poxviruses pose today. Investigation of evolutionary clues buried within the genome sequence of the virus, such as captured host genes left by such historical host interactions, may help us to better understand mechanisms of zoonotic infection, viral tropism, evolutionary adaptation, and pathogenesis.
In general, recombination may occur via homologous recombination, site-specific recombination, or non-homologous end joining, and may be between viral genomes or between a viral genome and some other genetic entity, such as the genome or cDNA of the viral host, or a co-infecting parasite or a plasmid. Recombination may be an important source of genetic variation among viruses, where it is often associated with rapid evolutionary divergence, due to the potential of providing a selective advantage much more quickly than through the accumulation of point mutations. Recombination has been detected in both DNA and RNA viruses including species of the families Caulimoviridae, Flaviviridae, Herpesviridae, Papillomaviridae, Picornaviridae, Potyviridae, Poxviridae, Polyomaviridae, Retroviridae, the genus Tobamovirus, and bacteriophages in the order Caudovirales (DeFilippis and Villarreal, 2001). Such evidence has led to the “modular” theory of virus evolution, whereby many viral genomes represent mosaics of genetic material obtained through multiple recombination events (Botstein, 1980; Shackelton and Holmes, 2004).
Acquisition of host genes and apparent selection for maintenance of those genes has been documented in many virus families (Iyer et al., 2001; McFadden and Murphy, 2000). Many of the apparently host-derived genes fall into one of two well-defined categories of gene function: immunomodulatory genes and genes involved in nucleic acid metabolism. Viral proteins that are very similar to host genes are documented to interfere with a variety of host immune defense mechanisms including antigen display, cytokines and their receptors, cytoplasmic signaling resulting from immune activation, and genes involved in resistance of cells to oxidative stress and apoptosis (Hughes and Friedman, 2005; Shackelton and Holmes, 2004). Many DNA viruses encode genes involved in nucleic acid metabolism, with which they redirect the host nucleotide precursor pool to viral DNA synthesis (Iyer et al., 2006). These enzymes are often clearly similar to host enzymes, and are often very highly conserved, probably due to functional restraints on structure and biochemical properties. There are many additional, possibly host-derived genes, whose functions have not yet been fully explored, but based at least on similarity to other proteins, seem to manipulate various intracellular processes to facilitate steps in the viral life cycle. Examples of these are genes involved in signaling pathways, lipid and carbohydrate metabolism, vesicle transport, and protein-protein interactions (Afonso et al., 2000; Geserick et al., 2004; Laidlaw et al., 1998; Werden and McFadden, 2008).
Several methods are available to detect genes that may have been horizontally transferred into virus genomes from hosts or other sources. These include phylogenetic inference, compositional features such as codon and nucleotide bias, and patterns of presence and absence of genes within genomes. A well accepted and widely used method to detect horizontal gene transfer (HGT) is demonstration of phylogenetic clustering of the gene of interest with taxa unrelated to the current genome in which the gene is found, to the exclusion of taxa more closely related to the current genome. This method provides information about the potential donor and recipient organisms, but its potential caveats include limited phylogenetic samples, undetected presence of paralogs, and unequal rates of evolution between lineages (Katz, 2002). Compositional features that may be used to detect recently horizontally acquired genes include nucleotide composition, oligonucleotide frequencies, and codon usage (Koonin and Wolf, 2008), but these methods work only for very recent HGT because the anomalous signatures of such genes decay rapidly due to continued evolution of the host genome (Katz, 2002; Koonin and Wolf, 2008; Monier et al., 2007), and these methods do not give information about the donor lineage (Katz, 2002). Presence of a gene within only a related subset of a taxonomic group is a possible indicator of HGT if apparent orthologs of the gene are present in unrelated taxa.
Sequence similarity alone is not accepted as a definitive demonstration of HGT or of close evolutionary relationship, since, for example, such results may be dependent on sampling biases present in the search databases used (Koski and Golding, 2001). However, sequence similarity measures can be a powerful tool for scanning very large amounts of data to find promising individual protein candidates for further analysis. Such sequence similarity analyses may also provide evidence for possible large-scale evolutionary trends across an entire virus taxon. This report therefore presents our effort to assess overall trends in HGT for members of the family Poxviridae and to identify individual poxvirus genes that show evidence of HGT for more detailed, subsequent studies.
Protein databases for various taxonomic groups were assembled and searched using BLASTP for best matches to query sets of viral proteins. Results were processed with perl scripts, and displayed in two-dimensional taxonomic group plots. This straightforward method of visually comparing two sets of BLAST scores for a set of proteins has been utilized previously to compare proteins of a single genome to the proteins of two other genomes (NCBI, 2006; Rasko et al., 2005), and to compare proteins of one taxonomically grouped set of genomes to the proteins of two other taxonomically grouped sets of genomes (Lefkowitz et al., 2006).
When the two taxonomic group databases being compared yield similar scores for a query protein, this indicates that each group contains at least one protein with about the same degree of pairwise similarity to the query protein. A set of query proteins with scores that are similar between the two databases creates a diagonal between the two axes.
When a point lies closer to one axis than to the other, this indicates that its best blast hit in the taxonomic group represented by the nearer axis has a greater degree of pairwise similarity to the query protein than its best match in the taxonomic group representing the opposing axis.
Any tendency of a set of query proteins to skew towards a particular taxonomic group might suggest a common evolutionary origin for those sequences either through descent from a common ancestor, or through multiple horizontal gene transfer events.
Each point on the graph is plotted based on the score of its single highest scoring hit in each of the target databases. In many cases, the target database provides many hits with scores that are nearly as high as the score for the single best hit. So the identity of the protein with the single best hit is only one representative of the group of all proteins with hits that exhibit closely related scores. The BLASTP bitscore is one of several metrics that can provide an indication of the degree of similarity between two proteins. No pairwise sequence metric can definitively establish an evolutionary relationship between two protein sequences, but many, including bitscore, can give clues regarding protein similarities. The similarities can be collectively examined to gauge general trends in the similarity between proteomes, and individual similarities might be suggestive of evolutionary relationships between proteins, which may then be followed up using more rigorous methods of investigation, such as phylogenetic analyses, to assess the nature and likelihood of possible evolutionary relationships between individual proteins.
Phylogenetic analyses were conducted on a small number of poxvirus proteins suggested by taxonomic group plots as having potentially interesting evolutionary histories. Each poxvirus protein was aligned with the protein providing its best BLASTP score as depicted in the plot, along with similar sequences from representative taxa. Sequences were aligned using the CLUSTALW algorithm (Thompson et al., 1994) implemented in MEGA version 4 (Tamura et al., 2007). Consensus phylogenetic trees were constructed by the Maximum Parsimony method using MEGA version 4, by the Maximum Likelihood method using Garli 0.96 (Zwickl, 2006)., and by Bayesian inference using MrBayes 3.12 (Ronquist and Huelsenbeck, 2003).
Our initial analysis was performed using as a query set, all proteins predicted to be encoded by all species and isolates in both the chordopoxvirus and entomopoxvirus subfamilies of the Poxviridae. This query set was used to probe three protein databases: all proteins encoded by eukaryotes, all proteins encoded by bacteria, and all virus-encoded proteins except those encoded by poxviruses. The results from the viral protein database were plotted against results from eukaryote proteins (Fig. 1a) and against results from bacterial proteins (Fig. 1b), and results from the eukaryote and bacterial protein databases were plotted against each other (Fig. 1c). Overall, the resulting plots show that chordopoxvirus proteins tend to exhibit greater similarity to eukaryotic proteins than to bacterial or viral proteins, suggesting that many poxvirus proteins may share a common evolutionary origin with proteins of their eukaryotic hosts. The plots in Fig. 1 distinguish between chordopoxvirus and entomopoxvirus subsets of poxvirus proteins. Although entomopoxviruses share several of the host-like genes present in chordopoxviruses, entomopoxvirus proteins do not show the same general skew towards greater similarity with eukaryotic proteins in comparison to other viral proteins. This could be due to the relative shortage of insect sequences in GenBank or to a bias of entomopoxviruses towards trading genes with other insect viruses over acquiring them from hosts.
The prominent very high scoring proteins that skew towards the virus protein axis in both plots (Fig. 1a and 1b) are encoded by the copy of the avian retrovirus, reticuloendotheliosis virus, which has integrated into the genome of fowlpox virus, of the Avipoxvirus genus (Hertig et al., 1997). These proteins score very high against the virus database because they are identical to those encoded by reticuloendotheliosis virus. Their closest cousins among eukaryote-encoded proteins, also providing high blast bitscores, are those coded for by endogenous retroviruses of pig, koala and possum.
Many proteins lie in clusters, which are nearly always made up of orthologous proteins from various poxvirus species. Because of slight sequence variations between orthologs, the proteins in a cluster get slightly different scores against the target database proteins, but still close enough to form an orthologous cluster. One example is the cluster of ribonucleotide reductase large subunit (RNR1) poxvirus orthologs that is circled in Fig. 1a,1b, and 1c.
The proteins in Fig. 1a segregate into six categories based on their location on the plot: A) along, or very close to the virus axis; B) in the region between the diagonal and the virus axis; C) on the diagonal between the two axes; D) in the region between the diagonal and the eukaryote axis; E) along or close to the eukaryote axis; and F) proteins that fall near the origin and therefore do not exhibit significant sequence similarity to any proteins from other virus families or from eukaryotic species. Each category has its own range of values for the ratio of virus similarity to eukaryote similarity. For example, proteins in category A have recognizable sequence similarity to proteins of other viruses, as compared to their insignificant levels of similarity to proteins of eukaryotes, while category E is just the inverse, with a high eukaryote-to-virus sequence similarity ratio. Category C proteins have relatively equal levels of sequence similarity to proteins of both viruses and eukaryotes, and regions B and D on either side of the diagonal have recognizable similarity to proteins in both the eukaryote and virus databases, but get a higher score to one database than to the other. Poxvirus proteins plotted in the same region may have similar scores or similar score ratios, but they are not necessarily similar to one another in any other way, either by sequence similarity, by sequence length, by distribution among poxviruses, or by the species in either searched database which provide their closest match. Region F contains the majority of points, with 77% of points. A breakdown by numbers and percentages of points in each region of the plots in Fig. 1 is shown in Table 1. A table of poxvirus proteins present in each region of Fig. 1a is available as supplemental Table S2. Table S2 identifies the taxonomic subset of poxviruses that encode each protein, the eukaryotic and/or virus species that exhibit the best scores to the poxvirus protein(s) and what is known about the function of the protein. This approach to evolutionary classification of poxvirus proteins is similar to that used to classify proteins of molluscum contagiosum virus (Senkevich et al., 1997).
The following sections outline representative poxvirus proteins from each category identified in Fig. 1a, identifying and discussing them in terms of their similarity to proteins from other virus families and/or eukaryotic species, their degree of distribution among poxvirus species, and their general category of function or putative function.
Poxvirus proteins that fall in this region of the plot have significant levels of sequence similarity to proteins of viruses in other virus families, but have no similarity to proteins of eukaryotes. For each poxvirus protein and its high scoring non-poxvirus protein or proteins, this high level of similarity could be due to a shared evolutionary origin, or to convergent evolution of proteins serving the same role in viruses with similar evolutionary niches.
The highest scoring chordopoxvirus proteins along the virus axis are the large group of homologues of the variola virus protein B22R, whose high scores against the virus database result from a single possible relative of this protein present in cyprinid herpesvirus 3 (CyHV-3), a recently discovered member of the family Alloherpesviridae which is notable for having several genes with unexpected high levels of similarity to poxvirus genes (Ilouzea et al., 2006). B22R is present in every chordopoxvirus genus except parapoxvirus, and is the largest protein encoded by poxviruses. While its function is still unknown, it is predicted to contain carboxyl-terminal transmembrane domains and cysteine residues which may mediate disulfide bond formation (Tulman et al., 2006). The position of this protein in a sparsely populated area of the plot, and its potential for relationship to a protein in a herpesvirus makes it a good candidate for further investigation by phylogenetic analysis. The consensus tree of the high scoring sequence from CyHV-3 and representative poxvirus sequences in Fig. 2a shows that a horizontal transfer event may have occurred between virus predecessors of crocodile poxvirus and CyHV-3.
Nucleoside triphosphatase I (NPH-1) transcription termination factor is the only protein appearing along the virus axis that is encoded by both entomopoxviruses and chordopoxviruses. This protein is found in most chordopoxvirus genera as well as in Melanoplus sanguinipes entomopoxvirus (MSEV), and all versions get best scores against viruses of the NCLDV group.
Among proteins along the virus axis with scores above 100 (approximate E values less than 10−22), there are 11 groups of orthologous proteins encoded by the entomopoxviruses, and 5 orthologous groups of proteins encoded by the chordopoxviruses. The highest scoring points include entomopoxvirus DNA and RNA repair enzymes, RNA ligase, and NAD+ dependent DNA ligase. Some entomopoxvirus proteins plotted in this region get their highest scores against proteins of viruses in the NCLDV group, but equally as many get high scores against proteins of viruses in the family Baculoviridae, where several, including the Fusolin/gp37 protein, and the Methionine-threonine-glycine (MTG) motif gene family member appear to enhance virus infectivity of the insect host (Dall et al., 2001). While many of the entomopoxvirus proteins plotted in this region score very high against proteins in other viruses, they are of unknown function, and contain no characterized domains.
Proteins plotted in this region have relatively high sequence similarity to proteins of other viruses as compared to their levels of similarity to eukaryote proteins. These poxvirus proteins may have a shared evolutionary origin with both virus and eukaryote ancestors, with greater similarity between the virus homologs due to similar evolutionary selection pressure and functional constraints on the virus genes, in contrast to the selection pressure on the eukaryotic versions of the protein. Poxvirus proteins in this category may share only one or a few protein domains with similar eukaryotic proteins, while best hits with proteins from other virus families exhibit similarity across the entire protein sequence.
Besides the proteins encoded by the reticuloendotheliosis virus integrated into fowlpox virus, the only points with scores above 100 which fall into region B, between the virus axis and the diagonal are encoded by members of the species Canarypox virus. CNPV153 has closest match with the viral replication protein, Rep, of members of the family Circoviridae, and the CNPV227 N1R/p28-like protein has closest match to acanthamoeba polyphaga mimivirus, of the family Mimiviridae.
Region C, surrounding the diagonal, contains proteins whose sequences are globally conserved throughout most DNA viruses and eukaryotes, but it also contains proteins which get a high score against sequences present in only one or a few members of the eukaryote or virus kingdom. Poxvirus proteins plotted in this region find best scores in the virus kingdom among possibly distantly related viruses, i.e. members of the NCLDV, as well as among species of the families Herpesviridae and Adenoviridae. Many of these proteins are universally highly conserved, function in the synthesis and maintenance of DNA and RNA, and are present in many members of both the poxvirus family and the virus family in which the highest score is obtained, as well as in most eukaryotes. The ultimate origin of these proteins is uncertain, and their entries into the virus lineages may have occurred concurrently with the inception of the first ancestors of these viruses, or at many different times during the evolution of the different virus families. Other proteins plotted in this region are apparently of eukaryote origin, have functions involving immune response and intracellular processes, and seem likely to have been transferred horizontally from hosts into the corresponding virus families.
The highest scoring proteins along the diagonal are the large and small subunits of ribonucleotide reductase (RNR) (class 1A), an enzyme that controls the cellular concentration of deoxyribonucleotides. Although there are three classes of RNR, only class 1, subclass A is found in eukaryote-infecting viruses. RNR class 1 is made up of large (RNR1) and small (RNR2) subunits, with two of each subunit required to associate into a heterotetramer to form a functioning enzyme (Stubbe, 1990). Both subunits are very well conserved in all major taxonomic groups in which RNR type I appears: eukaryotes, eubacteria, bacteriophages and eukaryotic viruses. The large subunit of RNR (RNR1) is present in orthopoxviruses and suipoxviruses, while the small subunit of RNR (RNR2) is present in most chordopoxviruses. For both subunits, the percent identities between queries and highest scoring hits are between 80% and 90% percent similarity, with such high levels of sequence conservation likely due to the stringent structural requirements the enzyme must maintain in order to function (Torrents et al., 2002). Although many chordopoxvirus species encode only the small subunit, it is probably functioning in association with host-encoded RNR1, based on the finding that even RNR subunits from vastly different species can associate to form heterotetramers (Hamann et al., 1998).
In addition to the very high scoring RNR proteins, many other enzymes involved in nucleotide synthesis and metabolism are high on the diagonal, including deoxyuridine-triphosphatase (dUTPase), thymidine kinase (TK), thymidylate kinase (ThyK), deoxycytidine kinase, and the one example of thymidylate synthase present in poxviruses. All these enzymes catalyze steps in pyrimidine metabolism, in particular converting cellular pools of RNA components into nucleotides for synthesis of DNA. Also high on the diagonal are DNA polymerase, alpha and beta subunits of RNA polymerase, and DNA photolyase, a DNA repair enzyme well conserved in all branches of life, but notably missing from placental mammals. The poxvirus proteins in this category are widely, some even ubiquitously, distributed among poxviruses, and have most similar viral proteins outside poxviruses in a wide variety of double stranded DNA viruses, including members of the postulated NCLDV group of viruses such as the phycodnaviruses, iridoviruses and mimiviruses, as well as in viruses outside this group, such as adenoviruses and herpesviruses. Eukaryotic best hits come from an even wider range, spanning everything from fungi and plants to vertebrates and invertebrate animals. These types of proteins fulfill basic needs of DNA viruses and all organisms with DNA genomes, and both their omnipresence in nature and the high levels of sequence conservation can be confounding factors in attempts to phylogenetically trace their individual evolutionary lineages.
Many poxvirus proteins plotted on the diagonal have limited distribution among poxviruses and have best virus hits almost exclusively in putatively unrelated viruses, such as members of the baculovirus and herpesvirus families. Proteins in this category, which probably participate in downregulation of the host immune response, include interleukin-10 (IL-10) proteins and complement-control proteins. This category also includes semaphorins and c-type lectin-like proteins whose functions in poxviruses are unknown, but similar proteins in other organisms have roles in immunological pathways. Poxvirus encoded apoptosis-inhibiting proteins and copper/zinc superoxide dismutase protect infected cells against programmed cell death. These virally encoded proteins find highest scores among eukaryotes which seem likely to be hosts or closely related to hosts of the respective viruses, which make the viral proteins seem likely to be the products of independent horizontal gene transfer events from hosts. Although actual assessments of such potential gene transfers may be provided only by further analysis of each gene group, notably by phylogenetic inference, the locations of the points on these plots and the identities of the highest scoring proteins on each axis suggest candidates for study, and provide clues as to which proteins may yield the most interesting results. A good candidate for further study is IL-10, presumably of eukaryotic origin, but with several apparently homologous proteins among poxviruses and herpesviruses. A phylogenetic reconstruction of several viral and host IL-10 sequences is provided in Fig. 2b. Analysis of the phylogenetic relationship between these proteins suggests the possibility of several independent IL-10 HGT events between hosts and infecting viruses. Three HGT events are suggested into different lineages of herpesviruses, and two separate HGT events are suggested for poxviruses, with one each into the capripoxvirus and parapoxvirus lineages. It is notable that for many of these HGT events, the most closely related eukaryote IL-10 protein to a specific virus IL-10 protein is between the particular host species and the virus that infects that host.
A few orthologous groups of proteins plotted in this region have functions unrelated to DNA/RNA/nucleotide synthesis and have closest viral hits in viruses of the NCLDV group. The eukaryote species providing the highest scores to these proteins seem unlikely to be hosts of the respective poxviruses. 3-beta-hydroxysteroid dehydrogenase proteins are widely distributed among poxvirus genera, are likely used to suppress the host inflammatory response, and find most similar virus proteins in fish-infecting iridoviruses. Orthopoxvirus and entomopoxvirus species encode a protein called vaccinia-related ser/thr kinase, which is widely distributed in the animal kingdom and seems to participate in regulation of cell cycle (Kang et al., 2008) and has closest virus relatives in iridovirus species. A few proteins with very limited poxvirus distribution have unknown functions and highest pairwise similarity to proteins of NCLDV member species. The ultimate evolutionary origins of these proteins are unknown.
This region of the plot is one of the most densely populated, with many poxvirus proteins that show significant hits to eukaryotic proteins and lower scores to homologs in other viruses. As with region C, region D contains proteins whose high scores seem due to universal sequence conservation, as well as proteins of presumably eukaryote origin, whose high scores on both axes most likely reflect historical transfer of these genes by separate routes into poxviruses and other virus families.
Several poxvirus protein families have points in both regions C and D, including the vaccinia-related kinase family, the c-type lectin-like proteins, the TK enzymes, and ankyrin repeat proteins. As with their orthologs in region C, best virus matches for these are found both among NCLDV and non-NCLDV DNA viruses. All these are encoded by viruses in many poxvirus genera, and get best eukaryote hits among a variety of animals.
The best scoring protein sequences in region D of the plot are the ATP-dependent DNA ligases encoded by several poxvirus genera. These all have higher pairwise identity to proteins of various mammals than to their best virus hits, which are all among putatively unrelated nucleopolyhedrovirus (NPV) species, a group of viruses in the baculovirus family. This is the only DNA-related enzyme unique to this region of the plot.
Two additional region D proteins with wide distribution among poxvirus genera may have functions modulating host immune response. These are G protein-coupled receptors (GPCR) with significant similarity to known CC chemokine receptors, and proteins in the serpin superfamily of proteinase inhibitors, which are implicated in the regulation of tumor progression, of inflammation, and of cell death (Silverman et al., 2001; Viswanathan et al., 2009). Various mammals provide the best eukaryotic blast scores for most of these sequences, but some avipoxvirus proteins score best to a chicken protein. Herpesviruses provide best virus scores to most of the GPCR proteins, while mimivirus gives the best score fort all the serpins, as it has the only known viral serpin outside the poxvirus family. A third set of proteins, the soluble tumor necrosis factor receptor (TNFR) II homologs, and has slightly less widespread poxvirus distribution. These putatively protect infected cells from TNF-mediated cell death, and get highest scores to proteins of various mammals, and to viruses of the herpesvirus and iridovirus families.
Proteins with very limited distribution among poxvirus genera include a protein similar to eukaryotic initiation factor-4a (eIF-4a) and a protein possibly functioning as an oligoribonuclease, both encoded by diachasmimorpha longicaudata entomopoxvirus, the first known symbiotic entomopoxvirus, which infects a parasitic wasp. The best eukaryotic scores for these proteins come from potentially host-like species and both find best virus scores against NCLDV members. A possible dual specificity protein phosphatase, encoded by canarypox virus, and a protein similar to human MHC Class I, encoded by squirrel poxvirus, are plotted in this region for blast scores against proteins of vertebrates and non-NCLDV viruses. MHC Class I-like proteins encoded by poxviruses of several other genera are plotted very close to the origin due to low pairwise sequence identity to their best matches on both axes. MSEV and squirrel poxvirus each encode a sequence of unknown function, and which, although they do not share high identity with one another, may both be chromosome segregation ATPases.
Region E contains poxvirus proteins that get notable scores against eukaryotic proteins and essentially insignificant scores against viral proteins. Poxvirus proteins that appear in this region are most likely of eukaryotic origin and have been transferred into poxviruses or ancestors of poxviruses, but have not been transferred or at least not maintained in sequenced viruses of any other present day virus family. Members of the poxvirus family may be the only virus species carrying these eukaryotic genes simply because poxviruses are more effective at capturing or maintaining host genes than other viruses. Alternatively, many of these genes may be absent from other viruses since they confer little or no selective advantage to these viruses, but do confer selective advantage to poxviruses due to unique aspects of their biology.
Poxvirus proteins plotted in this region include enzymes involved in lipid and carbohydrate metabolism, nucleotide metabolism, protection against oxidative damage, and intracellular processes including signaling, cell cycle control and apoptosis. The few proteins in this region which have wide distribution among poxvirus genera are kelch proteins and tyrosine protein kinase-like proteins. These proteins have unknown functions, and get best scores to proteins in a variety of vertebrates.
Orthologous groups of proteins from avipoxviruses appear more often in this region than proteins from any other genus. Several of these proteins have unknown function, but the functional characterizations of the others span the whole range of functions attributed to proteins of region E. Each orthologous group of avipoxvirus proteins gets best scores against a variety of eukaryotes, mostly vertebrates. Again, assessments of potential horizontal gene transfers may be provided only by detailed phylogenetic analysis of each gene group, but the wide range of vertebrates providing highest scores for each orthologous group is notable, and preliminary phylogenetic analyses (data not shown) may indicate that, although they score very well against vertebrate proteins, many of these avipoxvirus proteins may have begun diverging from the original host-acquired proteins in the ancient past.
Glutathione peroxidase protects against oxidative damage, and is the only avipoxvirus protein in region E that is also encoded by another poxvirus genus. Molluscum contagiosum virus encodes an ortholog of glutathione peroxidase that gets its highest blastp score against a similar protein in macaque, while the avipoxvirus sequences get highest scores against insect versions of the protein.
As with the avipoxvirus proteins, many of these proteins may have been transferred into the poxvirus lineage in the relatively distant past, from early vertebrates. The phylogenetic tree of the enzyme monoglyceride lipase (Fig. 2c), which appears in this region of the plot, provides evidence that the origin of the poxvirus homolog may represent a more ancient gene transfer into a poxvirus ancestor from an unknown host.
Many orthologous groups of proteins in this region have best blast scores scattered over a wide range of vertebrates, rather than among a narrowly defined group of species related to a potential HGT source. However, with only two exceptions both encoded by avipoxvirus species, all proteins find best scores against vertebrates, rather than against the wide variety of metazoa which provide the best scores for many of the potentially more universally conserved proteins plotted closer to the diagonal.
Approximately 77% of poxvirus proteins fall very close to the origin of this plot. These include genes that may be unique to the poxvirus family, as well as genes that in poxviruses have primary sequences too divergent to achieve high blastp scores against potentially orthologous proteins outside poxviruses. Examples of the former, poxvirus-specific genes include a DNA-binding phosphoprotein (Cop-F17R) and a structural protein (Cop-A12L). Examples of the latter, sequence-diverged genes, are a putative ATPase (Cop-A32L) and a capsid protein (Cop-D13L), both postulated to have orthologs in all members of NCLDV and included in the originally proposed core NCLDV genes (Iyer et al., 2001).
In addition to the comparison of poxvirus proteins to proteins of eukaryotes and other viruses, we also compared the similarity of poxvirus proteins to proteins of bacteria and other viruses (Fig. 1b). For almost all poxvirus proteins, bacteria provide lower pairwise scores than eukaryotes. Notably, most of the large groups of proteins that lie on and below the diagonal in Fig. 1a, skew in Fig. 1b towards the virus axis due to the absence of similar proteins in the bacterial kingdom.
With the exception of one entomopoxvirus protein, all proteins between the virus axis and the diagonal got higher scores against eukaryotes (Fig. 1a) than they get against bacteria. The exception is NAD+ dependent DNA ligase, encoded by MSEV and amsacta moorei entomopoxvirus (AMEV), which gets slightly higher scores against a sulfur-oxidizing bacterium and a fish-infecting mycoplasma than against its eukaryote best hits in amoeba. The unique status of this point on the plot marks it as a potentially interesting candidate for additional investigation. Preliminary analyses (data not shown) indicate that while apparent homologs of this gene are found predominately in bacterial genomes, a few are also found among species of bacteriophage and NCLDV, indicating a potential for interesting horizontal gene transfer events. These and all other such suggested relationships must of course be rigorously tested by phylogenetic analysis to provide the most reliable assessment of gene transfer pathways.
As in Fig. 1a, the diagonal in Fig. 1b contains several proteins highly conserved throughout nature. Fig. 1b also contains several proteins that in Fig. 1a were below the diagonal, showing high similarity to eukaryote proteins, but with scores against bacteria proteins more comparable to other virus proteins thus shifting them to the diagonal in Fig. 1b.
Nearly all points below the diagonal in Fig. 1b exhibit high bitscores against both bacterial and eukaryote proteins, although the eukaryote scores are usually higher. Among these proteins, all proteins with significant scores against virus proteins have mimivirus proteins as their best virus scores—possibly not surprising considering the many bacteria-like features of the mimivirus genome.
Poxvirus proteins plotted near the bacterial axis have similar scores with their best eukaryote protein hits. This region contains more avipoxvirus genes than genes from any other poxvirus genus. The only proteins in this region to get better bacterial than eukaryote scores come from the entomopoxvirus subfamily. There are the two different leucine-rich repeat (LRR) proteins encoded by AMEV, which get moderately good scores against eukaryotes yeast and plants, but get somewhat better scores against both a gram-negative anaerobic bacterium and a symbiotic green sulfur bacterium.
Although individual poxviruses usually contain more than 150 genes, only 49 of these are present in all of the fully sequenced poxviruses, with larger subsets being shared among members of each genus (Lefkowitz et al., 2006). In poxvirus genomes, the conserved "core" genes are involved in key functions such as replication, transcription and virion assembly, and tend to cluster in the central region of the linear genome, while genes that are unique to specific genera or species are distributed towards the two ends of the genome. Many of these peripheral genes encode proteins that manipulate host immune response and cellular processes, including apoptosis, antigen presentation and recognition, interferon functions and immune signaling processes.
Cowpox virus strain Gri-90 has one of the largest genomes among orthopoxviruses, and contains essentially all genes found in other members of the genus. For this reason, it serves well as an archetypical orthopoxvirus genome for the purpose of orthopoxvirus gene analysis. All proteins of this strain were analyzed by taxonomic group plots, to compare the relationships of core and non-core protein subsets with eukaryotes and with viruses outside the poxvirus family (Fig. 3.) In Fig. 3a, proteins were classified according to genomic location, as centrally (red points) or non-centrally (black points) located, where the central region of the genome is defined as all genes from G13L to A47L. In Fig. 3b, proteins were classified according to the number of poxvirus species with conserved orthologs, with red points representing the most widely conserved proteins among poxviruses, and proteins of most limited distribution in black.
Results show that the diagonal contains universally conserved as well as species-specific genes (Fig 3b), and contains proteins with both central and peripheral locations (Fig 3a). However, the proteins that lie to the eukaryote side of the diagonal are predominantly non-centrally located and appear in a very limited number of species. Presence of these genes in only one or a few genera or species strongly suggests the genes were acquired by the cowpox virus lineage subsequent to its divergence (or the divergence of the most recent orthopoxvirus ancestor) from the other poxvirus genera. High scores with eukaryotic proteins may also indicate relatively recent transfer of the genes from eukaryotes, and/or strong selection for sequence identity with host proteins. The sparsely populated area near the virus axis has only proteins widely conserved among poxviruses, and these are almost exclusively centrally located, with the one exception being the poxvirus B22R protein. B22R is a surface glycoprotein that is conserved in every chordopoxvirus genus, and as mentioned above, has only one possible homolog outside the poxvirus family, in CyHV-3.
A genome map of cowpox virus strain Gri-90 (Shchelkunov et al., 1998) (GenBank accession no. X94355) in Fig. 4 depicts all cowpox virus genes color coded according to the degree of similarity of each cowpox virus protein to its best hit when compared against all virus (non-poxvirus) or all eukaryotic proteins. Genes and their descriptions are provided in Table 2. Genes are labeled by their restriction fragment name and are colored according to the highest blastp bitscore obtained by the encoded poxvirus protein when searched against the respective taxonomy database. Bitscores are normalized by dividing by the highest possible bitscore the query protein could achieve, i.e. the bitscore it receives when compared to itself. Therefore the highest possible score for each comparison is 1. The map demonstrates the higher levels of similarity poxvirus proteins have to eukaryote proteins in comparison to virus proteins outside the poxvirus family. In addition, it is apparent that with only a few exceptions, poxvirus proteins with high levels of sequence identity to proteins of other organisms tend to lie towards the edges of the linear genome. Exceptions include S2R: thymidine kinase, L4L: ribonucleotide reductase large subunit, R2L: glutaredoxin 1, and E8L: carbonic anhydrase (virion protein.)
Protein coding genes of poxviruses have been the subject of much research. Poxvirus immunomodulatory genes, those both with and without host homologs, have been extensively examined (Finlay and McFadden, 2006; Iyer et al., 2006; McFadden and Murphy, 2000; Monier et al., 2007; Seet et al., 2003; Stanford et al., 2007) as have the gene content and gene families present in poxvirus species, and evolutionary relationships based on phylogenies of those genes (Bratke and McLysaght, 2008; Gubser et al., 2004; Iyer et al., 2001; Iyer et al., 2006; Lefkowitz et al., 2006; McLysaght et al., 2003; Upton et al., 2003; Xing et al., 2006). It is apparent that many genes have entered poxvirus genomes via horizontal transfer both from their hosts and also possibly from other viruses.
From an evolutionary perspective, the genes poxviruses share with other viruses have been examined most notably in the context of exploring the hypothesis that the poxvirus family may share a common ancestor with several other families of large DNA viruses (the NCLDV). This hypothesis is based largely on the set of similar proteins these viruses share (at a sequence and/or functional level), which may have served as a "core" set of NCLDV genes. Poxviruses also code for genes with significant sequence similarity to genes from non-NCLDV virus family members, including virulence genes shared by entomopoxviruses, baculoviruses and iridoviruses (Dall et al., 2001; Means et al., 2007), host-interaction genes present in poxviruses and herpesviruses (Afonso et al., 2000; Iyer et al., 2006; McFadden and Murphy, 2000), and other poxvirus proteins with notable levels of similarity to genes of a recently discovered fish herpesvirus (Ilouze et al., 2006).
The potential for horizontal gene transfer into poxviruses has been examined using several methods, including phylogenetic reconstructions, gene synteny analysis, and anomalous base composition. Phylogenetic reconstructions of gene families with members in other viruses and their hosts have suggested that multiple horizontal gene transfer (HGT) events have taken place into poxvirus genomes from other viruses (Dall et al., 2001) and from their eukaryotic hosts (Bratke and McLysaght, 2008; Hughes, 2002; Hughes and Friedman, 2005; Monier et al., 2007). Anomalous base composition (DaSilva and Upton, 2005; Monier et al., 2007), and gene synteny analysis (Bratke and McLysaght, 2008; McLysaght et al., 2003) have found evidence for HGT from hosts to poxviruses, including multiple HGT events for some genes. All methods of analysis conclude that the presence of many genes is best explained by HGT, although the process may not be frequent and recent (Lefkowitz et al., 2006; Monier et al., 2007), and some genes with noted similarity to genes of other organisms are proposed to not have been obtained via HGT (Hughes and Friedman, 2005; Iyer et al., 2001).
The goals of our current analysis were to develop a method of measuring and visualizing the similarities of all proteins expressed by virus isolates belonging to the entire poxvirus family to various taxonomically distinct sets of proteins from other organisms. This analysis was designed to detect overall trends in gene similarity and to detect individual genes that may be of interest due to anomalous characteristics with regard to such levels of similarity. Each individual protein may then be further investigated with regard to its function, distribution in poxviruses and other organisms, and via phylogenetic analysis, to determine its most likely evolutionary history.
Proteins identified as interesting candidates for follow-up research by this method may be further studied using more traditional phylogenetic methods as illustrated by our initial phylogenetic analyses of proteins in figure 2. Overall trends in sequence similarity of different subsets of poxvirus proteins, as well as information about individual proteins implicated by our analysis may contribute valuable information about the evolution of poxviruses and the mechanisms of host pathogenesis.
Overall, analysis by taxonomic group plots shows that chordopoxvirus proteins tend to exhibit greater similarity to eukaryotic proteins than to bacterial or viral proteins, suggesting that many poxvirus proteins may share a common evolutionary origin derived from proteins of their eukaryotic hosts. Although entomopoxviruses also contain host-like genes, both with and without homologs in chordopoxviruses, entomopoxvirus proteins do not show the same general skew towards similarity to eukaryotic proteins. However, entomopoxviruses encode quite a few proteins with notably greater similarity to proteins of other viruses than to bacterial or eukaryotic proteins. The relatively small sampling of insect proteins available in GenBank could partly account for the low scores of these proteins to the eukaryote database, with insects being represented by 799,971 proteins and 210 complete genomes, compared to a vertebrate collection of 1,787,682 proteins and 1,559 complete genomes. However, with only 3 exceptions, all chordopoxvirus proteins which achieve similarly high scores to proteins of other viruses are proteins with sequences universally conserved throughout nature, such as ribonucleotide reductase, DNA photolyase and RNA polymerase. For viruses of both the entomopoxvirus and chordopoxvirus subfamilies, the most similar virus proteins outside the poxvirus family are found both among members of the postulated NCLDV group of viruses and among non-NCLDV members, with viruses of the families Baculoviridae, Herpesviridae and Iridoviridae most represented.
Inspection of the individual proteins represented on the plots reveals that many of the proteins are universally highly conserved. These function in the synthesis and maintenance of DNA and RNA, and are present in many virus species, as well as in most eukaryotes. All the poxvirus enzymes that convert cellular pools of nucleotides for RNA synthesis into deoxyribonucleotides for synthesis of DNA fall either on the diagonal or just below it. The ultimate origin of these proteins is uncertain, but those with greater similarity to eukaryote proteins may have been transferred more recently into the poxvirus lineage than into the other virus families in which they appear, or these proteins may have been constrained for functional purposes towards high sequence identity with host proteins.
Many other proteins highlighted by this analysis are apparently of eukaryote origin, and fall either on the diagonal, just below it, or near the eukaryote axis, depending on their degree of similarity to proteins presumably transferred into viruses outside the poxvirus family. These have functions involving immune response and intracellular processes, and seem likely to have been transferred horizontally from hosts into poxviruses as well as into the families of other, non-poxvirus viruses. The functions of these proteins are presumed to be advantageous to the biology of viruses in all families where these proteins appear.
Proteins near the eukaryotic axis in Fig. 1a are only present in viruses of the family Poxviridae. The majority of these proteins are involved in the manipulation of intracellular processes, including redox state, protein signaling cascades, and lipid and carbohydrate metabolism, as well as involved in the manipulation of the extracellular environment. Some of these proteins are of unknown function. The fact that these eukaryotic-like proteins are found only among viruses in the poxvirus family may be informative about what cellular processes and signaling cascades are unique to poxvirus infections. Finally, there are many proteins that are seemingly unique to poxviruses, with no significant sequence similarity to known proteins among other viruses, eukaryotes or bacteria.
Together these results give us a picture of the many different subsets of proteins present in poxviruses, and allow us to draw some conclusions about each subset based on where else in nature proteins of these types appear. Investigation of the similarities and origins of particular proteins may yield further insights into poxvirus evolution and pathogenesis. For example, the fact that the poxvirus versions of universally highly conserved enzymes such as RNR have significantly more sequence similarity to RNR of eukaryotes than to those of bacteria or other viruses may imply a need for interoperability of the poxvirus enzymes with host proteins. Another example is the presence of different clusters of poxvirus TK sequences, where TK encoded by entomopoxviruses and avipoxviruses cluster together on the plot in a different location from the cluster of TK proteins encoded by poxviruses in other genera, agreeing with previously published suggestions that the TK enzymes of avipoxvirus, entomopoxvirus and the other chordopoxvirus genera may have different origins (Bratke and McLysaght, 2008; Koonin and Senkevich, 1992).
Finally, by using taxonomic group plots to study the proteome of cowpox virus, we show that the most host-like genes tend to lie at the ends of the linear genome and have the most limited distributions among poxvirus species.
More explicit conclusions about individual proteins, including gene origins, relationships to proteins of other organisms, and details of potential horizontal gene transfer events, will require additional, more extensive analyses at the level of each individual gene. Such investigations will require phylogenetic reconstruction of individual protein families utilizing sequences obtained from accurate annotations of poxvirus genomes, with particular attention to providing an accurate gene prediction for each genome and to the presence or absence of particular genes in each genome.
In conclusion, using taxonomic group plots to analyze proteins of poxviruses confirms the presence of many eukaryotic-like proteins in the genomes of poxvirus species, underscoring the importance of the contribution of host gene capture in the evolution of these viruses. These results also provide an overview of the functional significance of many of the genes poxviruses share with their hosts, and expose which host genes are captured uniquely by poxviruses and which are captured by other virus families as well. Information yielded by more comprehensive phylogenetic analysis of poxvirus genes to genes of their hosts and other viruses, will illustrate details of molecular mechanisms of poxvirus adaptation and survival throughout the history of the virus family, giving a richer picture of the evolution of this once devastating and still dangerous group of viral pathogens.
We would like to thank the staff of the Viral Bioinformatics Research Center (www.vbrc.org) for invaluable contributions, support and guidance. This work was supported by NIH/NIAID Contract No. HHSN266200400036C to EJL.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.