Retroviral and related endogenous retroviral sequences (ERVs) are integral parts of most eukaryotic genomes, sometimes constituting over 50% of them [1
]. Their ability to transpose and transfer horizontally [2
], confers genetic flexibility to complex genomes like those of humans [4
], chimpanzees [5
], other primates and vertebrates.
The origin of retroviruses is lost in a prebiotic mist. Assuming a 0.2% neutral substitution rate per million years [6
] and a 50% divergence limit for nucleotide sequence recognition, retroviral sequences >250 Million years old cannot be found in current genomes. If any of their genes are selected for, they may stay recognizable longer. Thus, although the ERV record has limitations, the reconstruction of retrovirus evolution differs fundamentally from that of other viruses, due to the ERVs in the ever richer archive of genomic assemblies. According to the VIIth ICTV report [7
borders to Pararetroviridae
(e.g. Hepatitis B), Metaviridae
(Gypsy-like) and Pseudoviridae
(Copia-like). Together with the even more more distant relatives Mal-R [8
], DIRS [9
] retrotransposons and chromoviruses [10
], not included here, they show that retroviruses are parts of a vast retrotransposon sequence universe. In this work, we concentrated on retroviruses. An ancestral retrovirus likely had structural traits which at present are common denominators of the diverse related sequences. Although some structural traits may be absent in individual viruses, readily identifiable common denominators are 5'LTR, PBS, Gag (MA, CA and NC), Pro, Pol, Env, PPT and 3'LTR [11
]. The most universal trait is the pol
gene, with its reverse transcriptase (RT), RNAse H and integrase (IN). The use of other conserved but distinguishing traits in phylogenetic inference and retroviral classification discussed here are: nucleotide bias, number of zinc fingers, translational strategy, C-terminal Pro and Pol motifs, presence of dUTPase and accessory genes and LTR length. Env is an unreliable evolutionary marker, exemplified by the hybrid betaretroviral MPMV [11
], but can be useful in narrow phylogenies to demarcate a specific group.
Retroviral taxonomy has traditionally been based on observed phenotypic qualities of exogenous retroviruses (XRVs) [7
]. Classification using ERVs, with an almost complete lack of phenotypic information, necessitates a nucleotide sequence analytical approach. Seven retroviral genera have been described (alpha-, beta-, gamma-, delta-, epsilon-, lenti- and spuma-like retroviruses) using sequence similarities, mainly in the Pol RT region. Although much work remains before all ERVs are fully characterized, ERVs have also been divided into loosely defined classes, originally based on HERVs [12
]. When analyzing the RT region, the gammaretroviruses cluster as class I and betaretroviruses as class II elements [12
]. The spuma- and spumalike elements group within the class III [14
]. Lenti- and deltaretroviruses have no known endogenous counterparts [15
]. This was also the case in our computerized genomewide screenings (see below).
ERV classification and grouping originally was based on sequence similarity between the proviral PBS and the host tRNA [11
]. This classification has proved useful for some ERVs, e.g. HERV-E [16
] and mostly for HERV-H [17
]. However, it is inconsistent for many other ERV groups that have alternative PBSes [18
] e.g. HERV-H/F [17
], ERV3 [16
], and ERV9/HERV-W [19
]. We did not extend these analyses here.
In several papers [[17
] and Jern et al. submitted
], we have used Pol similarity for ERV classification. Pol is highly conserved, and its large size (800–1100 aa) provides adequate information for a relatively detailed classification. This is facilitated by the program RetroTector©
[Sperber G.O. et al. in preparation
], which reconstructs probable Pol proteins ("puteins") from different reading frames in the often damaged gene candidates. The puteins are favored over nucleotide sequences since they are more conserved, easier to align and therefore allow phylogenetic inference and taxonomy over greater evolutionary distances. This is further discussed in the Methods and Results sections of this paper. A number of reliable distinguishing features must be defined to enable a durable retroviral taxonomy which can encompass the many new ERVs and XRVs, and to trace their evolution. In this study, we compared phylogenetic trees, based on Pol similarity, with distinct structural features of possible use as taxonomic and phylogenetic markers.