Endogenous retroviruses (ERVs) are copies or remnants of exogenous viruses derived from past infections of germ cells and subsequent integration into the host genome (
7,
8). Most ERVs are defective, having accumulated random inactivating mutations, and are therefore not pathogenic. However, many intact ERVs have been associated with different host diseases (
4,
21). Those from pigs are considered potentially hazardous during xenotransplantation due to reactivation or recombination with exogenous retroviruses (
18). ERVs have been studied extensively in mammals and birds (
9,
10,
16), while knowledge of ERVs in reptiles is limited to a few host species (
17,
24). Studies of a few full-length and partial
pol gene sequences of some crocodilian species (order Crocodylia) have shown that some ERV sequences clustered distantly from
Spumavirus while others clustered closely with
Gammaretrovirus-like retroviruses (
16,
17). However, their diversity, lineage specificity, and functionality have not been assessed across all extant crocodilian species.
The extant crocodylian lineage consists of 23 species classified in three families, the Alligatoridae (alligators), Crocodylidae (crocodiles), and Gavialidae (gharials). The Crocodylidae and Alligatoridae families are unambiguously recognized, with gharials either lumped with the Crocodylidae family or assigned to a separate family, the Gavialidae (
6,
12,
19). It has been estimated, based on DNA and amino acid data, that the Alligatoridae and Crocodylidae lineages diverged from a common ancestor about 97 to 103 million years ago (
12,
19).
Here we analyze the distribution, potential to function, and phylogenetic relationships of ERVs in 20 extant crocodilian species by studying the protease-reverse transcriptase (
pro-pol) gene. The
pro-pol gene (0.8 to 1 kb) was amplified using two sets of degenerate primers (
23). The PCR amplicons were gel purified and cloned using standard protocols, and about three ERV inserts were sequenced from each species by the sequencing service at the Australian Genome Research Facility. Sixty-five ERV DNA sequences were generated and translated to putative amino acid sequences using universal translation codes in the Molecular Toolkit interface (
http://arbl.cvmbs.colostate.edu/molkit/index.html). Open reading frames and levels of similarity to other available ERV sequences were determined using the blastx tool available through NCBI (
http://blast.ncbi.nlm.nih.gov/Blast.cgi). Two aligned datasets were generated using the CLUSTALW program (
22); the first consisted of 286 predicted amino acids from 65 novel and 9 published (
16,
17) crocodilian ERVs (CERVs), and the second comprised 258 predicted amino acids from the novel crocodilian sequences and 73 published endogenous and exogenous retroviral sequences after exclusion of gaps. These known retroviruses included avian leukosis virus (
Alpharetrovirus); murine leukemia virus (
Gammaretrovirus); mouse mammary tumor virus and Jaagsiekte sheep retrovirus (
Betaretrovirus); bovine leukemia virus and human T-lymphotropic virus (
Deltaretrovirus); human immunodeficiency virus type 1, equine infectious anemia virus, and visna virus (
Lentivirus); human foamy virus (
Spumavirus); walleye dermal sarcoma virus (
Epsilonretrovirus); 4 avian, 3 reptilian, 2 amphibian, and 14 mammalian ERVs, including human ERVs (
13,
14); 3 chicken ERVs (
10); 6 avian viruses similar to the alpha/beta group (
7); and 22
Gammaretrovirus-like ERVs (
9,
16). The best-fit model (JTT matrix model with parameter α = 1.61) for the two data sets was selected by using ProtTest software (
1) to implement neighbor-joining (NJ) in the MEGA 4 program (
20). In addition, a maximum parsimony analysis was performed. Levels of amino acid similarity between CERV and other ERV sequences were also ascertained in the MEGA 4 program, using the p-distance option.
ERVs were found in all crocodilian lineages examined (Fig. ). All 65 sequences show deletions, and 28 of these contain in-frame stop codons (Fig. ). These mutations indicate that all CERVs generated in this study are defective and, therefore, nonfunctional, as has been observed in other vertebrate hosts (
2,
3,
11,
15). Although only nonfunctional sequences were identified in members of the Alligatoridae and Crocodylidae, our study does not exclude the possibility of functional ERVs in these lineages.
The NJ and maximum parsimony phylogenetic trees were consistent and showed that CERVs cluster into two distinct major clades (Fig. ), named CERV 1 and 2. CERV 1 consists of 44 sequences from 12 species of Crocodylidae, revealing for the first time the existence of a host lineage-specific ERV clade for crocodiles. CERV 2 consists of 30 sequences representing eight Alligatoridae species and the only Gavialidae species. Four sequences from three Crocodylidae species (Crocodylus niloticus, Crocodylus palustris, and Mecistops cataphractus) also clustered within CERV 2. Interestingly, sequences from both CERV 1 and 2 were found in a single Crocodylidae species (Crocodylus niloticus). Pairwise genetic distances show that the variation within CERV 1 (distance = 0.070 ± 0.006 [mean ± standard deviation]) is very much lower than that within CERV 2 (distance = 0.459 ± 0.020). Analyses also show that all CERV 2 sequences are distinct, revealing additional diversity and new minor clades within this unusual clade (Fig. ) that have not been reported previously. In contrast, CERV 1 sequences show a high degree of similarity between species, and some of them are identical within species, including those from C. intermedius, C. siamensis, and C. moreletii.
Analysis of phylogenetic relationships with known ERVs showed that CERV 1 clusters closely with two clades of
Gammaretrovirus (birds and reptiles/mammals/amphibians/human), showing polytomy and long branch lengths with respect to each other (Fig. ). Clade CERV 2 clustered distantly from all known ERVs (Fig. ). These relationships are also supported by the amino acid similarity between a known representative of
Gammaretrovirus, murine leukemia virus, and CERV clades. In agreement with the results of previous CERV studies (
17), the amino acid similarity was lower for CERV 2 (24%) and higher for CERV 1 (47%) (Table ).
| TABLE 1.Percentages of similarity between CERV 1 and CERV 2 and exogenous members of the Retroviridae |
The comparison of CERV and host species phylogenies (
6,
12,
19) showed discordance. While crocodilian host phylogenies based on DNA sequence data are quite well defined, this was not the case within the CERV clades. Both CERV 1 and 2 showed a mixture of internal topologies from symmetric (bushlike), to random, to asymmetric with differential branch lengths, making it difficult to assess any coevolutionary patterns. Given that CERV 1 appeared to be specific to the Crocodylidae family and that its internal branch lengths were considerably shorter than those observed in the crocodile host phylogeny (
5) and CERV 2, it is plausible that CERV 1 has a lower mutation rate and represents a relatively recent retroviral infection that occurred after the divergence of the Crocodylidae and Alligatoridae families from the common ancestor.
The current investigation has confirmed the existence of two groups of ERVs and revealed additional distribution and diversity among extant members of the Crocodylia. Interestingly, we found a host lineage-specific clade which could have potential for use in the identification of members of the Crocodylidae at the family level. The data generated here will assist future studies identifying orthologous and paralagous ERVs among crocodilian species to assess the variation, distribution, and taxonomy of these retroelements within crocodilian species and populations.