|Home | About | Journals | Submit | Contact Us | Français|
Endogenous retroviruses (ERVs) are the proviral phase of exogenous retroviruses that become integrated into a host germ line. They can play an important role in the host genome. Bioinformatic tools have been used to detect ERVs in several vertebrates, primarily primates and rodents. Less information is available regarding ERVs in other mammalian groups, and the source of this information is basically experimental. We analyzed the genome of the cow (Bos taurus) using three different methods. A BLAST-based method detected 928 possible ERVs, LTR_STRUC detected 4,487 elements flanked by long terminal repeats (LTRs), and Retrotector detected 9,698 ERVs. The ERVs were not homogeneously distributed across chromosomes; the number of ERVs was positively correlated with chromosomal size and negatively correlated with chromosomal GC content. The bovine ERVs (BoERVs) were classified into 24 putative families, with 20 of them not previously described. One of these new families, BoERV1, was the most abundant family and appeared to be specific to ruminants. An analysis of representatives of ERV families from rodents, primates, and ruminants showed a phylogenetic relationship following their hosts' relationships. This study demonstrates the importance of using multiple methods when trying to identify new ERVs and shows that the number of bovine ERV families is not as limited as previously thought.
Endogenous retroviruses (ERVs) are the proviral phase of exogenous retroviruses that were once inserted into a host germ line and have remained integrated into the host genome for generations. ERVs have been detected in all mammals and a wide range of other vertebrates. Their typical structure is composed of a central part with the three major genes (gag, pol, and env) flanked by two long terminal repeats (LTRs) that were identical when the retrovirus entered the host germ line (4).
The biological significance of retrotransposons, including ERVs, ranges from their contributions to mutation, development, and disease to their roles in gene and genome evolution. In humans, mice, and sheep, for example, an env gene of retroviral origin, acquired independently in the different species, is involved in placenta morphogenesis (11). It has been suggested that ERVs could be possible contributors to or markers of disease in experimental animals and, in recent years, in human diseases, although their role as the etiological agent remains to be established (11). The expression of ERVs in humans has been linked to poor prognosis in breast cancer (9) and the malignant transformation of melanoma cells (36) and may play a role in multiple sclerosis (28). In addition, ERV-mediated recombination events have had profound effects on shaping hosts' genomes, and new ERV integrations introduce added variation to the host transcriptomes (11).
At present, there is no well-established or accepted standard for naming and classifying ERVs. For human ERVs (HERVs), tRNA complementary to the primer binding site (PBS) has traditionally been used for this purpose. This classification, however, is inaccurate, as proviruses from the same phylogenetic groups may display differences in the PBS, while otherwise unrelated proviruses may use the same tRNA as a primer. A more useful strategy for classifying ERVs is phylogenetic and related comparisons (11). ERVs are now sorted into three classes depending on the phylogenetic relationship with the exogenous retrovirus genus: class I ERVs are related to the genera Gammaretrovirus and Epsilonretrovirus; class II ERVs are related to the genera Alpharetrovirus, Betaretrovirus, Deltaretrovirus, and Lentivirus; and class III ERVs are related to the genus Spumavirus (8).
The availability of whole-genome sequences has made possible genome-wide analyses for the detection of ERVs using bioinformatic tools. In the last years RepeatMasker (A. F. A. Smit, R. Hubley, and P. Green, personal communication), a program designed to identify repetitive sequences using the Repbase database (13), has been widely used to generate an overview of repetitive elements in whole-genome sequences, among them the ERVs. For mammals, other programs have also been used in order to identify ERVs. BLAST-based searches were first used for humans (40, 41) and rodents (2). A program called LTR_STRUC (22) was applied to the chimpanzee genome in combination with a BLAST-based search (29). A new program specifically designed for ERV detection, called Retrotector, has recently been reported (38).
The mammalian order Cetartiodactyla has become a major focus of attention in comparative genomics because it comprises a phylogenetically distant clade of eutherian mammals related to primates, which diverged from a common ancestor ~85 million years ago (18). Bos taurus is one of the world's most important food animal species and is also among the most biologically interesting due to the unique physiology of its digestive, reproductive, and immune systems. The unveiling of the cattle genome sequence in 2009 allowed the first comprehensive effort to catalogue the diversity of transposable elements in the cattle genome (5). Interspersed repeats cover 46.54% of the genome. Among these, non-LTR retrotransposon LINEs account for 23.29% of the genome, and SINEs account for 17.66% of it. LTR retrotransposons, which include ERVs, account for 3.20% of the genome (5).
The cattle genome has also been analyzed experimentally for ERV elements. A PCR-based approach (43) detected a number of bovine ERVs (BERVs [here BoERVs]), which were classified into four families, named β3, γ4, γ7, and γ9, on the basis of their similarity to ovine ERVs (OERVs). The structures and sequences of BERV-β3 and the abundant BERV-γ4 elements were also analyzed. Those studies suggested that the expansion of the ERV family was more limited in cows than it was in other artiodactyls such as pigs and sheep (42-44).
To detect ERVs in the cow genome, we used three different methods: BLAST-based searches using retroviral sequences, the LTR_STRUC program, and the Retrotector program. ERVs that were detected by at least two of the methods and whose reverse transcriptase (RT) region was longer than 500 nucleotides were used to define bovine ERV families. Finally, representatives from each bovine ERV family and from ERV families from other species were used to study the relationship between the ERVs of different species.
We analyzed the Btau_3.1 version of a Hereford cow (Bos taurus) genome (×7.1 coverage). It was retrieved from the Baylor College of Medicine (http://www.hgsc.bcm.tmc.edu/projects/bovine/index.html).
Three different strategies for the detection of ERVs were applied and compared. The first strategy was based on the similarity of sequences. Segments of amino acid sequences from gag and env (TM region) genes from 12 well-annotated exogenous retroviruses were used as a search query (Table (Table1).1). In the case of the pol gene (RT region), one piece for each retroviral class was used (from Moloney murine leukemia virus [MLV], Mason-Pfizer monkey virus [MPMV], and bovine foamy virus). In each chromosome, individual gene segments were searched by using tBLASTn implemented in the NCBI-BLAST 2.2.14 program (1). The results of this search were parsed by homemade scripts written in PHP. To build the possible ERV elements, the results for each gene were compared. Based on the 12 search query retroviruses, the distances between gene queries were calculated (Table (Table1).1). A region was considered to be a possible ERV if the distance between the gag and pol genes was 868 ± 217 bp, the distance between pol and env genes was 3,851 ± 799 bp, and the distance between the gag and env genes was 4,719 ± 1,016 bp. If at least two matches were within the limits of these distances, the region was defined as a possible ERV.
In the second strategy, LTR_STRUC 1.1 (22) was used to find LTR elements. LTR_STRUC scans the genomic sequence for the presence of similar regions of length typical for LTRs (LTR pairs) and within the expected size of full-length LTR ERVs. If putative LTRs are found, the program then searches for additional retrotransposon features, such as primer binding sites (PBSs), polypurine tracts (PPTs), and target site repeats (TSRs), and assigns a reliability score to the hit based on the presence or absence of each of these features. In this work the predefined parameters were used.
We also used Retrotector v.1.0 (38) to find possible ERV elements. Briefly, Retrotector recognizes consensus motifs and constructs putative ERV proteins (“puteins”) from the different reading frames in the gene candidates. The program uses codon statistics, frequency of stop codons, and alignment to known retrovirus proteins to approximate an original open reading frame (ORF). The predefined parameters were used, and the cutoff score value was set at 250.
We have used the name bovine endogenous retrovirus (BoERV) for the ERVs described in this work in order to avoid confusion with previous names (43).
The distribution of the detected elements was tested chromosome by chromosome, as proposed previously by Villesen et al. (41). The expected number of elements (based on chromosomal mean density and length) was compared with the observed number by means of the χ2 test and the G test (37), each with 1 degree of freedom: χ2 = Σ [(observed − expected)2/expected] and G = 2 Σ observed × ln(observed/expected).
Correlation analyses between the number of elements detected by each method and chromosome length, GC content, and gene density were performed by using R language (32).
Relationships between the detected elements were determined by a phylogenetic analysis of the RT region. The 247 sequences with a high degree of similarity in BLAST-based searches (RT region of >500 nucleotides) were used along with the 12 exogenous retroviruses used as a search query, 8 detected cow ERVs (BoERVs) (43), and 12 sheep ERVs (17). Retroviruses used in previous research, such as BoEV (GenBank accession number X99924), HERV-E (accession number M10976), porcine ERV (PERV) (accession number AJ293656), MPMV (accession number NC_001550), intracisternal A particle (IAPM) (accession number M17551), ovine maedi visna virus (OMVV) (accession number NC_001511), feline foamy virus (FeFV) (accession number U78765), murine ERV-like (MuERV-L) (accession number Y12713), and Drosophila endogenous element ZAM (accession number AJ000387) were used as an outgroup. The sequences were aligned by using the MAFFT 5.861 program (14) (FFT-NS-1 option) and cleaned with Gblocks 0.91 (6) (minimum length of block of 5, allowed gap position with half, minimum number of sequences for a flank position of 146, and maximum number of contiguous nonconserved positions of 10). Phylogenetic trees were built by using three different methods. One was the neighbor-joining (NJ) method implemented in MEGA 3.1 (19). It relies on p distance using the pairwise deletion option and 1,000 bootstrap replicates. A second method was the maximum likelihood (ML) method implemented in Phyml 2.4.4 (10). The model used in the analysis was the GTR+G model (α = 2.71), as estimated by Modeltest 3.7 (31) with 1,000 bootstrap replicates. The third method was the Bayesian inference method implemented in MrBayes 3.1 (35). Four default-setting Metropolis-coupled Markov chain Monte Carlo methods were performed in two runs for 106 generations with trees sampled every 100 generations. The analysis was set to use the GTR+I+G model. The first 2,500 trees were discarded in the burn-in, and a 50% majority-rule consensus tree was computed from the remaining trees.
The bovine ERV putative families that we detected were defined based on the support of phylogenetic trees. A cluster was considered a putative family when the clustering was significant in at least two of the phylogenetic methods (bootstrap values of >70 for neighbor joining and maximum likelihood and Bayesian posterior probability of >95 for Bayesian inference). In order to confirm the families with a solitary member, MegaBlast (45) searches were carried out by using the solitary sequences as a query.
To elucidate the insertion time of elements classified into families, we used the divergence of LTRs estimated by LTR_STRUC and Retrotector. The dates of ERV insertion can be estimated mainly by the LTR comparison and the individual divergence relative to a consensus sequence. However, there are many difficulties in obtaining an accurate consensus sequence, especially for short insertions, so, as was done by many other authors, the LTR comparison was chosen for estimates of insertion dates. These divergence figures were then corrected to account for the presence of multiple mutations at the same site, back mutations, and convergent substitutions by using the Kimura two-parameter model (16). Nine elements with highly divergent LTRs were not included in this analysis. The insertion time of each element was estimated by applying a substitution rate of 2.3 × 10−9 to 5.0 × 10−9 to the divergence (12).
The representative members of a BoERV family were chosen as the closest element to the consensus sequence of the family or, when consensus was not possible, to the element with fewer stop codons. To build the consensus sequence, the amino acid sequences of the members of each family were aligned by using ClustalW (39), and consensus was determined by using the cons program of the EMBOSS suite (33). The distances between individual sequences and the consensus were calculated by using MEGA 3.1 (number of differences and pairwise deletion options).
In order to amplify BoERV1 in sheep, PCR was performed with the primers 5′-TGTGCTGAGACAGAGGAAGC-3′ (forward) and 5′-CCTATGGCCCTAGTCCCTTC-3′ (reverse) in six samples of Latxa breed sheep. PCR conditions consisted of 5 min at 94°C followed by 25 cycles of a 55°C annealing step for 30 s, polymerization at 72°C for 30 s, denaturation at 94°C for 30 s, and one final cycle at 72°C for 7 min. Reaction conditions were as follows: 13.1 μl of water, 2 μl of buffer, 0.8 μl of MgCl2, 0.3 μl of deoxynucleoside triphosphate (dNTP), 0.3 μl of each primer, 0.2 μl of Taq, and 3 μl of DNA.
The relationships of the BoERV groups with other species' ERV groups were analyzed by using a phylogenetic tree. We used 16 representative sequences from human ERV families (40), 38 from chimpanzees (29), 27 from mice (2, 23), 7 from rats (2), 14 from sheep (17), 8 from pigs (27), and 24 from cows (our data). The sequences were aligned by using MAFFT (14) (linsi option). Positions with gaps in more than half of the sequences were eliminated. The tree was constructed by Bayesian inference implemented in MrBayes 3.1 (35) (106 generations; RtREV matrix+G+I).
To estimate the insertion time of the human HERV-S71 element and the chimpanzee CERV3 element, we used the LTR divergence as described previously by Johnson and Coffin (12). The LTRs of each element were aligned with ClustalW, and the distance was calculated by using MEGA 3.1 (K2P substitution model). The insertion time was estimated by applying a substitution rate of 2.3 × 10−9 to 5.0 × 10−9 (12).
In the cow genome, the BLAST-based search detected 928 ERVs, LTR_STRUC identified 4,487 elements flanked with LTRs, and Retrotector detected 9,698 possible ERVs (Table (Table2).2). Only 172 elements were detected by all three methods (Fig. (Fig.1).1). A total of 739 of the elements detected by the BLAST-based search were also detected by Retrotector. Retrotector identified 8,183 elements that were not detected by either the BLAST-based search or LTR_STRUC (Fig. (Fig.1).1). The elements detected by the three methods were those that were best preserved from a structural standpoint; e.g., in 81% of the elements detected, Retrotector detected motifs from the NC region of the gag gene, and in 83% of the elements, it detected motifs from the RT region of the pol gene. However, the elements detected by a single program and LTR_STRUC and Retrotector together did not provide evidence for these motifs in most cases (35% and 54%, respectively).
The gene compositions of the detected elements also differed from method to method (Table (Table3).3). The elements detected by Retrotector included the three major genes. Most of the BLAST-based search elements included two genes. Surprisingly, most of the elements detected by LTR_STRUC did not include the pol gene. Based on the Retrotector results, 78 regions encompassed either a gag or an env ORF longer than 500 codons or a pol ORF longer than 700 codons (which approached the size of intact viral proteins).
The distribution of the BoERVs was not concordant across the three methods (Table (Table2).2). In the BLAST-based search, significantly more elements were detected on chromosomes 18, 28, and X than would be expected for a homogeneous distribution, while significantly fewer elements were detected on chromosomes 14 and 20. The number of elements that LTR_STRUC detected in chromosomes 1, 2, and 3 was significantly higher than would be expected for a homogeneous distribution, while significantly fewer elements were detected in chromosomes 17, 18, 19, and 22. Retrotector identified significantly more elements than expected in chromosomes 1, 2, 3, 6, 9, and X and fewer elements than expected in chromosomes 13, 18, 19, 22, 23, 24, and 25. In these analyses, the significance levels were similar by the χ2 and G tests. Using the G test, the overall distribution of BoERVs in the entire genome was not significantly homogeneous for the three methods.
The number of ERVs detected was strongly and positively correlated with the chromosome size by all three methods (Spearman's ρ = 0.7677 and P < 0.001 by BLAST, ρ = 0.9720 and P < 0.001 by LTR_STRUC, and ρ = 0.9735 and P < 0.001 by Retrotector) and inversely correlated with the GC content (ρ = −0.4968 and P < 0.001 by BLAST, ρ = −0.7066 and P < 0.001 by LTR_STRUC, and ρ = −0.6979 and P < 0.001 by Retrotector). No significant correlation was observed between the number of ERVs detected and chromosomal gene density.
We conducted a chromosome-by-chromosome search for solo LTRs based on Retrotector results (data not shown). Based on this analysis, the average rate of solo LTRs/ERVs was 6.06.
A phylogenetic tree based on the well-conserved pol gene RT region of selected BoERVs (detected with at least two methods and having an RT region with >500 nucleotides) with other endogenous and exogenous retroviruses showed that most of the elements were related to class I or class II outgroup elements. Thus, they can be classified as such by homology. No class III-related elements were observed (Fig. (Fig.2).2). Based on this tree, the BoERV elements were classified into 24 families (BoERV1 to BoERV24) according to the tree topology and the statistical support of the clustering relationships (Fig. (Fig.22 and Table Table4).4). Overall, we defined 18 families related to class I ERVs and 6 related to class II ERVs. Some of these groups were also related to ovine ERVs. It proved possible to classify all 24 families into three groups based on the number of ERVs included (Table (Table4).4). BoERV1 (82 clustered elements), BoERV3 (66 elements), and BoERV18 (16 elements) in class I and BoERV24 (22 clustered elements) in class II were the most abundant families. A second group of families had between 4 and 10 elements. In the third group, eight elements were isolated and had no significant relationship with any other element.
Most of the BoERVs included in the phylogenetic analysis were related to class I elements and were classified into 18 families. The average length of class I-related families was between 8,209 bases (BoERV18 family) and 15,219 bases (BoERV16). The longest ERV was from the BoERV16 family (23,004 bases), due to a duplication of the pol and env retroviral genes. The shortest ERV was from BoERV12 (7,058). Apart from the examples described above, the duplication of gag and env genes was observed for other ERVs. To rule out the presence of other genes, a search was made for ORFs in the longest elements. No different gag, pol, or env ORFs were detected. Thus, it is very likely that the presence of these elements was due to assembly errors. The remaining ERVs showed a typical length of between 7 and 12 kb (4).
Among the most abundant families, the BoERV1 family was quite heterogeneous. In contrast, the elements of the BoERV3 and BoERV18 families were similarly homogeneous. The BoERV3 family was related to previously described cow BERV-γ4 and sheep OERV-γ4 elements, and BoERV18 was related to the ovine OERV-γ9 element.
The representative element of the BoERV1 and BoERV3 families had the LPQG and YVDD motifs (Fig. (Fig.3).3). In the case of the BoERV18 family, the first motif was PPQG.
The remaining groups were represented by few or solitary elements. In the cases of BoERV2 and BoERV13, more related sequences were obtained in the MegaBlast analysis. They had not been included in the phylogenetic tree because they did not fulfill the conditions previously established.
The divergence level varied from family to family: the elements of BoERV8, BoERV9, and BoERV10 were less divergent than the elements of BoERV5, BoERV7, and BoERV15. BoERV7 was related to the previously detected cow BERV-γ7 and ovine OERV-γ7 elements, and BoERV16 was related to a previously detected BoERV (GenBank accession number X99924) and BERV-γ9 bovine elements. In BoERV12 and BoERV15, the two motifs were conserved. In the representative element of BoERV9, the YVDD motif was present. However, the representative elements of the BoERV2, BoERV4, BoERV5, BoERV6, BoERV10, BoERV11, BoERV16, and BoERV17 groups did not keep at least one of the two characteristic motifs.
The representative BoERVs from families with one or two members had different conservation levels of their functional motifs. BoERV2 had a deletion in the YVDD motif, and BoERV4, BoERV13, BoERV14, and BoERV17 had an amino acid change. Neither of the motifs was conserved in BoERV6 and BoERV11.
BoERV1 could be the oldest of the class I-related families because it contained an ERV that was inserted between 126 and 58 million years ago (MYA) based on the LTR divergence. This family also had the most recent insertion activity, since the youngest member was inserted recently. In sheep samples, we were able to amplify a BoERV1 conserved sequence from the RT region with a length of 150 bases (data not shown). The BoERV7 family could be the youngest, inserted between 19 and 9 MYA. Due to the uncertainty of the age estimates of the ERV sequences, which were based on the comparison of the LTRs of the elements, these values are only a rough estimate of the insertion time. Although different evolution rates and a correction were applied, the effect of recombination or gene conversion events, leading to the homogenization of the 5′ and 3′ LTRs, must be taken into account, and in this sense, the divergence times calculated in our study may have been underestimated.
The class II elements were grouped into six families, ranging in average length from 8,881 bases to 11,077 bases. After the MegaBlast analysis, BoERV22 consisted of more than one element. Some of these elements had not been included in the phylogenetic tree because they did not fulfill the conditions previously established.
The longest ERV was from BoERV24 (25,030 bases), and the shortest was from BoERV21 (6,283 bases). As with the longest class I-related BoERV, the longest BoERV in the class II-related families contained all genes duplicated.
Within-group divergence varied. The elements of BoERV23 were more divergent than those of BoERV19. The BoERV24 group contained some tightly related elements and some highly divergent ones. The BoERV24 family was related to the previously detected bovine BERV-β3 element. Moreover, we did not detect class II bovine ERVs related to endogenous Jaagsiekte sheep retroviruses (enJSRVs) in this genome version.
With one exception, the representative elements of the class II-related families conserved the LPQG and YMDD motifs. In the BoERV23 family, the LPQG motif was replaced by QPQG, and YMDD was replaced by YLDG.
The oldest class II-related family could be BoERV24, whose oldest ERV was inserted between 56 and 26 MYA. The newest family could be BoERV19 (between 27 and 13 MYA). In addition, a member of this family could be the youngest element, with a recent insertion.
In the phylogenetic tree for the ERVs from cow, sheep, pig, human, chimpanzee, mouse, and rat, ERV elements were grouped into three classes. Within each class, the representative ERVs clustered following the relationships with their host genomes. There was a close relationship between ERVs from cow and sheep, which clustered together in four phylogenetic lineages (Fig. (Fig.44).
The representatives of the different classes were grouped into polytomic nodes. However, there were clear relationships between human/chimpanzee families (HERV-I/-ADP and CERV20/21/22/23/24/25), human/chimpanzee/bovine families (ERV9/HERV-W, CERV15/16/17/18/29, and BoERV1/2), chimpanzee/bovine families (CERV19 and BoERV11), and human/chimpanzee/mouse families (HERV-L, CERV42, and MuERV-L/Mmr20). There were also suggestions of bovine/ovine relationships (BoERV15/16/17/18 and OERV-G9) and human/chimpanzee/porcine/bovine relationships (RRHERV-I/HERV-E, CERV4/5/6/7, PERV-g4, and BoERV12/13/14) (Fig. (Fig.44).
Surprisingly, one phylogenetic lineage contained elements from human, chimpanzee, mouse, pig, and sheep species but not from cows. This lineage was studied in depth, and the insertion time was estimated by using LTR divergence: the human HERV-S71 element was inserted between 19.5 and 8.9 MYA, and the chimpanzee CERV3 element was inserted between 33 and 15.8 MYA.
This study describes an attempt to systematically identify and characterize endogenous retroviruses in the cow genome. Although we used only located genomic information, leaving contigs untested, in this study we identified nearly 10,000 putative BoERVs that were distributed in a nonhomogeneous way across chromosomes. By comparing three different methods for ERV detection, we found that each method yields different and, in some cases, discordant information.
The BLAST-based search detected the fewest elements (928 elements), most of which were also detected by Retrotector. As the criteria used in the BLAST-based search were quite strict, the elements that it detected could be considered to be highly conserved ERVs.
LTR_STRUC detected 4,487 elements. It identified more elements without the RT region than did BLAST and Retrotector. It also detected many elements that were not identified by BLAST and Retrotector. Because LTR_STRUC is designed to find elements flanked by LTRs, it may be able to detect elements with a noncanonical structure (22).
Retrotector detected the most possible BoERVs (9,698) and had the most overlapping detections. In most of the elements detected by Retrotector, all three main genes were identified. It is thus clear that it is more efficient than BLAST-based detection and able to detect elements that are not as highly conserved (38).
Comparison of different genomes is problematic because various methods have been used to detect ERVs. In previous studies of human (20), mouse (26), rat (7), dog (21), cat (30), and cow (5), RepeatMasker and Repbase were used to detect repetitive elements. However, as stated previously by Sperber et al. (38), results from RepeatMasker and Retrotector cannot be directly compared because the RepeatMasker output is difficult to organize into proviruses. In addition, Retrotector rarely detects elements less than 1,000 bp long, and RepeatMasker can detect much shorter repeats and single LTRs. Moreover, the secondary integration of proviruses into each other, a feature of old elements, can also be a problem (38).
In a previous study of the cow genome, 142,096 ERVs were detected with PALS/PILER (5), while we identified 928 with BLAST, 4,487 with LTR_STRUC, and 9,698 with Retrotector. The genome coverage of the elements detected by the different programs was also discordant: 1.75% of the genome by PALS/PILER, 0.36% by BLAST, 1.77% by LTR_STRUC, and 4.29% by Retrotector. These data suggest that the coverage is similar or greater with fewer elements. Thus, the abundance of short elements by methods such as RepeatMasker and PALS/PILER make cross-species comparisons difficult. In addition, the classification of the elements detected by the different programs adds complexity to the comparison: RepeatMasker uses the Repbase annotation (13; Smit et al., personal communication), and Retrotector uses its own motif database (38). Thus, we found that RepeatMasker and Retrotector did not routinely sort the same element into the same class. For example, among ERVs classified as class I by the Retrotector method, 64.72% were classified as ERV1 and 35.28% were classified as ERVL by RepeatMasker.
One explanation for the different distributions of ERVs across bovine chromosomes could lie in the target elements employed by the methods used to identify ERVs. In the analysis of the chromosomal distribution of the elements detected, the various methods showed different chromosomes that did not follow any homogeneous distribution. The nature of the elements detected by each method could be a good reason for this discrepancy.
Across chromosomes, the BLAST-based and Retrotector methods identified significantly more ERVs in the X chromosome than would be expected from a homogeneous distribution. A similar excess of ERVs has been observed for the human X chromosome (41).
The number of ERVs detected was positively correlated with chromosome length (P < 0.001 for all three methods) and negatively correlated with the GC content of the chromosome (P < 0.001 for all three methods). No correlation was observed between the number of ERVs detected and gene or pseudogene density. In humans, the number of class I and class III ERVs—but not the number of class II ERVs—has been negatively correlated with GC content (24). The insertion preferences of ERVs in the cow genome should be analyzed in greater detail to gain a better understanding of the preferences of bovine ERVs.
Phylogenetic analysis based on the RT region of a number of selected elements was used to cluster these elements into 24 putative families, which we called BoERV families. Previously, 4 retroviral families were detected (43), which are included in the 24 families that we identified. Although it wad previously suggested that the BERV-γ4 family, referred to here as BoERV3, was the most abundant (43), we found that BoERV1 was actually the most abundant. This family had not previously been identified in any mammal. One possible explanation for this is that the members of this family have some nucleotide differences in the region where hybridization took place with the primers used for pig, sheep, and cow (43). We used PCR to amplify a 150-base sequence in sheep, so it is possible that BoERV1 could be a ruminant-specific ERV family.
The comparison of ERV family numbers was limited to four species with defined families (human, chimpanzee, mouse, and rat). In cows, the number of families (24 putative families) was higher than that for mouse (20 families) (23) and lower than those for chimpanzee (42 families) (29) and human (31 families) (15). In the case of rodents, where information is available only for class II elements in two species, the number of families in cow (six families) was similar to those in rat and mouse (seven families) (2). To the best of our knowledge, no information is currently available on dog and cat ERV families.
We did not detect any class III-related ERVs. Although this could be an artifact due to the distance from the reference sequences used for the BLAST-based search and the limits of class III element detection by Retrotector (38), it is more likely because the presence of class III ERVs in the cow genome is limited. In fact, although a number of sequences related to class III were amplified previously by Bénit et al. (3), the amplification signal was weak, and these sequences were quite short.
The relationship between representatives of the ERV families from different species is interesting. In general, the lineages of the different ERV groups are divided following the species phylogeny, with humans and chimpanzees on one side and cows, sheep, and pigs on the other side. Representative elements of the scarce murine class I families were included in our analysis, but their relationship with representative elements of the ERV families of other species remains obscure. Even so, representative elements of the human/chimpanzee groups and, to a lesser extent, mouse/rat and pig/sheep groups tend to follow the pattern of previously reported comparisons of each pair (2, 17, 29). Following this pattern, the representative bovine elements cluster with the representative sheep elements, as obtained by experiments (17) with most of the lineages. In some cow breeds, ovine enJSRV-related env, orf-x, and LTR sequences have been detected (25). However, bovine ERVs closely related to enJSRVs were not detected in the version of the genome used in our study. This genome sequence belongs to a Hereford animal, while Morozov et al. analyzed animals from Simmental and Limousine breeds. For humans, it was suggested that a combination of genetic and environmental factors could contribute to determining the prevalence of enJSRV-related sequences in different populations (34). Thus, it is possible that different breeds of cow could also have different prevalences of enJSRV-related sequences.
Related to the relationship of ERV families of different species, in one lineage, representatives of human, chimpanzee, pig, and sheep groups were present, while cattle elements were absent. To account for this absence, we estimated the insertion time of the elements in this lineage. As there is no genomic information available for pigs and sheep, estimates were available only for human (19.5 to 8.9 MYA) and chimpanzee (33 to 15.8 MYA) elements. These insertion times were later than the divergence of ruminants and primates. Based on the weak support of the tree topology, a single infection is unlikely. In this lineage, two independent infections by a similar virus could have been detected, and in the case of ruminants, it is possible that cows lost this element at some point.
The absence of some ERV families in cows, compared with sheep and pigs, has prompted some authors to suggest that cows have a limited number of ERV families (43). Taking into account that the numbers of ERV families described were 31 for humans (15), 42 for chimpanzees (29), 20 for mice (23), and 24 putative families for cows (this study), BoERVs may not be as scarce as previously stated. Moreover, we detected one family, BoERV1, that had not been detected previously but that appears to be present at least in ruminants.
As described above, we did not detect any class III elements. It was suggested previously that in primates and mice (18), ERVs related to this class have been subjected to one or two bursts of copy number. If so, it is possible that the difference in the number of ERV families with primates and mice could be based on this burst of class III-related ERVs. Finally, the whole picture could be also confused by the intense selective breeding processes that have accompanied the domestication of cows (4).
In conclusion, we identified several thousand ERVs in the genome of Bos taurus by three different methods. The number detected depended on the technique used, ranging from a low of 928 using a BLAST-based method to 9,698 using Retrotector. When attempting to detect new ERVs, the use of different methods is advisable. ERVs did not appear to be randomly scattered across the chromosome but were more abundant on some, especially the X chromosome, than on others. Among the 24 detected families, 20 were newly described ERV families. The most abundant BoERV1 family is described for the first time. Finally, representatives of ERV families from rodents, primates, and ruminants showed a phylogenetic relationship following their hosts' relationships.
This is indeed the first genome-wide approach for the detection and characterization of bovine endogenous retroviruses. Further in-depth analyses are thus needed to uncover the whole picture of these genomic elements in cattle.
K.G.-E. was a recipient of a UPV/EHU grant (Vice-Rectorate of Basque and Plurilingualism). This work has been partially funded by UPV/EHU by means of projects EHU06/107 and GIU07/62 (B.M.J.).
We thank anonymous reviewers for highly useful comments and improvements to the manuscript. We thank Maialen Sistiaga for helping in the experimental work.
Published ahead of print on 4 August 2010.