This report comprises the most complete and up-to-date analysis of HML-2 proviruses and solo LTRs that can be found in the published human genome. The HML-2 group comprises the most recent integrations of endogenous retroviruses in humans, and includes many members that are polymorphic within the species. It has been hypothesized that these endogenous elements represent the closest relatives of extant exogenous betaretroviruses that may retain the capacity to infect humans [
16,
18]. These putative exogenous viruses have a proposed role in breast cancer as well as biliary cirrhosis, although no such virus has been convincingly detected [
73-
75]. Based on expression patterns, many groups have also suggested a role for endogenous HML-2 proviruses in various diseases from breast, ovarian, and skin cancers, to schizophrenia and arthritis [
21,
24,
28,
34,
39,
44,
76]; however, a functional link to these diseases also remains to be established. Despite the mounting evidence suggesting a clinically significant role for HML-2 proviruses in disease, it is surprising that no study has yet fully described all HML-2 proviruses in the published human genome. As it is unlikely that each provirus equally contributes to every disease to which HML-2 expression is associated, identification of which provirus is expressed in these diseases remains impossible unless all known proviruses are characterized.
Here, we have identified and characterized 91 provirus elements present in the human genome, adding almost 30 more than have been previously described. We have also identified 944 solo LTR elements, over 1500 fewer than previously expected in the human genome [
12], and 300 fewer than the closest suggested estimate [
60]. Discrepancies between our estimate of solo LTR number and that published previously [
60] are likely due to our exclusion of elements that are not full length or near full length (> 750 bp). It is unlikely that this variance is due to differences in genomic builds, as when we compared the build used previously to the current build using the same criteria, we found the same number of solo LTR elements in both builds, and not the number reported previously (data not shown). While we believe our list is as comprehensive and as thorough as possible, it has excluded for the benefit of accuracy many partial provirus elements that lack sufficient sequence to determine their grouping within HERV-K elements. Also, there are at least 4 provirus elements not present in the human genome as full-length elements, due to their polymorphism. Therefore, it is very likely that 89 is an underestimation of the actual number of HML-2 proviruses, but it is the best approximation available to date. Previous groups have attempted to compile a list of HML-2 proviruses in humans, but the closest identification of all HML-2 proviruses in humans was only able to identify 54 proviruses, and many of the loci have changed since the publication of this report [
52].
Because of the inherent ambiguity of genomic builds, it becomes increasingly important to develop a standard for HML-2 provirus identification. Previous publications reporting HML-2 proviruses in the human genome include accession numbers referencing these proviruses corresponding to BACs, which provide little to no useful information about the proviruses described. Therefore, comparing information on HML-2 proviruses becomes difficult as different groups use different nomenclature to define proviruses, and many use different accession numbers for the same provirus, as multiple BACs may contain the same locus. Finally, some accession numbers are for BACs that include more than one provirus [
11,
51]. This confusion can be reduced through a few measures that this study provides: 1) deposition of all HML-2 sequences identified and their flanking sequences into GenBank; 2) standardization of HML-2 nomenclature; 3) subclassification of HML-2 for functional studies; 4) thorough analyses of all HML groups to define criteria for what qualifies a new element to belong to an existing group. Here, we have created a database of HML-2 provirus and flanking sequence that has been deposited into GenBank as well as clearly defined properties for all known HML-2 elements.
Although attempts have been made to standardize nomenclature for HERVs using tRNA primer as defining characteristic [
77,
78], we believe that for the HERV-K elements this does not make sense. HML-5 has a sequence that suggests priming from a Met tRNA, suggesting it belongs to a HERV-M group, despite being closely related to HERV-K [
77]. Many elements may lack or have mutated primer-binding sites precluding it from classification using this system. Therefore, while we do believe it is necessary for having a standardization of nomenclature for HERVs in the genome, we propose that all betaretrovirus-like elements be identified as HML-X (where "X" is 1-11, based upon phylogenetic similarity to known HML groups) followed by their locus on the human chromosome. An example from this study would be HML-2(3q12.3). While this nomenclature is limited to human genomes, it does provide a useful reference point when analyzing betaretrovirus-like ERVs in non-human primates. Further work is necessary for defining properties of all endogenous retroviruses in the human genome.
It seems likely that most HML-2 proviruses are the result of independent integration events that have been preserved within the genome. However, there are 17 elements that are in the genome as a consequence of transposition events that include both a complete provirus and at least 1500 kb of flanking DNA. Based upon our estimates, these proviruses have been in the genomes of primates for 20-30 million years, though it is likely that these transpositional events occurred very recently, approximately around the split of humans and chimpanzees (~5.5 mya [
68]). This is seen by the incomplete expansion of elements 8p23.1d and 11p15.4 which do not have a corresponding provirus in chimps, while 8p23.1c and 8p23.1b do. The expansion of elements in the Xq28 locus corresponds to gene duplication of the cancer testis antigen 1 (CTAG1) into CTAG1A and CTAG1B; both CTAG1A and CTAG1B are exclusively expressed in malignant tissues or normal testis [
79], which is the same expression pattern of HML-2 proviruses. This gene duplication is present in the chimpanzee, human, and orangutan published genomes, but not rhesus genome, consistent with the estimated integration time of the Xq28b provirus (~21 mya). The duplications in 1p36.21 are found within the PRAMEF gene cluster, comprising genes that are closely related to PRAME, another gene that is exclusively expressed in malignant tissues or normal testis [
80]. The duplicated 5A elements are all flanked by hypothetical proteins, therefore it remains to be seen what the significance of this expansion is. Nevertheless, it is interesting that the same element, along with flanking sequence was transposed multiple times, while most other elements were not; this strongly implies that the transposition is due to some element in the flanking DNA, not the provirus itself. This pattern contrasts with the ERV9 family of endogenous retroviruses, which have continued retrotranspositional activity within the genome since the hominid divergence within the primate lineage [
81].
We have reclassified the different subgroups of HML-2 proviruses based upon unique signatures of our 1087 HML-2 LTR sequences (947 solo LTRs and 140 provirus-associated LTRs). We did not observe the sequence polymorphisms within subgroups of our sequences as previously used to define the groups [
14], likely due to our much larger sample set. However, we did observe an LTR5Hs-specific 4 base insertion at position 585 and 10718 of the HML-2 provirus alignment (Additional File
1), which was found in ~80% of all LTR5Hs proviruses. LTR5A/Bs also have a unique insertion at positions 806/10957, which is found in all LTR5A/B sequences, but none of LTR5Hs. Furthermore, LTR5A can be identified by unique insertions at positions 182/10317. All of the figures in this publication are reflections of our definitions of LTR grouping, rather than previously inaccurate groupings. It should be noted that our reclassification of LTR is significant in categorizing viruses, as all of our phylogenetic trees (Figures and ) of provirus genes confirms monophyly of subgroups. As such, we feel our method of grouping LTRs is a rigorous and predictive method to identify HML-2 elements in future sequenced genomes.
It is of interest that proviruses and solo LTRs appear to have been differentially maintained within the genome. Under a neutral model of evolution, one would imagine that there should be approximately the same proportion of proviruses and solo LTR elements to size of chromosome or gene density of any given chromosome. In general, this principle holds true, though four chromosomes stand out - chromosomes 2, 4, 8, and 17. While chromosomes 2 and 17 are gene rich, they are relatively devoid of both proviral and solo LTR elements. Conversely, chromosomes 4 and 8 are seemingly enriched in HML-2 elements compared to RefSeq genes. Furthermore, we observed an enrichment of proviruses compared to solo LTRs on chromosome 8, and an enrichment of solo LTRs compared to proviruses on chromosome 2. A possible explanation for this would be that human chromosome 2 is a product of fusion of two smaller chromosomes in other primates. When this fusion event took place, it is conceivable that the recombination of many highly similar DNA sequences occurred leading to production of more solo LTRs than proviruses on this chromosome. It is difficult to determine if this is the case as most non-human primate genomes are unfinished and many proviral loci are not assigned to any given chromosome. Initial analysis identified one provirus on Chromosome 2a and 2b in chimpanzee and at least 5 proviruses in orangutan (data not shown). Further drafts of non-human primate genomes are necessary for this type of analysis to be performed in other species.
The significance of the increased ratio of proviruses to solo LTRs on chromosome 8 is unclear, although the distribution may simply be skewed by the expansion in the 8p23.1 locus. Removal of two proviruses on chromosome 8 puts the point within the 95% confidence interval of the solo LTR-provirus correlation. Nevertheless, our study shows that endogenous retroviruses can be used to study genome evolution, as they are present in numbers sufficiently large to be informative, but much more manageable than SINE or LINE elements. Finally, the strong correlation of HML-2 elements and gene regions, may reflect a propensity to integrate in such regions [
71], or, conceivably some sort of protection against mechanisms designed to remove transposable elements [
82]. However, this conclusion may be an oversimplification of a more complicated mechanism of regulating repetitive elements within the genome. The fact that so many elements are maintained in or near genes may provide a partial explanation for the correlation of HML-2 gene expression with various disease states.
While disease association of HML-2 proviruses is controversial, many believe that HML-2 expression in diseased tissue is a byproduct of cellular dysfunction. Others have argued that exogenous retroviruses may recombine with homologous endogenously expressed HML-2 elements yielding infectious viruses that cause disease. This study is also the first to thoroughly identify and characterize all available human HML-2 proviruses. Correlation of HML-2 expression to disease onset is well-supported, and suggests that provirus expression may be a useful biomarker for certain diseases, particularly breast cancer, where no useful diagnostic marker currently exists [
83]. Here, we have provided a list of provirus open reading frames (Tables and ) that may represent putative targets for detection of disease using HML-2 proteins or mRNA transcripts as biomarkers. We are also making available complete files of the sequences identified through deposition in Genbank (accession numbers:
JN675007-JN675097) along with flanking sequences (accession numbers: JN675098-JN675187). Finally, we have aligned these sequences (Additional File
1) and proved them as a useful reference that can be viewed using any sequence viewing software. These steps should prove helpful in identifying and categorizing HML-2 expression in disease and assigning sequences detected to specific proviruses and, therefore, chromosomal locations.
Two genes not analyzed for expression are
np9 and
rec, alternative splice products of type 1 and type 2 env genes, respectively. Although
rec transcripts are found in normal and cancer tissues,
np9 mRNA has only been detected in tumor tissue, as is observed in tissues from mammary carcinoma biopsies, suggesting a possible role in tumorigenesis [
64,
84,
85]. The type 1 proviruses all belong to the LTR5Hs subgroup, the most recent subgroup of HML-2 elements in the genome. Six of the 20 type 1 proviruses contain open reading frames for the
env gene without having the 292 bp sequence for expressing functional Env. It is possible that the retention of an open reading frame in the remaining
env sequence plays some role in the disease association of HML-2 proviruses.
The observations that type 1 proviruses are found almost exclusively within the LTR5Hs group of proviruses but are not monophyletic, combined with their patent incompetence for independent replication, are most consistent with their arising repeatedly by gene conversion of existing proviruses or by recombination between genomes arising from replication competent type 2 proviruses during reverse transcription prior to integration. Of these two models, recombination during reverse transcription is by far the more likely. First, if gene conversion post-integration was so frequent, it would also be seen in other parts of the genome, particularly in the LTR, where it is readily detected [
53,
65,
66]. However, such events, although they can be detected over evolutionary time, are quite infrequent for HML-2 proviruses [
86]. By contrast, recombination during reverse transcription of copackaged RNA genomes is the rule during retrovirus replication, and averages of 5-10 crossovers per genome per replication cycle have been estimated. Since all initial integrations almost certainly arose from infection of the germ line by an HML-2 virus produced by a somatic cell, which also contained and expressed type 1 proviruses, the heterozygous virions necessary for recombinant formation would have been very frequent, and such recombinants would arise at high frequency. An interesting topic for speculation is whether the deletion itself or the Np9 protein that results from it promotes this process in some way, for example by causing higher levels of expression of type 1 genome RNA.
The polymorphic nature of HML-2 proviruses may play an important role in the polymorphism of diseases with which they are associated. Genome-wide association studies (GWAS) have proven very useful for correlating single nucleotide polymorphisms (SNPs) to various diseases [
87]. We attempted to determine if there were any proviruses or solo LTR elements present between SNPs shown to be involved in disease; however, we did not identify any proviruses that were linked to disease-associated SNPs (data not shown). This result does not preclude the possibility of association of polymorphic proviruses not present in the published genome with these SNPs. Also, many SNPs found on repetitive elements like proviruses are precluded from GWAS analysis, thereby eliminating the possibility of studying disease association of polymorphic proviruses. The abundance of solo LTRs and proviruses in close proximity to genes would indicate that there is some protection for these elements within the genome. For that matter, dysregulation of solo LTR formation and recombination of proviruses might play an important role in disease.