The human genome consists of numerous HERV insertions that contain 5′ and 3′ LTRs and varying remnants of the viral coding genes in between. Many more insertions currently exist as solo-LTRs, which are the remnants of full-length insertions that have been truncated into a single LTR by virtue of inter-LTR recombination events. Inter-LTR comparison is the standard method used to estimate the age of full-length HERV insertions. At the time of insertion, the LTR sequences are presumed to be identical and the differences between the 5′ and 3′ LTRs of the same provirus are believed to arise due to substitutions accumulating in a clocklike manner post-insertion 
. To elucidate the insertion timeline of the human-specific endogenous retroviruses into the human genome, we obtained the complete genome sequences of all the human-specific full-length HERV-K insertion previously reported 
using the UCSC genome browser (hg18).
It has previously been hypothesized that gene conversion may be rampant amongst HERV loci 
. Pervasive gene conversion would prevent us from applying a molecular clock to these data, since observed sequence variation between LTR regions would be attributed to recombination rather than the stepwise accumulation of mutations predicted by the neutral theory of evolution 
. We reconstructed a maximum likelihood phylogeny of all HERV-K LTRs to determine the prevalence of gene conversion between insertions in the HERV-K family (). The clustering of the 5′ and 3′ LTR sequences associated with each insertion locus in our phylogeny suggests that gene conversion is rare among HERV-Ks, with the single exception of HERV-K115. Evidence of a gene conversion event in HERV-K115 has been reported previously 
Phylogenetic tree of full-length HERV-K (HML-2) LTR sequences.
To estimate the insertion times of each human-specific endogenous retrovirus, we calculated the age of each human-specific full-length HERV-K insertion using the traditional inter-LTR comparison method 
. Inter-LTR divergence measurements of the fourteen insertions that did not undergo gene conversion in the LTR region were converted to insertion age estimates by applying an established HERV-K LTR-specific divergence rate of 0.13% per million years (Myr) to these data 
(). Based on this method, HERV-K1p31.1 (referred to as HERV-K116 from here onwards) and HERV-K106 have the highest probability of being the youngest full-length endogenous retroviruses in the human genome. Due to the perfect identity between their 5′ and 3′ LTR sequences, an inter-LTR comparison is only informative to calculate a maximum, upper-bound age estimate. A single mutation is predicted to arise within an LTR every 0.8 Myr based on the divergence rate of 0.13%/Myr. Having no sequence differences between their LTRs, inter-LTR analysis of HERV-K116 and HERV-K106 suggests that both insertions must be younger in age than the time required for one sequence polymorphism to arise between the two LTRs, or 0.8 Myr.
Human specific complete HERV-K (HML-2) proviruses within the human genome.
We used Multalin 
to align HERV-K113, K115, K106, and K116 with experimentally reconstituted infectious HERVs KCON 
and Phoenix 
that are based on artificial consensus sequences of full-length human-specific HERV-K (HML-2) insertions. We observed that while all four HERV insertions (K113, K115, K106, and K116) exhibited similarities to the reconstituted viruses, all four contained mutations that were unique to each insertion (Figure S1
). We observed that both HERV-K106 and HERV-K116 are members of the type I HERV-K family as evidenced by the presence of a 292 bp ‘deletion’ in env
which is the signature of all type I HERV members. The presence of this 292 bp env
deletion in multiple HERV-K type I members suggests that this deletion may have been present in the infectious ancestral precursors of these viruses and probably does not render a HERV insertion dysfunctional on its own. However, HERV-K116 also has a 2846 bp deletion in its pol
. In contrast, HERV-K106 exhibits relatively intact retroviral genome architecture (). Thus, these data suggest that HERV-K106 is the youngest endogenous retrovirus that survives largely intact in the human genome.
Genome organization and haplotypes of HERV-K106.
While inter-LTR comparison is a useful tool to estimate the age of HERV insertions, it becomes less informative with fewer mutations between the LTR regions and provides only an upper bound age estimate in the absence of sequence differences between the LTR regions. We recently developed an alternative dating method to infer insertion age when the inter-LTR method is inapplicable (to estimate the insertion dates of solo-LTRs and HERV loci such as K115 that show evidence of gene conversion) 
. HERV insertions with identical 5′ and 3′ LTR sequences represent another scenario in which our alternative method is useful. We applied this method to the HERV-K106 insertion to derive a more precise estimate of its age. Our approach involves the application of coalescent inference to inter-host sequence variation in one of the proviral LTR sequences to estimate HERV insertion age. We generated complete HERV-K106 3′LTR sequences from 51 individuals representing various ethnicities and three different geographical locations within the United States (). PCR amplification and sequencing of the HERV-K106 LTR revealed three single nucleotide polymorphic positions (SNPs): 133, 403, and 835 (numbered according to their position in the GenBank reference sequence AF164620) (). Four HERV-K106 haplotypes could be constructed based on the SNPs identified in the 3′LTR region ().
Base frequencies and haplotypes of HERV-K106 3'LTR with haplotype frequencies in various ethnic groups within the United States.
The coalescent estimation of insertion age rests on the assumption that the insertion site is evolving neutrally. Therefore, we conducted tests on the HERV-K106 insertion site to determine whether the assumption of neutrality was maintained. We performed 10,000 coalescent simulations using MS software 
and Schaffner's calibrated model of human genome evolution 
to calculate the probability of observing exactly 3 SNPs in a neutrally evolving, 960 bp stretch of human DNA. We assumed no recombination, and an inferred human mutation rate of 9.0×10−9
subsitutions per site per generation. These simulations yielded a Gaussian mutational probability distribution with a mean of 8 SNPs, and revealed that the presence of only 3 SNPs in a 960 bp neutrally evolving region of the human genome deviates significantly from expectations based on Schaffner's model (p<0.05). These findings suggest one of two possibilities about HERV-K106. Either it is evolving under selection, or it is evolving neutrally but has not evolved in tandem with the human genome for a sufficient amount of time to conform to the predictions of Schaffner's human genome-based model. We explored the selection hypothesis by performing standard tests of neutrality on the K106 locus, including Tajima's D 
, and Fu and Li's D* and H 
(performed using DNASP 
). In all cases, we could not reject the null hypothesis that HERV-K106 is evolving neutrally (p>0.10). HERV-K106 itself may have been evolving neutrally but could have been driven to fixation due to hitchhiking effects if the region flanking the insertion had been under positive selection. We used the HGDP Selection Browser 
to test whether the genomic region containing the HERV-K106 insertion site is under selection by calculating the iHS and XP-EHH statistics on genotypes in the Human Genome Diversity Panel. The iHS and XP-EHH statistics are haplotype homozygosity-based tests used to detect signatures of recent selection on variants that have not yet reached fixation 
and can be applied to detect selective sweeps in alleles that have approached fixation in one population but are polymorphic in the overall human population 
. Even though HERV-K106 is fixed in all humans, the genetic region flanking the K106 insertion may contain SNPs that could reveal if a selective sweep has occurred in this region. We found that HERV-K106 is incorporated into a genomic location with only a few genes nearby (Figure S2
) and the region in chromosome 3 containing HERV-K106 exhibited no signatures of selection, as both iHS and XP-EHH did not yield extreme values (Figure S3
). These data collectively support that K106 is evolving neutrally and has only shared its evolutionary history with the human genome for a relatively short period of time.
We constructed a maximum likelihood phylogeny of all 94 observed HERV-K106 3′ LTR haplotype sequences to estimate the age of the K106 insertion (Figure S4
). According to coalescent theory, the genetic distance to the inferred most recent common ancestor (MRCA) should reflect the time that has elapsed since the establishment of the ancestral sequence, and in this particular case, the age of the proviral insertion itself. We used two previously reported evolutionary rates to translate genetic distance into coalescence time. The upper-bound for the coalescence-based age estimate was inferred using the HERV-K LTR specific mutation rate of 1.3×10−9
, and the inferred mammalian genome mutation rate of 2.2×10−9
was used to calculate a lower-bound estimate. Based on the two divergence rates, we estimate that HERV-K106 was integrated into the human genome between 91,000 and 154,000 years ago, after the emergence of anatomically modern humans