|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: RCS MAC FC KDM. Performed the experiments: RCS MAC. Analyzed the data: RCS MAC. Contributed reagents/materials/analysis tools: RCS MAC AP. Wrote the paper: RCS MAC AP FC KDM.
Endogenous retroviruses (ERVs), the remnants of retroviral infections in the germ line, occupy ~8% and ~10% of the human and mouse genomes, respectively, and affect their structure, evolution, and function. Yet we still have a limited understanding of how the genomic landscape influences integration and fixation of ERVs. Here we conducted a genome-wide study of the most recently active ERVs in the human and mouse genome. We investigated 826 fixed and 1,065 in vitro HERV-Ks in human, and 1,624 fixed and 242 polymorphic ETns, as well as 3,964 fixed and 1,986 polymorphic IAPs, in mouse. We quantitated >40 human and mouse genomic features (e.g., non-B DNA structure, recombination rates, and histone modifications) in ±32 kb of these ERVs’ integration sites and in control regions, and analyzed them using Functional Data Analysis (FDA) methodology. In one of the first applications of FDA in genomics, we identified genomic scales and locations at which these features display their influence, and how they work in concert, to provide signals essential for integration and fixation of ERVs. The investigation of ERVs of different evolutionary ages (young in vitro and polymorphic ERVs, older fixed ERVs) allowed us to disentangle integration vs. fixation preferences. As a result of these analyses, we built a comprehensive model explaining the uneven distribution of ERVs along the genome. We found that ERVs integrate in late-replicating AT-rich regions with abundant microsatellites, mirror repeats, and repressive histone marks. Regions favoring fixation are depleted of genes and evolutionarily conserved elements, and have low recombination rates, reflecting the effects of purifying selection and ectopic recombination removing ERVs from the genome. In addition to providing these biological insights, our study demonstrates the power of exploiting multiple scales and localization with FDA. These powerful techniques are expected to be applicable to many other genomic investigations.
Approximately half of the human genome is composed of repetitive elements. Yet we do not completely understand why certain elements insert in particular genomic locations, and what determines which elements are retained and which are eliminated from the genome. To address these questions we studied endogenous retroviruses, one type of repetitive elements which occupy ~10% of the human and mouse genomes, together with genomic features characterizing various biological processes (e.g., recombination and transcription) in the neighborhoods of these elements. Using statistical techniques, we identified enrichment of genomic features in the vicinity of endogenous retroviruses of different evolutionary ages. Features overrepresented adjacent to young endogenous retroviruses are expected to have facilitated their insertion in the genome. Features overrepresented adjacent to older endogenous retroviruses are expected to have facilitated both their insertion and their chances of being sustained in the genome. Our analyses allowed us to explain the uneven distribution of endogenous retroviruses along the genome, and thus to better understand the interaction of different biological processes in shaping the evolution of genome architecture.
Endogenous Retroviruses (ERVs) are Class I Transposable Elements (TEs) considered to be remnants of germ-line retrovirus infections inherited by the next generations . As all Class I TEs, ERVs transpose via an RNA intermediate, i.e. they “retrotranspose”. Because they possess Long Terminal Repeats (LTRs), they are also known as LTR-retrotransposons. Depending on the similarity of their gene content to that of certain retroviruses, ERVs are classified as Gammaretrovirus-, Betaretrovirus-, and Spumaretrovirus-like [1–3]. Full-length ERVs possess three retroviral coding genes (i.e. gag, pol, and env) and LTR flanking sequences ]. In most cases, the internal genes are deleted by recombination of LTRs, converting ERVs into solo-LTRs [5,6].
Most ERVs have accumulated numerous mutations that render them inactive . However, some rare examples of young ERVs that have coding capacity, are expressed and are transpositionally active, have been described in mammals, e.g. in koala , mouse , cat , sheep , and mule deer . Active ERVs are transposition-competent and have integrated recently; hence for them, polymorphic events–in terms of presence/absence–are observed at the population level but the allele frequencies of integrations are low. For instance, CrERVγ is an endogenous gammaretrovirus that was recently detected in mule deer . This ERV has been invading the germ line of mule deer since its speciation from white-tailed deer approximately 1.1 million years ago (MYA), and the copies found display polymorphisms in the wild mule deer population. In total, in this species, there are on average 100 full-length copies of the CrERVγ per haploid genome . However, if solo-LTR elements are included, this estimate increases two-to-three-fold .
ERVs occupy ~8% of the reference human genome (they are called HERVs for Human ERVs), and have been integrating in it starting more than 35 MYA [6,7,11,12]. However, only the HERV-K family has been active during the past 6 MY–since the divergence of human and chimpanzee . Moreover, among 113 human-specific HERV-K elements only 15 are full-length and none is infectious , though about a dozen were found to be polymorphic in 100 individuals from diverse populations indicating retrotransposition activity in the recent past [13,14]. In cell lines, however, two HERV-K named Phoenix  and HERV-KCON  were reconstructed to be infectious, producing retroviral particles and causing in vitro integrations. Expression of HERV mRNA varies among tissues. Importantly, significant expression levels were detected in testis as well as placenta [17,18]. Some analyses are available for HERV-Ks embryonic expression [19,20].
Approximately 10% of the reference mouse genome is derived from LTR elements, including ERVs . In mouse oocytes, approximately 13% of transcripts were reported to be derived from MaLRs (a type of LTR elements) as detected from ESTs . Also, mice have highly active ERVs causing up to 10–12% of spontaneous germ-line insertional mutations–most of which are due to activity of IAP (Intracisternal A Particle) and MusD/ETn (or ETns in short; Early Transposon family) elements . IAPs and ETns are both non-infectious betaretroviruses. In the mouse genome, full-length IAPs contain retroviral genes needed for retrotransposition; however there are also partially deleted copies (ERVs missing genes or other sequences). ETns consist of non-coding sequences and are aided by MusD proteins to retrotranspose . Insertional polymorphisms have been detected for both IAPs and ETns in multiple mouse strains; additionally, some insertions arose prior to the divergence of these strains [24,25]. In the rodent lineage, out of seventeen species studied, three (Mus, Spermophilus, and Cavia) possess 80% of all IAP loci found in these species . These elements are absent from monkeys and apes . Mouse IAPs and ETns are known to transpose in different mouse strains causing mutations in the germ line; both polymorphic (in terms of presence/absence) and fixed elements are known for each mouse strain .
The exaptation of ERVs–i.e. the recruitment of their sequences to perform a new function as regulatory or coding sequences–has influenced the evolution of genomes in multiple ways. Some enhancers and promoters derived from ERVs assume new roles in gene regulation; e.g., the alternative promoter of the CYP19 gene–an enzyme important for estrogen biosynthesis–leads to its high expression levels in the primate placenta [3,27]. Another interesting example of ERV exaptation associated with the evolution of placenta is syncytin, a gene derived from the env gene of HERV-W . Other ERV genes were exapted to function as proteases, RNA-dependent DNA polymerase with RNAse H, and integrases, as well as structural proteins, in diverse organisms . Importantly, while ERVs have been relevant to genome evolution, they have also been implicated in the development of multiple diseases by disrupting genes, modifying regulatory sequences or altering gene expression. Though causal links have not been definitely established, the diseases that have been associated with ERV retrotransposition or expression include multiple sclerosis, cancer and psoriasis in human [3,11,29,30]; and obesity, diabetes, and cancer in mouse .
Notwithstanding the role ERVs play in the architecture, evolution, and function of genomes, our knowledge of how the genomic landscape influences their integration and fixation is still limited. Gene density and GC content have been shown to be negative predictors of ERV density–for not only full-length elements but also solo-LTRs [31–33]. In contrast, reconstructed HERV-Ks integrate in regions with high numbers of gene transcription units . Similarly, in vitro IAP integrations occurred preferentially in actively transcribed domains of HeLa cells . Interestingly, human and mouse ERVs that are located in introns are mostly present in antisense orientation avoiding gene expression disruption [32,36,37]. Other important genomic characteristics of ERVs and their genomic neighborhoods are high levels of methylation and epigenetic modifications used by the genome to limit transposition [3,38]. ETns, however, show decreased methylation when located in the vicinity of transcription start sites and expressed genes . Chromosome location is another relevant feature of ERV distribution, as illustrated by the description of 100 previously unknown HERV-Ks in the centromeres of 15 chromosomes . It has been suggested  that the accumulation of ERVs is the net result of two processes–integration, which can be biased towards certain genomic landscapes, and purifying selection, which removes ERVs disrupting the function of important elements, e.g. of genes. Disentangling these two processes can be challenging and requires the investigation of ERVs that integrated in the genome at different times.
Several approaches have been used to elucidate the relationships between genomic features and distribution of TEs. Most studies of the associations between genomic features and TE density, Integration Site (IntS) preferences, or neighboring sequences characteristics were performed employing statistical methods such as ROC curves , non-parametric tests , Fisher exact tests , maximum likelihood techniques , MANOVA , and multiple regressions [45,46]. The main limitation of many past studies was the low data resolution determined by available technologies. However, resolution has recently improved, e.g., with the release of ENCODE and ModENCODE consortia data . The application of innovative statistical approaches though has not kept pace with the improvement in data. Statistical methodology should address the fact that many features of the genome act jointly in defining its biological functionality. Being able to consider multiple genomic features simultaneously, e.g., with multiple regression analyses [45,46], is essential to obtain meaningful biological conclusions. Moreover, with the availability of higher resolution data, it becomes paramount to use statistical techniques capable of detecting and differentiating effects at different scales and locations, e.g., one genomic feature may be generally enriched or depleted in the broad flanks of a TE, while another may show enrichment or depletion at a specific location in close proximity of the element’s IntS.
To perform more powerful and effective analyses, one can view genomic features as “curves” composed of measurements in consecutive genomic intervals. In this framework, Functional Data Analysis (FDA) techniques can be exploited to extract signals from these curves, taking advantage of the ordered nature of the measurements and considering different scales and locations, i.e. sizes and positions of genomic intervals (see  and  for a comprehensive introduction to FDA). This class of techniques includes curve smoothing and registration methods, functional principal component analysis, functional hypothesis testing, functional regression, and functional clustering . In the last decade FDA has been utilized in an increasing number of biomedical applications [51,52], particularly in cardiovascular research [53–55] and kinesiology . Although still limited in number, some applications of FDA also exist in the context of genetics and genomics, e.g., in genetic association studies [57–59], epistasis analysis , and ChIP-seq peak shape clustering .
Here, applying FDA methodology, we address three questions about the biology of ERVs. First, what genomic features are significant for ERV integration and fixation? Second, at what genomic scales and locations are these features influential? Third, and finally, how do genomic features work in concert to provide signals essential for integration and fixation of ERVs? Using genome-wide data, we applied the recently developed Interval Testing Procedure (ITP)  to determine the influence of flanking sequence features on integration and fixation of mouse (polymorphic and fixed ETns and IAPs) and human (fixed and in vitro HERV-Ks) ERVs. As a result, we detected diverse genomic features that affect integration and fixation of these elements (e.g. gene content, replication timing, AT count, and LINE content), and did so differentiating effects at various scales and locations in the flanking regions. Finally, we employed multiple Functional Logistic Regression (FLR) models to capture the combined effects of a restricted set of features resulting in a compact group of genome features that define the genomic landscape of integration and/or fixation preferences for ERVs. Importantly, the functional testing procedures and regression techniques we extended, employed and demonstrated in this study can be broadly applied in genomics.
In this study we analyzed in vitro, polymorphic, and fixed ERVs. The distributions of in vitro and polymorphic ERVs are only marginally influenced by selection and thus provide a more accurate view of integration preferences. Fixed ERVs, in contrast, carry information about both integration and fixation. We interrogated the genomic neighborhoods (32-kb flanking sequences upstream and 32-kb flanking sequences downstream of each element, so there is no overlap among flanking regions to maximize the number of ERVs in the study) of one human and two mouse ERV families. In mouse, we considered 1,866 ETns (242 polymorphic and 1,624 fixed) and 5,950 IAPs (1,986 polymorphic and 3,964 fixed) detected genome-wide by Zhang and colleagues ; elements were considered to be fixed if they were shared among four mice strains, and polymorphic if they were present in the C57BL/6J strain but not in the other three strains (see Methods). As control regions, we considered 1,379 continuous 64-kb regions of the mouse genome that did not overlap with the flanking sequences of ERVs (see Methods, Table 1). In human, we considered 826 fixed HERV-Ks (Table 1) annotated by Subramanian and colleagues . We also extracted the genomic locations of 1,065 in vitro HERV-K integrations in human embryonic kidney and fibrosarcoma cell lines  (Table 1). A total of 1,690 control regions were generated similarly to those in mouse (see Methods). Human and mouse ERVs in our analyses ranged from solo-LTRs (~60 bp) to full-length elements (~9 kb) (Table A in S1 Text). The number of ERVs present on each chromosome correlated with chromosome size (Fig A in S1 Text). Human chromosome 19 was an outlier with an overrepresentation of fixed HERV-Ks (Fig A in S1 Text).
We selected a diverse set of genomic features (Table 2) that could be implicated in ERV integration or fixation as reported by previous ERV [31–33] and non-ERV TE studies [45,46]. In total, we considered 41 and 43 genomic features in mouse and human ERV flanking regions, respectively (derived from 43 datasets in mouse and 44 datasets in human). These features reflected DNA conformation (e.g., G-quadruplex), DNA sequence (e.g., A/T content), position on the chromosome (e.g., distance to the closest centromere and telomere), recombination (e.g., local recombination rates), replication (e.g., replication timing), gene regulation and expression (e.g., histone marks and DNase I hypersensitive sites), as well as selection (e.g., exons and most conserved elements). Where possible, we specifically utilized features studied in embryonic stem cells (ESCs) or in sperm cells as they most closely proxy characteristics of germ-line and embryonic cells . Four low-resolution features (replication timing, recombination rates, distance to telomere, and distance to centromere) were represented by a single value for each 64-kb region. For each high-resolution feature, we measured either its content (fraction of the genomic window covered by the feature), its count or its weighted average (WA, only for methylation and expression features) in each of the 64 1-kb windows constituting the flanks of each ERV and each control region (Fig 1A, see Methods). We applied hierarchical clustering to screen out high-resolution genomic features that present strong correlations with each other (Figs B and C in S1 Text). For example, for human, exon content was highly correlated with gene expression in ESCs and thus we removed the latter from the analysis. As a result, a total of 35 mouse and 36 human high-resolution genomic features (derived from 35 datasets in mouse and 37 datasets in human) were retained for further analysis (Figs D and E in S1 Text).
To identify genomic features significantly affecting the ERV distributions in the human and mouse genomes, we contrasted flanking regions of fixed ERVs (either mouse ETn and IAP, or human HERV-K) vs. control regions; such a comparison is expected to reflect both integration and fixation preferences (Fig 1B). In an attempt to disentangle genomic features affecting ERV integration from those affecting their fixation, we conducted additional comparisons; namely, we contrasted flanking regions of polymorphic mouse ERVs (ETn and IAP) vs. mouse control regions, and flanking regions of in vitro HERV-K vs. human control regions. In these comparisons integration preferences are expected to be substantially more prominent than fixation preferences because selection had substantially less time to act on polymorphic or in vitro ERVs than on fixed ERVs. Finally, to pinpoint genomic features significant for ERV fixation, we contrasted flanking regions of fixed vs. polymorphic mouse elements (ETns and IAPs), and of fixed vs. in vitro HERV-Ks. In a way, the analysis of fixed ERVs vs. controls can be viewed as “cumulating” that of polymorphic or in vitro ERVs vs. controls, and that of fixed vs. polymorphic or in vitro ERVs. In total, we conducted nine comparisons (Fig 1B), each using four different statistical techniques as described below (Fig 1C). Admittedly, polymorphic integrations are affected by selection to a greater degree than in vitro ones, however we are not in possession of both of these data types for the species in our study; only polymorphic data are available for mouse ERVs and only in vitro data are available for human ERVs.
First, we tested whether ERV presence was significantly affected by low-resolution features using a univariate permutation test (see Methods, Table B in S1 Text; Fig 2); this is appropriate because these features are represented by a single value for each 64-kb region. Second, for the high-resolution genomic features, we employed the two-population Interval Testing Procedure (ITP) for functional data  to assess whether each feature, when considered alone, had significantly different content (or count, or WA) in a comparison, e.g. in ERV flanking regions vs. controls (see Methods for details). This technique is particularly suitable for our analysis because it considers the data as a curve over the 64 1-kb windows comprising each region, instead of taking one value for the region (e.g., an average over the 64 windows). ITP combines inference on the whole curve with component-wise inference (i.e. inference on measurements comprising the curve). Thus, it allows us to select relevant genomic features detecting both the scale and the location at which each feature acts (see Methods for more details). From this analysis we expect to detect genomic features that: (1) show significant enrichment/depletion locally, especially in windows close to the IntS of ERVs or further away from it–we call these localized differential landscape (LDL) features (e.g. Fig 3A); (2) show a uniform level of significant enrichment/depletion throughout all 64 1-kb windows–we call these invariant differential landscape (IDL) features (e.g. Fig 3B); or (3) are not significant over the whole 64-kb region examined. In order to capture different nuances of the data, we performed ITP using three test statistics (mean difference, median difference, and variance ratio; Figs Figs44–6 and F-T in S1 Text), however, below we focus on results concerning mean differences.
Third, to determine the individual explanatory power of major predictors, we fitted single Functional Logistic Regressions (FLRs; Figs Figs1C1C and and7,7, Table C in S1 Text) for each feature found to be significant in ITP. This analysis allowed us to summarize and better quantify the results obtained by univariate permutation test and ITP (Fig 7). Moreover, it allowed us to identify features that, by themselves, explained a percent of deviance in excess of 20%. These are clearly very relevant predictors (Table C in S1 Text) but we did not include them in our final multiple FLR models (see below) as they would hide the concurrent effects of other potentially relevant predictors.
Fourth, we examined the joint effects of the remaining significant predictors (as determined by ITP and univariate permutation test) via multiple Functional Logistic Regression (Fig 1C and Tables Tables33–11). The multiple FLR models expressed the likelihood of being in the neighborhood of an ERV vs. control (or of a fixed vs. a polymorphic mouse ERV, or of a fixed vs. in vitro HERV-K) as a joint function of several predictors. In particular, IDL features and low-resolution features that proved significant in univariate permutation tests were treated as scalar predictors represented by their averages across the 64 windows constituting each region. In contrast, LDL features were treated as functional predictors with curves evaluated at customized scales and intervals to capture the specific behavior of each LDL feature, e.g. around the IntS, as suggested by the ITP. Importantly, the modified ITP we employed (see Methods) gave us detailed information on the best scale and location, i.e. on the subregions on which to study the curve, for each of these functional predictors.
To identify genomic features affecting the distribution of ETns in the mouse genome–as a result of both integration and fixation preferences of these elements–we contrasted flanking sequences of fixed ETns vs. control regions. Univariate permutation tests applied to the low-resolution features (Fig 2A and Table B in S1 Text) indicated that the flanking regions of fixed ETns have lower recombination rates and later replication timing, and are closer to centromeres. ITP indicated (Fig 4A and Fig F in S1 Text) that all four microsatellites types, LINEs, L1 target sites, AT count, and the H3K9me3 histone mark are overrepresented, while SINEs, exons, introns, CpG islands, most conserved elements, all features associated with CpG methylation, ESC expression, and two histone marks (H3K27me3 and H3K36me3) are underrepresented, throughout the whole fixed ETn flanking regions. ITP also identified features with interesting localized behaviors: recombination hotspot content and the H3K27ac histone mark are overrepresented immediately next to the IntS, while DNase I hypersensitive sites and three histone marks (H3K4me1, H3K4me3, and H3K9ac) are underrepresented everywhere except for immediately next to the IntS. We found that also Z-DNA repeats and G-quadruplex repeats are underrepresented, with a yet more complex local behavior. Next, single FLRs revealed that all four microsatellites types, LINEs, introns, most conserved elements, and ESC expression have very strong effects (relative contribution to the deviance explained, RCDE ≥26%; Fig 7A and Table C in S1 Text). We therefore excluded these predictors from the final multiple FLR model, which explained 51.1% of the deviance in discriminating fixed ETns from controls (Table 3). The two strongest scalar predictors in such a model (i.e. H3K9me3 and L1 target sites, RCDE 20.5% and 23.5%, respectively) had positive effects, while AT count and H3K27me3 had negative effects (RCDE 17.8% and 4.7%, respectively; see Discussion for the explanation of the negative effect of AT count in multiple FLRs). The two strongest functional predictors–CpG methylation and exon content (RCDE of ~10%)–had negative effects. Interestingly, the H3K4me1 mark (RCDE 5.1%) had a strong positive effect strictly localized at the IntS (-4 to 4 kb) and a negative effect away from the IntS, while mirror repeats (RCDE 3.1%) had a positive effect on the whole region.
To highlight integration preferences, we contrasted flanking sequences of polymorphic ETns vs. control regions. In this analysis (Fig 4B and Fig G in S1 Text) we found many similarities but also a number of noteworthy differences with respect to the analysis of fixed ETns vs. controls (Fig 4A and Fig F in S1 Text). For instance, similar to fixed ETns, the flanking sequences of polymorphic ETns appeared to replicate later than controls (Table B in S1 Text and Fig 2A) suggesting that this feature might be important for ETn integration. The underrepresentation of exons, CpG islands, and several histone marks was weaker for polymorphic ETns vs. controls than for fixed ETns vs. controls suggesting that ETns can integrate but tend not to become fixed in such environments (Fig 4B and 4A). Moreover, the content of DNase I hypersensitive sites did not differ significantly from controls in a relatively large region surrounding the IntS for polymorphic ETns (-6 kb to +16 kb; Fig 4B) but only at 1 kb upstream from the IntS for fixed ETns (Fig 4A). Single FLRs displayed very similar explained deviances for microsatellites and LINEs in the two comparisons, while the explained deviances for intron content and most conserved elements content were lower for polymorphic ETns vs. controls than for fixed ETns vs. controls (Fig 7A, Tables Tables44 and and3)–confirming3)–confirming that the former are subject to weaker selection effects. The final multiple FLR model for polymorphic ETns vs. controls explained 48.4% of the deviance (Table 4) and was similar to the model for fixed ETns vs. controls (Table 3).
Next, to highlight fixation preferences, we contrasted flanking regions of fixed vs. polymorphic ETns. This analysis, too, revealed some similarities and some important differences relative to that of fixed ETn vs. controls. Fixed ETns were located in regions with lower recombination rates and closer to centromeres compared to polymorphic ETns (Table B in S1 Text and Fig 2A) suggesting that these features are important for fixation. The ITP tests contrasting fixed vs. polymorphic ETns (Fig 4C and Fig H in S1 Text) did not identify microsatellites as features differentiating fixation and integration propensities. However, they did reveal a more localized underrepresentation of exons and most conserved elements around the IntS (Fig 4C)–as compared to the ITP tests contrasting fixed ETns vs. controls (Fig 4A). Recombination hotspots, which were overrepresented in polymorphic ETns vs. controls (Fig 4B) and strongly overrepresented near the IntS in fixed ETns vs. controls (Fig 4A), were underrepresented in fixed vs. polymorphic ETns (Fig 4C). Also, several histone marks (H3K9me3, H3k4me1, and H3K9ac) were overrepresented near the IntS in fixed vs. polymorphic ETns (Fig 4C), but did not show a significant difference near the IntS in fixed ETn vs. controls (Fig 4A). Interestingly, DNase I hypersensitive sites were overrepresented -1 kb upstream and underrepresented up to 2 kb downstream from the IntS (Fig 4C). Single FLRs did not identify features which, individually, had great strength in characterizing fixed vs. polymorphic ETns (all explained deviances <3.5%) (Fig 7A and Table C in S1 Text). However, taken together in the context of multiple FLR, eight features explained 15% of the deviance in discriminating fixed vs. polymorphic ETns (Table 5) and reiterated most of our observations from the ITP tests (Fig 4C).
Interestingly, the genomic features significant in distinguishing between the flanking sequences of fixed IAP and control regions were very similar to those identified in the analogous comparison for ETns. For instance, just as fixed ETns (Table B in S1 Text and Fig 2A), fixed IAPs were found in regions with lower recombination, later replication, and smaller distance to the centromere than controls (Table B in S1 Text and Fig 2B). Most predictors found to be significant by the ITP tests were also shared between the fixed IAPs vs. controls and the fixed ETns vs. controls comparisons (Figs (Figs5A5A and and4A,4A, respectively). Several noteworthy exceptions included a uniform underrepresentation of H3K27ac throughout the 64 kb region flanking fixed IAPs (Fig 5A and Fig I in S1 Text), while this histone mark was overrepresented near the IntS of fixed ETns (Fig 4A and Fig H in S1 Text). Transcription start sites were underrepresented for fixed IAPs, while they were not significant for fixed ETns. Also, three histone marks (H3K4me1, H3K4me3, and K3K9ac), as well as DNase I hypersensitive sites, were uniformly underrepresented throughout the 64 kb region flanking fixed IAPs, but not significantly different from controls immediately next to the IntS for ETns (Fig 5A, Fig I in S1 Text, Fig 4A, and Fig H in S1 Text). Major individual predictors, as identified by single FLRs, were also remarkably similar between fixed IAPs (Table C in S1 Text and Fig 7B) and fixed ETns (Table C in S1 Text and Fig 7A)–with only one extra predictor for fixed IAPs; H3K9me3 content. The multiple FLR model had a 33.3% deviance explained (Table 6) and again showed many similarities to the analogous model for fixed ETns (Table 3).
Next, following a logic similar to the one used above for ETns, we attempted to separate integration vs. fixation preferences for IAPs. Just as was observed for ETns, genomic features significant in distinguishing the flanking sequences of polymorphic IAPs from control regions (Fig 5B and Fig J in S1 Text) were very similar to those distinguishing the flanking sequences of fixed IAP from control regions (Fig 5A and Fig I in S1 Text). The underrepresentation of Z-DNA repeats was more localized (close to the IntS), while recombination hotspot content was more uniformly overrepresented in polymorphic IAPs vs. controls than in fixed IAPs vs. controls. Moreover, transcription start sites showed no significant differences and the H3K27ac histone mark was only weakly underrepresented in polymorphic IAPs vs. controls, while these features were more prominent in fixed IAPs vs. controls suggesting their importance for fixation rather than integration. Single FLRs identified the same group of major predictors for polymorphic IAPs vs. controls as were identified for polymorphic ETns vs. controls, with the addition of the functional predictor ESC expression (explained deviance 22%)(Fig 7B and Table C in S1 Text). We excluded these features from the final multiple FLR model, which explained 30.4% of the deviance in discriminating polymorphic IAPs vs. controls (Table 7) and was similar to the analogous model for ETns (Table 4).
When comparing fixed and polymorphic IAPs (Fig 5C and Fig K in S1 Text), the ITP identified only three features that had lower means throughout the fixed IAP integration regions–SINEs, DNase I hypersensitive sites, and recombination hotspots. However, striking differences were observed for 12 genomic features with localized landscape. For instance, fixed IAPs revealed strong signatures of depressed means within ±4 kb from the IntS for features such as dinucleotide and trinucleotide microsatellites, most conserved elements, unmethylated CpGs, and the histone mark H3K27me3. Introns and two histone marks (i.e. H3K4me1 and H3K27ac) were also underrepresented in a larger area around the IntS. Moreover, we observed overrepresentation of LINEs and H3K9me3 (the mark of heterochromatic regions) surrounding the IntS in fixed vs. polymorphic IAPs. Exon content was underrepresented in a few windows downstream of the IntS, while AT count was overrepresented in most windows of both flanks except at the IntS. Single FLRs for fixed vs. polymorphic IAPs identified H3K9me3 as a very strong functional predictor (explained deviance of 21.5%, Fig 7B and Table C in S1 Text). All other features, singularly, could explain only a very low portion of the deviance (<2%), and even their concurrent effect was low, producing a multiple FLR model with an explained deviance of 3.8% (Table 8).
Flanking sequences of fixed HERV-Ks were characterized by lower recombination rates and later replication timing than controls (Table B in S1 Text and Fig 2C). Based on the ITP (Fig 6A), microsatellites (of all four types), LINEs, recombination hotspots, L1 target sites, AT count, and H3K9me3 marks were overrepresented, while G-quadruplex repeats, SINEs, replication origins, cytosine methylation level features (i.e. CpG, CHG, and CHH), and four histone marks (H3K36me3, H3K4me1, H3K27ac, and H3K9ac) were underrepresented, throughout these flanking sequences. Exons, introns, and most conserved elements were underrepresented as well, especially near the IntS. Single FLR fits revealed that all four microsatellites types (scalar predictors), introns, and H1-hESC transcript expression, considered as functional predictors, each individually explained 26–76% of the deviance. These features were excluded from the final multiple FLR model (Fig 7C and Table C in S1 Text), which explained 79% of the deviance (Table 9). The model included LINEs and recombination hotspots as positive scalar predictors (RCDE of 13% and 9.7%, respectively). In terms of functional predictors, L1 target sites was the strongest predictor with a positive effect in the whole region from -30 to 30 kb, and stronger away from the IntS (RCDE 27.1%). In addition, AT count had a negative effect stronger away from the IntS (RCDE 12.6%), hypomethylation in testis and CpG islands had positive and negative effects, respectively, for the whole integration region (both RCDE of ~5%), and G-quadruplex repeats had a negative effect near and upstream of the IntS (RCDE 1.9%).
To highlight HERV-Ks’ integration preferences, we contrasted flanking sequences of in vitro HERV-Ks vs. control regions. The former replicated later and were more distant from the centromere than the latter (Table B in S1 Text and Fig 2C). The ITP indicated (Fig 6B) that G-quadruplex repeats, L1 target sites, recombination hotspots, replication origins, CHH methylation, and four histone marks associated with active transcription or promoters (H3K36me3, H3K4me1, H3K27ac, and H3K9ac) were underrepresented, while LINEs, AT count, and the H3K9me3 mark were overrepresented, throughout the flanking sequences compared to control regions. Additionally, the ITP indicated that there were fewer CHG methylated sites near the IntS of in vitro HERV-Ks than in control regions. Other genomic features had significant differences further away from the IntS (e.g., SINEs, TSS ENCODE, H1-hESC transcript expression, and H3K27me3), or showed more complex localized behaviors (Fig 6B and Fig M in S1 Text). The multiple FLR model for in vitro HERV-Ks vs controls explained 24.4% of the deviance (Table 10).
To focus on HERV-K fixation preferences, we contrasted flanking sequences of fixed vs. in vitro HERV-Ks. Recombination rates, replication timing, and distance to centromere were significantly different, with lower means for fixed compared to in vitro HERV-Ks (Table B in S1 Text and Fig 2C). The ITP (Fig 6C and Fig N in S1 Text) indicated that microsatellites, L1 target sites, and recombination hotspots were overrepresented throughout the flanking sequences of fixed vs. in vitro HERV-Ks. Introns and most conserved elements were underrepresented throughout the whole region too, but the difference was stronger near the IntS. Exons were underrepresented ±2 kb next to the IntS. Moreover, three histone marks (H3K4me3, H3K27ac, and H3K9ac) were underrepresented upstream of the IntS, while the H3K9me3 mark was overrepresented downstream of the IntS. CpG and CHG methylated sites, as well as H3K36me3, were generally underrepresented but had more complex localization patterns. Single FLRs identified six positive scalar predictors–all four microsatellites types, recombination hotspots, and L1 target sites–each of which, individually, explained as much as 85% of the deviance (Fig 7C and Table C in S1 Text).
Here, to investigate patterns in the distributions of ERV elements, we studied a number of genomic features in the flanking sequences of human and mouse ERVs with FDA, a class of statistical techniques still relatively underused in the field of genomics. In particular, we contrasted features between the flanking regions of fixed ERVs vs. controls, polymorphic (or in vitro) ERVs vs. controls, and fixed vs. polymorphic ERVs. The first type of contrast reflects both fixation and integration preferences with the latter somewhat eroded by selection, the second type highlights integration preferences with minimal influence of selection–particularly in the case of in vitro ERVs, while the third type captures fixation preferences. We observed that various genomic features are overrepresented or underrepresented in some but not other contrasts (Fig 7). Moreover, while some features present significant differences over the whole flanking regions considered, others present localized differences–especially close to the IntS. Below we discuss our findings, relating them to biological processes proxied by the various genomic features under analysis (Table 2).
We explored the associations between ERVs and a diverse set of non-B DNA conformation predictors inferred from the primary DNA sequence of the human and mouse genomes. In vivo, such conformations are formed transiently during recombination, repair, transcription, and replication, frequently causing genomic instability  and were found to be associated with the presence of DNA transposons [44,46]. We observed that mirror repeats and A-phased repeats are overrepresented in the flanking regions of fixed ETns, as well as of fixed and polymorphic IAPs, as compared with control regions (A-phased repeats are also overrepresented in the vicinity of fixed HERV-Ks). The overrepresentation of these repeats in the flanking regions of both fixed and polymorphic IAPs suggests their role in ERV integration–the lack of significant overrepresentation for polymorphic ETns perhaps being due to a more limited statistical power, given the smaller sample size (Table 1). A subset of mirror repeats–triplex repeats–are thought to bind mismatch and nucleotide excision repair proteins , therefore we propose that these protein complexes might be recognized by the integrase machinery and trigger ERV integration. This hypothesis needs to be evaluated experimentally. Mirror repeats have also been associated with low gene expression levels . In agreement with this we found ERVs enriched in regions with low levels of transcription (see below). A-phased repeats cause double helix bends that have been implicated in nucleosome assembly and expansion of trinucleotide microsatellites [65,80] and might be important for the recognition of IntSs by the ERV integrase, as suggested by retroviral studies [81–84]. Moreover, unlike mirror repeats that do not have base composition bias, A-phased repeats are adenine-rich , resonating with the effect of A/T nucleotides (see below).
G-quadruplex and Z-DNA repeats displayed negative associations with the ERVs. G-quadruplex repeats are underrepresented in fixed vs. control and in vitro (or polymorphic) vs. control contrasts for both HERV-Ks and IAPs and thus likely inhibit ERV integration. Z-DNA repeats might inhibit ERV integration as well, because they are underrepresented in the flanking regions of both fixed and polymorphic IAPs. Importantly, these two types of repeats appear to be inhibitive of ERV integration beyond their GC-rich composition  because in several of our models they appear as significant predictors of ERV distributions together with AT-content (Tables (Tables33 and and9).9). Both G-quadruplex and Z-DNA repeats are enriched in promoters and in the 5’ and 3’ gene termini , and therefore we cannot exclude the possibility that purifying selection removes ERVs from such regions.
We observed a strong overrepresentation of all four types of microsatellites in the fixed and polymorphic mouse ERVs compared to controls, suggesting the importance of microsatellites for ERV integration. Many microsatellites form non-canonical DNA structures–e.g., (AG)n repeats form triplexes, (AT)n form four-stranded cruciforms, while (CA)n and (GC)n form Z-DNA [85,86]–which lead to genome instability  and may be used by the integrases to recognize potential IntSs. In the human genome, we found an enrichment of microsatellites of all four types for fixed HERV-Ks compared with controls, and for fixed vs. in vitro HERV-Ks. Therefore, microsatellites might not be directly relevant to HERV-K integration, and instead be more relevant for their fixation (caution should be exercised though when comparing the results from in vitro vs. control and polymorphic vs. control contrasts, as the latter are more influenced by selection). The flanking regions of in vitro HERV-Ks actually had an underrepresentation of mononucleotide microsatellites, potentially because such microsatellites are frequently present as (A/T)n repeats located at the 3’ ends of retrotransposed genes–where HERV-K integrations might be selected against. Additionally, (A/T)n repeats are found in Alus that are also underrepresented in the in vitro HERV-K flanking regions (see below).
Corroborating previous studies [24,32], we observed that both mouse and human ERVs integrate and become fixed in AT-rich genomic regions. Indeed, AT-content was a significant predictor in eight out of nine ITP contrasts (except for in vitro HERV-K). Moreover, L1 target sites  are overrepresented in the flanking regions of polymorphic and fixed mouse ERVs as compared to controls, suggesting that these sequences correlate with mouse ERV integration. L1 target sites are also overrepresented in the flanking regions of fixed HERV-Ks vs. controls, and of fixed vs. in vitro HERV-Ks, but are underrepresented in the flanking sequences of in vitro HERV-Ks vs. controls, thus suggesting that these sequences correlate with HERV-K fixation. The enrichment of L1 target sites in the vicinity of ERVs can be explained by the high AT-content of these sequences, and thus it might simply reflect the AT-richness of the genomic regions in which ERVs integrate or are fixed, or perhaps it also indicates the enrichment of LINEs in these flanking regions (see below). Both AT nucleotides and L1 target sites have positive effects as single predictors, however, in the context of multiple FLR, when the two predictors are considered jointly, we observed a negative effect of AT nucleotides. Indeed, given the high correlation of the two predictors (Figs B and C in S1 Text), the positive effect shown by AT nucleotides when considered on their own is covered by a statistically dominant effect of L1 target sites in the joint analysis.
Mouse and human ERVs tend to be present in genomic regions rich in LINEs, but depleted of SINEs. This trend is significant for the fixed vs. control and polymorphic (or in vitro) vs. control contrasts, but not for the fixed vs. polymorphic (or in vitro) contrasts, arguing for a link with ERV integration. An association of ERVs with LINEs might reflect a preference towards integration in AT-rich sequences for both types of TEs. SINEs, in contrast, accumulate in GC-rich regions of the genome [12,89]. Additionally, a strong positive relationship between LINEs and ERVs, as evidenced by ITP and FLR, could also be explained by H3K9me3 histone marks known to be located in regions rich in these TEs  (see below). An important consideration in our analysis is that control regions are almost completely depleted of ERVs (see Methods), as we excluded even older ERVs than the ones we studied here from our control regions; this might underestimate the influence of such ERVs in integration preferences of the studied ERVs.
In our study, ERV flanking regions had a late replicating profile. More specifically, the flanking regions of fixed ERVs replicated later than those of polymorphic ERVs, and the flanking regions of polymorphic ERVs replicated later than control regions. This was true for all three types of ERVs studied. Moreover, the flanking regions of both fixed and in vitro HERV-Ks presented a low content of replication origins–also a signature of late replication . We hypothesize that the ERV integrase machinery targets late replicating regions because they are AT-rich  or that ERV integration might be coordinated with DNA replication, similar to Tf1 retrotransposon integration at stalled replication forks  and as proposed for DNA transposons .
Recombination appears to be important for both integration and fixation of ERVs. The flanking sequences of fixed ERVs have lower recombination rates than those of polymorphic (or in vitro) ERVs and of controls (Fig 2), suggesting a signature of fixation preference. This observation is in concordance with the hypothesis that ERVs are removed from highly recombining regions via ectopic recombination [6,31,94]. Alternatively, drift might fix ERVs in low recombining regions of the genome where selection is weaker. Katzourakis and colleagues  did not find a correlation between HERV fixation and recombination rates, but the discrepancy between our results and theirs might be due to the different HERV families investigated in the two studies.
We also observed that polymorphic and fixed mouse ERVs are located in genomic neighborhoods with higher content of recombination hotspots than controls, suggesting a role of recombination hotspots in mouse ERV integration. The overrepresentation of hotspots right next to the IntS for polymorphic ETns further supports this observation. In human, Myers and colleagues  detected overrepresentation of two retrovirus-like elements (THE1A and THE1B) in regions enriched with recombination hotspots. Moreover, for DNA transposons, it has been proposed that recombination hotspots are required by the transposition mechanism , and perhaps a similar interaction is essential for ERVs. In contrast, the comparison of fixed vs. polymorphic mouse ERVs indicated an underrepresentation of recombination hotspots suggesting that their high concentration prevents fixation. HERV-Ks presented an almost opposite pattern: recombination hotspots were overrepresented when contrasting fixed HERV-Ks vs. controls and fixed vs. in vitro HERV-Ks, and were underrepresented when contrasting in vitro HERV-Ks vs. controls, arguing for an association between recombination hotspots and HERV-K fixation but not integration. Note that we found experimentally validated recombination hotspots  to be significant in mouse, and predicted ones  to be significant in human, potentially explaining some differences in the results (experimentally validated recombination hotspots in human [68,69]) were not significant in our models).
We found that fixed mouse ERVs are located closer to centromeres than polymorphic mouse ERVs, and that the latter are located closer to centromeres than control regions. The preferential location of ERVs next to centromeres might be explained by their integration in AT-rich regions and by their fixation in regions with low recombination rates (see above). Indeed, recombination rates and GC-content, which are highly correlated with each other, are markedly decreased near centromeres . In contrast, only small differences in the distance to centromere were observed among fixed HERV-Ks, in vitro HERV-Ks, and control regions.
Both methylated and unmethylated CpG sites were underrepresented in the flanking regions of fixed and polymorphic mouse ERVs compared to controls, reflecting integration of these elements in AT-rich areas of the genome. Interestingly, hypomethylated CpGs representing contiguous domains of low methylation measured in sperm (mean size of 1.8 kb) , are overrepresented in the flanking regions of fixed HERV-Ks compared to controls and to the flanking regions of in vitro HERV-Ks, suggesting a link with HERV-K fixation. Hypomethylated CpGs overlap with promoters and other regulatory regions  and thus selection might not tolerate ERVs in such areas of the genome.
All three ERV families studied tend to occur in regions with low DNase I hypersensitive sites content, i.e. in areas with closed chromatin. However, our results are conflicting as to whether this is an integration or fixation preference. Indeed, the signal for HERV-Ks is generally weak, in contrast to Brady and colleagues  who reported integration close to DNase I cleavage sites for in vitro HERV-K (i.e. HERV-Kcon). There is no significant signal near the IntS in the polymorphic ETns vs. controls contrast, and the signal in the fixed vs. polymorphic ETns comparison is also weak (Fig 4). For IAPs, the signal is stronger in the fixed and polymorphic vs. control contrasts than in the fixed vs. polymorphic contrast, arguing for a link with integration. Integration in areas with closed chromatin was previously observed for several retroviruses (e.g., MLV and HIV), and it was proposed that the nucleosomal DNA is targeted for integration by retroviral integrases [81,96].
As reported previously for retroviruses , histone marks are important predictors of ERV distributions. Overall, our results corroborate previous findings (e.g., ) suggesting that ERVs integrate in areas of repressed chromatin. Consistent with previous studies [38,90,99–101], we observed an underrepresentation of histone marks associated with transcribed chromatin (H3K36me3 and H3K9ac), promoters (H3K4me3), and enhancers (H3K4me1 and H3K27ac; ) in the flanking regions of fixed and polymorphic ERVs. In agreement with this observation, we observed an overrepresentation of the H3K9me3 mark specific to repressed chromatin [102,103] in the flanking regions of both fixed and polymorphic elements. Surprisingly, and at odds with other studies (e.g., investigating youngest ETns and IAPs , and ERV-Ls ), we observed an underrepresentation of the H3K27me3 mark associated with repressed chromatin in the flanking regions of ERVs. What can explain the opposite results we obtained for H3K9me3 and H3K27me3, two marks of repressed chromatin which are one enriched and the other one depleted in the flanking regions of ERVs? While both marks signal repressed chromatin in early embryonic development, they are found in different regions of the genome . On the one hand, the H3K9me3 mark is associated with heterochromatin formation due to the presence of tandem repeats , and thus its enrichment in the flanking regions of ERVs supports the strong association we found between ERVs and microsatellites. On the other hand, the H3K27me3 mark is abundant in gene-rich regions , and thus its depletion in the flanking regions of ERVs might reflect purifying selection acting against ERV integration in or around genes. Also, our results might be specific to the ESCs–as they agree with those of Hiratani and colleagues  who also studied ESCs. Some differences between our study and that of Brady and colleagues  might be explained by the use of histone modification data from different cells (Brady and colleagues utilized the data generated for CD4+ cells ). Finally, some of the differences among studies may be due to differences in the protocols used to construct control data sets.
Our results concerning ERVs fixation preferences (from the fixed vs. polymorphic, or in vitro, contrasts) with respect to histone marks suggest that ETns are fixed in areas rich in enhancer marks (H3K4me1 and H3K27ac) and this signal is localized at the IntS, while IAPs show an opposite trend (depletion of these marks) that is less localized. ETns bind transcription factors  and therefore could act as enhancers attracting H3K27ac, however we would like to see our observation of an association between ETns and enhancer marks validated in subsequent studies including histone marks from different ESC.
In our analysis, exons, introns, and most conserved elements were underrepresented in the flanking regions of fixed and polymorphic (or in vitro) ERVs vs. controls, as well as in the flanking regions of fixed vs. polymorphic (or in vitro) ERVs. This may be evidence of purifying selection acting against integration and fixation of ERVs in areas of the genome rich in genes and most conserved elements. The signal was frequently weaker in the polymorphic (or in vitro) vs. control contrast, and localized close to the IntS in the fixed vs. polymorphic contrast. Our observation that mouse polymorphic ERVs are more prevalent in gene-rich regions than mouse fixed ERVs corroborated the findings of Zhang and colleagues . Our results for fixed ERVs are also in agreement with the results of Medstrand and colleagues  who observed that older ERVs are underrepresented within 5 kb from genes. A selective purge from areas of the genome that are rich in genes and most conserved elements appears to be a characteristic of most TEs [24,32,36,45,46]. Alus, preferentially integrating within genes [42,45], appear to be an exception. We also observed that ERVs are more prevalent in areas with low levels of transcription–likely reflecting both an integration preference for repressed chromatin and a fixation preference for gene-poor neighborhoods. This result contrasts observations made by Brady and colleagues . These differences are probably due to the strict selection of control regions in our analyses and the different methods of analysis.
Some of the genomic features analyzed here were included in previous studies of ERV distributions, and our results are largely consistent with such studies [97,108–110]. However, our analysis included the most comprehensive list of genomic features to date, with many features not considered in previous studies (e.g., replication, most conserved elements, diverse DNA conformation features, L1 target sites, and recombination hotspots for human and mouse). Moreover, our use of FDA techniques allowed us to effectively investigate localization and scale at which the effects of these features unfold.
As a result, we are in a position to propose a comprehensive model for the integration and fixation preferences of the mouse and human ERVs considered in our study (Fig 8). ERVs integrate in regions of the genome with high AT-content, enriched in A-phased repeats (as well as mirror repeats for mouse ERVs) and microsatellites–the former possessing and the latter frequently presenting non-canonical DNA structure. This highlights the potential importance of unusual DNA bendability in ERV integration, in agreement with previous studies [96,111]. Interestingly, some non-canonical DNA structures–e.g., G-quadruplexes and Z-DNA–appear to inhibit ERV integration (with Z-DNA avoidance strictly localized at the IntS site for polymorphic IAPs). ERV integration regions are rich in LINEs, replicate late, and have low density of replication origins. Replication hotspots might assist integration of mouse ERVs (with the signal localized close to the IntS for ETns). Our results on histone marks and DNase I hypersensitive sites suggest that integration occurs in gene-poor regions with closed chromatin–mimicking the behavior of retroviruses, whose integrases were proposed to target nucleosomal DNA [81,96]. Our observations also indicate that ETns and IAPs differ locally in their sensitivity to certain histone marks associated with promoters or expression and open chromatin, with the flanking regions of IAPs completely depleted of these features. However, for ETns these features are not significant next to the IntS.
The ERVs in our study are preferentially fixed in areas of the genome where both genes and most conserved elements are scarce (the signal is frequently stronger next to the IntS). There is also a strong preference of ERVs to be fixed in areas with low recombination rates, likely because in such areas they have a lower probability of being removed via ectopic recombination. The initial integration occurs preferentially relatively close to the centromere, likely because of increased AT-richness, and fixed ERVs are located even closer to the centromere, likely because of reduced recombination rates. ETns also show a fixation preference for areas rich in enhancer histone marks (alternatively they might induce the formation of these marks), and this signal is localized at the IntS.
Utilizing data on the distributions of fixed, polymorphic, and in vitro ERVs, and comparing their flanking regions with parts of the genome devoid of ERVs (controls) and with each other, we were able to separate, at least partially, the integration and fixation preferences of these elements. Ideally, one would like to utilize all three groups of flanking regions plus controls to study each element family. However, we were only in possession of data on polymorphic elements in the mouse genome, and of data on in vitro elements in the human genome. Because of this, observed patterns may not be entirely comparable; in particular, polymorphic mouse ERVs may experience more purifying selection than in vitro human ERVs. Moreover, one drawback of using in vitro integrations is that these are produced in cell lines using ERV constructs.
The complexity of our study of ERV biology reflects the intricacy of the genome and of the mechanisms affecting the integration and fixation of these elements. We found evidence of the existence of genomic features associated with both insertion sites and the ability of elements to be fixed in the genome. Our analysis leveraged the availability of data on polymorphic ERVs and in vitro integrations, which we could contrast with fixed elements; we suggest applying a similar strategy to future studies of TE distribution. We expect further refinements of our model as more genomic data become available, e.g., nuclear lamina interactions and Hi-C profiles data  providing additional information on chromatin conformations that may influence DNA accessibility for new integrations. Finally, the FDA statistical methodology we utilized, in particular the ITP and FLR techniques, can be employed to analyze a variety of genomic data. These techniques are very versatile; the ITP does not require assumptions on the distributions underlying the data, and both the ITP and the FLR allow one to effectively characterize location and scale of a phenomenon, linking detected effects to specific intervals. We thus expect them to substantially contribute to numerous applications in genomics research.
We retrieved ETn and IAP elements fixed (1,868 and 5,064, respectively) in mouse strain C57BL/6J, and those polymorphic (248 and 2,224, respectively) for C57BL/6J with respect to strains A/J, DBA/2J and 129X1/SvJ (reference genome mm9)  (Table 1). We assume ERVs shared among four mice strains to be fixed, even though these elements are transpositionally active and some of them might be absent in mouse strains not analyzed here. In addition, we chose 1,036 full-length and solo-LTR HERV-K elements from the human genome (hg19)  (Table 1). This data set includes all known polymorphic elements (a total of five at the time of writing this manuscript)  but, because they are also expected to be at least a few hundred years old , we analyzed them together and refer to them all as fixed HERV-Ks. All elements were manually re-annotated, to avoid errors from automatic annotations in the genomes. The data sets were filtered to include only the elements in the 60 bp to 11 kb length range (Table A in S1 Text). We also obtained the IntS of in vitro HERV-K integrations in human embryonic kidney (293T, female) and fibrosarcoma (HT1080, male) cell lines from  (Table 1). Out of the 1,565 integrations reported in this study, we located 1,208 by alignment of the IntS sequence to hg19 using BLAT . Only matches with 100% identity over the total length of the IntS sequence were considered. In addition, IntS sequences that matched multiple locations (a total of 31 cases with integration site sequences <100 bp) were discarded because we could not identify their integration sites definitively.
For each element, we considered flanking sequences spanning 32-kb upstream and 32-kb downstream of the element, so that we could maximize the number of ERVs analyzed without overlapping of the flanking regions. This 64-kb region was then divided into 64 1-kb windows on which to quantify genomic features (Fig 1A). We eliminated elements for which the 32-kb flanking sequences overlapped by over 320-bp (1%) with genome gaps (i.e. sequences with Ns) or by over 1-bp with other elements’ flanking regions. Similarly, we generated 64-kb control regions, selecting areas that overlapped by at most 2% (1%) with mouse (human) LTR elements (as annotated in the corresponding genome at the UCSC Table browser ), or with the 32-kb flanking sequences of our ERVs (Table 1). After this filtering, we were left with 1,866 ETns (242 polymorphic and 1,624 fixed), 5,950 IAPs (1,986 polymorphic and 3,964 fixed), 826 fixed HERV-Ks and 1,065 in vitro HERV-Ks, which represent approximately four fifths of the elements analyzed in the original publications, respectively (Table 1). For ERVs that were annotated in the reverse strand (i.e. negative orientation with respect to the chromosome orientation), we inverted the 64-kb regions to consistently evaluate the effects of the genomic features with respect to the ERVs. Most of the functional datasets used in this study lack data for the Y chromosome, therefore we excluded the ERVs on this chromosome from our analyses.
We selected a set of genomic features associated with ERVs by prior studies, or found to be significant in other TE experiments. The data were obtained from the UCSC Table Browser , ENCODE , and previous publications (Table 2); when necessary the lift-over tool  was used to convert genome locations to mm9 and hg19. In total, we quantified 38 mouse and 39 human high-resolution genomic features, derived from to 39 mouse and 40 human datasets (Table 2) in 1-kb windows along the flanking sequences of our elements and our control regions (Fig 1A). Genome-wide microsatellite features were extracted from mouse and human genomes requiring a minimum of 9, 5, 4 and 4 motif repeats to define mono-, di-, tri-, and tetra-nucleotide microsatellites, respectively . Mono-, di-, tri-, and tetranucleotide microsatellites were analyzed separately due to their unique genome distribution and mutation rates [118,119]. High-resolution recombination rates (in mouse), as well as methylation levels and expression levels, were quantified as weighted averages (average of rates or levels weighted by the total number of base pairs in a window) in each 1-kb window. All other features were quantified as coverages (fraction of the genomic window covered by the feature) in each 1-kb window, except L1 target sites, AT and CG nucleotides for which we used counts.
In order to reduce multicollinearity in downstream analyses, we applied a hierarchical clustering with Spearman’s rank correlation (distance = 1-|correlation|) and complete linkage , and selected one feature from each cluster above a 80% threshold in mouse and in human (Figs B and C in S1 Text). This reduced the set of features under investigation to 35 and 36 (35 and 37 datasets) in mouse and human, respectively (Figs D and E in S1 Text).
In addition to these high-resolution features, we selected two features that were available at low resolution in human and mouse (recombination rates and replication timing) and we considered their average value as a single value for each 64-kb region. Replication timing was measured as the log2 of early/late S-phase populations of cells in culture ; therefore regions that replicate earlier present high replication timing “values”. Moreover, distances to telomere and centromere (measured considering the element or the center of the control region) were included in the study.
For each high-resolution genomic feature, we considered curves described by the whole 64-kb signal as the object of our study, embedding the problem in the framework of Functional Data Analysis (FDA) [48,49]. Importantly, FDA allowed us to naturally incorporate in the analysis the consecutive ordering of the measurements of each feature along the genome. The analysis was divided in two main phases (see Fig 1A). First, we considered each individual genomic feature separately, in order to assess whether it had significant discriminatory power in a given comparison (e.g. when contrasting ERV flanking sequences vs. controls; Fig 1B). We performed this phase using an extended version of the Interval Testing Procedure for functional data (ITP) recently introduced by . ITP is able not only to assess whether differences exist between the distributions of the curves in, e.g., ERV flanking sequences vs. control regions, but also to anchor statistically significant differences to specific locations (1-kb windows) or sub-intervals in the regions, and to specific scales (e.g., a specific 5-kb sub-interval near the integration site, comprising a difference that is registered as significant up to a scale of 30-kb). Also, in any given comparison (Fig 1B), low-resolution features (recombination rates, replication timing, distance to telomere and distance to centromere) were tested using the simple univariate non parametric test employed by the first step of the ITP on each basis component (see below for more on the ITP). In the second phase we evaluated individual features fitting single Functional Logistic Regressions (FLR), and dealt with multiple predictors simultaneously by means of multiple FLR models (see Fig 1C) [48,49]. Combining the results obtained with these FDA methods we were able to perform an extensive, genome-wide evaluation of the effects of genomic landscape on integration and fixation of the various ERV families considered in this study (Fig 1C).
In order to capture the influence of different genomic features on integration and/or fixation of ERVs, we applied the ITP and FLR to nine pair-wise comparisons (Fig 1B): (1) ETn polymorphic vs. control, (2) ETn fixed vs. controls, (3) ETn fixed vs. polymorphic, (4) IAP polymorphic vs. controls, (5) IAP fixed vs. controls, (6) IAP fixed vs. polymorphic, (7) in vitro HERV-Ks vs. controls, (8) fixed HERV-Ks vs. controls and (9) fixed vs. in vitro HERV-Ks. Only non-overlapping regions were employed in these comparisons, leading to the sample sizes indicated in Table 1.
Let XF,R(w) w = 1,…,64 be the signal (measured as count, coverage or weighted average) related to the genomic feature F in the 64-kb region R–this will be one of two types of regions; e.g., the flanking sequences of an ERV element or a control region. We considered this signal as 64 equidistant pointwise measurements of the curve xF,R(t) within the interval [-31.5; 31.5] surrounding the IntS of an ERV, or the center of a control region (see Figs D and E in S1 Text). In order to determine which are the features that best distinguish between the two types of regions, we employed a functional hypothesis test on the differences between the curves distributions in the two groups. Many FDA methods exist to deal with this inferential problem globally, considering the distribution of the whole curve at the same time , or locally, considering each component of the curve separately. Since we were interested both in global and local significance, we used the ITP , which allowed us to simultaneously test for global differences between the two distributions and to impute observed differences to specific subregions. Because of its non-parametric (permutational) nature, this test does not require assumptions on the distributions underlying the data; this too made it highly attractive for our application, since many of the genomic features under consideration have signals that appear to follow complex, non-regular distributions. Importantly, we also introduced some extensions to the original ITP of . In its extended version, the procedure allows us to (i) use multiple test statistics–capturing distinct aspects of the curves distributions–in assessing differences, and (ii) evaluate multiple scales by changing the maximum length of the intervals on which the global test is performed. Below we give a brief description of the ITP approach (full detail can be found in ).
Let x1,i(t) i = 1,…,n1 and x2,i(t) i = 1,…,n2 be two random samples from two independent stochastic curves. In our application, these are the curves xF,R(t) related to a feature F, in the two groups that we want to compare (e.g., flanking sequences of ERV vs. control regions). We consider the problem of testing the null hypothesis that the distributions of the two stochastic curves are equal, versus the alternative hypothesis that the two distributions differ. The first step of the ITP consists in decomposing the observed curves on a suitable reduced basis [48,49], i.e. represent each curve xg,i(t) as the set of coefficients obtained in the expansion
Then, in the second step, a univariate permutation test  on each basis component c(k) is performed to assess whether its distribution is significantly different for the two stochastic curves (or underlying “population” of curves). Under the null hypothesis that the distributions are the same, all permutations of the n1+n2 “observed” coefficient values have the same chance to occur. Hence, we can compute the empirical null distribution of a test statistics considering the values it assumes on all the different permutations; the p-value is generated dividing the number of permutations with a test statistics more extreme than the one observed on the data by the total number of permutations. The third step is to perform analogous multivariate permutation tests on each possible set of contiguous components (i.e. on the interval components c(1)-c(2), c(2)-c(3), …, c(1)-c(2)-c(3),…). Here we employ the Nonparametric Combination Procedure developed by Pesarin and Salmaso , that allow us to implement a multivariate test by combining univariate test statistics obtained with synchronized permutations. In detail, the same permutations of the n1+n2 values are applied to all components k = 1,…, K. Finally, in order to control the familywise error rate over sets of contiguous components, the fourth step computes an adjusted p-value for each component as the maximum among the p-values of all tests whose null hypothesis includes the component under consideration. Such a strategy exploits the ordered nature of the basis components creating a multiple testing correction that is coherent with the structure of data, and allows us to impute observed differences to specific subregions.
We chose to represent the curves using B-splines of order 1 (piecewise constant basis) with 65 or 33 nodes in the interval [-31.5; 31.5] and thus retain all the variability of the genomic features observed at 1-kb resolution (Figs D and E in S1 Text). In this way, we were able to test directly the raw data (65 nodes choice, one value every 1-kb) as well as a piecewise constant smoothed version of the raw data (33 nodes, one value every two 1-kb windows) without introducing substantial biases in the ITP. Notably, even if we are considering order 1 B-splines to represent the curves, the functional test statistics of the ITP (e.g. the mean curves) as computed on the data are rather smooth and their distributions do not show marked discontinuities (Figs F-N in S1 Text). The test p-values were computed considering 10,000 random permutations.
In its extended implementation, the ITP can be performed using different test statistics, each highlighting a particular aspect of the curves distributions. In particular, we considered three test statistics: 1) the sample mean difference, 2) the sample median difference, and 3) the sample variance ratio. In addition to introducing different test statistics, we modified the ITP so that differences between the distributions of the two stochastic curves are evaluated at all possible resolutions; that is, we corrected the p-values controlling the familywise error rate on intervals of all possible maximum length. In this way, we were able not only to evaluate the importance of a genomic feature in characterizing ERVs (globally or at specific locations) when considering the whole 64-kb flanking region, but also to establish if the feature of interest is able to differentiate between events and controls only at a smaller scale.
All genomic features found significant (p-value<0.05), considering at least one test statistics in the ITP, were included in the Functional Logistic Regression (FLR) analysis phase (see Fig 1C and below). In particular, features that showed significant differences between the two groups independently of scales and locations (invariant differential landscape features–IDL) were represented through their mean values across the 64 1-kb windows and treated as scalar predictors in the FLR. On the other hand, features that were significant only at a particular scale and/or for specific locations, considering at least one test statistics (localized differential landscape features–LDL) were treated as functional predictors. Genomic features that were significant on the whole 64-kb region but showed stronger differences (e.g., in means) in a particular location were also treated as functional predictors.
To test for significant differences (e.g., between ERVs flanking sequences and controls, or other comparisons) in recombination rate, replication timing, distance from telomere and distance from centromere (low-resolution features), we employed an univariate version of the ITP described above, where a single value is considered for each region. Similarly to the ITP, we tested the null hypothesis that the feature under study has the same distribution in elements and controls (or other comparisons), versus the alternative hypothesis that the two distributions differ, by means of a univariate permutation test. We considered again three test statistics: 1) the sample mean difference, 2) the sample median difference, and 3) the sample variance ratio. For each, we computed the test p-value estimating the distribution of the test statistics under the null hypothesis with 10,000 random permutations. The features that resulted significant (p-value<0.05) with respect to at least one test statistics were included in the FLR as scalar predictors (see Fig 1C and below).
The second phase of our analysis consisted in fitting single and multiple FLR models [48,49] using as potential predictors the genomic features selected with the ITP (Fig 1C). The goal of single FLR fits was to identify predictors with very strong predictive power in each comparison. After setting aside these major predictors, whose strength may obscure the role of other features, the goal of multiple FLR was to relate the discrimination (e.g., between the flanking sequences of ERVs and control regions) with all remaining features simultaneously.
The details are as follows. For each comparison, we generated a binary response Y encoding events and controls as “1” and “0”, respectively, and considered the functional model
where p represents the probability of an event conditionally to the observed features. Here ZFi represent the scalar predictors, i.e. the features that emerged as important for all scales and locations (IDL features) and the low-resolution features selected through the univariate test, while xFi(t) represent the functional predictors (LDL features) with support intervals IFi, respectively. In detail, the signal XF,R(w) w = 1,…,64 in region R for a genomic feature F that was significant independently of scales and locations in the ITP, was summarized by its sample mean over the 64 windows
and then included in the FLR model as scalar predictor. Conversely, the curve xF,R(t) with pointwise evaluations XF,R(w) w = 1,…,64 corresponding to a genomic feature F that was significant exclusively in the interval IF was included in the model as functional predictor in the selected interval. For each functional predictor, we expanded predictor curve and related coefficient function on the same reduced basis , i.e.
This basis (with its specific support) was chosen separately for each functional predictor, in order to minimize the dimension KF while capturing the scale that emerged as significant in the ITP.
To arrive to a meaningful FLR model (Fig 1C) we first considered each scalar predictor ZF alone and fitted the single logistic regression model:
where pF represents the probability of an event conditionally to the observed feature F. Whenever the distribution of the scalar predictor was skewed, we regularized it with a shifted logarithmic transformation and fit the model with the transformed data. More specifically, we took logs after adding a positive shift parameter s chosen to simultaneously maximize the p-values of the Shapiro-Wilk normality test on the transformed scalar signal in both events and controls. We computed the proportion of deviance explained (DE) by the model, or pseudo R-squared, as where Dnull is the null deviance and Dmodel the model residual deviance . This measure revealed the discriminatory strength of each individual scalar predictor (e.g., to distinguish between flanking sequences of ERVs and control regions). Similarly, we fitted a single functional logistic regression model for each individual functional predictor xF(t):
where pF represents again the probability of an event conditionally to the observed feature F. Similar to scalar predictors, functional predictors were log-transformed as needed, and their strength was measured by means of the DE. Scalar or functional predictors that were very strong (DE > 20%) were noted and interpreted (see Results and Discussion), but not included in the multiple FLR–as their inclusion would obscure subtler contributions and co-significance of other, weaker predictors.
Next, we chose the best subset among the remaining scalar predictors using a multiple logistic regression model with LASSO regularization after predictors standardization . The optimal regularization coefficient was chosen as the one maximizing the mean 10-fold cross validation misclassification rate. After identifying the best subset of scalar predictors, say , we fitted the standard additive multiple logistic model (without LASSO penalization)
We then augmented this scalar model attempting to add the remaining functional predictors one at a time in a stepwise forward fashion (i.e. adding each time the functional predictor inducing the biggest gain in terms of DE), as long as the DE increased more than 1% and the AIC decreased. Finally, we excluded scalar predictors that may have been rendered non-significant by the introduction of the functional predictors in a stepwise backward fashion–leading to our final multiple FLR model for the comparison under consideration.
The importance of each predictor (scalar or functional) in the final multiple FLR model was measured by its relative contribution to the deviance explained (RCDE), defined as where Dnull is the null deviance, Dmodel the multiple FLR model residual deviance, and Dred model the residual deviance of the multiple FLR model obtained by removing the predictor of interest .
All data manipulations were performed with in-house scripts and Galaxy tools (e.g. lift-over, make windows, assign weighted-average, and feature coverage). Statistical analyses were performed in the R environment using the packages fda.usc , car , glmnet , pheatmap  and a modified version of the functions in fdatest .
We thank Simone Vantini and Piercesare Secchi for their insight and help with the use of FDA techniques; Dixie Mager, Ross Hardison, and Erika Kvikstad for their comments on the manuscript; Samarth Rangavittal for computational advice; and Mary Poss and Dixie Mager for their guidance and several discussions concerning ERVs.
This work was supported by the National Science Foundation (grant number DBI-0965596 to KDM); the National Institute of General Medical Sciences (grant number GM087472 to KDM and Kristin Eckert); the Penn State Clinical and Translational Science Institute and the Pennsylvania Department of Health using Tobacco Settlement Funds (the Department specifically disclaims responsibility for any analyses, interpretations or conclusions). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
All relevant data are within the paper and its Supporting Information files.