We used a set of conserved, and by extension functional, elements in the human genome that were defined using a phylogenetic hidden Markov model 
on a multi-species alignment of 29 mammalian genomes 
. This method identifies regions of the genome with strong cross-species conservation based on a depletion in substitutions. Regions with strong cross-species conservation show evidence of being under strong negative selection and are not explained by low mutation rates 
. While some regions of the genome have faster or slower mutation rates in extant populations, the locations of these regions are not consistent throughout mammalian lineages 
and therefore have a minimal effect when looking for regions of the genome exhibiting cross-species conservation. Using cross-species conservation to elucidate functional regions will miss functional elements that are not under strong selection, are lineage specific, or undergo rapid turnover 
. For these regions the total amount of bases under selection may be two to three times what is seen by cross-species conservation 
To create a subset of putative regulatory regions we removed all conserved elements that overlap protein-coding exons, 3′ untranslated regions (UTRs), 5′ UTRs, or exons from non-coding RNA genes that have currently been identified (see Methods). The resulting set of approximately 2.6 million conserved non-exonic elements (CNEEs) totals 75 Mbp, which is 2.6% of the human genome. These CNEEs are under selection, but do not appear in mature transcripts. A previous study demonstrated that 50% of 437 CNEEs tested at a single time point in development acted as tissue specific enhancer elements 
. CNEEs also act as repressors 
, insulators 
, matrix attachment regions 
, and regulators of splicing 
To understand which of these putative regulatory elements in the human genome are the result of mobile element insertions we examined the overlap of our CNEEs with mobile element annotations generated by running RepeatMasker on the human genome (see Methods). We did not keep all CNEEs that overlapped a mobile element insertion, but only those which had a majority of their bases annotated as originating in a SINE, LINE, LTR, or DNA transposon insertion. This resulted in a set of 284,857 conserved non-exonic elements, totaling nearly 7 Mb of sequence, that are likely to have been exapted from mobile element insertions.
More than 11% of CNEEs in the extant human genome have been exapted from mobile element insertions. With ~280,000 exaptations and ~4.4 million mobile element fragments in the current reference assembly of the human genome, more than 6% of mobile element fragments show signs of selective constraint for a non-exonic function (). Not only have mobile elements played a significant role in the evolution of gene regulation on the human lineage, but a non-negligible portion of repeat fragments in the reference genome appear to be under selective constraint.
The exaptation of mobile element classes and superfamilies.
CNEEs are Functional in Humans
To ensure that the set of CNEEs co-opted by mobile elements is still evolving under constraint in human populations, we examined the derived allele frequency spectrum of single nucleotide polymorphisms currently segregating in the Yoruban population (see Methods). The set of co-opted mobile elements exhibit a characteristic lower mean rank of derived allele frequencies. This shift is indicative of regions under current, or very recent, selective constraint where the majority of mutations are deleterious and rarely progress to high frequencies in the population. While the set of CNEEs as a whole and CNEEs co-opted from mobile elements both show a significant shift relative to intronic regions (
, Mann-Whitney U test), which act as a conservative proxy for neutrally evolving sequence, the shift is not as severe as found in protein coding regions (). The set of CNEEs co-opted from mobile elements does not show a significant shift relative to the set of CNEEs as a whole (P
~ 0.6) and appears to be under a similar level of constraint in present-day humans.
The frequency of rare derived alleles is greater in CNEEs compared to neutral sites.
The set of CNEEs exapted from mobile elements shows enrichments for functional regions identified by biochemical assays. The regions are enriched for transcription factor binding sites (NRSF, 3.1x,
; c-Fos, 1.8x,
; c-Jun, 1.7x,
; BATF, 1.6x,
; JunD, 1.6x,
; USF1, 1.5x,
; NF-E2, 1.5x,
; SIX5, 1.5x,
), clusters of DNase hypersensitivity sites (1.6x,
), and H3K27 acetylation (1.2x,
) identified by the ENCODE Consortium 
in human cell lines. Only 25% of the CNEEs from mobile elements were overlapped by a DNase hypersensitivity site; however, some of these regions may only be functional in a different tissue, time point, or environmental condition than that measured by ENCODE, and in some cases there may have been technical difficulties in assaying repetitive regions of the genome 
CNEEs Exapted from Mobile Elements Resemble the Set of all CNEEs
The subset of CNEEs exapted from mobile elements has a visually similar distribution of lengths to the set of non-exapted CNEEs (). However, the mean of the exapted set is less than that of the non-exapted set, 25 bp and 30 bp respectively, showing a slight bias for the exaptation events to be smaller and the distribution to have a slighty different shape (
, Kolmogorov-Smirnov test). This slight bias towards the exapted elements being smaller may be due to mobile elements being unable to carry very large regulatory modules as many mobile elements are only a few hundred bases in length.
Exapted CNEEs and non-exapted CNEEs have similar length distributions.
The rate of substitution is also visually similar for both sets (). However, the exapted elements evolve with a mean of 0.30 times the neutral rate, while the non-exapted set of CNEEs evolves at 0.32 times the neutral rate (see Methods). The distribution of mutational rate has a slightly different shape for the exapted elements (
, Kolmogorov-Smirnov test). It is possible that this difference is due to a slower rate of evolution, but it is likely due to CNEEs under more severe constraint being closer to the consensus sequence that originally inserted and therefore easier to identify as mobile element insertions.
Exapted CNEEs and non-exapted CNEEs have similar distributions of constraint.
Previous studies used genome-wide enrichment tests to demonstrate that CNEEs exapted from mobile elements cluster near transcription factors and developmental genes 
, which had also been observed for the set of CNEEs as a whole 
. However, the similarities between the set of all CNEEs and the subset exapted from mobile elements goes beyond clustering near this set of genes. The density plots of the two sets closely correlate with each other (). We quantified this similarity by calculating the Pearson product moment correlation coefficient for the changes between the two density functions. The correlation coefficient is 0.55 when comparing those CNEEs originating through exaptation to those originating by other mechanisms. This demonstrates that CNEEs created through the exaptation of mobile elements have similar locations in the genome to those CNEEs originating by other means.
Mobile elements co-opted as conserved non-exonic elements (CNEEs) are rarer than expected in gene deserts.
The regions of divergence between the density plots of CNEEs and the subset of CNEEs from mobile element exaptations are rare. The few deviations that do exist consistently happen in the centers of the largest gene deserts (). To our knowledge, cis-regulatory elements have only been shown to act over distances of up to 1 Mb from the transcription start site (TSS) of the gene being controlled 
, yet thousands of CNEEs are present in the centers of these large gene deserts, over 1 Mb away from any currently known gene. It is these CNEEs, over 1 Mb away from any known gene, that are rarely found to be exapted from LINEs, SINEs, LTRs, or DNA transposons. Only 1.7% of the CNEEs from exaptation events are more than 1 Mb from the closest TSS, versus 3.1% for non-exapted CNEEs (
, hypergeometric test). This observation holds for both stable gene deserts, which resist rearrangements and have
2% of their bases conserved between human and chicken, and variable gene deserts where
2% of the sequence is conserved 
. The edges of gene deserts, which harbor large amounts of regulatory material for the developmental genes often found at their borders 
, have an amount of exapted elements in them that reflects the density of CNEEs as a whole, even though the centers of the gene deserts do not.
We can demonstrate how ancient this process is by explicitly dating each exaptation event. It is possible to date insertions of repetitive elements by analyzing a large multiple alignment of vertebrate species. We assign each insertion to the branch of the human lineage preceding the speciation of the most divergent species that possesses the insertion (see Methods). This method confirms that the exaptation of mobile elements as CNEEs on the human lineage is an ancient process. We detect 133 exaptation events predating the speciation of ray-finned fish from the human lineage, exemplifying that this is a mechanism that has been influential for at least 450 million years 
. These 133 exaptation events are only identifiable as such because they have been evolving at a slow enough rate and are large enough that they still provide significant alignments to the mobile element consensus that deposited them hundreds of millions of years ago. We also have a poor understanding of the mobile elements that were active at this time since they rarely are active into the present day and their consensus may have changed over time 
. For these reasons it is likely that many of the CNEEs that were created in our early vertebrate ancestors were deposited by mobile elements, but the exapted area was too small, too quickly evolving, or from a mobile element that was inactivated too long ago for us to realize the origins of these functional elements. Thus, the statistic of over 11% of CNEEs coming from a mobile element insertion is a lower bound of how much mobile elements have contributed to our current repertoire of gene regulation.
Using such dating methods, it was shown that the appearance of new CNEEs near different categories of genes has not been uniform during vertebrate evolution 
. In particular, in early vertebrate evolution, new CNEEs appeared near transcription factors and genes involved in embryonic development twice as frequently as near other types of genes, but this trend ended before the emergence of mammals. Such development-associated genes often flank large gene desserts, so based on this result one might expect an enrichment for ancient CNEEs in large gene desserts, and in particular in the middle of large gene deserts. This is what we find (). This tendency for gene deserts to have more ancient CNEEs may explain the observation above that a smaller fraction of CNEEs in these regions come from exaptations of known repetitive elements. This may be due in part to our incomplete knowledge of older mobile element families, which has a disproportionate influence on our statistics in regions that are dominated by ancient CNEEs.
Ancient CNEEs are more likely to be found far from transcription start sites.
All Mobile Element Superfamilies Contribute to Regulatory Innovation
Along with analyzing the set of exaptation events as a whole, we can decompose it into subsets based on the class or superfamily of the mobile element that was exapted (). All 36 superfamilies of LINEs, SINEs, LTRs, and DNA transposons in the human genome have contributed to the increase in putative regulatory material on the human lineage. These repeat superfamilies have been active at various times over the course of vertebrate evolution 
. The mechanism of the host genome capturing and refining regulatory elements from repeats has not been isolated to one family or one time period in history. This is a process that was happening as far back as we can currently detect mobile element insertions in the human genome.
Some mobile element superfamilies have provided more putative regulatory sequence than others. The L1 superfamily of LINEs appears to have contributed the largest number of CNEEs to the human genome ( and ). This may be expected since L1s have almost 1 million copies in the human genome and account for more than 1 out of every 6 bases. The mobile elements that contributed the greatest number of CNEEs, relative to their copy number in the genome, are all ancient superfamilies that have not been recently active on the human lineage. The top four superfamilies in terms of relative CNEE contribution () are also the top four superfamilies in terms of percentage of their insertions predating the ancestor of placental mammals (
, hypergeometric test). For ancient superfamilies, the insertions not under selection have disappeared due to neutral decay, leaving only the slowly evolving exapted copies ( and ). It is often difficult to infer the consensus sequence of a mobile element from only a handful of ancient exapted copies. This leads to these ancient exaptations either being putatively placed in a family or having their annotation come from another species where the repeat is still active 
. The latter was the case with the DeuSINE, which was found to have a near-ancestral version still active in the coelacanth 
. The DeuSINE was active so long ago on the human lineage that there are more CNEEs attributed to their insertions than there are insertions. Often seeing multiple conserved elements within a single DeuSINE insertion exemplifies that with the 29 mammalian genomes we now have sufficient resolution to not only see that an insertion is evolving under purifying selection, but we can also interrogate exactly which sections of the insertion are under constraint. In the case of the DeuSINE, we see that when an exaptation event happened, it often placed more than one section of the consensus under selection.
Contribution of mobile element classes, superfamilies, and families.
Contribution of mobile element classes, superfamilies, and families relative to their abundance.
We have limited statistical power to detect very recent exaptation events. As a mobile element insertion happens closer to the present day, we have less orthologous sequences in other species and therefore less branch length to notice a resistance to mutations. Many of the recently active mobile elements may be depositing functional sequence, but we will be unable to detect these exaptations. For this reason, many of the mobile elements with few exaptations per genomic instance are recently active ().
Mobile Elements Carry Functional and Nearly-functional Regulatory Elements
With mobile element insertions contributing at least 47% of the extant human genome, we would expect a number of CNEEs would arise out of mobile element insertions by chance, just as can happen with neutrally evolving DNA. If this is the only process by which mobile elements create functional sequence for the host genome, then we would expect the probability of a base position in the consensus coming under selection to be directly proportional to how often that base appears in the genome. However, if a mobile element insertion harbors elements that are functional in the host, nearly-functional, or in some way preferential to the molecular machinery of the host that interacts with DNA, we would expect these bases in the consensus to be overrepresented in the exapted copies relative to the genomic background. In a previous study, we showed that for many mobile elements there is a bias as to where exaptation events happen along the consensus sequence, a finding consistent with the host co-opting functional, pre-functional, or preferential sequences carried by the mobile element 
We have detected 259 regions of consensus sequences that are more than twice as likely to be exapted than would be expected from their genomic prevalence. Each peak is based on data from at least 40 exaptation events to avoid small sample sizes. These 259 sections of consensus sequences have an average length of 11 base pairs and delineate regions in the consensus sequence that are more likely to be utilized by the host genome after insertion. To better understand the significance of evolutionary constraint repeatedly occurring in the same region of the mobile element consensus, we randomly placed the set of CNEEs throughout the genome 1000 times. During these 1000 trials only 147 peaks of 2X overrepresentation occurred by random chance, i.e. an average of 0.15 overrepresented peaks per genome. This contrasts with the 259 peaks of 2X overrepresentation we detect in the extant human genome.
It is possible that these preferentially exapted regions of the repeat consensus contain generally useful characteristics for a section of regulatory DNA, such as high GC content 
, a DNA structure easily accessible for protein binding 
, or a general predisposition to be methylated 
. The alternate explanation is that the mobile element contributes a specific binding site which is then used by the host 
. In the case of the former, the human paralogs representing the peak will have diverged under different selective constraints and therefore share few similarities in the extant human genome. In the case of the latter, the human paralogs will have been evolving under a similar selective constraint, much as orthologs after speciation.
Just as the orthologs of a binding site conserved across species may be aligned to elucidate the preference for A, C, G, and T at various positions, the same can be done with paralogous exaptations. For each section of the consensus where exaptations preferentially occur, we used MEME 
on the human paralogs to define a motif common to most, or all, of the exaptations. 225 of the 259 peaks are defined by a motif greater than 8 base pairs in length and an e-value less than 0.01, after correction for multiple tests (see Methods). We then compared the sequence motifs from the human paralogs against known vertebrate transcription factor binding profiles (see Methods). There are 6 matches between motifs defined by paralogous exaptations in the human genome and known binding motifs for transcription factors (). All 6 of these matches between paralogous motifs and TF binding motifs have a corrected p-value less than 0.01.
Paralogous instances of mobile elements show selective pressures matching transcription factor binding preferences.
An example of human paralogs defining a motif that matches a known transcription factor binding profile is the L1MC4 element, which appears to have a section of its 5′ end conserved to act as a binding site for one of the octamer transcription factors (). The consensus of the L1MC4 element does not contain an octamer binding site that is then retained after insertion, but rather it contains a nearly-functional site that is a single substitution from being functional. The substitution is a CpG dinucleotide undergoing a transition to a TpG dinucleotide, which is a a common substitution that happens at 12 times the normal rate of transitions 
. While consensus L1MC4 instances do not match the octamer binding profile upon insertion, it seems that they are poised to bind an octamer family transcription factor after a single commonly-occurring mutation that may then be driven to fixation by selection. A similar phenomenon has been shown in Alu elements, where deamination may result in p53 binding sites 
L1MC4 may be a fecund source of octamer binding sites.
Estimating the Contribution of Mobile Elements to Gene Regulatory Innovations
We have conservatively estimated a lower bound of 11% on the fraction of CNEEs deriving from mobile element insertions. A more accurate estimate is obtained by calculating the CNEEs appearing on a single branch and determining how many of these CNEEs have their origins in mobile element insertions. We have chosen the branch of the human lineage following the split with marsupials (opossum) and prior to the speciation of atlantogenata (elephant). We selected this branch because it is close enough to the present that we understand many of the mobile elements that were active at the time, but ancient enough that we can easily detect selection based on orthologous regions in other species. On this branch we calculate that ~19.2%, almost 1 in 5, of the CNEEs are the product of an exaptation event involving a mobile element. This is an increase from the ~16% that was estimated for the same branch at the time when the opossum genome was first published 
. To test the robustness of this estimate to the method of repeat annotation we repeated the calculation using the Censor 
software package. This yielded an even higher, yet similar, estimate of ~19.6%. While this appears to be a robust estimate for the ~40My of the branch, it is unclear how generalizable the contribution of transposons over this time interval is to all of human evolution. It is possible that the influx of mobile elements, regulatory potential of mobile elements, and rate of regulatory innovations has not been consistent through time. Large changes in these variables may lead to an non-uniform contribution of mobile elements to regulatory innovations during human evolution.