|Home | About | Journals | Submit | Contact Us | Français|
The numerous discovered cases of domesticated transposable element (TE) proteins led to the recognition that TEs are a significant source of evolutionary innovation. However, much less is known about the reverse process, whether and to what degree the evolution of TEs is influenced by the genome of their hosts. We addressed this issue by searching for cases of incorporation of host genes into the sequence of TEs and examined the systems-level properties of these genes using the Saccharomyces cerevisiae and Drosophila melanogaster genomes. We identified 51 cases where the evolutionary scenario was the incorporation of a host gene fragment into a TE consensus sequence, and we show that both the yeast and fly homologues of the incorporated protein sequences have central positions in the cellular networks. An analysis of selective pressure (Ka/Ks ratio) detected significant selection in 37% of the cases. Recent research on retrovirus-host interactions shows that virus proteins preferentially target hubs of the host interaction networks enabling them to take over the host cell using only a few proteins. We propose that TEs face a similar evolutionary pressure to evolve proteins with high interacting capacities and take some of the necessary protein domains directly from their hosts.
In recent years, the traditional view that transposable elements (TEs) are only a burden to their host organism has shifted. Although their parasitic nature is not questioned, the discoveries that many proteins originate from TEs and that TEs contributed to the invention of several key cellular machineries of multicellular organisms highlighted their significance in evolutionary innovations (1–3). The best known cases of TE domestication include the RAG protein of the immune system of vertebrates (4), CENP-B protein of centromeres (5–7), light sensing in plants (8), regulation of telomere length (9) or developmental regulation [PAX6 gene, (10)]. Although numerous cases of domestication have already been identified in model organisms, their real number remains unclear, as estimates range from thousands to dozens even in the well-characterized human genome [see (11) versus (12,13)]. Besides providing the raw material for novel genes, TEs also contributed to regulation and generation of allelic diversity: 25% of promoters in human genes contain TE sequences (14), whereas the activity of Helitrons, the eukaryotic rolling circle transposons [reviewed in (15)] and MuDR (MULE) transposons (16) resulted in the sometimes massive amplification of functional genes in their hosts (17–20). The ability of DNA transposons to mobilize fragments of DNA has even resulted in the development of highly efficient vectors (e.g. Sleeping Beauty transposon) for gene transfer (21).
Naturally occurring gene capturing has been well studied in the maize and rice genomes, where it occurs at a high rate, involves only particular repeat types like Helitrons or MuDR repeats and is a major force shaping the genome. During gene capturing, fragments or entire genes are incorporated into the transposon, and the subsequent amplification of the repeat results in a high copy number of the gene as well. However, Helitrons are unusual among other TEs in their ability to mobilize adjacent DNA [also the prokaryotic relatives of Helitrons are known to mobilize DNA fragments, including antibiotic resistance genes and virulence factors (22)], and even in Helitrons, the captured gene fragments only rarely contribute to the evolution of the transposon sequence itself (17).
As the global sequencing effort completed the genomes of most classic model organisms, attention has turned towards several other eukaryotic genomes, either owing to their phylogenetic importance [i.e. (23)], or owing to being important for a narrower research community (24). Besides providing key insights into genome organization and function, these studies have also revealed a large diversity of TEs that previously had not been appreciated: repeat classes that had been thought to be extinct in mammals were found [like DNA transposons in bats, (25)], entirely novel classes of repeats were discovered, e.g. Polintons (26,27), and the diversity of known families was greatly expanded (28,29).
Here, we investigate the incorporation of host genes into TE sequences, however, in a narrower sense than it was reported for Helitrons or MULEs; we focus only on those cases that ‘made it’ to the consensus sequence of the transposon and thus could influence its evolution. The incorporation of a protein domain into a TE has been described mostly in those cases, where it resulted in the emergence of a novel repeat type; in non-LTR repeats, the acquisition of an endonuclease, RNase H domain and an ORF1 protein in the early history of R2–R4-like repeats resulted in the emergence of their currently most common families [L1, I, Jockey, CR1; (30)]. Incorporation of genes has been documented in LTR retrotransposons, where the (multiple) transitions from the transposon state to a viral state were enabled by the acquisition—usually from other viruses—of envelope proteins (31). Also, the acquisition of a small number of proteins with unclear function by LTR elements has been described (32), and among DNA transposons, the acquisition of the helicase domain in Helitrons has been reported (15).
The main objective of this article is to search for cases of protein incorporation into TEs and test the systems-level properties of these proteins, i.e. whether the incorporated sequences originate from a random selection of host proteins, or TEs selectively incorporate genes with distinct properties within the cellular networks. Currently, cellular networks are well characterized only in a small number of model organisms; therefore, we use budding yeast (Saccharomyces cerevisiae) and the fruitfly (Drosophila melanogaster), as yeast is the eukaryote with the most-understood interactome, whereas among multicellular organisms, the Drosophila interactome is especially well characterized. Our results indicate that (i) the acquisition of host proteins by TEs is not a rare exceptional event but is relatively common; (ii) the incorporated genes participate in significantly more protein–protein interactions (PPIs) than expected by chance; (iii) are central within the interaction networks (i.e. have high betweenness and closeness centrality); and (iv) a considerable fraction of them are subject to selection, thus contribute to the evolution of the TEs.
We searched 4848 verified yeast ORF sequences and 13 909 Drosophila genes against RepBase (v.15.12), the main database of eukaryotic TEs. Only consensus sequences were used from RepBase to exclude dubious TEs from the analysis. In the case of Drosophila genes, we used only the core region of each gene, i.e. the regions that are present in all known alternative splicing products. This was necessary because TEs can occasionally be incorporated into transcripts by alternative splicing, and the functionality of such splice products is uncertain (33).The sequences of eukaryotic TEs were downloaded from RepBase [http://www.girinst.org, v. 15.12, (34)]. The sequences of yeast proteins were downloaded from the Saccharomyces Genome Database (http://www.yeastgenome.org); Drosophila genes were downloaded from FlyBase (http://flybase.org).
The DNA sequences of TEs were translated in all six frames, and the sequence comparisons between yeast and Drosophila proteins and the translated TE sequences were made with the jackhmmer tool of the hmmer package (35), with a bit score cutoff 27. In the homologous sequence fragments of TE, yeast and Drosophila proteins, we identified conserved domains using the Pfam database (36) v. 24, (http://pfam.sanger.ac.uk) with hmmscan. To decide whether the homology between a yeast protein and a TE sequence represents transposon domestication or the reverse process, the incorporation of a protein fragment into a TE, we implemented the following, automated protocol. First, we implemented the taxonomic tree of ~180 000 taxa, using known phylogenetic relationships from the Pfam database (ncbi_taxonomy table). Next, we screened the Uniprot database (http://www.uniprot.org) with the sequence fragment from yeast or fly from a homologous pair of TE - yeast/Drosophila proteins, using jackhmmer (bit score threshold 27), excluding all matches of viral or TE origin. Using the species of the Swissprot hits, we identified a subtree on the global tree; in the cases where this resulted in no hits, we used Uniprot hits. We repeated the same procedure for the repeat fragment, using the 6-frame translated RepBase as the sequence database, and then compared the two trees (Figure 1B). The tree which the broader taxonomic span (i.e. contains the other) points to the source of the sequence (TE domestication versus protein incorporation into a TE). This method performs well in the case of protein incorporation events, where the phylogenetic spread of a protein domain is typically wide, but its homologue is present only in a small number of TEs; however, it is less reliable in the case of ancient TE domestications: in the case of common domains, e.g. Zinc fingers that are widespread both in TEs and host proteins, the sequence exchange events happened too long ago to reliably identify its source taxon, and also several independent domestication events may not be distinguished from each other, leading to a very broad phylogenetic distribution of such proteins. In the phylogenetic analysis of the activity-regulated cytoskeletal-associated protein (Arc) gene, multiple alignments were made with muscle (37), and the maximum likelihood phylogenetic tree (1000 bootstrap replications) was built with MEGA5 (38).
We used Monte Carlo simulations to determine whether the number of genetic and protein interactions (degree), betweennes centrality and closeness centrality of the gene fragments that were incorporated into TEs are significantly different from the random expectation. First, we took 100 000 random samples of genes without replacement from the yeast and Drosophila data sets and determined the parameter of interest, i.e. the median node degree of the proteins in the sample. Next, we determined the number of random samples with higher median value than what we observed in the proteins homologous with TEs. Significance (p) was determined as p = (n + 1)/(N + 1), where n is the number of random samples with medians equal or higher than in the observed sample, and N is the total number of random samples. All analyses were carried out with perl scripts developed in-house. Betweenness centrality and closeness centrality were calculated with Pajek, a program for the analysis of large networks (http://pajek.imfm.si); in the analyses of yeast interactions, only those interactions were used where at least one of the interacting partners is a verified ORF.
We tested for significant selection in the incorporated proteins with two methods. Wherever the incorporated protein fragment was present in more than one TE family, we compared the two closest homologous families to decide whether their Ka/Ks ratio is significantly different than 1. We identified Ka and Ks values with PAML (39) and tested whether their ratio significantly differs from one with a likelihood ratio test: we fixed the Ka/Ks ratio at 1, and fitted a similar maximum-likelihood model to the alignment of the two sequences (40). The difference between the log-likelihoods of the two models was tested with Chi-square tests, to test whether the null model assuming neutral evolution (Ka/Ks = 1) in the TE protein performs significantly worse than the one where Ka and Ks could vary independently. In those cases where additional TE homologs could not be found, we applied a different (and less powerful) procedure; we searched the NCBI ref_mrna database with the captured fragment of the TE protein with tblastn. Using the best match to the TE fragment, we searched NCBI ref_mrna again, and using the results of the two searches identified an outgroup sequence, which is at least as distant (in terms of bit score) both to the TE fragment and its best homologue as they are to each other. These three sequences were used to construct an unrooted phylogenetic tree, and we identified the separate evolutionary rates for all of its branches with PAML. We tested whether the Ka/Ks ratio of the TE branch is significantly different from 1, similarly as for two homologous TEs, with a likelihood ratio test: we fixed the Ka/Ks ratio of the TE branch of the tree at 1 and fitted a similar maximum-likelihood model to the alignment of the three sequences, and the difference between two models was used to test whether the null model with Ka/Ks = 1 in the TE branch of the tree performs significantly worse than the one where the evolutionary rate was allowed to vary.
We identified 38 yeast proteins and 145 Drosophila proteins that show significant similarity to mobile elements (Figure 1, Supplementary Table S1, Supplementary UCSC Genome Browser track). By comparing the taxonomic distribution of the homologues of each protein with the taxonomic distribution of its corresponding TE sequence (see Methods and Figure 1B), we determined whether the matches are likely to be the result of domestication or capturing of a protein fragment by a TE. We identified only six cases that represent a domestication of a TE sequence, whereas the phylogenies of 108 genes support the protein incorporation scenario (Supplementary Table S1, see also Supplementary Figure S1A–C for examples). In 67 cases, the homologous sequences are so widespread both in the hosts and among TEs that our method could not infer whether the sequence was originating from a host or a TE, and in two cases, the phylogenies within TEs and hosts show no relationship (thus, horizontal transfer may be involved).
A particularly interesting case of domestication is the Arc gene of Drosophila (FBgn0033926). Arc genes in mammals received considerable attention in recent years because they are key regulators of synaptic plasticity required for normal brain functioning and long-term memory formation (41,42). In deuterostomes, they are present only in tetrapods and contain a domesticated fragment of a gag protein from a Gypsy retrotransposon (43). In Drosophila, the Arc genes are also expressed in neurons and regulate behavioral responses for stress (44), although unlike in mammals, they do not influence synaptic plasticity. Drosophila Arc genes also contain a domesticated fragment of a gag protein from an (insect) Gypsy retrotransposon and show homology to mammalian Arc genes (Figure 2). However, the absence of Arc-like genes in other protostomes than insects (and in other deuterostomes than tetrapods) together with their phylogeny (Figure 2) suggests that the gag proteins of Gypsy retrotransposons were recruited twice independently, and in both cases, the resulting ‘Arc’ genes gained functions in the neural system and can be seen as an example of ‘convergent domestication’.
Most proteins do not operate in isolation but interact with other proteins and form multi-protein complexes, which perform a particular cellular function. Using PPIs from the BioGRID (v. 3.1.83) database for yeast (45) and from the FlyBase database (46) for Drosophila, we tested with Monte Carlo simulations whether the incorporated genes have distinct positions in the protein interaction network, i.e. whether the median number of protein interactions (degree) of the captured genes, their betweenness centrality (the fraction of shortest paths of the network that pass through a particular node) and closeness centrality (the inverse of the mean distance between node v and all other nodes reachable from it) are significantly higher than expected by chance. We found that the incorporated genes that are present in the PPI databases have significantly higher degree and centrality measures than expected by chance (Figures 3C and and4)4) both in yeast and Drosophila. To rule out any detection biases caused by the phylogenetic distance between Drosophila/yeast and the host species of TEs, we determined the divergence time between Drosophila/yeast and the TE hosts with the TimeTree application (47) and tested whether the network characteristics of incorporated proteins are positively correlated with divergence. None of the network parameters showed a significant correlation (Supplementary Figure S2).
An important question is whether the incorporated protein fragments/domains are themselves highly interacting or merely come from highly interacting proteins. Currently, this can be tested only indirectly because data on direct domain–domain interactions are sill limited and have been compiled only for proteins present in the PDB database (48), and thus represent only a small subset of all protein interactions. We used the 3did database (49), which contains 6260 interactions between 4302 Pfam domains of PDB entries, to test whether the Pfam domains found in the incorporated sequences are domains that interact with more domains than it would be expected from randomly selected ones. First, we searched for the presence of conserved protein domains in the protein fragments homologous to a TE using the Pfam database. Altogether, we identified 149 conserved domains in the yeast and fly genes homologous to a TE (Supplementary Tables S2 and S3), of which 74 are found in the protein incorporation cases. These represent only 37 different domains though, as frequently similar domains are incorporated into different TEs. From the 37 different Pfam domains of the incorporated proteins, 27 are present in the 3did database. Using Monte Carlo simulations, we found that the mean number of interactions of these domains with other Pfam domains is significantly higher (3.44, P = 0.023) than expected by chance (2.14 + /− 0.56), supporting the hypothesis that TEs pick up domains with higher number of interacting partners than the average.
Besides physical interactions, genetic interactions provide an alternative means to depict functional connections between genes. A genetic interaction is defined as the difference of the fitness effect of a double gene deletion mutant in comparison with the expected multiplicative effect of the two individual deletions. For example, an extreme case is a synthetic lethal interaction, a lethal double mutant phenotype where the individual deletion products of the two genes are both viable phenotypes. Recently, large-scale genetic interaction maps have become available for yeast [i.e. (50)], that enabled the characterization of the entire functional landscape of the yeast cell. Genetic interaction data for Drosophila are much less abundant and are available only for a small fraction (12%) of genes. Similarly to PPIs, we used genetic interactions deposited in the BioGRID and FlyBase databases and tested whether the incorporated genes have higher node degree, betweenness and closeness centrality in the genetic interaction network as well. We found that node degree and closeness centrality of the incorporated genes is significantly higher that the random expectation only in Drosophila, whereas betweenness centrality is not significantly different from the random expectation neither in yeast nor Drosophila (Figure 3B and C).
Finally, using Monte Carlo simulations, we tested whether the incorporated proteins interact more frequently with each other in the host cellular network than random proteins and whether they have related functions. Although the number of direct PPIs is low in both species, we found that their number is still much higher than expected between randomly chosen proteins (Figure 3), indicating that TEs are under selection to incorporate genes with particular functions. To test this, we estimated the enrichment of GO terms in them with GOrilla (51); however, the incorporated genes cannot be assigned to a single GO term either in yeast or Drosophila, the most enriched (P < 10−5) molecular function terms are metal ion binding in Drosophila and cofactor binding and glyceraldehyde-3-phosphate dehydrogenase activity in yeast (Supplementary Table S4).
Randomly inserting or deleting a sequence into a protein can lead to the loss of its function, and thus an important question concerning the TE proteins that captured a DNA fragment of their host is whether such proteins remained functional. Although domain rearrangements are relatively common in the evolution of genes (52), and recent studies show that mid-domain breaks can also result in functional proteins (53), it is not possible to prove in silico that a particular chimeric protein is or was functional (i.e. enzymatically active). Nevertheless, several lines of evidence indicate that many of these TE proteins contribute to the fitness of the TEs.
A functional protein is likely to be subject to purifying or adaptive selection. We tested for signatures of selection in the captured gene fragments of the TE proteins that have acquired such a gene fragment, and the length of the captured sequence was at least 50 aa residues (Supplementary Table S5). We used two methods (see Materials and Methods for details); whenever it was possible we compared the two closest homologous TE families sharing the same incorporated protein fragment, and in the remaining cases, we built an unrooted phylogenetic tree with the TE sequence, the closest RefSeq protein and an outgroup, and tested for selection on the TE branch of the tree. There is very little or no difference between the incorporated gene fragment and its closest RefSeq homolog (Ks < 0.01) in nine cases; thus, these sequences could not be tested reliably; additionally in 19 cases, the TE sequence and the closest RefSeq or RepBase homologs are so highly diverged (Ks > 10) that the saturation of synonymous substitutions also makes any tests of selection unreliable. From the 24 cases of protein incorporation where 0.01 < Ks < 10, we could detect significant selection in nine cases [P < 0.05, likelihood ratio test, (40)], with one case indicating adaptive evolution (hAT-N22_DR) and the remaining eight indicating purifying selection (Supplementary Table S5).
The incorporation of a new domain may result in a protein that contains conflicting molecular features, for example, the presence of both extracellular and nuclear domains within the same protein. To exclude those cases where the domain composition/structure already indicates a dysfunctional protein, we tested all TE proteins with a captured gene fragment with MisPred, a pipeline designed to detect mispredicted and abnormal proteins based on such conflicts (54). In the majority of the proteins with incorporated gene fragments, MisPred found no signs of abnormality, except five cases: BEL1-I_SM and Helitron-N3_ZM contain a truncated Pfam domain owing to mid-domain breaks, whereas the respective proteins in Gypsy-1-I_MI, Gypsy-40_Mad-I and SRV_MM-int contain transmembrane helices (55), which is unexpected in the case of TEs, as they are not part of membranes. In the case of Gypsy-1-I_MI, the transmembrane helix is not in the incorporated fragment; thus, it may be a misannotation or may have an unknown function. (We found additional four cases of proteins with transmembrane helices in TEs where the origin of the domain was unclear or owing to domestication).
Finally to investigate how the captured gene fragments influence the function of the TE proteins, we predicted the 3D structure of each chimeric protein that carries a fragment of a non-TE gene and is shorter than 500 amino acids (13 cases) with I-TASSER (56,57). Owing to the lack of sufficiently good templates in the Protein Data Bank (www.pdb.org), only three of the proteins have sufficiently high quality models (estimated TM-score ≥ 0.5) that also the positioning of the individual domains is likely to be correct, and therefore their function can be predicted with high confidence (Figure 5). These models were analysed with COFACTOR, a part of the I-TASSER pipeline that predicts the function and catalytic centers of proteins using their tertiary structure. Although in the three structures, the incorporated protein fragment provides a different functionality (oxidoreductase activity, methyltransferase activity - RNA binding, DNA binding; see Figure 5), the sequence of the incorporated fragment overlaps with the predicted binding sites of the proteins (Figure 5), further supporting the hypothesis that capturing host genes contributed to the emergence of new functions.
Overall, our findings suggest that not only TE proteins contribute to the evolution of their hosts but the reverse process of protein ‘junkification’ might be also significant in explaining the origin and the diversity of the TE sequences. The central position of the homologues of the incorporated genes both in the yeast and Drosophila PPI networks suggests that TEs acquire genes with particular characteristics and function, and that these sequences are either not picked up randomly by the repeats or are not retained randomly in the repeats. Different TEs show variability in their insertion preferences; non-LTR retrotransposons, which do not move horizontally or are active in hosts with compact genomes, typically target gene-poor, AT-rich regions [i.e. L1s or Alus in the human genome, (12)] or heterochromatin [i.e. Ty5 retrotransposons in the yeast genome, (58)], most likely to minimize their deleterious effect on host fitness. In contrast, genomic parasites capable of horizontal transfer like DNA transposons, LTR retrotransposons and retroviruses either show little target selectivity, or preferentially insert near actively transcribed genes (59,60), which, in consequence, might be more likely incorporated into a TE. A test of this hypothesis would be if genes with homologues to different repeat classes would show different patterns, i.e. genes with homology to an LTR retrotransposon would be hubs of PPI networks, whereas genes with homology to non-LTR retrotransposons would not; however, the number of genes with homology to non-LTR retroelements is too low to allow a meaningful comparison.
An alternative hypothesis is that TEs incorporate genes fragments randomly, but only a small fraction of sequences—with high number of interactions and centrality—remain in the TE consensus owing to selection. A number of observations support this hypothesis. First, in the majority of the TEs, the captured protein fragment resides within the predicted genes of the TE and not in the intergenic regions of the repeat, have a similar strand orientation to the nearest TE protein (Supplementary UCSC Browser track) and in a significant fraction of the cases where there has been enough time to accumulate nucleotide differences between the host gene and the incorporated sequence (Ks > 0.01), significant selection could be detected (Supplementary Table S5). Second, all repeats in our analysis are consensus sequences that are present in multiple copies in their host genome; thus, the incorporation of foreign sequences clearly did not make these repeats dysfunctional. Third, research on virus-host interactions indicates that incorporation of proteins with high degree of interactions and centrality may be beneficial for the TEs. Recently, Calderwood et al. (61) demonstrated that the proteins of Epstein–Barr virus preferentially interact with hubs of human protein interaction networks, and this pattern was subsequently confirmed for many other viruses and even parasitic bacteria (62). The most likely cause of the preferential interaction with highly connected proteins is that targeting hubs of the hosts’ cellular network is the most efficient way of using its resources, i.e. diverting its pathways to the use of the parasite, especially if the parasite has only a few proteins to achieve this task. As autonomous TEs encode only few proteins (frequently only one), the efficient use of the host resources may favor the evolution of multifunctional proteins with abilities to interact with several pathways of the host interactome, and the simplest way to achieve this is to acquire such protein domains directly from the host. In addition, retroviruses and retrotransposons were shown to interact with overlapping sets of proteins (63), further corroborating this hypothesis.
As the identified cases of gene capturing did not happen in yeast or Drosophila, it raises the question to what extent the yeast and Drosophila homologues of the captured proteins have similar properties to the captured ones. We used these species to investigate the systems-level characteristics of the incorporated proteins because they have much better characterized interactomes than most other model organisms, and the fact that we observe a similar pattern in two very distantly related species, and also in domain–domain interactions, provides strong support for the generality of our results. Recent findings indicate that PPIs are highly conserved and evolve three orders of magnitude slower than protein sequences themselves (64), which explains the qualitatively similar results in the two species. However, it is unclear how far genetic interactions are conserved across species. Such comparisons are challenging owing to technological differences between model organisms (e.g. RNAi is used for gene knockdown in multicellular organisms while in in-frame deletion is used in yeast), and also the number of genes for which information is available is very different [see (65) for review]. The lack of a significant effect in yeast, and the low number of genes with known genetic interactions in fly indicates that any conclusions on genetic interactions cannot be readily generalized at this point.
Supplementary Data are available at NAR Online: Supplementary Tables 1–5, Supplementary Figures 1 and 2, Supplementary UCSC Genome Browser track.
Funding for open access charge: Hungarian Scientific Research Fund [PD83571 to G.A., NK77978 to A.Sz. and PD75261 to P.B.]; International Human Frontier Science Program Organization (to B.P.); ‘Lendület Program’ of the Hungarian Academy of Sciences to (to B.P.); NSF Career Award [DBI 0746198 to Y.Z.]; National Institute of General Medical Sciences [GM083107 and GM084222 to Y.Z.].
Conflict of interest statement. None declared.
The authors thank Csaba Pál for useful comments and suggestions.