|Home | About | Journals | Submit | Contact Us | Français|
A large proportion of the genome of most eukaryotic organisms consists of highly repetitive mobile genetic elements. The sum of these elements is called the “mobilome,” which in eukaryotes is made up mostly of transposons. Transposable elements contribute to disease, evolution, and normal physiology by mediating genetic rearrangement, and through the “domestication” of transposon proteins for cellular functions. Although ‘omics studies of mobilome genomes and transcriptomes are common, technical challenges have hampered high-throughput global proteomics analyses of transposons. In a recent paper, we overcame these technical hurdles using a technique called “proteomics informed by transcriptomics” (PIT), and thus published the first unbiased global mobilome-derived proteome for any organism (using cell lines derived from the mosquito Aedes aegypti). In this commentary, we describe our methods in more detail, and summarise our major findings. We also use new genome sequencing data to show that, in many cases, the specific genomic element expressing a given protein can be identified using PIT. This proteomic technique therefore represents an important technological advance that will open new avenues of research into the role that proteins derived from transposons and other repetitive and sequence diverse genetic elements, such as endogenous retroviruses, play in health and disease.
Mobile genetic elements are DNA sequences that can move within and between genomes. In eukaryotes, transposons make up the majority of such elements, comprising between 5% (yeast; Saccharomyces cerevisiae) and 77% (frog; Rana esulenta) of an organism's genome.1 The sum of an organism's transposable elements is referred to as its mobilome. We recently reported the first high-throughput global profiling of an organism's mobilome-derived proteome.2 In this commentary, we provide a more focused description of our transposon proteomics method, and discuss which aspects of transposon biology are best studied proteomically. While our emphasis here is on transposons, our technique is equally useful for studying endogenous retroviruses and other repetitive and/or sequence-diverse elements that are not fully represented in reference genome databases.
The fact that transposable elements constitute such a large proportion of most eukaryotic genomes makes their study important for fully understanding an organism's biology. The most widely known activity of transposons is their ability to transpose and insert themselves into new positions within the genome. class I elements replicate via a “copy and paste” mechanism in which an RNA transcript derived from the genomic transposon sequence acts as a template for cDNA (cDNA) production by a transposon-encoded reverse transcriptase.3-5 This cDNA copy integrates elsewhere in the genome through the action of a transposon-encoded integrase to create new copies of the element.4,5 class II elements do not replicate via an RNA intermediate.3,6 Instead, “cut and paste” DNA transposons use transposase enzymes to excise and insert themselves elsewhere within the genome, with copies generated through DNA repair mechanisms, and during S phase if the donor, but not the acceptor, site has been replicated before transposition.3,6 Non-RNA-mediated “copy and paste” transposition mechanisms also exist.3,6 Transposons express several proteins during transposition, including enzymes and structural proteins.3-5 Some transposable elements do not encode their own proteins, hijacking the machinery of other elements instead; these include short interspersed nuclear elements (SINEs) and miniature inverted repeat transposable elements (MITEs).3 These non-autonomous elements are not detectable proteomically and will not be discussed further. Individual transposons tend to lose their ability to transpose over time, both through host defense mechanisms and through the acquisition of inactivating mutations.7-12
Transposition is biologically interesting because the insertion of transposons into host gene coding sequences or regulatory elements can generate new phenotypes. Exons or entire genes may be copied, disrupted or shuffled, new introns created, epigenetic modifications altered, and gene expression modulated.6,13,14 Large-scale chromosomal rearrangements also occur.6 Transposon activity is therefore both a driver in the evolution of new functions,6,13,14 and a contributing factor in diseases such as cancer and hereditary disorders.6,13-17
Defining the transpositionally active mobilome is challenging. Genomic studies only reveal whether a transposable element was recently active in general terms, evidenced by new genomic insertions in offspring compared to parents, or by insertion site variation between individuals or species in which elements have been active since the last common ancestor.9 Transposition in specific cells or tissues under varying conditions however is difficult to capture. On the other hand, RNA sequencing can detect transposon RNA in individual samples, but also picks up RNA-mediated host defenses against mobile elements that are not indicative of transposition.18-20 Reporter assays measuring the transposition of specific elements are useful for targeted studies, but do not provide a complete picture of the active mobilome and do not identify which genomic copies of an element are active. In contrast to these approaches, proteomics has the potential to provide a complete picture of mobilome activity by identifying all protein-producing transposons in a sample, many of which will be in the process of active transposition.
In addition to transposition-mediated effects, it has become evident that transposons can be “domesticated” and their genetic material co-opted for new cellular functions.1,6,21 At least 50–100 plant and mammalian proteins are known to originate from transposons.1 For example, transposase-derived genes contribute to V(D)J recombination during B- and T-cell receptor maturation, and the DNA-binding domains of several transcription factors and proteins involved in chromosome segregation also originate from transposases.1,6 Meanwhile, proteins derived from the structural gag and env proteins of long-terminal repeat (LTR) retrotransposons (class I) and endogenous retroviruses have been linked to placental development, cell proliferation, apoptosis, and antiviral defenses.1,21-23
Transposon-derived cellular genes can be distinguished from non-domesticated transposable elements by their lack of functional transposition sequences, lack of inactivating mutations, evolution under purifying selection, and single-copy coding sequences that are maintained at orthologous loci across species.1 Especially those with known functions should in principle be annotated in reference genomes. However, identifying domesticated transposons, particularly recently domesticated ones, can be challenging in genomes containing many related and recently active transposable elements.6 Domesticated transposon-derived proteins have so far been identified either serendipitously in molecular studies of cellular and disease mechanisms, or through bioinformatic genome analyses that provide no evidence for protein production and often focus on just one type of transposon protein. Here too, unbiased proteomic experiments can help identify unrecognised cellular functions derived from mobile genetic elements by surveying the complete repertoire of transposons that demonstrably produce protein. Protein function may also be hinted at from protein expression dynamics in different contexts (e.g. cancerous versus non-cancerous cells).
Proteomics therefore has several advantages over genomics and transcriptomics in measuring global mobilome activity, and can make valuable contributions to all investigations into the numerous aspects of normal physiology and disease processes in which transposons and transposon-derived proteins play a role. The major limitation of proteomics is that it cannot definitively prove active transposition, even if all proteins from a single element are detected. On the other hand, detection of only a single protein from a given element may not discount active transposition, due to experimental limits of protein detection and the potential contribution of proteins from other transposons to transposition. Nevertheless, proteomics provides a valuable springboard into mechanistic follow-on studies, and adds the capability of detecting transposition-independent protein expression from mobile genetic elements.
In typical global proteomic workflows, protein isolated from an experimental sample is separated by gel electrophoresis, tryptically digested, and analyzed by liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) to produce a set of spectra that correspond to the detected peptides (Fig. 1A). Peptides, and ultimately proteins, are identified by comparing these spectra to spectra bioinformatically predicted from protein annotation in reference genomes (Fig. 1A). From obtaining good DNA sequence coverage of transposons present in the genome, to bioinformatically relating detected peptides back to individual transposable elements, there are several hurdles that make mobilome proteomics challenging technically.
(1) Coverage of highly repetitive elements is frequently incomplete in genomes sequenced using Sanger and Illumina platforms, because short reads often do not span the full length of large transposons.24 (2) High quality genome annotation of transposable elements is often lacking, partly because their highly repetitive and sequence-diverse nature complicates their identification, and partly because automating transposon annotation is difficult.24-26 (3) To facilitate gene annotation, repetitive sequences are purposefully masked in genome assemblies,27 meaning reference genomes cannot be used for predicting transposon proteins, peptides, and spectra in proteomic workflows. (4) Dedicated repetitive element databases such as Repbase (girinst.org)28 and Tefam (tefam.biochem.vt.edu) do exist, but mostly list consensus sequences of phylogenetically related elements.28,29 Individual transposons may diverge considerably from this consensus.29 (5) Reference genomes may not accurately reflect the mobilome of a given experimental sample, because transposon sequences and insertion sites can vary substantially between populations, individuals, and tissues.9,30,31 (6) Large copy numbers (up to one million copies for the most common transposon family (Alu) in humans)9,32 make bioinformatically assigning detected peptides to a specific genomic copy of an element virtually impossible (Fig. 1Ci; but see later).
These specific challenges are exacerbated by the generally poor assembly and annotation quality of many genome sequences, and the large and diverse array of bioinformatic tools used to identify repetitive elements, which complicate comparisons between genomes.24 Performing proteomics on endogenous retroviruses and other sequence-diverse non-annotated genetic elements poses similar challenges.
We recently performed the first high-throughput global proteomic analysis of an organism's transposon proteome in a cell line derived from the mosquito Aedes aegypti.2 Several previous studies had proteomically analyzed a subset of protein spots excised after 2D gel electrophoresis, but had focused on only a limited selection of transposon proteins (e.g., transposase).33-40 The method we used, “proteomics informed by transcriptomics” (PIT),41,42 solves the aforementioned problems afflicting mobilome proteomics by circumventing the requirement for genome annotation and instead identifying peptides based on matched RNA-Seq data (Fig. 1B). In PIT, the experimental sample is split; protein is extracted from one part and processed for LC-MS/MS as usual, while RNA is isolated from the rest and used for RNA-Seq. RNA sequencing reads are assembled into transcripts de novo (without the use of a reference genome) using one of several bioinformatic transcriptome assembly programmes, and translated in silico to predict proteins, peptides, and spectra that are ultimately used to determine which proteins were detected by LC-MS/MS (Fig. 1B).41,42 The result is a bespoke reference database exquisitely matched to the proteome of the experimental sample, which is limited only by RNA sequencing depth.42 PIT therefore solves the combined problems of incomplete repetitive element sequence coverage, identification, and annotation in genomes, as well as the potentially poor fit of experimental data to reference databases.
In our study, we identified transposon proteins by BLASTing the in silico translation of detected peptide-associated transcripts against the Tefam and Repbase reference databases.2 Using the full-length amino acid sequence is important, because the short (<20 residue) peptides detected by LC-MS/MS could map to multiple transposons, while increased sequence coverage allows specific elements to be detected confidently (Fig. 1C). Although nucleotide BLAST could in theory be performed instead, protein BLAST is preferable because it reduces divergence from the consensus by excluding synonymous sequence differences. In this way, we identified a total of 136 transposon proteins in our sample with high confidence.2 It is important to tailor the thresholds for transposon protein identification to each species and reference database, as we observed differences in a side-by-side comparison of the Tefam and Repbase databases.2 Only 15 of the 136 identified transposon proteins closely matched the Ae. aegypti transposon reference database,2 confirming the aforementioned technical challenges to transposon proteomics posed by incomplete transposon identification, the inclusion of only consensus sequences in databases, and potential differences between a given experimental system and the reference genome.
Importantly, we also validated PIT's ability to make biologically relevant observations about mobile genetic elements.2 For example, non-LTR retrotransposons (class I) encode 2 ORFs, with ORF1 often truncated and not transcribed.5 This was reflected in our PIT data, with fewer proteins detected for ORF1 than ORF2 for non-LTR retrotransposons.2 Another interesting finding was the overabundance of proteins detected from LTR retrotransposons compared to other elements,2 despite the fact that non-LTR retrotransposons are more abundant in Ae. aegypti.43 Although this result must be interpreted with caution, as our proof-of-principle study included just one data point from a cell line that may not reflect the in vivo situation, our results are in agreement with the enrichment of LTR retrotransposon-derived small RNAs, known to correlate with transposon activity,44 in the related insect Drosophila melanogaster.45 Since LTR retrotransposons specifically are implicated in antiviral defenses in Ae. aegypti,22,23 we postulated that this mosquito may differentially allow LTR retrotransposons to remain active while suppressing other elements. If this result is corroborated, investigating the mechanisms by which the organism achieves differential transposon silencing, and copes with the potential deleterious consequences of heightened LTR retrotransposon activity, would be highly interesting. Although a discordance between genomic abundance and transposition activity has previously been observed in genomic studies,30 we are the first to describe this at the protein level,2 which may reflect not only transposition but also other (possibly cellular) functions of transposon proteins.
After publishing our study, the genome for the Ae. aegypti cell line we used (Aag2) was sequenced and made available at vectorbase.org.46,47 We wanted to test whether combining our PIT data with a matched genome sequence would allow us to pinpoint precisely which genomic copies of an element express protein. We therefore BLASTed (blast.ncbi.nlm.nih.gov) the full experimentally determined sequence of our 17 detected transposon transcripts that were associated with at least 2 peptides against the Aag2 cell genome (with repeats unmasked). In principle, each detected RNA transcript sequence should match the genomic DNA sequence at the locus from which it derives with 100% sequence identity across the full transcript length (100% query coverage). In practice, many of the thousands of genomic copies of a transposable element may be almost identical to each other and the transcript. Furthermore, sequencing errors and differences between our Aag2 cell clone and the published reference sequence may reduce the observed sequence identity. For our purposes, we considered transcripts exhibiting at least 99% nucleic acid sequence identity over 99% query coverage to be an “exact match.” Using these criteria, we were able to identify the exact genomic transposon sequence expressing protein for 5 elements (Fig. 2A). By cross-referencing the Aag2 contig containing the protein-expressing transposon with the Ae. aegypti reference genome (Liverpool strain version L3,43 vectorbase.org), and a physical chromosome map for Ae. aegypti,48 we were also able to identify the physical chromosomal location of the identified elements (Fig. 2B).
However, it is not always possible to map protein-expressing transposons in this way. For example, 5 transposon transcripts matched multiple almost identical genomic transposon sequences and could thus not be accurately located to a single source (Fig. 2A). Identical insertions contained within larger repeat regions are also expected to complicate this kind of analysis. Finally, several transcripts had no close match in the reference genome (Fig. 2A), either due to incomplete sequence coverage of repetitive elements,24 or because these elements differ between our clone of the cell line and the sequenced clone. Due to mobilome divergence, it was not possible to accurately directly map protein-expressing transposons using the main Ae. aegypti reference genome (Liverpool strain version L3,43 vectorbase.org; data not shown), highlighting the need for perfectly matched genome, transcriptome, and proteome data for this kind of analysis.
We therefore provide proof-of-principle that PIT can not only characterize the global profile of the mobilome-derived proteome, but also that detected transposon proteins can be matched to their precise genomic source. It should be noted however that, overall, our proteomic approach is likely facilitated by the fact that mosquitoes encode a large diversity of mobile genetic elements, each with a relatively low copy number compared to mammals.24 Using PIT (and other approaches) to characterize the mobilome-derived proteome may be more challenging in humans and (almost all) other placental mammals, where the major active protein-producing transposable element is the highly abundant non-LTR retrotransposon L1.24
Our PIT pipeline allows interrogation of the mobilome-derived proteome in a global and unbiased way for the first time, opening up exciting new opportunities for defining the total contribution of transposon-derived proteins to cellular function, as well as for characterizing transposon activity in different contexts. Importantly, global transposon proteome profiling will allow the field to move away from targeted studies and the serendipitous discovery of transposon protein functions in health and disease, and toward holistic experiments that give a complete picture of the positive and negative impacts of the mobilome on its host organism. Combining transcriptomic and proteomic data with matched genomic information provides a powerful toolkit for dissecting the contribution of individual transposons, out of the thousands of genomic copies of an element, to the overall global activity of the mobilome. Inherently, our methods are equally valuable for studying endogenous retroviruses and other repetitive and/or divergent genetic elements that may or may not be accurately represented in reference genomes. The tools we have developed therefore open exciting new avenues of research into the dynamic role these mobile DNA sequences play in cellular function, disease, and the evolution of new phenotypes, while also capturing their changing activity during invasion and eventual silencing, inactivation, and domestication in new hosts.
No potential conflicts of interest were disclosed.
The authors thank Catriona Macfarlane for constructive comments on the manuscript.
The work associated with this commentary was funded by: Wellcome Trust fellowship 096062 and a seed grant from the Faculty of Health and Medical Sciences, University of Surrey, to KM; Medical Research Council (UK) grant G0801973 to ADD; BBSRC grant BB/M02542X/1 to ADD and DAM; BBSRC grants BB/M020118/1, BB/L018438/1, and BB/K016075/1 to DAM.