Here we show that WGS data from next generation resequencing projects can successfully be used to identify large samples of de novo insertions in order to discover properties of TE target sites in D. melanogaster. Assuming results for the families studied here can be generalized to other TE families, the major biological findings of this work are: (i) TSDs for TIR and LTR elements are less than 10 bp in length, (ii) TSD length for TIR and LTR elements are shared by related TE families in the same clade, (iii) TSMs for TIR and LTR elements are palindromes, and (iv) target sequence preferences for TIR and LTR element-encoded TSMs extend beyond the limits of the TSD. We believe these general conclusions about TIR and LTR target site preferences are robust for several reasons. First, for strains of D. melanogaster that have been independently sequenced using 454 and Illumina technologies, the insertion location, orientation and TSD are highly consistent among different platforms (). Thus, it is unlikely that the fundamental data used here to infer properties of TE insertion are heavily biased by the platform-specific sequencing errors. Second, our results based on population genomic data from wild-type flies is consistent with previous findings in D. melanogaster based on spontaneous and artificially generation mutations in lab strains (). This reproducibility across data types reciprocally implies that the inferences about TSD and TSM properties from both large-scale population genomic and classical data are reliable. Finally, we observe consistent phylogenetic signals in TSD length and TSM properties among related clades of TE families that are not predefined by constraints in our methodology and can only arise by common biological processes.
Our use of next-generation sequence data to study the details of target site preferences joins a growing number of applications that attempt to identify TE insertion mutations based on targeted or whole-genome resequencing. Broadly speaking, the aims of these previous techniques fall into two major classes: (i) genome-wide screens for insertions in DNA pools from a single TE family induced by artificial mutagenesis to identify genomic regions that are essential for growth in bacteria
[38],
[39],
[40],
[41] or tumors
[42],
[43],
[44], and (ii) genome-wide screens in individuals/strains for spontaneous insertions from one or more TE family to study population genomics and genome evolution
[45],
[46],
[47],
[48]. The aim of our method for TE insertion discovery differs from these previous methods in that our approach is designed to reveal the mechanistic details of transposon insertion site preferences. As such, our approach employs stringent filtering to identify only well-supported
de novo insertion sites, and attempts to annotate insertions at exact nucleotide-level resolution rather than provide a comprehensive map of all TE insertions in all strains.
In terms of studying TE insertion site preferences, our next-generation sequencing based population-genomic approach has many advantages over traditional methods. Our method can be applied in any species with active TEs, requires no artificial mutagenesis, is high-throughput and fully automated, generates TSD and TSM information simultaneously for all active TE families, uses a common biological data source and consistent computational methods for all TE families studied, allows direct comparison of pre-integration and post-integration genomic sequences, is based on naturally-occurring mutational events, and identifies the exact breakpoints of TE integration in the genome. Nevertheless, there are several key limitations with our TE insertion site discovery approach that prevent comprehensive application to all TE families and for use in other applications (e.g. population genetics). First, our method requires both termini of a full-length element to be present for a de novo insertion to be detected. Thus, we cannot identify incomplete de novo TE insertions such as 5′ truncated non-LTR retrotransposons. While our method can find full-length non-LTR elements, the variable TSD length of these TEs prevented automated inference of optimal TSD length for downstream filtering and TSM inference, which is why they were excluded from this study. Second, we require TE-junction information to be contained in a single read and our sequence similarity thresholds effectively require ~30 bp of homology to both TE and flanking DNA. Thus our approach requires a minimal read length, which we find empirically to be greater than 65 bp. This limitation of minimal read length could be bypassed in principle by using paired-end data and attempting to assemble contigs that span the TE-flanking region junction. Third, we are not able to identify de novo insertions in repetitive regions of the genome (i.e. TE-rich pericentromeric regions) and thus many potential de novo TE insertion sites are not included in our data set. Despite these shortcomings, our approach has permitted the general properties of TIR and LTR element target sites in D. melanogaster to be generated in an automated and reproducible manner.
With the ability to generate a wealth of data on the natural target site properties for large numbers of TE families, genome-wide properties of TE target sites can now be uncovered in other species to test the generality of the conclusions reported here and further illuminate the molecular biology of transposition. Previous results from other species using classical approaches supports our ultimate conclusion that TSMs (which incorporate all lower level features of the data including position, orientation and TSD) are generally palindromic structures for TIR elements (see references in of
[6] and
[17],
[18],
[49],
[50]) and LTR elements/retroviruses
[17],
[51],
[52],
[53],
[54],
[55],
[56]. Given the strong concordance between population genomic and classical data types in
D. melanogaster (), we are confident that application of next-generation sequencing population genomics based methods to study TE target site properties will support this general finding across a wide range of species and TE families. Importantly, the common palindromic nature of TIR and LTR target sites suggest similar mechanisms for TIR and LTR insertion, which is supported by the fact that retroviral-like LTR elements use integrases that share catalytic activity with transposases of TIR elements
[57]. Palindromic target sites are also generally consistent with transposases or integrases acting as multimeric complexes (e.g.
[58],
[59]), with the target site entering the catalytic complex along an axis of two-fold symmetry
[60],
[61]. Finally, the general AT-richness of TSMs may imply that flexibility of the target site sequence is crucial factor for the integration of many TE families
[62]. These connections reveal how combining inferences from the rich natural resource of population genomic data with detailed structural and functional studies will benefit future work on the mechanistic basis of TE insertion into host genomes.