PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of mgeLink to Publisher's site
 
Mob Genet Elements. 2016; 6(6): e1241050.
Published online 2016 September 26. doi:  10.1080/2159256X.2016.1241050
PMCID: PMC5173273

LTRclassifier: A website for fast structural LTR retrotransposons classification in plants

ABSTRACT

Automatic classification of LTR retrotransposons is a big challenge in the area of massive genomics. Many tools were developed to detect them but automatic classification is somehow challenging. Here we propose a simple approach, LTRclassifier, based on HMM recognition followed by BLAST analyses (i) to classify plant LTR retrotransposons in their respective superfamily, and (ii) to provide automatically a basic functional annotation of these elements. The method was tested on various TE databases, and shown to be robust and fast. This tool is available as a web service implemented at IRD bioinformatics facility, http://LTRclassifier.ird.fr/.

KEYWORDS: classification, LTR retrotransposons, superfamily

Introduction

Because of the current availability of complete genome sequences, a lot of efficient tools have been developed to identify LTR (Long-terminal Repeat) retrotransposons.1-3 Those tools are generally based on structure recognition of those elements: terminal direct repeats (LTR), flanked by target site duplication, and with a reverse transcriptase (RT) motif between the 2 LTRs. However, only a very few of them is able to provide informations about the non-RT motifs, or to provide a classification in superfamily (copia or gypsy, Fig. 1; see Wicker et al.4).

Figure 1.
Structure of LTR retrotransposons. The order of RT-RH/INT determines the corresponding superfamily. For Ggypsy elements, the CD (Chromo Domain) is present for a subset of elements only (Athila related). GAG: Group-specific AntiGen; AP: Aspartic Protease; ...

Some recent heavy implementations have been published for local analyses and classification,5-7 but neither on line tool nor light implementation was proposed yet. Here we described LTRclassifier, a web server based on a suite of perl scripts to classify a set of LTR retrotransposons in their superfamily. LTRclassifier was tested on different datasets of LTR retrotransposons, from a single sequence to thousands, and shown a very efficient recovery score in a short amount of time.

Results and discussion

Annotation and classification results

In terms of efficiency and step (Table 1), we can assign to a superfamily from 26% to 99% of elements, depending on the considered tested public TE database used as target for the analysis. Thus, with the exception of RepBase (see below the discussion on the limits), LTRclassifier was able to automatically classify about 75% of the total elements, with a sensibility of 80% to 98%, and a specificity higher than 95% (see Materials & Methods). Its precision is higher than 97%, as it wrongly annotated as LTR retrotransposon only 2.2% of the complete DNA elements from TREP (data not shown).

Table 1.
Benchmarking upon various nucleic databases of LTR retrotransposons (number of LTR are shown in bracket). N/A means no informations of classification was available from the database at the superfamily level.

LTRclassifier can annotate quickly (less than 2h for 4,000 LTR retrotransposon sequences), and with a high confidence (95 to 97.5% of specificity), the frames for any LTR retrotransposon sequence, complete element or not. It can detect the normal HMM motifs, protein motifs and nucleic similarity if needed, and provides position/annotation information for proteins and motifs in tabular text files. Moreover, it will provide informations about 2 types of abnormal structures to improve annotation: (i) detection of opposite strands as coding (e. g. Frame +1 for gag and -1 for RVE), (ii) detection of motifs from copia and gypsy.

Limits of the approach on consensus sequences

As shown on Table 1, the results on RepBase are low in terms of classification (only 26,2% of LTR retrotransposons vs almost 70% for TIGR grasses database). Most sequences in RepBase are consensus sequences from large families and subfamilies of repeats, and are used with tools such as RepeatMasker,8 Censor 9 or BLASTN, to mask and annotate repetitive DNA in genomes. These consensus nucleic sequences do not represent true sequences and have induced frameshifts modifying the translation, and then not allowing the PfamScan/HMM analysis, as well as BLASTX one on a limited scale, to provide good results under our conditions (data not shown). The structure of RepBase is thus very efficient in terms of massive genome annotation (or transcriptome), but a consensus sequence cannot be used to annotate the element itself for its functional components.

Conclusion

A lot of currently available tools can detect genomic structure of LTR retrotransposons (LTRharvest,2 LTR_finder,10 LTR_Struct,1 REPET,11). However, only few provide informations about the functional annotation of those elements in term of proteins and motifs, and they are generally either heavy to install (REPCLASS,7 LTRdigest,12 TEclassifier from REPET, PASTEC13 eg.) and not available on-line (Table 2).

Table 2.
Comparison of existing classification tools.

Here we present a fast on line tool able to annotate functionally a large set of LTR retrotransposons. The provided data are strong enough to make of LTRclassifier a efficient companion tool for any genome annotation improvement and LTR retrotransposon evolution study.

Materials and methods

Pipeline of annotation

The whole pipeline sequence is shown in Fig. 2. Firstly, the element sequences are translated in the 6 frames (positive and negative; for nucleic data only) using transeq tool from the EMBOSS package 14 and the translated sequences submitted to a PfamScan.pl v1.3 analysis,15 using the version 26.0 of Pfam database, to identify their putative functional motifs. Specific motifs were manually selected (Table 3) after large analyses using well defined LTR retrotransposons. The retain criteria were to be associated with normal LTR retrotransposon motifs AND only with them.

Figure 2.
General scheme of the analysis. The 6-frames translation and the recurrent BLASTN steps are optional to nucleic data.
Table 3.
Pfam motifs retained for the classification.

The conserved results from PfamScan step are 85% significant (see PfamScan.pl documentation for more details), with a minimal score of 50. Other motifs are conserved but are not used for superfamily classification, but only to validate the sequence as a LTR retrotransposon. Based on the identified motifs, the elements are then classified as copia, gypsy, unclassified. The Pfam-classified elements are also provided with their functional annotation based on Pfam motifs location.

The not yet classified elements are then subject to a BLASTX16 analysis, using as reference database a home-made concatenation of the TREPprot17 and plant RepbaseProt18,19 (developed for REPET) databases. The retained criteria for classification here are 25% of identity, for a minimal length of 80 residues.

The final step for nucleic input is a recurrent BLASTN analysis, trying to classify still unclassified sequences using their homology with already classified ones. This BLASTN is performed on the internal sequence only, if determination of LTR position is possible. For that the system will use a BLASTN of the sequence vs itself and check if 2 highly similar sequences (the LTR) can be identified on each side of the target sequence; if yes, the internal sequence will be extracted and used in the next step. Else, the BLASTN analysis will be performed on the whole sequence. The limits are 80% of identity, on at least 80 bases, and a minimal length of 80 bases for the query (as described in Wicker et al.4). The here classified elements are then added to the database, and the analysis is re-iterated until no more element can be added to the classified ones.

The final output will be a set of files, as described in Table S1 and Fig. S1.

Databases description

The databases used for the annotation itself are Pfam (version 26.0), RepBaseProt (version 17.0 for REPET pipeline, 5,844 proteic sequences) and TREPprot (version 11.0, 193 proteic sequences). Pfam contains HMM motifs for functional annotation; TREPprot and RepBaseProt contain proteic sequences of already known TE from diverse organisms. Here we limited the set of proteic data to plant LTR retrotransposons sequences, but it can be theoretically extended to any kingdom to identify copia and gypsy elements.

Testing material (sequences and computer)

The tested sequences for implementation and benchmarking are RetrOryza,20 TIGR grasses repeats (one representative element per family), TIGR Oryza repeats (all annotated elements), RepBase (v. 17) and TREPtotal (v. 11). From those sequence libraries, we selected only the LTR retrotransposons sequences for our tests.

The benchmarks were performed using 4 threads per assay. The current version is using 2 threads for each analysis.

The sensibility was calculated as the percentage of element classified by LTRclassifier upon the total number of already classified elements in the reference tested set. The sensitivity was expressed as the percentage of correctly classified element by LTRclassifier compared to the reference tested data set (already annotated elements).

Availability and implementation

The perl scripts are available at http://LTRclassifier.ird.fr/LTRclassifierScripts.tar.gz, under GPLv3.

The version implemented on the IRD bioinformatics cluster is available at http://LTRclassifier.ird.fr/. The input is limited to 8 Mbytes of data, with a minimum of a single sequence.

Supplementary Material

KMGE_S_1241050.pdf:

Abbreviations

BLAST
basic local alignment search tool
HMM
Hidden Markov Models
LTR
Long Terminal Repeats

Disclosure of potential conflicts of interest

No potential conflicts of interest were disclosed.

Acknowledgments

Authors want to thanks Benoit Piegu for the consensus problem identification, and the two anonymous reviewers for their comments.

References

[1] McCarthy EM, McDonald JF. LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 2003; 19:362-7; PMID:12584121; http://dx.doi.org/10.1093/bioinformatics/btf878 [PubMed] [Cross Ref]
[2] Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 2008; 9:18; PMID:18194517; http://dx.doi.org/10.1186/1471-2105-9-18 [PMC free article] [PubMed] [Cross Ref]
[3] Flutre T, Duprat E, Feuillet C, Quesneville H. Considering transposable element diversification in de novo annotation approaches. PLoS One 2011; 6:e16526; PMID:21304975; http://dx.doi.org/10.1371/journal.pone.0016526 [PMC free article] [PubMed] [Cross Ref]
[4] Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet 2007; 8:973-82; PMID:17984973; http://dx.doi.org/10.1038/nrg2165 [PubMed] [Cross Ref]
[5] Abrusán G, Grundmann N, DeMester L, Makalowski W TEclass—a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 2009; 25:1329-30; http://dx.doi.org/10.1093/bioinformatics/btp084 [PubMed] [Cross Ref]
[6] Steinbiss S, Kastens S, Kurtz S. LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons. Mob DNA 2012; 3:18; PMID:23131050; http://dx.doi.org/10.1186/1759-8753-3-18 [PMC free article] [PubMed] [Cross Ref]
[7] Feschotte C, Keswani U, Ranganathan N, Guibotsy ML, Levine D. Exploring repetitive DNA landscapes using REPCLASS, a tool that automates the classification of transposable elements in eukaryotic genomes. Genome Biol Evol 2009; 1:205-20; PMID:20333191; http://dx.doi.org/10.1093/gbe/evp023 [PMC free article] [PubMed] [Cross Ref]
[8] Smit, AFA, Hubley, R, Green, P. RepeatMasker Open-3.0. 1996-2010. http://www.repeatmasker.org
[9] Kohany O, Gentles AJ, Hankus L, Jurka J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics 2006; 7:474; PMID:17064419; http://dx.doi.org/10.1186/1471-2105-7-474 [PMC free article] [PubMed] [Cross Ref]
[10] Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 2007; 35:W265-8; PMID:17485477; http://dx.doi.org/10.1093/nar/gkm286 [PMC free article] [PubMed] [Cross Ref]
[11] Flutre T, Duprat E, Feuillet C, Quesneville H. Considering transposable element diversification in de novo annotation approaches. PLoS One 2011; 6:e16526; PMID:21304975; http://dx.doi.org/10.1371/journal.pone.0016526 [PMC free article] [PubMed] [Cross Ref]
[12] Steinbiss S, Willhoeft U, Gremme G, Kurtz S. Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Res 2009; 37:7002-13; PMID:19786494; http://dx.doi.org/10.1093/nar/gkp759 [PMC free article] [PubMed] [Cross Ref]
[13] Hoede C, Arnoux S, Moisset M, Chaumier T, Inizan O, Jamilloux V, Quesneville H. PASTEC: an automatic transposable element classification tool. PLoS One 2014; 9:e91929; PMID:24786468; http://dx.doi.org/10.1371/journal.pone.0091929 [PMC free article] [PubMed] [Cross Ref]
[14] Rice P, Longden I, Bleasby A. EMBOSS: The European molecular biology open software suite. Trends Genet 2000; 16:276-7; PMID:10827456; http://dx.doi.org/10.1016/S0168-9525(00)02024-2 [PubMed] [Cross Ref]
[15] Mistry J, Bateman A, Finn RD. Predicting active site residue annotations in the Pfam database. BMC Bioinformatics 2007; 8:298; PMID:17688688; http://dx.doi.org/10.1186/1471-2105-8-298 [PMC free article] [PubMed] [Cross Ref]
[16] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215:403-10; PMID:2231712; http://dx.doi.org/10.1016/S0022-2836(05)80360-2 [PubMed] [Cross Ref]
[17] Wicker T, Matthews DE, Keller B TREP: a database for Triticeae repetitive elements. Trends Plant Sci 2002; 7:561-2; http://dx.doi.org/10.1016/S1360-1385(02)02372-5 [Cross Ref]
[18] Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005; 110:462-7; PMID:16093699; http://dx.doi.org/10.1159/000084979 [PubMed] [Cross Ref]
[19] Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 2015; 6:11; PMID:26045719; http://dx.doi.org/10.1186/s13100-015-0041-9 [PMC free article] [PubMed] [Cross Ref]
[20] Chaparro C, Guyot R, Zuccolo A, Piégu B, Panaud O. RetrOryza: a database of the rice LTR-retrotransposons. Nucleic Acids Res 2007; 35:D66-70; PMID:17071960; http://dx.doi.org/10.1093/nar/gkl780 [PubMed] [Cross Ref]

Articles from Mobile Genetic Elements are provided here courtesy of Taylor & Francis