|Home | About | Journals | Submit | Contact Us | Français|
Automatic classification of LTR retrotransposons is a big challenge in the area of massive genomics. Many tools were developed to detect them but automatic classification is somehow challenging. Here we propose a simple approach, LTRclassifier, based on HMM recognition followed by BLAST analyses (i) to classify plant LTR retrotransposons in their respective superfamily, and (ii) to provide automatically a basic functional annotation of these elements. The method was tested on various TE databases, and shown to be robust and fast. This tool is available as a web service implemented at IRD bioinformatics facility, http://LTRclassifier.ird.fr/.
Because of the current availability of complete genome sequences, a lot of efficient tools have been developed to identify LTR (Long-terminal Repeat) retrotransposons.1-3 Those tools are generally based on structure recognition of those elements: terminal direct repeats (LTR), flanked by target site duplication, and with a reverse transcriptase (RT) motif between the 2 LTRs. However, only a very few of them is able to provide informations about the non-RT motifs, or to provide a classification in superfamily (copia or gypsy, Fig. 1; see Wicker et al.4).
Some recent heavy implementations have been published for local analyses and classification,5-7 but neither on line tool nor light implementation was proposed yet. Here we described LTRclassifier, a web server based on a suite of perl scripts to classify a set of LTR retrotransposons in their superfamily. LTRclassifier was tested on different datasets of LTR retrotransposons, from a single sequence to thousands, and shown a very efficient recovery score in a short amount of time.
In terms of efficiency and step (Table 1), we can assign to a superfamily from 26% to 99% of elements, depending on the considered tested public TE database used as target for the analysis. Thus, with the exception of RepBase (see below the discussion on the limits), LTRclassifier was able to automatically classify about 75% of the total elements, with a sensibility of 80% to 98%, and a specificity higher than 95% (see Materials & Methods). Its precision is higher than 97%, as it wrongly annotated as LTR retrotransposon only 2.2% of the complete DNA elements from TREP (data not shown).
LTRclassifier can annotate quickly (less than 2h for 4,000 LTR retrotransposon sequences), and with a high confidence (95 to 97.5% of specificity), the frames for any LTR retrotransposon sequence, complete element or not. It can detect the normal HMM motifs, protein motifs and nucleic similarity if needed, and provides position/annotation information for proteins and motifs in tabular text files. Moreover, it will provide informations about 2 types of abnormal structures to improve annotation: (i) detection of opposite strands as coding (e. g. Frame +1 for gag and -1 for RVE), (ii) detection of motifs from copia and gypsy.
As shown on Table 1, the results on RepBase are low in terms of classification (only 26,2% of LTR retrotransposons vs almost 70% for TIGR grasses database). Most sequences in RepBase are consensus sequences from large families and subfamilies of repeats, and are used with tools such as RepeatMasker,8 Censor 9 or BLASTN, to mask and annotate repetitive DNA in genomes. These consensus nucleic sequences do not represent true sequences and have induced frameshifts modifying the translation, and then not allowing the PfamScan/HMM analysis, as well as BLASTX one on a limited scale, to provide good results under our conditions (data not shown). The structure of RepBase is thus very efficient in terms of massive genome annotation (or transcriptome), but a consensus sequence cannot be used to annotate the element itself for its functional components.
A lot of currently available tools can detect genomic structure of LTR retrotransposons (LTRharvest,2 LTR_finder,10 LTR_Struct,1 REPET,11). However, only few provide informations about the functional annotation of those elements in term of proteins and motifs, and they are generally either heavy to install (REPCLASS,7 LTRdigest,12 TEclassifier from REPET, PASTEC13 eg.) and not available on-line (Table 2).
Here we present a fast on line tool able to annotate functionally a large set of LTR retrotransposons. The provided data are strong enough to make of LTRclassifier a efficient companion tool for any genome annotation improvement and LTR retrotransposon evolution study.
The whole pipeline sequence is shown in Fig. 2. Firstly, the element sequences are translated in the 6 frames (positive and negative; for nucleic data only) using transeq tool from the EMBOSS package 14 and the translated sequences submitted to a PfamScan.pl v1.3 analysis,15 using the version 26.0 of Pfam database, to identify their putative functional motifs. Specific motifs were manually selected (Table 3) after large analyses using well defined LTR retrotransposons. The retain criteria were to be associated with normal LTR retrotransposon motifs AND only with them.
The conserved results from PfamScan step are 85% significant (see PfamScan.pl documentation for more details), with a minimal score of 50. Other motifs are conserved but are not used for superfamily classification, but only to validate the sequence as a LTR retrotransposon. Based on the identified motifs, the elements are then classified as copia, gypsy, unclassified. The Pfam-classified elements are also provided with their functional annotation based on Pfam motifs location.
The not yet classified elements are then subject to a BLASTX16 analysis, using as reference database a home-made concatenation of the TREPprot17 and plant RepbaseProt18,19 (developed for REPET) databases. The retained criteria for classification here are 25% of identity, for a minimal length of 80 residues.
The final step for nucleic input is a recurrent BLASTN analysis, trying to classify still unclassified sequences using their homology with already classified ones. This BLASTN is performed on the internal sequence only, if determination of LTR position is possible. For that the system will use a BLASTN of the sequence vs itself and check if 2 highly similar sequences (the LTR) can be identified on each side of the target sequence; if yes, the internal sequence will be extracted and used in the next step. Else, the BLASTN analysis will be performed on the whole sequence. The limits are 80% of identity, on at least 80 bases, and a minimal length of 80 bases for the query (as described in Wicker et al.4). The here classified elements are then added to the database, and the analysis is re-iterated until no more element can be added to the classified ones.
The final output will be a set of files, as described in Table S1 and Fig. S1.
The databases used for the annotation itself are Pfam (version 26.0), RepBaseProt (version 17.0 for REPET pipeline, 5,844 proteic sequences) and TREPprot (version 11.0, 193 proteic sequences). Pfam contains HMM motifs for functional annotation; TREPprot and RepBaseProt contain proteic sequences of already known TE from diverse organisms. Here we limited the set of proteic data to plant LTR retrotransposons sequences, but it can be theoretically extended to any kingdom to identify copia and gypsy elements.
The tested sequences for implementation and benchmarking are RetrOryza,20 TIGR grasses repeats (one representative element per family), TIGR Oryza repeats (all annotated elements), RepBase (v. 17) and TREPtotal (v. 11). From those sequence libraries, we selected only the LTR retrotransposons sequences for our tests.
The benchmarks were performed using 4 threads per assay. The current version is using 2 threads for each analysis.
The sensibility was calculated as the percentage of element classified by LTRclassifier upon the total number of already classified elements in the reference tested set. The sensitivity was expressed as the percentage of correctly classified element by LTRclassifier compared to the reference tested data set (already annotated elements).
The perl scripts are available at http://LTRclassifier.ird.fr/LTRclassifierScripts.tar.gz, under GPLv3.
The version implemented on the IRD bioinformatics cluster is available at http://LTRclassifier.ird.fr/. The input is limited to 8 Mbytes of data, with a minimum of a single sequence.
No potential conflicts of interest were disclosed.
Authors want to thanks Benoit Piegu for the consensus problem identification, and the two anonymous reviewers for their comments.
Francois Sabot http://orcid.org/0000-0002-8522-7583