|Home | About | Journals | Submit | Contact Us | Français|
Rapidly growing number of sequenced genomes requires fast and accurate computational tools for analysis of different transposable elements (TEs). In this paper we focus on rapid and reliable procedure for classification of autonomous non-LTR retrotransposons based on alignment and clustering of their reverse transcriptase (RT) domains. Typically, the RT domain protein sequences encoded by different non-LTR retrotransposons are similar to each other in terms of significant BLASTP E-values. Therefore, they can be easily detected by the routine BLASTP searches of genomic DNA sequences coding for proteins similar to the RT domains of known non-LTR retrotransposons. However, detailed classification of non-LTR retrotransposons, i.e. their assignment to specific clades, is a slow and complex procedure that is not formalized or integrated as a standard set of computational methods and data. Here we describe a tool (RTclass1) designed for the fast and accurate automated assignment of novel non-LTR retrotransposons to known or novel clades using phylogenetic analysis of the RT domain protein sequences. RTclass1 classifies a particular non-LTR retrotransposon based on its RT domain in less than 10 minutes on a standard desktop computer and achieves 99.5% accuracy. RT1class1 works either as a standalone program installed locally or as a web-server that can be accessed distantly by uploading sequence data through the internet (http://www.girinst.org/RTphylogeny/RTclass1).
All eukaryotic transposable elements (TEs) belong to only two types: retrotransposons and DNA transposons (Craig et al., 2002; Jurka et al., 2007; Kapitonov and Jurka, 2008). All genomic and extra-chromosomal copies of retrotransposons are transposed through an RNA intermediate. Their messenger RNA (mRNA) is expressed in the host cell, reverse transcribed, and the resulting DNA copy (cDNA) is integrated into the host genome. Reverse transcription and integration steps are catalyzed by reverse transcriptase (RT) and endonuclease/integrase (EN/IN), which are encoded by autonomous retrotransposons. Unlike retrotransposons, DNA transposons are transposed by transfering their copies from one chromosomal location to another without copying their RNA intermediates. DNA transpositions are catalyzed by DNA transposases encoded by autonomous DNA transposons (Craig et al., 2002).
Eukaryotic retrotransposons can be divided into four classes: non-long terminal repeat (non-LTR) retrotransposons, LTR retrotransposons, Penelope, and DIRS retrotransposons. While the first two classes are well established and studied (Eickbush and Malik, 2002), the Penelope and DIRS classes were only recently introduced (Arkhipova et al., 2003; Evgen’ev and Arkhipova, 2005; Poulter and Goodwin, 2005; Lorenzi et al., 2006; Gladyshev and Arkhipova, 2007). Members of all the four retrotransposon classes are present in the genomes of all eukaryotic kingdoms: Protista, Plantae, Fungi, and Animalia.
A typical autonomous non-LTR retrotransposon generally referred to as LINE (Long INterspersed Element) contains one or two open reading frames (ORFs), and an internal RNA polymeraseII promoter in its 5’-terminal region that drives transcription of the full-length retrotransposon. Both its RT and EN domains are universally encoded by the same ORF (Eickbush and Malik, 2002). An mRNA expressed during transcription of a genomic non-LTR retrotransposon serves as a template for reverse transcription, and the resulting cDNA is inserted in the genome. The mechanism of retrotransposition and integration of LINEs into the genome is is viewed as a process called target-primed reverse transcription (TPRT) (Luan et al., 1993; Eickbush and Malik, 2002). According to the TPRT model, reverse transcription is primed by the free 3’ hydroxyl group at the target DNA nick introduced by EN.
Despite the basic common mechansism, there are some variations in target preferences, duplications/deletions at the insertion sites as well as in the speed and accuracy of reverse transcription. Such variations often correlate with sequence differences among proteins encoded by different phylogenetic groups of non-LTR retrotransposons (Eickbush and Malik, 2002). Therefore, meaningful classification is an important step in studies of LINE elements.
Here we describe a simple approach to produce a semi-automatic classification of autonomous non-LTR retrotransposons based on phylogenetic analysis of their RT domain protein sequences. In modern taxonomy, a monophyletic group of living or fossil organisms that consists of a single common ancestor and all its descendants is often referred to as a clade (from klados or “branch” in ancient Greek). The term clade was introduced in 1959 by Julian Huxley (Huxley, 1959), and became popular in evolutionary biology during the last 20 years. In 1999, Malik, Burke and Eickbush proposed the use of the term “clade” to represent those non-LTR retrotransposons that share the same structural features, are grouped together based on phylogenetic analysis of the reverse transcriptase domain, and dated back to the Precambrian era (Malik et al., 1999).
Based on structural features of non-LTR retrotransposons and phylogeny of RTs, they were assigned to five groups, called R2, L1, RTE, I, and Jockey, which were subdivided prior 2003 into 15 clades: CRE, NeSL, R4, R2, L1, RTE, Tad1, R2, LOA, I, Ingi, Jockey, CR1, Rex1, and L2 (Malik et al., 1999; Lovsin et al., 2001; Eickbush and Malik, 2002). Another nine clades, including Outcast, Hero, RandI (also known as Dualen), Daphne, Tx1, RTEX, Proto1 and Proto2 have been reported since 2003 (see table 1). Currently, we consider 28 different clades, including the L2A, L2B, Nimb and RTETP clades introduced in this manuscript (figure 1, table 1, and supplemental figure 2). It is believed that the R2 group is composed of the most ancient non-LTR retrotransposons: the CRE, NeSL, R2, Hero, and R4 clades, which are characterized by a single ORF coding for the RT and EN domains. The endonuclease domain is similar to different restriction enzymes and is always preceded by the RT domain. Most likely, the restriction-like endonuclease (RLE) in retrotransposons from the R2 group is responsible for their frequent target-site specificity (Kojima and Fujiwara, 2005b). Members of the L1, RTE, I, Jockey groups encode the apurinic-apyrimidinic endonuclease (APE), which is always N terminal to the RT domain. In addition to RT and EN, all retrotransposons from the I group code for ribonuclease (RNase) H (Eickbush and Malik, 2002). Analogously, diverse plant L1 retrotransposons also code for the RNAse H (V.K., unpublished data). Usually, Non-LTR retrotransposons are transmitted vertically, with only few exceptions in the RTE clade (Kordis and Gubensek, 1997). RandI/Dualen retrotransposons identified in the green algae form an ancient clade; like R2-group elements, they contain only one ORF that codes for a protein with the RLE domain (although its similarity to known RLEs is marginal) (Kojima and Fujiwara, 2005a). In addition, this protein contains the conserved APE domain (Kapitonov and Jurka, 2004; Kojima and Fujiwara, 2005a). Therefore, we consider the RandI clade as a founder of a new group of non-LTR retrotransposons (figure 1 and supplmental figure 2).
The RT domain is functionally the most important and the only domain present universally in all autonomous non-LTR retrotransposons (Eickbush and Malik, 2002) (see also figure 1). Moreover, numerous studies by independent groups devoted to the assignment of non-LTR retrotransposons to different clades have relied basically on phylogeny of their RT domains and produced results that seem to be quite stable and reliable despite the amount of new data accumulated after publications (Malik et al., 1999; Lovsin et al., 2001; Eickbush and Malik, 2002; Kojima and Fujiwara, 2004; Putnam et al., 2007). Therefore, the RT-based phylogeny is probably unavoidable and mostly sufficient approach for assignment of diverse retrotransposons to known clades of non-LTR retrotransposons. However, the current methods of robust phylogenetic analysis are extremely slow. A construction of a reliable tree for some 100 protein sequences often takes more than a day, especially if these sequences are less than 20% identical to each other, as is typical for the RT domain of non-LTR retrotransposons (Fig. 2). Moreover, selection of diverse protein sequences encoded by different families of non-LTR retrotransposons is crucial for obtaining reliable results regarding classification of novel retrotransposons. Another problem is a huge diversity and complexity of modern methods of phylogenetic analysis (http://evolution.genetics.washington.edu/phylip/software.html). As a result, the classification of novel retrotransposons can be either inaccurate or unreasonably time consuming. Therefore, the use of well established reference sets of protein sequences encoded by previously classified non-LTR retrotransposons and of reliable and fast methods of phylogenetic analysis are highly important as “the rosetta stone” in future studies induced by an explosion of sequence data.
A basic scheme depicting an assignment of a protein sequence to a specific clade of non-LTR retrotransposons is outlined in Figure 3. Hereafter, we use the term “classification” as a synonym of the assignment to a known or new clade. First, a set of protein sequences of the RT domain encoded by known classified non-LTR retrotransposons was collected from Repbase (Jurka et al., 2005) and GenBank. Our choice of sequences included in this collection, named the “RTclass1 learning set”, was limited by two self-imposed restrictions: (i) the protein sequence identity between any two sequences included in the collection must be ≤60%, and (ii) the collection must contain only currently active or young non-LTR retrotransposons, represented mostly by their consensus sequences. The first restriction forces us to increase the amount of useful information in the learning set not just by the increase of the number of sequences but rather by the increase of the RT protein diversity covered by the included sequences. Most likely, inclusion of numerous sequences highly identical to each other would lead to dramatically slow computations, without significant improvement of the classification accuracy. The second restriction minimizes the amount of background noise introduced in the learning set due to numerous “dead mutations” accumulated in genomic copies of non-LTR retrotransposons that lost their mobility many million years ago. The average pairwise protein sequence identity between RTs that belong to two different clades is only 19% (Fig. 2). Therefore, even a small number of “dead mutations” included in the learning set may lead to errors in the multiple alignment and wrong classification. We consider a particular family of non-LTR retrotransposon to be young as long as the host genome contains several members/copies of this family that are less than 10% divergent from each other. The consensus sequence built from a multiple alignment of the genomic copies should be free of numerous “dead mutations”. Sometimes, the genome contains only a single copy of a particular family of non-LTR retrotransposons. We consider this single copy retrotransposon young if it codes for the standard-size ORFs without stop-codons interrupting them. Given that the standard ORF encoding the RT-containing protein is longer than 3 kb, the absence of stop-codons in such a long sequence ensures small number of “dead mutations” accumulated in the retrotransposon. As a result, the current RTclass1 learning set consists of 211 protein sequences of the RT domain from diverse families of classified non-LTR retrotransposons that belong to all known clades (see supplemental figures S1 and figure S2).
When a protein sequence encoded by a new retrotransposon is taken for classification (Fig. 3), boundaries of its RT domain are defined based on BLASTP similarities to the RT domain sequences collected in the learning set. In the next step, the multiple alignment of the new RT sequence with all sequences from the learning set is obtained by using MUSCLE (Edgar, 2004a; Edgar, 2004b). Given that the learning set was composed of N=211 sequences, we have tested whether the multiple alignment of N+1 sequences could be replaced by the realignment of the new unclassified sequence with the existing multiple alignment of the N sequences. In such realignment, the multiple alignment of N sequences can be only modified by indels introduced simultaneously at the same positions in all N sequences. Such an addition of a new sequence to the old multiple alignment, also known as the “profile alignment,” is implemented in CLUSTAL (Larkin et al., 2007) as well as MUSCLE (Edgar, 2004a) and is extremely fast. Unfortunately, the accuracy of the profile alignment of highly diverse RT domain sequences encoded by non-LTR retrotransposons is not adequate. For instance, we prepared a random sample of 15 non-LTR retrotransposons that were not included in the learning set. In seven (out of the 15) retrotransposons, the “profile alignment”-based classification differed from the expected classification supported by the standard multiple alignment. Therefore, we cannot rely on the profile alignment.
Nevertheless, according to the previously reported estimates, execution time of standard multiple alignment by MUSCLE increases only tenfold as the number of ~300-aa protein sequences increases from 200 to 1000 (Edgar, 2004a). Currently, the multiple alignment of the 211 RT domain sequences from the learning set takes ~10 seconds. Therefore, a multiple alignment of an expanded set of 1000 RT sequences (this is the expected number of sequences included in the learning set in the next two years) will take less than 2 minutes.
Our main objective is to develop a fast and reliable method that would permit to assign unclassified non-LTR retrotransposons to known and novel clades, either locally through a pipeline installed on a standard desktop computer or distantly via a web-server. Here, we present a simple procedure for ranking different methods and programs developed specifically for fast phylogenetic analysis of thousands of proteins sequences, including BIONJ (Gascuel, 2000), Clearcut (Sheneman et al., 2006), FastME (Desper and Gascuel, 2002), PHYLO_WIN (Galtier et al., 1996), QuickTree (Howe et al., 2002) and RaxML (Stamatakis et al., 2005).
For each method listed above, we used the same multiple alignment of previously classified 100 RT domains representing established clades of non-LTR retrotransposons, which we collected during recent identification and classification of non-LTR retrotransposons in the Nematostella vectensis genome (Putnam et al., 2007). The model phylogenetic tree of RT domain sequences encoded by these retrotransposons was constructed by using MEGA4 (Tamura et al., 2007). This tree was also supported by numerous studies of non-LTR retrotransposons in the past (Eickbush and Malik, 2002; Kojima and Fujiwara, 2004; Kojima and Fujiwara, 2005a). In the model tree, which represented 17 different clades of non-LTR retrotransposons, all sequence names were grouped into 17 model clusters, where each cluster was composed of the names of sequences that belonged to the same clade.
For each tested method we created 1000 bootstrap trees by generating permutations in the original multiple alignment by SEQBOOT from the PHYLIP package (Felsenstein, 2005). Every bootstrap tree, based on its Newick format representation (http://evolution.genetics.washington.edu/phylip/newicktree.html), was automatically split into all possible clusters. A cluster was defined as a Newick substring bordered by the left and right parentheses at its left and right ends and containing equal numbers of left and right parentheses, including the two border parentheses. As a result, every bootstrap cluster contained unique sequence names, and all clusters together contained the complete set of 100 sequence names. In a set of all possible clusters identified in the bootstrap tree, we kept only those 17 clusters that were closest to the 17 model tree clusters.
To determine how close was each cluster in the bootstrap tree to a particular model cluster, we counted the following numbers: the number of sequence names from the model cluster that were not present in the bootstrap cluster (n1), and the number of sequence names in the bootstrap cluster that were not present in the corresponding model cluster (n2). Based on these two numbers, we calculated the error α of the consensus cluster as
where M and T were the numbers of all sequence names constituting the model and consensus clusters, respectively. The smaller the α error, the closer is the bootstrap cluster to the model cluster. By screening all possible consensus clusters for every model cluster, we identified the unique consensus cluster characterized by the smallest error α .
For each bootstrap tree generated by the same tested method, after choosing all 17 best bootstrap tree clusters closest to the corresponding 17 model clusters/clades, we calculated the error δ of the tested method as
where Mi and Ti were the numbers of sequence names present in the corresponding to each other model and predicted bootstrap cluster i, respectively. The error of the method was calculated as the mean value of δ in the 1000 bootstrap trees. Among all tested methods (Table 2), BIONJ had the lowest error and was chosen as the best method for assignment of novel non-LTR retrotransposons to known clades.
The RTclass1 dataset is composed of 211 RT domain protein sequences that belong to 28 clades: CRE, R2, R4, RandI/Dualen, NeSL, Hero, Tx1, L1, Proto1, Proto2, RTETP, RTEX, RTE, I, Nimb, Ingi, Outcast, Jockey, Crack, Daphne, L2, CR1, Rex1, L2A, L2B, Tad1, R1, and Loa (Table 1). All 28 clades are currently introduced into the classification scheme implemented in Repbase (Jurka et al., 2005; Kapitonov and Jurka, 2008). To keep the nomenclature of individual non-LTR retrotransposons simple, we recommend the following standard rule for naming every novel non-LTR retrotransposon: “name of the clade”-”family number”_”species abbreviation” (Kapitonov and Jurka, 2008).
The basic classification scheme implemented in RTclass1 is flexible and allows simple modifications by choosing different methods of multiple alignment, estimation of the protein distances, and inferring phylogenetic trees (Fig. 3). The input protein sequence can be assigned to one of the known or novel clades in less than 10 minutes either by submitting it to the RTclass1 web-server or by executing the stand-alone program locally on a standard desktop computer with Linux operating system. The classification output consists of several reports: (1) the name of the clade the input sequence belongs to, or if it cannot be classified the word “out-group” is displayed, indicating an unknown, potentially novel clade; (2) the consensus phylogenetic tree built from 1000 bootstrap trees that can be viewed by any standard web-browser; (3) the “real” tree built from the multiple alignment of the input RT sequence and the RTclass1 sequences; (4) the multiple alignment.
The analyzed protein sequence should be in the FASTA format. In the first step, the RTclass1 tool uses WU-BLAST/CENSOR (Kohany et al., 2006) to extract the RT domain from to the analyzed sequence (Fig. 3). In the second step, RTclass1 creates the multiple alignment of the analyzed domain sequence and the RTclass1 dataset of RT domains. This multiple alignment is transformed by random bootstrap permutations via SEQBOOT (Felsenstein, 2005) into 1000 bootstrap multiple alignments, which are used later for calculation of 1000 protein distance matrixes by CLEARCUT (Sheneman et al., 2006). For each protein distance matrix, RTclass1 infers 1000 phylogeny trees by using BIONJ (Gascuel, 2000). In the next step, by analyzing the obtained 1000 bootstrap trees via Consense (Felsenstein, 2005), RTclass1 creates the consensus bootstrap tree and identifies the model cluster that contains the input sequence.
A particular eukaryotic genome may contain non-LTR retrotransposons that belong to more than 100 families (Putnam et al., 2007; Putnam et al., 2008). The phylogeny-based classification of all these families, even in its simplest version described here, takes more than 16 hours on an average desktop computer and demands hours of manual work. On the other hand, the phylogenetic analysis appears not to be necessary for classification of a non-LTR retrotransposon with RT domain over 40% identical to the domain encoded by some classified retrotransposon present in the learning set. This “magic” 40% threshold is well supported by the distribution of the inter-clade pairwise protein identity obtained for 211 RT sequences constituting the learning set (Fig. 2). In this set, the number of different pairs of sequences that belong to different clades equals 13780. The distribution of the pairwise protein identity of all these 13780 pairs is characterized by the mean and standard deviation equal 19.3% and 4.2%, respectively; with minimum and maximum values equal to 8% and 39%. Therefore, as long as the protein identity between two RT domain sequences is ≥40%, covering over 90% of their sequence length, one can safely assume that both sequences belong to the same clade.
The efficiency of this approach on a genome scale level can be demonstrated by our recent studies of non-LTR retrotransposons in the lancelet genome, which contains 192 families of non-LTR retrotransposons identified computationally (Putnam et al., 2008). As illustrated in figure 4, only 40 families need to be passed through the phylogenetic analysis described above to obtain accurate classification. Out of all 192 families, 105 can be immediately assigned to known clades based on ≥40% identity of their RT domain sequences to one of the classified RTs from the RTclass1 learning set (Fig. 4, step 1). Due to their dominant vertical transmission mode, most families of non-LTR retrotransposons present in a particular genome are much closer to each other rather than to non-LTR retrotransposons identified in other species, including those that constitute the RTclass1 learning set. Therefore, some of the 105 families classified based on high identities between their RTs to the RT1class sequences can be ≥40% identical to the remaining 87 unclassified families. In fact, using these 105 sequences as a new learning set 2, we found that another 42 families can be classified based on high identities of their RT sequences to classified sequences from the set 2 (Fig. 4, step 2). Repeating iteratively the described procedure (Fig. 4, steps 3–4), we have only 40 families left that cannot be classified based on BLASTP identity of their RTs to the previously classified RTs.
While the assignment of novel non-LTR retrotransposons by the RTclass1 tool is reliable and accurate, we would urge potentials users of this tool to be cautious in inferring the macro-topology of the global tree of non-LTR retrotransposons, e.g. the topology of branches connecting different clades to each other. Given the low identity between RTs from different clades (Fig. 2), an accurate reconstruction of evolution of clades, especially the oldest ones, needs additional studies and sophisticated methods of phylogenetic analysis that would take into account both variations of the mutation rate at different amino acid positions of the RT domain and variations of the mutation rate in different species.
To improve the current classification procedure, we are planning to enhance it by analysis of other protein domains in non-LTR retrotransposons, including endonucleases, ribonuclease and ORF1-encoded proteins. In one year, we are going to increase significantly the number of diverse RT domain sequences from young non-LTR retrotransposons (from 211 to ~1000) constituting the RTclass1 dataset. We would also like to encourage a feedback from potential users, including requests for submissions of new sequences and clades in the RTclass1 dataset and Repbase.
We would like to thank Oleksiy Kohany for help with putting the RTclass1 tool to the web-server and Irina Arkhipova for valuable comments on the manuscript. This work was supported by the National Institutes of Health grant 5 P41 LM006252.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.