|Home | About | Journals | Submit | Contact Us | Français|
Miniature inverted-repeat transposable elements (MITEs) are a special type of Class 2 non-autonomous transposable element (TE) that are abundant in the non-coding regions of the genes of many plant and animal species. The accurate identification of MITEs has been a challenge for existing programs because they lack coding sequences and, as such, evolve very rapidly. Because of their importance to gene and genome evolution, we developed MITE-Hunter, a program pipeline that can identify MITEs as well as other small Class 2 non-autonomous TEs from genomic DNA data sets. The output of MITE-Hunter is composed of consensus TE sequences grouped into families that can be used as a library file for homology-based TE detection programs such as RepeatMasker. MITE-Hunter was evaluated by searching the rice genomic database and comparing the output with known rice TEs. It discovered most of the previously reported rice MITEs (97.6%), and found sixteen new elements. MITE-Hunter was also compared with two other MITE discovery programs, FINDMITE and MUST. Unlike MITE-Hunter, neither of these programs can search large genomic data sets including whole genome sequences. More importantly, MITE-Hunter is significantly more accurate than either FINDMITE or MUST as the vast majority of their outputs are false-positives.
Transposable elements (TEs) reside in all characterized eukaryotic genomes where they are often the largest component. For example, sequences derived from TEs make up at least 31% of the genome of dog (Canis familiaris), 38% of mouse (Mus musculus), 46% of human (Homo sapiens) and 85% of maize (Zea mays ssp. mays L.) (1–4). TEs have structural features and classification systems that serve to distinguish them from simpler repetitive sequences like microsatellite repeats. TEs are divided into two classes based on the molecule involved in transposition: retrotransposons (Class 1) move via a RNA intermediate while DNA is the intermediate of DNA transposons (Class 2). In each class, TEs are further divided into superfamilies and families (5). In plants, six Class 2 superfamilies have been identified thus far: Tc1/Mariner, PIF/Harbinger, hAT, MULE, CACTA and Helitron (5,6). With the exception of Helitrons, TEs in the other five superfamilies have terminal inverted repeats (TIRs) and transpose through a cut-and-paste mechanism. TEs are also classified as autonomous or non-autonomous elements based on whether they can produce functional transposase.
Miniature inverted-repeat TEs (MITEs) are a special type of Class 2 non-autonomous element that is present in high copy numbers in many eukaryotic genomes. For example, ~56000 MITEs were identified in sorghum (Sorghum bicolor) (7), 73500 in rice (Oryza sativa) (8) and 150000 in human (9). Ever since their discovery almost 20 years ago (10,11), MITEs have been the subject of increasing interest in both plants and animals (12–15). Unlike the ‘traditional’ low copy non-autonomous TEs (such as the Ds element of maize), MITEs are uniformly short (most <500bp) and amplify rapidly from one or a few elements to very high copy numbers (16). The two largest MITEs families, Stowaway and Tourist, were found to be members of the Tc1/Mariner and the PIF/Harbinger superfamilies, respectively (12,17–19). MITEs have also been reported from the hAT and MULE superfamilies (13,20).
While the rapidly expanding databases of genomic sequence present an opportunity to expand the study of MITEs, it also poses a significant challenge to their correct and efficient annotation. Many TE annotation programs have been developed that use one or more of the following computational approaches: (i) homology-based, (ii) de novo, (iii) polymorphism based and (iv) structure based (21–23). Homology-based TE annotation is powerful at detecting TEs that share sequence similarity with known elements, but it is inadequate at identifying full length or novel TEs. Methods using de novo approaches can discover all TEs as long as they have multiple copies. However, the drawback of this approach is that its output is a mixture of TEs from all superfamilies and non-TE repeats. As such, the manual identification and classification of TEs from the output of de novo methods is often very tedious and time consuming. Polymorphism-based approaches can discover new TEs but the output is also a mixture of different types of sequences. More importantly, its application is limited to the comparison of data sets from very closely related species. When compared to the other algorithms, structure-based approaches are very effective at discovering certain TE types like LTR retrotransposons. However, currently available programs are less successful at identifying other TE types like non-autonomous Class 2 transposons (including MITEs) because they possess few distinguishing structural features.
To date three programs have been developed exclusively to find MITEs: TRANSPO (24), FINDMITE (15) and MUST (25). TRANSPO is a homology-based program that requires known MITE sequences. As such it is not effective at finding new MITEs (21). FINDMITE and MUST are structure-based TE discovery programs that can be used to discover new MITEs because they search for common MITE structural features rather than similar sequences. However, because MITEs have only two common structural features, TIRs and target site duplications (TSDs), many sequences that are not MITEs are in the outputs of FINDMITE and MUST. Thus, the false-positive rates of these programs are very high and extensive manual curation is required to filter false-positives from their output files.
Here, we present MITE-Hunter, a program that accurately discovers MITEs as well as other short non-autonomous ‘cut-and-paste’ Class 2 TEs in genomic data sets including those of whole genomes. To evaluate MITE-Hunter, we compared it with FINDMITE and MUST. We chose the rice genome to evaluate the performance of MITE-Hunter because rice harbors abundant and well-annotated Class 2 TEs and MITEs (8,26,27). In the examples reported in this study, MITE-Hunter missed only two known rice MITEs and discovered 16 previously unknown elements. Compared to FINDMITE and MUST, MITE-Hunter has a much lower false-positive rate and the output is easier to be checked and classified. MITE-Hunter and related programs can be freely downloaded at http://target.iplantcollaborative.org/.
MITE-Hunter is a UNIX program pipeline composed mainly of Perl scripts. Given genomic sequences as the input data, MITE-Hunter identifies Class 2 non-autonomous TEs and produces outputs of consensus sequences classified into families. MITE-Hunter can use multiple processers (default 5 CPUs). The MITE-Hunter pipeline has five main steps that are summarized in Figure 1: (i) identify TE candidates through a structure-based approach, (ii) identify and filter false-positives using an approach based on the pairwise sequence alignment (PSA), (iii) generate exemplars, (iv) identify and filter false-positives using an approach based on the multiple sequence alignment (MSA), generate consensus sequences and predict TSDs and (v) group consensus sequences into families. Details of each step are presented in the results section.
The build five rice IRGSP/RAP genome sequence was used (28) as was Repbase version 14.02 (29) and RepeatMasker 3.26 (Smit, A.F.A., Hubley,R. and Green,P., unpublished data; http://www.repeatmasker.org). TE copy number was calculated using a previously described method (4). Pair-wise sequences alignment (PSA) used BLAST (30) and multiple sequences alignment (MSA) used Muscle (31). All computation was done on a Linux cluster.
We applied MITE-Hunter to the rice genome with default parameters. MITE-Hunter completed the analysis in ~44h. Details of the algorithms and results of each step of MITE-Hunter are presented below.
To test the authenticity of the MITE-Hunter output we curated the 700 rice TEs (Figure 2). Each MSA file was manually analyzed for TIR and TSD structures that are characteristic of Class 2 TE superfamilies found in plant genomes. A TE is validated if it has at least three full-length copies and its ends, characterized by TIRs and TSDs, can be recognized from the MSA file. TEs that do not meet these criteria are considered to be false-positives. Using these strict parameters, we identified 46 false-positives. In addition, eight solo LTRs and four short Helitrons were identified and classified as false-positives. These 12 elements were in the MITE-Hunter output because they coincidentally have TIR-like and TSD-like structures near their ends. After removing these elements there were 642 TEs remaining from the original 700, resulting in a false-positive rate of 8.3% [(46 + 8 + 4)/700].
In addition to 58 false-positives, we were unable to classify 15 TEs into superfamilies. Although these sequences appeared to be TEs (based on their MSA files), their TSDs and TIRs were ambiguous because they contained too many mismatches. As such, they were classified as unknowns.
The remaining 627 TEs were confirmed to be ‘cut-and-paste’ Class 2 TEs and were classified into previously described superfamilies. However, during the classification process we found that several families contain TEs belonging to more than one superfamily. By comparing their sequences, we discovered that this problem was caused by 14 compound TEs that were formed by the insertion of one superfamily member into another (Figure2-a). Because TEs were grouped into families based on their similarity, these 14 compound TEs drag TEs from different superfamilies together. In addition, we identified another 12 compound TEs that were formed by the fusion of two TEs from the same superfamily (Figure 2-b and -c). These 26 compound TEs have low full-length copy number in the genome and were excluded from the following analysis. Thus 601 TE consensus sequences remained.
Manual curation reveals that some TE consensus sequences in the MITE-Hunter output miss or have additional sequences at their ends. This problem is caused by the existence of false-TIR and TSD structures near the authentic ones. The missing or additional sequences are mostly short and can be manually identified after locating the real TIRs and TSDs in the MSA files. After correcting the consensus sequences of the remaining 601 Class 2 TEs (by adding or trimming the missing or additional sequence), the similarity between some TE sequences satisfies the grouping criteria in Step III (Figure 1C). As such we ran the programs in Step III and V of MITE-Hunter and got the final data set composed of 551 TE consensus sequences grouped into 401 families. Of these, 97 Tc1/Mariner TEs are grouped into 86 families, 146 PIF/Harbingers into 104 families, 123 hATs into 95 families, 173 Mutators into 110 families and 12 CACTAs into 6 families.
To identify and characterize MITEs from MITE-Hunter output, we performed a RepeatMasker search of the rice genomic database using the curated 551 TE sequences as the query. From the RepeatMasker output, we counted the copy number of each TE (data not shown). To distinguish MITEs from lower copy Class 2 non-autonomous TEs, we defined a MITE as a Class 2 non-autonomous TE of <800bp and with at least 100 full-length copies in the genome. Potential MITEs that have not experienced significant amplification were defined as having fewer copies (10–99) but high sequence identity (identity ≥99%). Based on these criteria, we identified 132 rice MITEs from the MITE-Hunter output, including 15 hAT-MITEs, 22 Mutator-MITEs, 50 Stowaways and 45 Tourists. No additional CACTA MITEs were found.
To estimate the false-negative rate of MITE-Hunter we used the rice Class 2 non-autonomous elements in the Repbase as the reference data set. Repbase was selected for this analysis because it is a collective TE database containing most, if not all, previously reported rice Class 2 TEs (29). However, because Repbase contains both Class 1 and 2 autonomous and non-autonomous TEs, the first step was to retrieve only rice Class 2 non-autonomous elements. From these we then selected 230 elements that were <1.7kb because the longest rice TE found by MITE-Hunter has 1676bp. The 230 elements were manually checked using the same approach that was applied to the MITE-Hunter output. Thirty-two of the 230 elements were excluded because they lack multiple full-length copies. In addition, 13 were excluded because their TIR and TSD structures could not be identified from MSA files. The remaining 185 Repbase TEs were classified into Class 2 TE superfamilies. By using the same approach as was used for identifying MITEs from the MITE-Hunter output, we identified 101 MITE-like elements from the 185 Repbase TEs, including 4 hAT-MITEs, 19 Mutator-MITEs, 40 Stowaways and 38 Tourists.
The false-negative rates of MITE-Hunter were calculated separately for Class 2 non-autonomous TEs and MITEs as follows. First, we used the curated 551 Class 2 non-autonomous TEs discovered by MITE-Hunter as the query to mask the Repbase data set using RepeatMasker. On average, 84.9% of the sequences in the Repbase data set were masked (Table 1, second column). Using a similar approach, 97.6% of MITE sequences in the Repbase were masked by the TEs in the MITE-Hunter output (Table 1, third column). Thus the false-negative rate of MITE-Hunter is 15.1% for Class 2 non-autonomous TEs and 2.4% for MITEs. MITE-Hunter failed to identify only two Tourist MITEs (OSTE23 and ID-4) that were in Repbase. In contrast, using the data of the Repbase as the libraries, 47.9% of Class 2 non-autonomous TEs and 83.4% of MITEs in the MITE-Hunter output were masked (Table 1, the last two columns). Sixteen MITEs discovered by MITE-Hunter were not found in Repbase including 1 Tourist, 11 hAT-MITEs and 4 Mutator-MITEs.
We tested the ability of two previously published MITE finding programs, FINDMITE and MUST, to discover MITEs in the rice genomic data set using default parameters. Importantly, when we attempted to use the entire genomic sequence (~372.8Mb) as the input data, both FINDMITE and MUST reported errors and quit. As such we applied FINDMITE and MUST to a much smaller data set, rice chromosome 12 (~28.2Mb) (Table 2). MUST completed the task in ~5 h and 30min and generated 5485 putative TE sequences. Because FINDMITE requires users to define the TSD sequence and length, we chose ‘TA’, which is the TSD sequence of Stowaway MITEs. FINDMITE finished in <1min and generated 10 864 putative Stowaways. To calculate the false-positive rate, we randomly sampled 100 TE sequences from the outputs of FINDMITE and MUST, respectively, and checked them using the same approach as was used for evaluating MITE-Hunter. With only 15 and 14 validated TEs for FINDMITE and MUST, respectively, both programs have a false-positive rate of over 80%. To perform an impartial comparison, we also applied MITE-Hunter to the rice chromosome 12 data set. Using default parameters, MITE-Hunter finished in 1h and 40min and generated 114 TE consensus sequences that were grouped into 88 families. Through manual curation, five TEs were identified as false-positives resulting in a false-positive rate of 4.4%. Because the input data is a small subset of the rice genome, we did not compare the results of FINDMITE and MUST to the Repbase data to calculate the false-negative rate.
A necessary prerequisite for the comprehensive analysis of MITEs is their identification in newly sequenced genomes. Two programs were previously developed for this purpose, FINDMITE and MUST. However, as demonstrated in this study, both FINDMITE and MUST have very high false-positive rates (~85%) and cannot efficiently utilize whole genomic data sets like that from rice. To remedy this situation, we developed MITE-Hunter, which is a structure-based program pipeline that can efficiently identify TEs that have TIR and TSD structures from whole genome data sets. Important features of MITE-Hunter are discussed below.
MITE-Hunter has an efficient approach to reduce the high false-positive rate, which is the main limitation of currently available MITE discovery programs. The vast majority of rice genomic sequences with TIR-like and TSD-like structures are not Class 2 TEs. MITE-Hunter has two modules to filter false-positives, that both exploit the principle that homologs of a true TE only share sequence similarity within the terminal structures. The main difference between the two modules is that one detects sequence similarity through the PSA approach while the other uses the MSA approach. The MSA-based module is more powerful at identifying false-positives but it is slower than the PSA-based module. To achieve both high speed and high sensitivity, the PSA-based module is first performed in Step II to filter most of the false-positives while the MSA-based module is performed in Step IV to filter the remaining false-positives. Because MITE-Hunter has such a system to identify and filter artificial TE candidates, the false-positive rate of MITE-Hunter (4.4–8.3%) is ten times lower than either FINDMITE (85%) or MUST (86%).
MITE-Hunter is competent at discovering Class 2 non-autonomous TEs especially MITEs. In our test, MITE-Hunter rediscovered most of the known rice Class 2 non-autonomous TEs (85%) and almost all MITEs (97.6%) in Repbase [Table 1, second and third columns]. Only two MITEs (OSTE23 and ID-4) in Repbase were missed by MITE-Hunter. OSTE23 is a very old MITE family and its TIR and TSD structures are difficult to detect even by manual examination of the MSA file. ID-4 has two mismatches in the TIRs that were not identified in Step I of MITE-Hunter.
Compared to other MITE discovery programs, the MITE-Hunter output is much easier to curate manually. First, the number of TEs in the MITE-Hunter output is very small because MITE-Hunter generates consensus sequences that best represent the whole TE data set of the genome being analyzed. As shown in the results section, MITE Hunter generated 700 consensus TEs from the entire rice genomic data set. In contrast, FINDMITE generated ~10000 putative Stowaway MITEs using only the smallest rice chromosome (#12) as the input data set. Using the same data set MUST generated about 5000 elements. Second, for each TE sequence in its output, MITE-Hunter generates a MSA file and predicts TSDs, which are useful for both TE validation and classification. The validity of each TE discovered by MITE-Hunter can be determined by identifying TIRs and TSDs from the MSA file by manual inspection. Finally, in the output of MITE-Hunter, identified TEs are automatically grouped into families based on the sequence similarity, which further helps manual curation by users. These features are of value to all users, especially those who need a TE data set that is 100% accurate and is classified into superfamilies
In summary, MITE-Hunter is the first program to efficiently and accurately identify MITEs from whole genome sequence. Whereas the rice Class 2 non-autonomous TEs in Repbase were the products of many studies, MITE-Hunter was able to find virtually all the MITEs in a relatively short time frame and to do so accurately. Finally, the MITE-Hunter output is easy to curate as it contains highly condensed TE consensus sequences that are grouped into families. The validity of a TE discovered by MITE-Hunter can be quickly judged from the automatically generated MSA file, which is, to our knowledge, a unique feature of MITE-Hunter.
The National Science Foundation (NSF) plant genome (0607123 to S.R.W.). Funding for open access charge: The NSF plant genome grant 0607123.
Conflict of interest statement. None declared.
We thank Yaowu Yuan for valuable discussions of both of the programs and the article. We thank Hao Wang for installing and running MUST.