An important aspect in the study of the TEs in a genome is knowledge of their abundance and diversity, as well as classification and further organization of this information in a clear and comprehensive manner. Aiming at the discovery of unknown, active elements, we performed a search and characterization of TEs in the sequenced genome of An. gambiae
. We used a combined strategy (Figure ) to identify and characterize repetitive elements present in the genome of the primary vector of malaria, An. gambiae
. Our strategy combines an extensive search of repetitive elements within this genome utilizing the PILER-DF algorithm as a first screening method, together with a detailed characterization of each of the retrieved sequences by analyzing signature characteristics of TEs as well as performing homology-based analysis of the obtained sequences to several databases. This automated pipeline includes the different steps suggested recently by Wicker, et al
] for classifying eukaryotic TEs. The results are compiled in a database of repetitive elements in the mosquito genome, called AnoTExcel (Additional Files 1
) that provides information for all the TE families identified, along with offering the individual sequences found in each family.
Pipeline for the identification and characterization of transposable elements in the Anopheles gambiae genome. Flow chart indicating the steps followed for the characterization of repeats in An. gambiae.
AnoTExcel is organized as an Excel spreadsheet with cells containing, in a hyperlinked format, the results obtained after the various analyses performed in the characterization of each family. Each line of the spreadsheet represents a cluster containing a variable number of sequences with a high degree of identity among them, constituting--in most of the cases--TE families. The spreadsheet columns contain the several analyses performed on the sequences and were grouped into four types, colored for easier viewing as follows: blue columns represent the classification of the families; green columns present the general characteristics of the sequences in clusters, such as presence of TIRs and their sequences, direct repeats (LTRs), consensus and centroid sequences, ORF sequences, etc. In orange columns are presented all the results of the blasts performed on several databases (see Methodology section), and in yellow columns (FZ-GQ) are results of new clustering performed by tblastx considering different degrees of identity over at least 50% of the length of the alignments (from 35% to 95% of identity on 50% of the length). This was performed to identify shorter fragments of sequences belonging to the same family that were included in different clusters when more stringent conditions were applied. Clusters that share some degree of identity are colored in column A in AnoTExcel.
This is not an exhaustive database, as some of the element families already known to be present in An. gambiae have not been identified by our algorithm; however, it has the important and distinctive characteristic of presenting all the individual sequences within each family (including fragments or short sequences belonging to TE families) as well as their global alignments in fastA format, the consensus and centroid sequences, and detailed information on their structural characteristics. In addition, it presents the results obtained after several blasts performed on different databases and includes in hyperlinked format the file with the significant matches to the given database.
This constitutes a rich resource that includes important information on the different families of TEs present in the genome of An. gambiae besides serving as a platform for the analysis of TEs present in other sequenced genomes.
AnoTExcel contains 3826 repetitive sequences larger than 400 nts in length that were grouped into 245 clusters (see Methodology section). The repetitive sequences are considered "intact" because they are globally alignable and "isolated" because they are surrounded by unique sequences; therefore, they correspond to insertional events of different repetitive sequences, mainly TEs [25
]. Our strategy in generation of the clusters permitted the grouping of sequences belonging to the same TE family with different sizes or corresponding to different regions of the same element. Considering that the sequences identified by PILER-DF are isolated and unique, it is possible to state that those sequences represent different transposition states of the same TE family, allowing for evolutionary or dynamical analysis. The spreadsheet containing AnoTExcel can be found at http://exon.niaid.nih.gov/transcriptome/TE/A_gambiae/AnoTExcel-WEB.zip
and the standalone data can be downloaded from http://exon.niaid.nih.gov/transcriptome/TE/A_gambiae/AnoTExcel-SA.zip
The latter file should be decompressed to the user's computer, and the Excel file should be opened from within Excel. AnoTExcel can be downloaded and stored in a personal computer; it occupies approximately 178 Mb and can be modified by adding or deleting columns according to the user's needs.
We were able to assign most of the retrieved sequences (75%) to a given TE class (I or II); the rest were classified as rRNA (1%), or repetitive sequences with no TE signatures (24%) and possibly expressed pseudogenes (Figure ).
Distribution of repeats in AnoTExcel. (a) Distribution of clusters. (b) Distribution of long terminal repeat (LTR) superfamilies. (c) Distribution of non-LTR (NLTR) superfamilies. (d) Distribution of Class II superfamilies.
In order to assess the efficiency of our approach in identifying and characterizing TEs from the genome of A. gambiae
we compared the number of sequences in AnoTExcel with those deposited in Repbase and TEfam. The results are shown in Figure . It should be noted that these figures are not 100% comparable since many elements are deposited in Repbase and TEfam under different names (overlapping). Also, Repbase and AnoTExcel present data on subfamilies (that might correspond to different transposition events in time but not to different families) that were included in Figure . In addition, some but not all of the elements present in TEfam are also deposited in Repbase (redundancy). We identified approximately half of the LTR elements that have been described so far in the genome of An. gambiae
, few of the NLTRs and almost all the known Class II elements. Moreover, we were able to identify and characterize several families of elements (both Class I and II) that have not been reported previously, even using similar approaches for identifying TEs in the mosquito genome [20
]. Further characterization of the TE families is given below.
TE distribution in Repbase, TEfam and AnoTExcel. Number of LTR, NLTR and Class II families present in three databases: TEfam. Repbase and AnoTExcel. The numbers in the bars indicate the number of families in each category.
The TE families identified in AnoTExcel were classified according to their class, subclass, order, superfamily, and family, following the Wicker criteria for TE classification. We also classified the families according to the length of sequences in each cluster, i.e., according to the percentage of the cluster consensus sequence representing the full-length canonical element as: full-length (100% match), fragments or remnants (less than 10% of match) and, depending on the case, as Solo LTR (for solitaire LTRs resulting from the homologous recombination occurring in LTR elements), MITEs (miniature inverted terminal repeat elements), or Class II-NA (non-autonomous elements, already described as such in Repbase). The methodology used here permitted identification of families presenting both full-length elements and degenerate copies of the same family (mainly for Class I-NLTR elements). The LTR elements contain several full-length putative active sequences, while NLTRs and Class II elements are mainly constituted by fragments or remnant sequences. For the NLTR elements, the percentage of clusters considered as "full" in AnoTExcel is notably higher than the percentage of full-length sequences that in fact exists, indicating that many of the NLTR clusters constitute a mixture of both full-length and fragmented sequences.
Notably, although the An. gambiae
genome has been scrutinized in search of TEs before using several methodologies [22
], we have been able to identify novel TE families that have not been previously identified, as detailed further below.
Elements Class I, order LTR
The LTR elements in the genome of An. gambiae constitute a numerous order, although there are only three superfamilies: Ty1-Copia, Pao-Bel, and T3-Gyspsy, the last being the most diverse LTR superfamily within the mosquito genome.
The LTR elements identified in AnoTExcel were characterized based on the presence of flanking LTRs and/or based on positive matches to Peptidase_A17, RVT_1 or 2 or RVE in Pfam or directly due to their significant matches to known LTR elements already deposited in the TE databases (TEfam and Repbase) or in the GenBank non-redundant database.
In AnoTExcel, they correspond to 19% of all the families identified (Figure ), and 26,4% of the total TEs identified, totalizing 46 different clusters, and are represented by members of the main three superfamilies. Novel elements corresponding to the Copia and Pao-Bel superfamilies were also identified and will be described later.
Full-length and fragmented elements are not equally represented in the different LTR superfamilies. While the majority of the Pao-Bel elements are represented by full-length copies (68%), the Gypsys are represented by Solo-LTRs (50%), and the few Copia families identified correspond to full-length elements.
The predicted activity of the full-length sequences was also studied (Additional File 3
). Based on i)
the degree of nucleotide identity among the sequences belonging to the same families; ii)
the degree of identity among the LTRs of individual elements (identical LTRs in all of them); iii)
the indication of expression based on positive matches to EST databases in AnoTExcel; and iv)
the presence of full-length ORFs containing conserved domains for the proteins involved in transposition, we concluded that a substantial fraction of the LTRs described here corresponds to putatively active elements, as indicated by their presence within cDNA libraries (clusters 45, 115, 238, 140, 130, 191, 110, 199, 101, 98, 131, 182, 104, 119, 171) (Additional File 3
). The LTR families contain elements presenting an average nucleotide identity higher than 99.5%, and comparison of the LTRs among all the sequences in each of the clusters showed a high degree of identity for the three superfamilies (average of 99.13% for Pao-Bel, 99.04% for Copia, and 99.11% for Gypsy). In addition, the identity between the two LTRs present in the individual sequences was calculated and appeared to be very high for some of the LTR members (averages of 99.81; 99.64, and 99.83% for Pao-Bel, Copia, and Gypsy, respectively) (Additional File 3
). The presence of full-length ORFs in the sequences reinforces the idea that these families are active or have been very recently inactivated [19
The Copia superfamily in the mosquito genome is represented by five different families: Copia1-5_AG [27
] and the Mtanga family [30
]. These families have been reported to Repbase, and none of them is present in TEfam. In AnoTExcel, elements belonging to two families, Copia3_AG (cluster 172) and Copia5_AG (cluster 150) were identified. The consensus sequence of these two families is identical to the corresponding consensus sequences deposited in Repbase. Element Copia3_AG is apparently a truncated element with no conserved protein domains. The Copia5_AG sequences present some characteristics of activity, such as the presence of conserved protein domains for RT and RVE and identical within sequence LTRs, even if they presented no positive matches to the mRNA and the EST databases (Additional File 3
Novel Copia elements
Two clusters (numbers 134 and 149 of AnoTExcel) were characterized as elements belonging to the Copia superfamily, although it was not possible to assign them to any of the already described families within this superfamily since they present high nucleotide distances against all the previous characterized Copia elements (Additional File 4
). The consensus sequences of these clusters showed significant matches by tblastx with the Copia5_AG consensus sequence and Copia_DM from Drosophila melanogaster
], respectively (see AnoTExcel). A phylogenetic analysis including all Copia elements previously identified in An. gambiae
and a Copia element from D. melanogaster
showed no clear clustering of any of these sequences with the previously described Copia elements (Figure ).
Figure 4 Maximum likelihood phylogenetic tree of the pol region of Copia elements from the genome of Anopheles gambiae. The final alignment encompasses a region of 1928 aa positions. The tree was generated using default settings for MUSCLE 3.7 (as alignment tool), (more ...)
Additional File 5
presents the main characteristics of the sequences belonging to the novel LTR elements described in AnoTExcel. The four sequences within cluster 134 are 4399 long with a p-distance at the nucleotide level, considering the full-length alignment of 0.0009 (standard deviation [sd] = 0.0003). The LTRs are 149 nts long and present a very high degree of identity (p-dist = 0.0033; ds = 0.0038). In two of these sequences, the 3' and 5' LTRs are identical (Additional File 5
), indicating that they have inserted recently and have not had time to accumulate mutations between the LTRs. The consensus sequence presents an ORF of 1350 aa containing conserved regions for integrase (RVE), reverse transcriptase (RVT_2), and RnaseH, a primer-binding site (PBS) for proline, and a polypurine signal at position 4234, information that reinforces the idea of activity of this family.
The mean aa distance of the consensus sequence of clusters 134 to each of the consensus of the already described Copia elements shown in Figure is 0.6520, indicating that they do not correspond to any of the already described Copia families (Additional File 4
). The consensus sequence of cluster 134 shows two regions spanning 2008 and 1560 nts with 65% and 69% identity, respectively, with element Copia-5_AG. The unique ORF of cluster 134 presents 56% identity along the whole sequence with Copia5_AG. Still, the 149-nts-long LTRs of cluster 134 do not show any identity with the 108-nts-long LTRs from Copia-5_AG, indicating that they are not members of the same TE family.
The synonymous substitutions among the sequences in cluster 134 were more than six times more frequent than the substitutions at non-synonymous sites, which indicate the presence of purifying selection operating on them. Tajima's test, on the other hand, was not significant for these sequences, although the fact of being four sequences with few segregating sites along the alignment might influence the ability of the test to detect selective pressures operating on them (Additional File 5
The consensus of the sequences in this cluster has been deposited in Repbase under the denomination Copia-6_AG.
The four sequences in cluster 149 are 2245 nts long with a mean p-distance of 0.0083 and sd = 0.0014, considering the whole alignment. The LTRs are 168 nts long and are identical in all the sequences, suggesting a recent transposition event.
The consensus sequence has an ORF of 615 aa that contains no conserved regions. The sequences present two PBSs, three sequences for valine, and one for methionine. The consensus sequence for this cluster shows no regions of identity with previously identified elements. In the phylogenetic analysis shown in Figure , the cluster 149 sequence groups with a Copia element from D. melanogaster, although with a non-significant bootstrap value.
The mean aa distance of the consensus sequence of cluster 149 to each consensus of the already described Copia elements shown in Figure was 0.7645.
The blasts performed on an EST library as well as on a predicted expressed mRNA, both from An. gambiae
, gave positive matches (Additional File 5
). So, even if this sequence apparently corresponds to an inactive element, it is being expressed, as demonstrated by its positive match both to the mRNA and the EST databases, which indicates the activity of this family. The consensus of the sequences in this cluster has been deposited in Repbase under the denomination Copia-7_AG.
According to Repbase, there are 20 families of Pao-Bel elements in the mosquito genome (AGM1 and Bel1-19_AG). AnoTExcel presents sequences of 15 of them plus 2 novel families. Eleven of these clusters contain full-length sequences that were further analyzed. They all contain ORFs with protein domains for RVT, Pep17, and RVE (with the exception of cluster 110, belonging to the family Bel16_AG, which does not present RVE domain), contain very high sequence identity within each cluster and, in all the cases, the within-element LTRs are identical, indicating that they are active at present or were so very recently. In addition, ten of these families are being expressed, as indicated by their positive matches to the EST database (Additional File 3
). A phylogenetic analysis of the RT-PeptA17 conserved domains of the Pao-Bel elements confirmed the classification of the clusters based on the information present in AnoTExcel. In this phylogeny, only those sequences containing conserved domains for the RT were included.
Novel Pao-Bel elements description
Two families of elements characterized as Pao-Bel due to their matches to Pao-Bel elements deposited in Repbase by tblastx (clusters 174 and 185, respectively) were further characterized.
Cluster 174 is composed of three sequences presenting a high degree of identity among them (p-dist = 0.0068; SD = 0.0010) (Additional File 5
) and a total length of 5769 nts; two of them contain 213-nts-long identical LTRs, indicating that these sequences might have transposed very recently. The consensus sequence presents an ORF of 1758 aa with conserved domains for RT, PeptA17, and RVE, also indicating that sequences might belong to an active or recently active family. The blastn performed on TEfam and Repbase showed that this family has not been reported before; still, the tblastx on Repbase gave positive matches to several Pao-Bel families, indicating that this family belongs to the Pao-Bel superfamily.
The consensus sequence of this cluster has three regions (nucleotide positions 700-1132, 2538-3907, and 4440-5357) with 63% identity with element Bel16-I_AG. Phylogenetic analysis performed with already characterized Pao-Bel elements showed that the consensus sequence of cluster 174 grouped together with the representative sequence of BEL16_AG with a significant likelihood branch support value (Figure ). Nonetheless, the LTRs of both families show no identities and present different sizes, indicating that they do not correspond to the same family.
Figure 5 Maximum likelihood phylogenetic tree of the pol region of Bel elements from the genome of Anopheles gambiae. The final alignment encompasses a region of 423 aa positions. Refer to legend of Figure 3 for detailed information. Two sequences belonging to (more ...)
The sequences in cluster 174 present 100% identity with two cDNA spanning 786 nts from a cDNA library, indicating that this elements is being expressed.
On the other hand, we identified more substitutions in synonymous than in non-synonymous positions, with a dN/dS value of 0.2093, indicating a selection pressure on the ORF of these sequence. Tajima's test was not executed due the small number of sequences.
The consensus of the sequences in this cluster has been deposited in Repbase under the denomination Bel20_AG.
The three sequences in cluster 185 are 3803 nts long, with a p-distance of 0.0033 and sd = 0.00077, presenting LTRs of 227 nts that are identical within each element. The consensus sequence presents an ORF of 927 aa that has no conserved domains for known proteins. The LTR finder program detected a PBS for asparagine and a PPT signal.
The results obtained after blastn in the TE databases, TEfam and Repbase, indicate no identity with known elements; nevertheless, the tblastx against Repbase shows significant e-values with several Pao-Bel elements, mainly with families Bel-18_AG [32
] and Bel-3_AG [33
]. The LTRs of these families are not, however, similar in length or nucleotide composition to the LTRs of cluster 185.
The blasts performed on an EST library as well as on a predicted expressed mRNA, both from An. gambiae
, gave positive matches (Additional File 5
). Nevertheless, the absence of conserved domains in the ORF present in this sequence made it impossible to compare it phylogenetically with other Pao-Bel sequences.
Analysis of dS vs. dN showed no differences in the proportion of substitutions in synonymous vs. non-synonymous positions, indicating a neutral evolution (Additional File 5
). The general characteristics of the sequences in this family suggest that it corresponds to a non-autonomous family composed of few elements. The consensus of the sequences in this cluster has been deposited in Repbase under the denomination Bel-21_AG.
The Gypsy superfamily is the most diverse LTR superfamily in terms of the number of different families that have been previously identified [16
]. Traditionally, this superfamily had been classified into nine different lineages based on phylogenetic analysis of the RT, RnaseH, and INT domains [35
]. Six of these lineages have been previously reported in insects and five of them (Gypsy, Mag, Mdg1, CsRn1, and Mdg3) were described in An. gambiae
. Tubio, et al
., reported the identification of a huge variety of families within each of these lineages.
In AnoTExcel, we identified families with full-length sequences belonging to each of the so-called lineages, but we failed to identify the majority of the individual families.
AnoTExcel presents six clusters with full-length sequences in addition to other clusters presenting fragmented elements or solo LTRs. The six full-length clusters have signs of activity (Additional File 3
), presenting functional ORFs, and five of them have positive matches to the EST library.
Most of the Gypsys that we identified correspond to Solo elements. In An. gambiae
, Tubio et al
. found a high proportion of solo LTRs belonging to this superfamily [19
] and interpreted this fact as a slow turnover of LTR elements from the genome of An. gambiae
. This would mean that each individual copy remains for a longer time in the genome, enhancing the chance of homologous recombination to occur, which is the proposed mechanism to generate this type of deteriorated LTR [36
]. The lower proportion of Solo LTR elements in the other LTR families might suggest that they are not equally present in all LTR elements.
Elements Class I, order non-LTR
The non-LTR elements identified previously in the Anopheles genome and deposited in TEfam and/or Repbase constitute a very diverse order of elements composed of 22 different superfamilies, 7 of which were identified in AnoTExcel, belonging to the superfamilies RTE, Outcast, Jockey, I, and the majority of them to the CR1 superfamily. Some families were considered as remnants of NLTR elements for presenting little identity with NLTR elements from TEfam and/or Repbase by tblastx.
The NLTRs are usually between 3 and 8 Kb long, and they usually contain two genes: the pol gene, fundamental for their replication, and the gag gene, which it is also present in the LTR retrotransposons and retroviruses.
In AnoTExcel, the NLTR elements were classified based on their positive matches to RVT_1 in Pfam and/or their positive matches to polyprotein in the TE databases, as well as positive matches to sequences present in the specific TE databases or GenBank NR database.
In total, 32 clusters were classified as NLTRs (Figure ), corresponding to 13% of all the families identified and 18,4% of the total TE families. The families of NLTR elements contain, overall, more numerous sequences than the LTRs (20.6 versus. 5.1 sequences per family); this is to say, each family is more abundant and more heterogeneous, because most of the clusters are composed of both full-length and fragmented sequences. Most of the full-length sequences present conserved protein domains for exo-endo phosphatase and RT (Additional File 6
). Nine of these clusters present a second ORF but no apparent conserved domains. The p-distances among all the sequences within each cluster are small and, together with the significant results of Tajima's test as well as the significant matches to the EST library (Additional File 6
), indicate that some of these clusters are transcribed.
The majority of the NLTR families present in AnoTExcel correspond to the CR1 superfamily. Most correspond to fragmented sequences, and the full-length sequences keep a high nucleotide identity.
AnoTExcel also presents one cluster with signs of expression belonging to the I superfamily. Both the Jockey and Outcast superfamilies seem to have active members that have already been reported [18
]. Three clusters belong to the RTE
superfamily, two of them present full-length ORFs and positive matches to the EST library, and one has a positive match to the predicted expressed mRNA.
The actual transposons or Class II elements are characterized by the presence of a gene coding for a transposase enzyme flanked by TIRs and have been recently classified into two subclasses (1 and 2) according to the number of DNA strands that are cut during the transposition event [11
]. Elements of class II belonging to subclass 1 and presenting TIRs have been subsequently classified into nine superfamilies, six of which have been previously identified in the Anopheles
genome. Members belonging to all these superfamilies are present in AnoTExcel, where class II elements have been classified based according to the presence of TIRs and on the positive matches to already characterized elements deposited in any of the databases analyzed.
The majority of all the TE families identified in AnoTExcel belong to this class (43%), (Figure ), which corresponds to 55.2% of all the TE sequences retrieved by PILER. Elements belonging to six different superfamilies (P, Tc1-Mariner, Transib, PIF-Harbinger, piggyBac, and hAT) were identified, as well as Helitrons that belong to subclass 2. Tc1-Mariner constitutes the most numerous superfamily, with 39 different families, representing 35% of the class II families (Figure ).
We were not able to identify copies of full-length piggyBac families [39
] or the Herves element, which correspond to a class II active element [40
]. This element belongs to the hAT superfamily, and although we identified other hAT elements in the genome, all of them constitute truncated copies [40
]. On the other hand, 32 clusters with a variable number of sequences harboring TIRs with no relationship to previously known elements were also identified. These elements have been classified as novel Class II MITE-like elements and will be later characterized.
The great majority of the class II elements identified here correspond to highly deteriorated sequences, represented by elements with different degrees of deterioration, including several families already characterized of MITEs, NA (non-autonomous families already identified in Repbase), fragments, and a few remnant clusters. Only four clusters harbor full-length sequences, belonging to superfamilies Tc1-Mariner (clusters 41, 114 and 133) and P (cluster 161) (Additional File 7
). All contain full-length TIRs and, except for cluster 114, they contain conserved domains for transposase and positive matches to the EST library, so they probably constitute active or recently active elements (Additional File 7
These elements were originally described in plants [41
] and have been found in other eukaryotic organisms, including mosquitoes [44
]. They are small, non-autonomous elements (~100-500 bp) that contain TIRs but do not codify for any protein. They are believed to originate from full-length active elements that lose their coding capacity but maintain the TIRs, which allow their mobilization, in a parasitic manner, by active transposases. There is ample evidence indicating a relationship between active elements and MITEs, although there is no clear mechanism that explains their generation [47
]. They are normally present in high copy number. In AnoTExcel, 18 clusters were classified as MITEs of previously characterized families. Some of them had been identified as MITEs before, belonging to the superfamilies Tc1-Mariner, P, and Harbinger, while others have not been reported as such until now (e.g
., the MITEs from the Gambol elements). The Gambol elements belong to the Tc1-Mariner superfamily; they contain the characteristic DD34E motif [53
] and are represented by 13 different families that are deposited in TEfam. Here we identified ten families related to Gambol elements, six of which have TIRs with high identity with the TIRs of the original elements but smaller size (Figure ). This might indicate that just a part of the TIR is necessary for element mobilization. In the six examples shown here, the extreme outer region of the TIRs is maintained while the inner region of the TIRs is not present. Also, the internal region of all the Gambol MITE-like elements presented in Figure have no significant nucleotide similarity with any internal region of the Gambol counterpart elements. They all have quite small sizes, and in the case of the MITE-like elements of Gambol_Ele1, cluster 100 and 112, they constitute two subfamilies with almost identical TIRs but with different sizes and different internal regions in nucleotide composition, indicating that they probably originated in different events (Figure ).
Figure 6 Structure of MITE-like elements belonging to the Gambol superfamily. (i) The diagram represents the Gambol element as described in Tefam ; striped arrows represent the terminal inverted repeats (TIRs). (ii) Representation of the structure of the MITE-like (more ...)
It is possible that more than one family of MITEs evolved from a unique master copy, generating diverse families that keep on transposing and evolving depending on the presence of an active element in the genome, as has been demonstrated for P elements in the An. gambiae
]. The origin of the internal region of MITEs is controversial: it might derive from internal regions of the respective master TEs followed by nucleotide degeneration or, alternatively, it might originate from an ectopic site [51
MITE-like novel elements
32 clusters presenting TIRs or palindromic flanking ends, but with no identity to known TEs, either in the TIR or in the internal region, were identified. TIR lengths vary from 18 to 204 nts and their total length from 430 to 1678 nts, and six of them are, in fact, palindromic repeats. Palindromic TEs have been previously identified in Caenorhabditis elegans
] and D. melanogaster
and also in Mariner elements identified in An. gambiae
]. Palindromes are predicted to form secondary structures; in bacteria, repetitive extragenic palindromic elements (REP) have been described as hotspots for transposition, indicating a relationship between REPs and TEs [56
]. Little is known about the role of these sequences within genomes, however.
Nine of these clusters constitute sequences longer than 1000 nts, and 17 are longer than 800 nts (Additional File 8
). Some of these families present more than 60 individual copies, but none showed signs of autonomous activity, as they only present small ORFs and no signs of conserved motifs. A significant proportion presents matches to EST databases, suggesting that they are being expressed, and five of them present significant matches to the mRNA databases.
Tajima's test for the group of sequences within these clusters showed significant values in 14 of them. Tajima's D is normally used as a selective neutrality test statistic; however, it has been suggested that sudden population expansions can lead to negative D values, moving the observed D value outside the 95% confidence interval derived for a neutral locus and stationary population [57
]. Considering this, it is possible that the significantly negative D values obtained for the clusters analyzed here correspond to TE families that are in a process of expansion--indeed, they all constitute quite numerous families within the Anopheles
We believe that these sequences are MITE-like elements, although we cannot rule out the putative master TE due to the lack of identity with known elements.
MITEs are known to be present in several genomes, and they have been associated with master TEs [51
]. They share their TIRs with active elements, and it appears that they manage to survive and spread within genomes by borrowing the transposase from active elements. What is not known is whether their internal regions derive from internal regions of the master elements or if they are copied from an ectopic site by a conversion process after a double-strand break. The clustering with less stringent conditions performed by tblastx, shown in the last columns of AnoTExcel, permitted the identification of sequences with a high degree of identity only among the TIR regions, i.e
., different subfamilies of MITE-like elements. Three groups of clusters among the novel MITE-like (clusters 121-87; 137-144; and 43-197) were identified by the less stringent clustering. In all cases, they share high identity in TIR regions but low identity in internal regions (Figure ). It is possible that the diverging time between these elements is so long that it is impossible to find any detectable similarity.
Figure 7 Structure of three MITE-like families belonging to unknown transposable elements. (a) Family of MITE-like elements including clusters 121 and 87 from AnoTExcel. (b) Clusters 144 and 137; and (c) clusters 197 and 43. Black squares represent the terminal (more ...)
It is interesting to note that many known MITEs, such as the Stowaway family in rice, have no homologies with other TEs, leaving an open question regarding the origin and means of replication of these small, non-autonomous elements [58
]. This appears to be the case for the orphan MITE-like sequences presented here.
It is apparent that the Class II elements present in the genome of An. gambiae
are composed of a variety of different structurally degenerated sequences that might represent different stages in the process of deterioration of these elements, which in turn might be differentially involved in the regulation of Class II families [59