|Home | About | Journals | Submit | Contact Us | Français|
Pre-mRNA splicing is essential to ensure accurate expression of many genes in eukaryotic organisms. In Entamoeba histolytica, a deep-branching eukaryote, approximately 30% of the annotated genes are predicted to contain introns; however, the accuracy of these predictions has not been tested. In this study, we mined an expressed sequence tag (EST) library representing 7% of amoebic genes and found evidence supporting splicing of 60% of the testable intron predictions, the majority of which contain a GUUUGU 5′ splice site and a UAG 3′ splice site. Additionally, we identified several splice site misannotations, evidence for the existence of 30 novel introns in previously annotated genes, and identified novel genes through uncovering their spliced ESTs. Finally, we provided molecular evidence for the E. histolytica U2, U4, and U5 snRNAs. These data lay the foundation for further dissection of the role of RNA processing in E. histolytica gene expression.
Eukaryotic genes are often expressed as discontinuous units requiring the removal of intervening RNA sequences (introns) in order to discern their reading frames and ensure their accurate expression. The pre-mRNA-splicing reaction partners are brought into proximity through dynamic rearrangements of the spliceosome, a RNP complex composed of numerous snRNPs and five noncoding snRNAs: U1, U2, U4, U5, and U6 (18, 27). The precise splice sites are characterized by conserved sequence elements.
Entamoeba histolytica infects an estimated 500 million people annually (41). Cysts are ingested in food and water contaminated with fecal matter and excyst into the disease-causing trophozoite in the small intestine. In most people, this results in asymptomatic colonization and reencystation with no subsequent pathology. However, 50 million of those infected each year develop invasive disease (bloody diarrhea or liver abscesses) (41). How E. histolytica regulates gene expression during host invasion, encystation, excystation, and trophozoite vegetative growth is largely unknown.
Prior to completion of the E. histolytica genome sequence, only a few introns had been reported (24, 33, 34, 40). Based on these limited data, the consensus amoebic 5′ and 3′ splice sites (5′, GUUUGU; 3′, UAG) and the lack of a well-conserved branch point consensus were described (40) and incorporated into the computational gene finders used for genome annotation (24). Given that only a few examples of introns had ever previously been uncovered, it was surprising that the genome-sequencing project revealed 3,188 introns in the 9,938 predicted genes (24). Correct intron removal is therefore a necessity for the accurate expression of at least a third of the presently annotated E. histolytica genes. However, the vast majority of these intron predictions lacked molecular validation. The absence of a systematic test of splice site predictions and splicing in this organism presents a significant barrier to our ability to understand its genome structure and the role of RNA processing in amoebic gene regulation.
In this study, we computationally mined an E. histolytica expressed sequence tag (EST) library for hallmarks of splicing. The questions we sought to address were (i) how accurate are the current intron predictions and (ii) how complete is our understanding of splicing in this organism. We compared the intron predictions to the processing patterns deduced from EST analysis, mined the ESTs for novel introns, and used covariance models to computationally identify E. histolytica snRNAs. We found evidence supporting the splicing of several predicted introns and identified several splice site misannotations, novel introns in annotated genes, and novel intron-containing genes. In addition, we identified EST evidence for intron retention and provided molecular evidence for U2, U4, and U5 snRNAs. These data are the result of the largest-scale test of splicing in this organism to date and form the basis for dissecting the interplay between the spliceosome and other cellular machinery involved in amoebic-gene regulation.
The E. histolytica EST library was created from pooled total RNA (from parasites in the mid-log and stationary phases and from a mouse model of amoebic colitis) (Barbara Mann, personal communication). The datasets containing the intron predictions, EST sequences, and gene predictions were downloaded from The Institute for Genomic Research (http://www.tigr.org/tdb/e2k1/eha1/).
In order to determine which genomic loci were likely to encode the ESTs, we aligned the EST sequences to the genome sequence data, using the BLAT alignment program (with the version 30 default parameters) (20). Per the default parameters, there were no restrictions on the size of the gap or the amount of 5′ and 3′ overlap between the ESTs and the genomic sequence. Because each EST should be nearly identical to the corresponding genomic region (some mismatch was allowed for sequencing errors), we considered alignments that had ≥98% sequence identity between the genomic regions and the full-length EST transcript (matches of >0.98 × QuerySize).
In order to identify possible introns, we computed the coordinates of unaligned gap regions (i.e., the putative introns) from the BLAT alignments described above. The EST gap coordinates were computationally compared to the 3,188 predicted intron coordinates determined from the genome sequence project (24). If the EST gap coordinates matched the coordinates of a predicted intron, we counted this intron as “spliced as predicted” (see Table S1 in the supplemental material). If the EST gap coordinates did not match the predicted intron, we counted that intron as “spliced but at coordinates other than that which are predicted” (Table (Table1).1). If the EST gap coordinates did not map to a region known to contain an intron or gene, we deemed that intron “novel.” If no ESTs mapped to a region containing a predicted intron, we deemed that intron prediction “untestable” and did not consider it further. If only ungapped ESTs mapped to a region containing a predicted intron, we deemed that intron “not spliced as predicted” (see Table S2 in the supplemental material).
We computationally identified the U2 and U4 spliceosomal RNAs using a combination of Hidden Markov models (HMMs) and stochastic context-free grammars (SCFGs), techniques that search for conservation in the primary sequence and secondary structures between a query sequence and a training set (3, 11, 23, 36). U2 and U4 in the Rfam database Release 7.0 were used to train the above programs (14, 15). The majority of the genome sequence was filtered out, using HMMs (default parameters, version 2.3.2). The remainder of the genome sequence with the greatest similarity to known U2 and U4 snRNAs was further scored, using an SCFG (internal package) against the models obtained from Rfam (default parameters, version 0.7). In order to identify U5 snRNA, we downloaded all 235 full sequences of U5 from the Rfam database. We used BLAT (standard parameters, version 30) to align each of these sequences against the full E. histolytica genome sequence.
E. histolytica strain HM-1:IMSS was grown axenically in Trypticase-yeast extract-iron-serum (TYI-S-33) medium (9, 26). Trophozoites were grown to log phase, and total RNA was isolated, using Trizol reagent. Genomic DNA was isolated as indicated by Ali et al. (1).
One microgram of total RNA was treated with DNase I and incubated with 0.5 μg of oligo(dT)15 for 10 min at 95°C, and reverse transcription and cDNA amplification were performed as by Ehrenkaufer et al. (12). The PCR products were electrophoresed on a 6% native acrylamide gel and stained with ethidium bromide. The cDNA PCR products were cloned into a TOPO-TA vector (Invitrogen) and sequenced, and splicing of the intron was determined based on its absence from the cDNA. For Northern blot analysis, 10 μg of total RNA from E. histolytica HM-1:IMSS trophozoites was electrophoresed on a 6% acrylamide-7 M urea gel along with a radiolabeled 10-base-pair marker (Invitrogen), transferred onto a Hybond-N+ (Amersham) nylon membrane, and cross-linked, using a Stratalinker. Oligonucleotide probes (see Table S3 in the supplemental material) were prepared and used to probe the membrane as described by Davis and Ares (7).
The following sequences have been deposited in GenBank under the numbers indicated: U2 snRNA, BK006130; U4 snRNA, BK006131; and U5 snRNA, BK006132.
The E. histolytica genome sequence was completed in 2005 and led to a list of 3,188 putative introns in 9,938 predicted genes (24). This is a substantial number of introns compared to the paucity of introns in the related protists Giardia lamblia and Trichomonas vaginalis, suggesting that splicing plays a greater role in amoebic-gene regulation (4, 32, 35, 38). In order to gather a global view of the predicted introns, we determined their sizes and their positions with respect to the start codon and the nucleotide frequencies at the 5′ and 3′ splice donors. Distribution analysis of the predicted E. histolytica intron sizes indicated that the vast majority are small, ~40 nucleotides in length (Fig. (Fig.1A).1A). This is consistent with previous reports of small introns in E. histolytica (33, 39, 40) and comparable to intron sizes from the single-cell parasites T. vaginalis and G. lamblia (4, 32, 35, 38). We noticed that 35 of the predicted E. histolytica introns are smaller than 23 nucleotides. Although spliceosomal introns as small as 23 nucleotides have been validated in the ciliated Paramecium (42), the 23-nucleotide intron size may reflect a lower limit on the geometric constraints for snRNA binding and lariat formation; thus, we concluded that these introns are likely not real (see Table S2 in the supplemental material). Finally, we found that in E. histolytica, the highest proportion of introns are located over the 5′ end of the transcript length (Fig. (Fig.1B),1B), a feature commonly found in intron-sparse genomes (29).
Analyses of the predicted splice sites indicate that the primary 5′ splice site is composed of GUUUGU and the 3′ splice site is UAG (Fig. (Fig.1C),1C), consistent with the previous limited reports of introns in E. histolytica (25, 34, 40). One of the unique features of the spliceosomal introns identified in T. vaginalis and G. lamblia is the incorporation of a well-conserved branch point sequence into an extended 3′ splice site (32, 35). Of the known T. vaginalis introns, the branch point sequence ACUAAC is incorporated into the extended 3′ splice site, prompting speculation that T. vaginalis spliceosomes may combine the steps of branch point- and 3′-splice site recognition (38). In contrast, only 90 of the 3,188 predicted E. histolytica introns contain this sequence (data not shown), indicating that this branch point sequence is not strictly conserved in E. histolytica introns. However, sequences that resemble the degenerate mammalian branch point are found in many E. histolytica introns (40). Lastly, a substantial number of E. histolytica genes are predicted to contain multiple introns (24), raising the issue of whether some of these genes undergo regulated or alternative splicing.
Although 3,188 introns have been predicted in E. histolytica, less than 20 have been experimentally validated (25, 33, 40). In order to determine the accuracy of the intron predictions, we directly compared the predicted introns to their spliced counterparts by mining an EST library for hallmarks of splicing. To accommodate the putative intron, we allowed for gaps of ≥23 nucleotides to occur in the EST relative to its genome sequence (Fig. (Fig.1A).1A). Of the 3,188 predicted intronic loci, 275 are spanned by ESTs that satisfy these criteria and are therefore testable. In order to determine if the predictions matched the ESTs, we compared the EST gap coordinates to those of the predicted intron. One hundred sixty-four of the EST gap coordinates matched the coordinates of the predicted intron, indicating that they are spliced exactly as annotated (see Table S1 in the supplemental material), at splice sites primarily composed of GUUUGU-UAG (Fig. (Fig.1D).1D). However, for other introns, the predicted coordinates did not match those deduced from the ESTs, indicating that these predictions are incorrect (Table (Table1).1). In general, we noticed that splice sites that were incorrectly predicted to use a splice donor other than the preferred GUUUGU are not used in vivo, in favor of a nearby GUUUGU. Likewise, a nearby UAG 3′ splice acceptor site appears to be utilized over GAG, AAG, and, in some instances, even a neighboring UAG. Moreover, in nearly all cases, the spliced intron was smaller than predicted. Lastly, although 103 of the 275 testable putative introns contain canonical splice sites, we failed to find evidence for their removal in any of their corresponding ESTs (Fig. (Fig.1D;1D; also see Table S2 in the supplemental material). This suggests that either these are not introns, are not spliced under conditions represented in the EST library, or have such low splicing efficiency that no spliced isoforms were cloned.
In order to identify novel processing events within the E. histolytica EST database, we mined the ESTs for transcripts with intron-like features independent of any prior predictions. We queried the ESTs for regions that have two or more blocks of sequence with at least 98% identity to the genomic sequence and are separated by a gap of 40 to 200 nucleotides and hand collated the data. In total, we identified 35 novel introns, each of which was classified into one of three categories based on how it affected the protein-reading frame (Table (Table22).
Class I is the largest class of novel introns we identified. These introns are located in or near annotated genes but in regions not annotated to be intronic; i.e., they were predicted to be exonic or in regions immediately proximal to an open reading frame. However, in silico translation of the surrounding spliced sequence revealed an extension of the protein-coding region of the adjacent genes.
Class II introns map immediately proximal to annotated open reading frames. However, in contrast to Class I introns, in silico translation of the surrounding spliced sequences did not alter the protein-coding region of the adjacent genes, suggesting that these introns reside in their untranslated regions (UTRs). Thus, their retention or removal does not affect the protein-coding potential of the gene.
Class III introns are located in regions currently annotated as “intergenic” and not predicted to have any protein-coding potential. However, in silico translation of the spliced sequences surrounding the introns uncovered several novel proteins with extended reading frames. Most of these predicted genes have not been previously identified in E. histolytica but have homologs in other organisms. One of the novel genes (on BLAT scaffold 154) lacks homology to any known proteins and contains two introns (one represented by an EST and the other identified computationally while deciphering the protein-reading frame). Splicing of both introns was confirmed by reverse transcription (RT)-PCR and cDNA sequencing (data not shown).
In order to experimentally confirm splicing of the novel introns identified above, we performed RT-PCR on cDNA generated from log-phase E. histolytica HM-1:IMSS trophozoites grown under standard axenic culture conditions. In all cases tested, PCR amplification of cDNA using exonic primers spanning the novel introns generated a product smaller than that amplified from genomic DNA, consistent in size with that from splicing of the predicted introns from these transcripts (Fig. (Fig.2).2). The cDNAs for acriflavin resistance protein, pantothenate kinase, 47.m00184, 21.m00231, and (154.m), a novel gene with no homology to any known protein in the GenBank database, were cloned and sequenced (data not shown). In all cases, the sequencing results confirmed that the splice sites indicated in Table Table22 were used. Given the canonical splice donor and acceptor sequences in Table Table2,2, we expect that these remaining novel introns are likewise correct. These data demonstrate that the novel introns we identified are efficiently spliced in log-phase E. histolytica trophozoites and suggest that many additional introns remain to be uncovered.
Multi-intron-containing genes are generally a feature of higher eukaryotes and are often accompanied by alternative splicing, such as exon skipping and mutually exclusive exons (17). Approximately 6% of the presently annotated genes in E. histolytica are predicted to be multi-intron containing (24). However, none of the ESTs that span two or more predicted introns in a gene exhibit evidence for exon skipping and mutually exclusive exons (data not shown). Moreover, we found no evidence of exon skipping or mutually exclusive exons in RT-PCR experiments in log-phase E. histolytica trophozoites using primers that span several exons in 10 other multi-intron-containing genes (data not shown).
Other forms of alternative splicing, such as intron retention, are more prevalent in lower eukaryotes with fewer multi-intron-containing genes and smaller introns (21). In order to see if there was any evidence in the ESTs for intron retention, we sought to compare the number of spliced ESTs to the number of unspliced ESTs for each of the 164 introns for which there is functional/EST evidence of splicing (see Table S1 in the supplemental material). While 87% of the 164 introns are spliced in 100% of their representative ESTs, 13% are spliced in only a fraction of their representative ESTs. Two possibilities can readily explain this observation: (i) the fraction of “unspliced” ESTs for an individual intron are derived from its pre-mRNAs cloned prior to splicing; or (ii) the fraction of “unspliced” ESTs for an individual intron are derived from a distinct growth condition in which the intron is selectively retained, i.e., intron retention. Additional directed and high-throughput experiments, such as splicing-sensitive microarray (5), and larger cDNA libraries are needed to identify individual processing events and monitor the alterations in processing during parasite growth and development.
Examples of regulated splicing have been described in other systems as a mechanism to turn transcripts on and off (2, 6, 8, 22, 37). Because we have not tested every growth condition in the life of an amoeba, we cannot formally exclude the possibility that the 37% of introns for which we see no evidence of splicing are indeed spliced under a given condition. One point at which alternate isoforms of the same pre-mRNA may be generated is the developmental switch between the trophozoite and cyst forms of E. histolytica. Microarray data indicate that ~15% of annotated genes change ±3-fold between trophozoites and cysts of E. histolytica (12). Whether these changes in RNA abundance between the life cycle stages reflect alterations in transcription frequency or decay as a result of regulated processing remains to be tested.
Finally, some genes are known to generate different proteins as a result of splicing at alternate 5′ and 3′ splice sites (10, 16). In order to see if there was any evidence in the EST library for alternate 5′- and 3′-splice site usage, we individually mined each spliced intron for examples of ESTs in which all of the coordinates for one of the splice sites was fixed while the other varied. We found no evidence for alternate 5′-splice site usage. However, 89.m00113, a gene with similarity to human Sm_B/B′ protein, has representative ESTs in which different 3′ splice sites are used for the penultimate intron, which would introduce two additional amino acids in the C terminus (data not shown). Curiously, the human Sm_B and Sm_B′ isoforms are derived from alternative splicing using different 3′ splice sites of the penultimate intron that are distinguishable by autoantibodies generated in people with systemic lupus erythematosus (19). Thus, overall, we found EST evidence for candidate intron retention and alternative 3′-splice site usage.
snRNAs bound in the spliceosomal complex of over 150 proteins interact with the intron through RNA-RNA interactions (18). The pre-mRNA reaction partners for the two catalytic steps of splicing are brought into proximity through dynamic rearrangements of the pre-mRNA/snRNA and snRNA/snRNA complexes requiring U1, U2, U4, U5, and U6 snRNAs (27). To date, U6 is the only E. histolytica snRNA that has been identified (28). Given the essential role of the snRNAs in splicing, we queried the E. histolytica genome for the presence of the U1, U2, U4, and U5 snRNAs.
U2 snRNA is involved in pre-mRNA/snRNA base pairing and juxtapositioning of the branch point adenosine for the first transesterification reaction. In order to identify the E. histolytica U2 snRNA, we downloaded 553 U2 snRNA sequences from Rfam and built an HMM to look for conserved features. The region on scaffold 25 from 23993 to 24173 had the greatest similarity to known U2 snRNAs and was selected for Northern blot analysis. We saw U2 accumulate as a predominate species, 178 nucleotides in length, in trophozoite RNA (Fig. (Fig.3C).3C). Its putative secondary structure is similar to those of other known U2 snRNAs, including the branch point binding sequence and the Sm binding site (data not shown), and is predicted to interact with U6 snRNA in the conserved fashion. The U4 snRNA base pairs with U6 snRNA, acting as its chaperone and maintaining it in an unfolded conformation while part of the U4/U5/U6 tri-snRNP (13). We applied the above approach to identify U4 snRNA based on the 372 U4 snRNA sequences in Rfam. We identified the region on scaffold 150 from 39898 to 40028. Subsequent Northern blot analysis of this region uncovered a predominant band 125 nucleotides in length (Fig. (Fig.3C).3C). This putative U4 snRNA is able to interact with the previously identified U6 snRNA in a conserved fashion. Of note, the U4 snRNA also seems to lack the terminal 3′ stem loop found in higher eukaryotes (30).
U5 snRNA interacts with the exons upstream of the 5′ splice site and downstream of the 3′ splice site, tethering them in the active site for the second transesterification (31). Our efforts to identify the E. histolytica U5 snRNA using the above means failed. Therefore, we used BLAT for each of the 235 U5 sequences in the Rfam database against the E. histolytica genome scaffolds. We identified a region on scaffold 283 from 9300 to 9468 with significant homology to the U5 sequences from Entosiphon sulcatum, Oryza sativa, Zea mays, and Arabidopsis thaliana. Northern blot analysis of this region uncovered a single band 118 nucleotides in length (Fig. (Fig.3C).3C). Secondary structure prediction showed its potential to form the evolutionarily conserved site in stems I and II as well as the Sm binding site (Fig. (Fig.3B).3B). Using the computational approaches outlined above, we were unable to identify U1. Whether this indicates that the E. histolytica U1 sequence is substantially different or that it has escaped being sequenced is not clear at present.
Despite the ability of RNA processing to markedly alter the coding potential of genes, the mechanisms that control these events in E. histolytica are poorly understood. We compared the splice patterns mined from EST data to 275 computational intron predictions. We found evidence supporting the splicing of 60% of introns exactly as predicted. Additionally, we identified several splice site misannotations, novel introns in annotated genes, and novel intron-containing genes. Since the EST data we analyzed represented ~7% of the predicted amoebic genes, our work indicates that a larger-scale EST library would significantly improve gene annotation and uncover additional useful information regarding mechanisms of RNA processing in E. histolytica. This work represents the first large-scale test of splicing in a deep-branching eukaryote and indicates that similar analyses in other systems may be similarly fruitful.
We thank all members of the Singh lab, specifically Gretchen Ehrenkaufer and Jason Hackney, for critical and editorial comments on the manuscript; Neil Hall and Lis Caler (TIGR) for providing the EST sequences and incorporating data into genome reannotation; Barbara Mann (University of Virginia) for providing information on the EST library; and Neha Gupta for preliminary RT-PCR analysis of intron-containing genes.
This work was supported by NIH grants AI-053724 to Upinder Singh and T32 AI-07502 to Carrie A. Davis.
Published ahead of print on 27 April 2007.
†Supplemental material for this article may be found at http://ec.asm.org/.