In higher eukaryotes, alternative splicing and polyadenylation, which generate multiple isoforms from a single messenger RNA (mRNA) precursor (pre-mRNA), are the major mechanisms for expanding the diversity of their transcriptomes and proteomes (1
). Numerous studies demonstrated alternative splicing is pervasive in higher eukaryotes. For example, ~95% of multi-exon genes undergo alternative splicing in human (3
) and at least 42% of intron-containing genes are alternatively spliced in Arabidopsis thaliana
). Moreover, microheterogeneity (6
) and long-range heterogeneity (7
) of polyadenylation site usage in eukaryotic mRNAs are also found to be extensive. Recent studies of polyadenylation site heterogeneity using RNA-Seq further demonstrated the pervasiveness of alternative polyadenylation in animals (2
) and plants (10
). Functional consequences of physiologically regulated alternative splicing are well documented (12
). Also, the impacts of alternative polyadenylation on mRNA coding capacity, localization, translation efficiency and stability have also been described (13
). Nonetheless, the proportion of these alternative isoforms being physiologically regulated versus those that are solely derived from the inherent stochasticity of RNA processing (14
) is largely unknown.
How much of the observed alternative splicing and polyadenylation are a consequence of stochastic noise of RNA processing? A number of studies attempted to address this question. Based on comparative analyses of human and mouse expressed sequence tags (EST) data, Sorek et al.
) proposed a significant portion of alternative isoforms is likely to be non-functional, and might be resulted from aberrant rather than regulated splicing events. Melamud and Moult (14
) further showed that the number of alternative isoforms and their abundance can be predicted by a simple stochastic noise model, demonstrating most alternative splicing in humans is a consequence of stochastic noise in the splicing machineries. More recently, Pickrell et al.
) used RNA-Seq to demonstrate the existence of a large class of low abundance and unconserved isoforms, presenting empirical data to support the hypothesis of noisy splicing. The extent of stochastic noise in polyadenylation is less well studied, despite this the genome-wide atlas of polyadenylation site was mapped in a number of model organisms (2
). Quantifying the properties of alternative splicing and polyadenylation events in wider range of eukaryotes would certainly help to clarify the inherent stochasticity of these processes, and hence provide insight into the prevalence of functionally relevant alternative isoforms.
In this study, we sequenced the poly(A)+ transcriptome of Entamoeba histolytica
at saturated depth and quantified the extent of alternative usage of splicing and polyadenylation sites in its mRNAs. E. histolytica
is an enteric parasite in humans, which causes amoebiasis in ~10% of the infected individuals, resulting in 50 million cases of dysentery annually (17
belongs to the Amoebozoa kingdom, which represent one of the earliest branches from the last common ancestor of all eukaryotes and is phylogenetically distinct from ‘model organisms’ of animals, fungi and plants (18
). While most of the observations on alternative splicing and polyadenylation were derived from studying these model organisms of animals, fungi and plants, it is therefore interesting to extend the observations to other less characterized kingdoms.
Initial analyses of E. histolytica
genome in 2005 (assembly of ~23 Mb with 888 scaffolds) predicted 9938 coding genes (average size: 1.17 kb), comprising 49% region of the genome (19
). About 25% of these genes were predicted to contain introns, and only 6% of them contain multiple introns (19
). This initial analysis provided the first blueprint of E. histolytica
genome to the research community, which opened the avenue to post-genomic high-throughput studies, e.g. transcriptomics and proteomics. Nonetheless, the genome is AT rich and highly repetitive, and thus, this initial assembly might contain misassembled regions and partially sequenced or unidentified genes (20
). Therefore, the genome was reassembled 5 years after its initial analyses, with >100 artifactual tandem duplications eliminated, reducing the assembly size to ~20 Mb with 1496 scaffolds (21
). Re-annotation of the new assembly reduced the predicted gene number to 8201, and 40% of the original gene models were changed (21
). Even so, most of the gene models were solely based on in silico
prediction and lack of supporting experimental data, e.g. complementary DNA (cDNA)/EST.
The primary goal of this study is to quantify the heterogeneity of splicing and polyadenylation in E. histolytica
, an organism with few introns and short 3′ untranslated region (UTR) (22
), providing insights into the stochastic noise of these processes in lower eukaryotes. In addition, as resources for the Entamoeba
community, we revised the gene model annotations of E. histolytica
in AmoebaDB based on our sequencing data.