We have developed a cost effective method to analyze whole genome expression and splicing profiles from organisms employing leader trans-splicing. This was tested on T. brucei, where we obtained 85% coverage of all genes with fewer than 1 million sequence tags. A major improvement over traditional microarray technology is the possibility of sequencing directly from the spliced leader, which allows selective analysis of the transcriptome of intracellular parasites like the amastigote forms of Trypanosoma cruzi or forms that are closely associated with their host and extremely difficult to purify like the epimastigote form of T. brucei in tsetse salivary glands.
Using the SLT approach we sequenced more than 1 million splice site tags from poly(A)-mRNA from each of three life cycle stages of
T. brucei. Each sequence tag covered at least 24 nucleotides 3′ of the spliced leader/5′ UTR junction. Even though the analyzed strains (Antat 1.1 and Mitat 1.2) were not identical to the genome strain (TREU 927) mapping of the 24mer sequence tags onto the TREU 927 genome was very successful, given that the majority (98%) could be aligned with high statistical significance using a combination of mismatch and sequence quality scores (see
materials and methods). In the process, this method revealed the existence of a new expression site associated gene in an active VSG expression site.
In general, our data are in good agreement with the current genome annotation, and strongly support both the number and positions of putative transcription start sites identified by binding of specific histones
[8]. In addition, 98% of the sequence tags from the procyclic form map in the sense orientation of annotated transcription units, and only 2% in the zones between transcription units, where the direction of transcription is not immediately obvious from the annotations. The splicing patterns indicated that transcription overlaps in several of the converging strand switch regions as has been suggested previously for converging transcription units transcribed by Pol I and Pol II, respectively
[22]. Interestingly, bloodstream form cells show twice the number of sequence tags mapping to the zones between the transcription units compared to the procyclic form, indicating that the control of transcription initiation may be less specific, or degradation less efficient in this life stage. It is also possible that splice site recognition is different, or not equally efficient between the stages, leading to the increased number of processed RNAs from the regions between the transcription units.
The large variation in transcript abundance that we detected within transcription units was expected and is in good agreement with previous studies, strengthening the notion that steady state RNA levels in trypanosomes are regulated mainly post-transcriptionally at the level of RNA processing and/or RNA stability. When we analyzed the monocistronic transcription units we found >70% unlikely to be expressed in the life cycle stages analyzed because we could not detect any splice sites within 2 kb upstream of the start codon. Of the remaining 76 transcripts, only 8 showed differential expression.
Previous microarray studies of
T. brucei using genomic arrays or single probe arrays have indicated a rather static transcriptome with relatively few changes between the bloodstream and procyclic forms. Estimates ranged from 2–6% of the genome being regulated at the transcript abundance level
[9],
[10],
[11],
[12],
[13]. A more recent study by Jensen
et al. using eight oligonucleotides per gene on a Nimblegen microarray found that up to 700 transcripts or (8%) change expression between the two life cycle stages
[12]. Considering these studies it would appear that a multiprobe array is more likely to detect a larger set of regulated genes. Furthermore, using arrays with only one probe in the 5′ region of the genes is likely to be affected by the misannotation of the start sites of open reading frames and/or by alternative splicing. Using SLT we found 30% of the transcripts from protein-coding genes to be significantly regulated between the long slender bloodstream form and the procyclic form (). The number of changes increased to 40% of all genes when extended to include the short stumpy form of the parasite. Even at a more conservative threshold (≥2-fold, significantly changed), 35% of all genes exhibited changes in transcript abundance. In conclusion, we have found that a much larger cohort of genes changes expression levels during development than was previously thought to be the case. This is in line with recent findings that about 50% of the genome of
T. cruzi is regulated at the level of transcript abundance
[14]. When we compared our results with the previous microarray studies we find different levels of correlation in gene expression depending on the study. While some part of the differences might be due to the higher sensitivity of the SLT approach we have to keep in mind that the low level of correlation might also be due to strains and culture conditions that vary considerably between the different studies. We also employed three different approaches to verify the results obtained by SLT. (i) Results from RT qPCR showed strong positive correlation with the SLT approach for the expression level changes between life stages and the differential splicing events (
Figure S7,
Table S4,
Table S6). (ii) We also performed two RNAi experiments indicating that the abundance of the RNAi target transcript as quantified by SLT is in excellent agreement with quantification by Northern blots. The overall correlation between the induced and uninduced transcriptomes was on a par with the technical reproducibility of RNAseq, further supporting that SLT tag counts serve as a sufficient proxy for comparisons between different life cycle stages or cell lines of the same organism (
Figure S8). (iii) We included the correlation to one run of regular RNAseq of poly(A)-mRNA (24 million sequence tags,
Figure S9). While not perfect, the correlation between SLT and RNAseq (Spearman ρ

=

0.69) is nearly on a par with that between a technical comparison of RNAseq and microarrays (Spearman ρ

=

0.75 e.g.
[37]).
During the revision of this manuscript Siegel
et al. published a study describing the expression profile in bloodstream and procyclic forms of
T. brucei using RNAseq
[38]. Although the approaches used in both studies are different (RNAseq vs. SLT) many features found in both studies are well correlated; the mean number of splice sites per gene (2.6 versus 2.7–2.9), the mean lengths of the 5′ UTRs (184 versus 105–135), the number of genes with internal splice sites (488 versus 496–558) and the large dynamic range (10
5 to 10
6). However, there are also a number of features that do not correlate so well, most striking of which is the difference in expression profile. We identified almost 40% of the genes as being regulated significantly while Siegel
et al. found only about 10%. There are several reasons that could explain the differences observed: (i) strain differences and the number of life stages analyzed, (ii) growth conditions for bloodstream forms (in vitro versus in vivo), (iii) cDNA preparation, with RNAseq being more likely to capture precursors and breakdown intermediates (e.g. intron sequences), and (iv) scaling of the data (tags/million versus constant median count/gene). The major advantage of SLT when compared to conventional RNAseq is the cost effective mapping of 5′ splice sites in splice leader bearing organisms, especially intracellular parasites where the purification of RNA is very difficult.
According to our study, VSG transcripts account for 7–11% of spliced mRNAs in bloodstream forms trypanosomes, which is in excellent agreement with the value obtained from hybridization data
[39]. The levels of most other transcripts, however, are at least two to three orders of magnitude lower. Assuming there to be approximately 40,000 mRNA molecules in a procyclic trypanosome, and approximately half that number in a bloodstream form (Haanstra and co-workers and our calculations), a large number (>5000, 65%) would be present at <1 mRNA per cell
[40]. Similar results have been reported for yeast where more than 80% of all transcripts are present at ≤2 copies per cell
[41],
[42]. The question if the small number of transcripts is distributed evenly across the population of trypanosomes, or accumulates in very few cells during a transcriptional burst, remains to be investigated. Recently it has been suggested that transcription in yeast is much more steady, with fewer transcriptional bursts than seem to occur in mammalian cells
[43],
[44],
[45].
The data presented here describes the splice sites for 85% of the annotated genes in the
T. brucei genome (). Spliced leader addition sites were very well conserved within and between life cycle stages. AG was the predominant splice acceptor dinucleotide of the major splice sites, however 20% of the minor splice sites used other dinucleotides, predominantly GG (). The least abundant acceptor dinucleotide was CC followed by any pyrimidine combination. The major splice sites contained a 15 (±6) nucleotide polypyrimidine stretch at an average distance of 31 nucleotides (±19) upstream of the actual splice acceptor site. Both these findings are in very good agreement with previously published experimental data on individual splice sites
[46]. The UTR length distribution indicated 5′ UTRs with a median length between 32 to 47 nucleotides, which is similar to yeast (50 nt;
[47]; ). Interestingly we detected a shift towards longer UTRs in the long slender bloodstream form when compared to the procyclic form. This is at least in part due to the differential use of splice sites in the two life forms and might indicate differential regulation of translation. About half of the major splice sites in the different life stages could not be predicted using our current splice site recognition model, although the majority contained the signals that conform to our current understanding of splice sites. An obvious consequence of alternative splicing would be the change of N-terminal targeting sequences as has been shown for
T. cruzi (LYT1) and for alternative
cis-splicing in other systems
[48],
[49],
[50],
[51]. Our analysis indicated this likely to be the case for several AARS that are essential in the cytosol and mitochondrion
[35],
[36] thus providing evidence that alternative splicing is a potential mechanism for dual localization of proteins similar to what has been reported for the LYT1 gene in
T. cruzi [48].
More than 500 transcripts with splice sites exclusively 3′ of the annotated start site were identified by SLT. While we cannot exclude that a very small fraction is still spliced upstream of the annotated AUG, we consider it much more likely that the
bona fide start codon is within the open reading frame. A second set of transcripts (>600) indicated the possibility of 5′ extensions to the currently annotated open reading frames, which would effect changes in the N-terminus of the corresponding protein. The evaluation of these N-terminal extensions is much more difficult and will require additional experiments. However, by screening the recently published proteomics dataset from Panigrahi and co-workers, we were able to identify 11 candidates where peptides corresponding to a region upstream of the annotated start codon are expressed in the procyclic form trypanosomes
[33]. Depending in the life cycle stage, we also identified 90–97 transcripts in which alternative splicing ablated the start codon, suggesting this form of splicing plays a minor but significant role in regulating gene expression.
Most surprising was that a large number of transcripts showed differential abundance of alternative splice variants in the three life stages analyzed. A previous report indicated that
T. brucei used different splice sites on an artificial construct, but it remained unclear if, and how frequently, this might occur in the
T. brucei transcriptome
[52]. We found more than 600 differentially spliced transcripts between the life stages, supporting the idea that alternative splicing has functional consequences for the regulation of parasite development. One open question is if the actual splicing event is regulated or the differential abundance is a result of altered stability of the RNA transcripts. So far, we have been unable to detect sequence elements in the vicinity of the alternative splice sites that would explain the differential regulation of splicing itself. It is worth noting, however, that transcripts encoding several of the core components of the spliceosome, such as SMD1, SMD3 and SMG, are themselves differentially regulated during development; this may reflect an adaptation of the splicing machinery to differences in the major splicing targets, such as the VSG and procyclin transcripts, or to the subtleties of alternative splicing. These hypotheses should now be testable on a genome-wide scale, using the SLT approach in combination with RNA knockdown of specific splicing components.