|Home | About | Journals | Submit | Contact Us | Français|
Recent advances in high-throughput sequencing present a new opportunity to deeply probe an organism's transcriptome. In this study, we used Illumina-based massively parallel sequencing to gain new insight into the transcriptome (RNA-Seq) of the human malaria parasite, Plasmodium falciparum. Using data collected at seven time points during the intraerythrocytic developmental cycle, we (i) detect novel gene transcripts; (ii) correct hundreds of gene models; (iii) propose alternative splicing events; and (iv) predict 5′ and 3′ untranslated regions. Approximately 70% of the unique sequencing reads map to previously annotated protein-coding genes. The RNA-Seq results greatly improve existing annotation of the P. falciparum genome with over 10% of gene models modified. Our data confirm 75% of predicted splice sites and identify 202 new splice sites, including 84 previously uncharacterized alternative splicing events. We also discovered 107 novel transcripts and expression of 38 pseudogenes, with many demonstrating differential expression across the developmental time series. Our RNA-Seq results correlate well with DNA microarray analysis performed in parallel on the same samples, and provide improved resolution over the microarray-based method. These data reveal new features of the P. falciparum transcriptional landscape and significantly advance our understanding of the parasite's red blood cell-stage transcriptome.
Plasmodium falciparum malaria is responsible for more than one million deaths annually, most of which occur in young children (Breman, 2001). For several decades, the development of new antimalarial compounds has been slow, mostly due to a lack of well-defined Plasmodium-specific targets, adding to a growing concern as established drugs become ineffective due to widespread resistance in the field (Arav-Boger and Shapiro, 2005). In 2002, the genome of the 3D7 clone of P. falciparum was sequenced (Gardner et al., 2002), renewing hope that progress towards reducing the burden of malaria would be greatly accelerated. The P. falciparum genome encodes roughly 5400 genes and has the lowest G+C content (19%) of any genome sequenced to date. Approximately half of the predicted coding sequences (CDSs) are uncharacterized, with little sequence similarity outside the Plasmodium genus, and a large number of genes and gene families are unique to P. falciparum. Furthermore, the proteome contains a high proportion of low complexity sequence where poly-asparagine regions are highly prevalent (Aravind et al., 2003).
Making use of these sequence data, two groups in 2003 used different DNA microarray platforms (70-mer oligonucleotide based, and Affymetrix) to measure transcript levels during the intraerythrocytic developmental cycle (IDC) of the P. falciparum parasite. These studies revealed a highly ordered cascade of gene expression (Bozdech et al., 2003; Le Roch et al., 2003). A later study demonstrated that gene expression levels were similar between different strains of disparate geographical origin, suggesting highly conserved modes of transcriptional regulation during development (Llinas et al., 2006). More recently gene expression studies in a variety of other life cycle stages have been performed (Young et al., 2005; Siau et al., 2008; Tarun et al., 2008). Previous efforts to sequence RNA transcripts from P. falciparum have focused on expressed sequence tags (Lu et al., 2007) or have analysed a small fraction of the full-length cDNA sequences from the Plasmodium spp. (Wakaguri et al., 2009a,b;). These low-resolution studies already demonstrated inaccuracies in P. falciparum gene model predictions and suggested that both variable length untranslated regions (UTRs) and diversity in splicing were prevalent in the transcriptome.
The depth of sequence obtainable with highly parallel sequencing technologies such as Illumina's Genome Analyzer (Bentley et al., 2008), 454 (Droege and Hill, 2008), SOLiD (Ondov et al., 2008) (http://solid.appliedbiosystems.com) platforms make it possible to sequence cDNA and obtain high coverage of all transcribed genes. In 2008, the first applications of high-throughput sequencing technologies to the direct sequencing of expressed RNA transcripts (RNA-Seq) from human tissue (Pan et al., 2008; Sultan et al., 2008), yeast (Nagalakshmi et al., 2008; Wilhelm et al., 2008) and mouse (Mortazavi et al., 2008) were reported. RNA-Seq can reliably be used to correct gene annotations, confirm new and existing splice forms, analyse UTR regions, define non-coding RNAs or find new transcripts (Wang et al., 2009). In general, samples for RNA-Seq are produced by reverse transcription of purified mRNA using oligo(dT) and random priming. The sequencing results, typically short reads between 25 and 75 bp, as either singletons or as paired-end reads can then be mapped onto a reference genome with computational tools such as SSAHA2 (Ning et al., 2001), MAQ (Li et al., 2008), BWA (Li and Durbin, 2009) and ELAND (http://www.illumina.com/). From the mapping position of each read, the cumulative occurrence per base pair can be calculated to generate genome coverage plots.
In this study, we applied RNA-Seq to seven time points from the asexual IDC of P. falciparum with the aim of capturing features associated with all expressed RNA transcripts and measuring splicing dynamics that occur during parasite development. Despite the high A+T content of the genome, which presents challenges for mapping transcripts, we were able to detect transcription of 4871 genes during the 48 h IDC. While a previous report had demonstrated the feasibility of the short read sequencing approach for P. falciparum, these results were based on sequencing a biased mRNA sample prepared using an oligo(dT) affinity strategy (Sorber et al., 2008). Our methodology provides an improvement in genome-wide sequence coverage due to an enhanced enrichment strategy for mRNAs. Overall, our results indicate a higher level of transcription than previously realized by DNA microarray studies due to our ability to detect far more low abundance species. We have identified over 100 new transcripts in the genome and reannotated more than 10% of the existing gene models. Finally, we were able to identify numerous alternative splicing events and further our understanding of 5′ and 3′ UTRs, which may lend further insight into gene regulation in this important human pathogen.
Due to the high abundance of ribosomes in all cells, ribosomal RNAs (rRNAs) account for over 90% of all cellular RNAs. Therefore, the most standard method for rRNA depletion is to affinity purify polyadenylated mRNA species using an oligo(dT) sepharose. For P. falciparum, this methodology does not work effectively because of the high A+T content of the genome (80% in coding and 90% in non-coding regions). In previous DNA microarray-based genome-wide transcriptome analyses, this effect has largely been ignored with cDNA being synthesized directly from total RNA using oligo(dT) followed by in vitro reverse transcription (Chen et al., 2003) or a combination of oligo(dT) and random priming (Bozdech et al., 2003). Either of these methods results in an over-representation of parasite rRNAs in the material assayed on the microarray but, in general, does not compromise the performance of the microarray. However, for a sequence-based approach, removal of abundant rRNA is essential in order to maximize coverage of other low expressed transcripts, because the total number of possible reads per machine run is finite.
In an attempt to remove a significant fraction of plasmodial rRNAs, we tested two possible mRNA enrichment strategies. The first method is affinity-based and targets 26 high abundance rRNA species as well as the 32 most abundant mRNA species in P. falciparum such as the histones, merozoite surface protein 1 and several heat shock proteins (based on previous microarray data sets) (Bozdech et al., 2003; Chen et al., 2003) (see Table S1). To accomplish this, we first bound 58 biotin-tagged DNA oligonucleotides to magnetic streptavidin beads and then incubated isolated total RNA in the presence of the immobilized (complementary) DNA sequences. Subsequently, the unbound RNAs were isolated and subjected to reverse transcription for Illumina sequencing (see Experimental procedures). Our second method was to treat with Terminator™ 5′-Phosphate-Dependent Exonuclease (Epicentre) (a processive 5′-3′ exonuclease that digests RNA containing a free 5′-monophosphate end, thereby removing rRNAs, transfer RNAs, and apicoplast-derived RNAs) and then to affinity deplete with our 58 biotin-tagged DNA oligonucleotides. Independent sequencing runs were performed to compare the effectiveness of these approaches (Table 1, Table S1). Although comparable numbers of reads (> 4.5 million) were attained for all three samples, 60% of the reads from the undepleted sample mapped against rRNA loci targeted by our depletion strategies and covered only 1% of the genome more than 10 times. However, using the affinity oligonucleotide depletion or combined oligonucleotide and Terminator™ depletion strategies, 58% and 21%, respectively, mapped to the depleted regions. Furthermore, the percentage of the genome with more than 10× coverage increased to 2% and 7% respectively, and overall genome coverage increased more than five times over the undepleted sample. Finally, the genome coverage is overall much greater in the depleted samples and is maximal (51%) in the combined Terminator™ and oligonucleotide-depleted sample. Based on these results, we conclude that for enrichment of P. falciparum mRNAs for high-throughput sequencing, the combined oligonucleotide and Terminator™ exonuclease depletion strategy was the best. We therefore used a combined method of Terminator™ exonuclease depletion in conjunction with the biotin-oligonucleotide depletion strategy for all subsequent experiments.
Using highly synchronized 3D7 parasites, we collected RNA samples at seven different time points every 8 h for 48 h, thus capturing the entire IDC of P. falciparum from the ring stage to mature schizonts. Total RNA samples were processed as described above and cDNA generated by reverse transcription using a 1:1 combination of oligo(dT) and random nonamer primers (see Experimental procedures). Illumina sequencing reads were mapped and processed using standard methods developed at the Sanger Institute (see Fig. 1, Experimental procedures). Complete mapping statistics are presented in Table S2. In summary, between 51% (early rings) and 20% (late schizonts) of the reads mapped uniquely against the genome. Of these, around 50–60% of the reads mapped to predicted protein-coding genes and 20% mapped within 1.0 kb up- or downstream of genes, demarcating possible UTRs. For each individual time point, we calculated a transcript abundance value based on the geometric mean depth of coverage (see below) of all sequencing reads corresponding to a given genomic region (see Experimental procedures). These mapped regions include previously annotated genes (Table S3), annotated pseudo genes (Table S4) and novel genes predicted by our results (Table S5). The temporal data capture the dynamic variation of mRNA abundance values during the IDC and correlate well with previous microarray studies (see below). The total range of transcriptional activity captured by these data varied by five orders of magnitude. Using data from CDS only, variation in gene expression by a factor of up to 8200 was observed.
As these are the first temporal, sequencing-based transcriptome data generated from P. falciparum mRNAs during the IDC, we wanted to directly compare and validate our sequencing results with microarray data generated in parallel using the same RNA samples. Previous studies have found variable correlations when measuring gene expression with different technologies (Bloom et al., 2009), suggesting that both methods have their biases when compared with quantitative PCR. When plotted according to the phase of gene expression, as previously determined by Llinas et al. (2006), the well-established cascade of gene expression is faithfully reproduced by both data sets (Fig. 2A and B). The RNA-Seq and DNA microarray data are in good agreement, with Pearson correlations between these data sets ranging from 0.7 to 0.85 at various time points (Fig. 2C, Table S2).
In addition to the high correlation with microarray data, we estimated the hours post invasion (hpi) for each time point of the RNA-Seq in the IDC using a statistical likelihood-based estimator that calculates the relative temporal sequence and a generalized overview expression curve for each time point (Fig. S1) (Lemieux et al., 2009). These data recapitulate the systematic temporal progression of parasite development during this experiment and demonstrate that the cultures are well synchronized.
Our RNA-Seq data capture approximately 90% of the predicted CDS in the P. falciparum genome. As the data are highly variable across the length of a transcript, for each predicted CDS locus in GeneDB (http://www.genedb.org/), we have calculated an mRNA abundance value based on the geometric mean of the sequencing reads (Table S3). Many P. falciparum genes contain low complexity sequence which means that reads cannot always be unambiguously aligned to these regions. In order to correct for this mapping error we have adjusted the geometric mean, using as a denominator the region of each gene that we calculate to be mappable (see Experimental procedures). It is notable that the percentage of uniquely mapping reads is lower at the schizont stage than at other time points. This may be explained both by the fact that this set of reads were 37 bp rather than 54 bp (and so were less likely to map uniquely) and to the observation that the proportion of proteins containing low complexity sequence appears higher towards the end of the IDC (data not shown). The data can easily be used to discriminate exons within a given gene and to identify novel transcripts (Fig. 3). Even for highly complex intron/exon structures such as that seen for the GTPase activator (PF11_0152), a coding gene that is comprised of 11 exons, we can easily confirm the predicted splice sites (Fig. 3). Moreover, adjacent genes with opposite temporal expression profiles can be observed using the RNA-Seq data. For example, PFI0180w (alpha tubulin) is maximally expressed during the trophozoite stage, while the next gene PFI0185w is not expressed, while the following downstream gene, PFI0190w is a ring stage RNA (Fig. 3).
For some sequenced transcripts, putative UTRs can be discerned (Figs 3 and and4).4). However, in general the depth of sequencing coverage of the UTRs is lower compared with the exonic regions and is mostly attributable to the poor mapping of the relatively short Illumina reads onto the extremely high A+T-rich (> 90%) UTR regions. Unfortunately, this RNA-Seq data set does not serve to capture short non-protein coding transcripts previously predicted by several groups (Chakrabarti et al., 2007; Mourier et al., 2008; Mishra et al., 2009) because we used a fragment size cut-off around 200 bp, excluding possible short RNAs.
The RNA-Seq data are highly informative for correcting structural boundaries of predicted genes (Table 2). The most common correction was elongation or shortening of predicted intron/exon boundaries, in which case, the coverage plots of all expressed time points unambiguously drop towards zero before the splice site. For example, we predicted a shortened first exon for PF10_0022, a member of the PHISTc exported protein family (Sargeant et al., 2006) (Fig. 4) and confirmed this experimentally by RT-PCR and directed sequencing of the PCR product across the splice junction (Fig. S2). Using the RNA-Seq data, we performed a genome-wide update of all P. falciparum splice sites, incorporating mostly data from this study but also changes due to homology to other recently sequenced species (Carlton et al., 2008; Pain et al., 2008) or expressed sequence tag evidence (Haas et al., 2008). In all, 423 genes were modified (8% of the total predicted genes) 202 splice sites for new exons added, with the majority being confirmed by two or more RNA-Seq read pairs. Despite low coverage of UTRs, we were able to identify several gene models with strong UTR signals, and report 192 spliced UTR events (with at least two confirming reads).
Splicing is highly prevalent throughout the P. falciparum transcriptome. We used our RNA-Seq data to perform a genome-wide analysis of splicing and splice-site variants to identify a total of 8496 splice sites in 3D7. As expected, our data confirmed that the majority of splice sites in the genome were previously properly predicted. However, our data also serve to correct mispredicted splice sites and identify new and alternative splice sites. For splice site analysis, we used partially mapping reads (bridging reads), where two segments of the sequencing read map to different positions along the same chromosome but are anchored in the genome by the other end of the read pair. For new splice site reads, only a portion of the sequence maps to the CDS sequence, calling for re-analysis of the non-mapping segments (see Experimental procedures). In total, 294 possible new splice sites containing canonical splice donors and acceptors were confirmed by at least two RNA-Seq reads (Table 2). All new gene model predictions are available through GeneDB and have been deposited at http://PlasmoDB.org version 6.0 (Bahl et al., 2003).
Using reads spanning existing and new splice sites, we built an algorithm to identify alternative splicing (see Experimental procedures). We identified four types of alternative splice events: exon-skipping; intron retention/creation; 3′ and 5′ alternative splicing (coordinate changes to external exons) and 3′ and 5′ splicing that results in an alternative start or stop codon (Table 3). Out of a total of 84 alternative splice events, the most common is the 3′ and 5′ alternative splicing (56 events), followed by exon skipping (16 events). We required at least two confirming reads for a new splice site to be automatically assigned. All predictions were manually verified and included in the GeneDB database. In six instances (PFF0630c, PFI0280c, PF10_0149, MAL13P1.144, MAL13P1.267, MAL13P1.277) we found more than one alternative splicing event in a single gene. Figure 5A shows an example of two 5′ alternative splicing events for PF14_0581 (putative apicoplast ribosomal component). The new splice site occurs mostly in the 16 and 24 hpi samples, whereas the previously known splice site occurs in the later time points. This observation brings up the tantalizing possibility that alternative splicing is temporally regulated during the P. falciparum lifecycle. Overall, we did not detect a temporal bias for alternative splicing at any specific time point. In another case of alternative splicing in the PF14_0108 gene (unknown function), exon skipping was detected. Both transcripts were validated by RT-PCR and sequencing (Fig. S2).
We carefully analysed the expression signals that mapped neither to regions on the genome that correspond to annotated genes nor to their likely UTRs. A region was called positive (i.e. transcriptionally active) if a window of 60 bp was fully covered and had a geometric mean score of 5 or greater. All regions were verified by manual inspection. Using these criteria, we identified 107 novel blood-stage transcripts that were previously unknown (Table S5). These appear to be non-protein coding as no bona fide protein-encoding gene prediction is possible for these novel transcripts. With the methods used here we are unable to determine from which strand these transcripts arise. Such non-protein coding transcripts have been described before as mRNA-like non-coding RNAs (mlnc RNA) (Rodriguez et al., 2004; Griffiths-Jones, 2007). There is growing evidence for a large number (> 34 000) of these mlncRNAs in the human genome (Carninci et al., 2005). They are likely to be transcribed by RNA polymerase II but their function in most cases is unknown. One exception is the mammalian Xist mlncRNA transcript that is derived from a pseudogene and plays a vital role in X-dosage compensation (Duret et al., 2006). The length of these transcripts in P. falciparum varies from 57 to 8931 bp (median is 918 bp) and the majority appear to be shared with other Plasmodium parasites sequenced to date (Table S5). We do not, however, have any experimental evidence to suggest that these orthologous genomic regions are transcribed during the blood stages of infection in other Plasmodium parasites. We show the expression profile of one of these novel transcripts in Fig. 3.
There are 81 annotated pseudogenes in the P. falciparum 3D7 genome (Tables S4 and S5), of which we detect transcription from 38 in our sequencing data. Of these, 17 were previously found to be expressed in a periodic manner during the IDC (Bozdech et al., 2003) suggesting their recent evolutionary conversion to pseudogenes.
Using high-throughput sequencing (RNA-Seq), this study provides the first in-depth sequencing-based analysis of the P. falciparum transcriptome, derived from seven time points from the red blood cell stage of development. Despite the high A+T content of this genome, our data provide near-complete genome coverage of RNA transcripts at single base pair resolution and provide significant improvement to our understanding of the global transcriptome. We identify rare stage-specific alternative splicing events, novel transcripts and low abundance transcripts, and correct existing gene models using the deep read coverage provided by Illumina sequencing. Our high-coverage RNA-Seq data were made possible by the use of a new strategy for depleting high-abundance transcripts (predominantly rRNAs) that would otherwise dominate the sequencing library. This consisted of a combination of an exonuclease digestion, specific for uncapped RNA, followed by oligonucleotide depletion of the most abundant transcripts.
Overall, using the geometric mean coverage for each gene across the seven time points measured, we recapitulate the periodic gene expression patterns previously reported (Bozdech et al., 2003; Le Roch et al., 2003). Using a parallel analysis of the same samples used for deep sequencing by DNA microarrays resulted in a good correlation between these independent methods. The sequencing data provide an expanded IDC transcriptome because we were able to capture the transcript abundance for lower abundance species, resulting in 4871 transcripts detected during the erythrocytic stages of P. falciparum life cycle, suggesting that roughly 90% of the genome is transcriptionally active during this stage. Some genes that are known to be expressed in other life cycle stages (such as PFI0185w, expressed in gametocytes, Fig. 3) show no expression during the IDC. In addition to improved sensitivity in detecting transcript expression, these higher resolution RNA-Seq data provide detailed structural data for each of these 4871 transcripts.
We have used the RNA-Seq data for the verification and correction of existing gene models and to create a number of new gene models: 423 existing gene models were corrected and 121 new genes were added to the current annotations. Seventy-five per cent of splice sites were confirmed with at least two read pairs. For all novel transcripts detected here, we also report the gene expression patterns during parasite development in red blood cells. Future improvement in technology such as increased read length, larger insert sizes and strand-specific reads will dramatically improve overall mapability, our ability to analyse splice sites and UTRs, and possibly to identify anti-sense transcripts. RNA-Seq of additional life cycle stages such as gametocytes, oocytes, sporozoites and liver stages is also expected to unravel stage-specific alternative splicing events and add more new transcripts.
Our results identified only 84 alternative splice sites, which is perhaps unexpected, because in many higher eukaryotes over half of transcripts are alternatively spliced (Zavolan and van Nimwegen, 2006). A recent study seeking to identify alternative splicing in late-stage schizont genes and in gametocytes found alternative splicing for 16% of the 88 open reading frames characterized (Iriko et al., 2009), but found only two events of alternative splicing in the blood stages. While we detected 84 alternative splicing events, we anticipate that with increased sequencing read length, it will become easier to identify bridging reads and thus enhance our current understanding of the role of alternative splicing. A major question that remains unanswered is what mechanisms are being used to regulate the production of alternative splice forms and are they functional?
The true power of RNA-Seq is that from one experiment, the full transcript, from transcription start site to polyadenylation signal, will be captured. Unfortunately, due to the high A+T content of the P. falciparum genome as well as the prevalence of low complexity regions, this study falls short of providing significant information about non-coding regions. Undoubtedly, on a per-gene basis, there is a lot of information to be gained, but this is difficult to generalize computationally across the genome.
The results of this work demonstrate how RNA-Seq can further our understanding of blood-stage transcription, including insights into post-transcriptional events. Given that our understanding of transcriptional regulation remains poor for P. falciparum, accurate gene models of RNA transcripts are essential. Ongoing efforts are underway to extend sequencing-based transcriptome analyses to other Plasmodium life cycle stages including gametocytogenesis, mosquito development and liver-stage development. Of great interest is the sequencing of transcripts from individual patient isolates to continue to address the degree to which there are differences between the in vivo and in vitro transcriptome of this parasite as has been previously suggested (Daily et al., 2007; Lemieux et al., 2009).
Highly synchronous P. falciparum cultures were attained by growing 50 ml cultures of parasites in RPMI 1640 culture medium (with standard supplements) and using 5% sorbitol to select for ring-stage parasites by standard methods (Trager and Jensen, 1976; Lambros and Vanderberg, 1979). Initially, one 50 ml culture (5% parasitaemia) was synchronized 2 h post invasion and subsequently at 10 h, following re-invasion, newly formed rings were again selected for with two sorbitol treatments. The cultures were then expanded to sufficient culture flasks at a final parasitaemia of 10%. Total RNA from P. falciparum was isolated as described previously (Bozdech et al., 2003). Briefly, a TRIZOL extraction was performed followed by an overnight sodium acetate in isopropanol precipitation, with a final 70% ethanol wash.
For custom depletion of P. falciparum rRNAs and high-abundance transcripts, 500 µl of streptavidin beads (Dynabeads, Invitrogen) was washed (10 mM Tris-HCl, pH 7.5, 1.0 mM EDTA, 2.0 M NaCl) three times and resuspended in 200 µl of the same buffer plus 200 µl of a mix of biotin-labelled oligonucleotides (100 pmol of each) (Table S1) and incubated at 37°C for 10 min. The beads were then washed three times with 1 mM sodium citrate, pH 6.4. 25 µg of total RNA was then incubated for 10 min at 65°C in a final volume of 200 µl 1 mM sodium citrate, 0.4 M guanidine, 1 µl RNAseAway. After cooling to room temperature for 10 min, the bound RNA was removed by magnetic purification of the beads. The free RNA was cleaned using a Zymo RNA column. For Terminator™ 5′-Phosphate-Dependent Exonuclease (Epicentre) reactions, the manufacturer's recommendations were followed. Briefly, 25 µg of P. falciparum total RNA was incubated with 2 µl of exonuclease in a final volume of 40 µl for 1 h. After 1 h the reaction was stopped with 2 µl 0.1 mM EDTA, pH 8.0 and immediately added to the biotin-labelled immobilized beads (see above). For P. falciparum, cDNA synthesis and purification, we used a combination of random oligonucleotide and oligo(dT) primers as previously published (Bozdech et al., 2003).
Sequencing libraries for the Illumina GA II platform were constructed by shearing the enriched cDNA by nebulisation (35 psi, 6 min) followed by end-repair with Klenow polymerase, T4 DNA polymerase and T4 polynucleotide kinase (to blunt-end the DNA fragments). A single 3′ adenosine moiety was added to the cDNA using Klenow exo- and dATP. The Illumina adapters (containing primer sites for sequencing and flowcell surface annealing) were ligated onto the repaired ends on the cDNA and gel-electrophoresis was used to separate library DNA fragments from unligated adapters by selecting cDNA fragments between 200 and 250 bp in size. Ligated cDNA fragments were recovered following gel extraction at room temperature to ensure representation of AT-rich sequences. Libraries were amplified by 18 cycles of PCR with Phusion DNA polymerase (Finnzymes Reagents).
The efficacy of each stage of library construction was ascertained in a quality control step that involved measuring the adapter-cDNA on an Agilent DNA 1000 chip. Sequencing libraries diluted to 2 nM were denatured with sodium hydroxide and diluted to 3.5 pM in hybridisation buffer for loading onto a single lane of an Illumina GA flowcell. Cluster formation, primer hybridisation and single-end (or read pairs) of either 37 or 54 sequencing cycles were performed using proprietary reagents according to manufacturer's recommended protocol (https://icom.illumina.com/).
As we used both 37 and 54 bp read pairs in this experiment, we compared the RNA-Seq results from the sample sequenced both ways. The correlation between the geometric mean (defined as the exponential of the product of the natural logarithm of the coverage of n unique bases of a given CDS) was between 0.98 and 1.0. In the calculation of this mean, non-unique regions of the gene were ignored. If one read of a mate pair can be placed in a non-unique region, this coverage is not counted.
Figure 1 describes the read-processing pipeline used in this study. We used SSAHA2 (Ning et al., 2001) and SSAHA_pileup to align the Illumina reads against the P. falciparum 3D7 (version 2.1.4) reference genome. We also used MAQ (Li et al., 2008) to investigate the success of the different depletion strategies because MAQ randomly places repetitive reads such as those aligning to rRNA sequences. In the SSAHA2 mapping, we only included reads where one end of the pair aligned uniquely to the genome and the distance between the pairs was within the expected insert size. The output of SSAHA_pileup was used to create the coverage plots over the genome. We generated uniqueness plots for all possible windows of 30 or 50 bp over the genome. All repeat regions longer than the read length in a gene are ignored for the expression calculation. To measure the expression levels of each gene, we used the geometric mean.
For each time point, 12 µg of total RNA was prepared for hybridization by indirect amino-allyl cDNA labelling as previously described (Bozdech et al., 2003). A pool of 3D7 total RNA from all IDC stages was utilized as the reference sample. Array hybridizations were performed as described using a recently designed new-generation P. falciparum-specific long oligonucleotide DNA microarray (Hu et al., 2007). The arrays were scanned using an Axon 4200A scanner and images analysed using Axon GenePix software (Axon Instruments, Union City, CA, USA). Due to technical reasons, we were not able to successfully attain DNA microarray hybridization data for the 40 h time point.
The IDC maps (Fig. 2) were created in R with the heatmap command. The order for the genes was taken from Llinas et al. (2006). The correlation with the microarray was also calculated in R using the Pearson correlation. To compare the correlation between the different read lengths, we used the logarithm of the geometric mean of each CDS. For the comparison with the array data, we used the expression of CDS of the 54 bp reads and compared it with the Cy5 values of the array data. If for a given gene all time points of the array were not available, the gene was ignored for the analysis.
To find new splice sites or to confirm existing splice sites, we used bridging reads. These are reads that would map over a splice site in the nucleotide sequence of the gene, but in the alignment output will split align to two different regions of a chromosome, as they are mapping to two different exons. To find new splice sites, we look for reads that do not map entirely on the CDS. We then required that these possible splice site confirming reads map with no more than one mismatch to an existing CDS or a new exon and each part of the read must have a match of at least 13 bp.
The non-mapping regions of the read provide information about where another exon is beginning. Therefore, those reads are remapped with SSAHA2 using very sensitive parameters (-seeds 1 -score 10 -cmatch 9 -ckmer 6 -kmer 9 -cut 2000000 -skip 1), generating multiple partial hits. A read identifies a putative new splice site if two partial hits map to the same chromosome and strand, their distance is less than 10 000 bp, and the genomic sequence contains a splice donor (GT) and acceptor (AG).
To define alternative splicing (Table 3) we mapped all the reads against gene models containing the new splice site and counted how many reads confirmed the new splice site. We required at least two confirming reads to call the potential new splice site.
To estimate the coverage of UTR regions, we analysed the 1 kb upstream and downstream regions of each CDS (providing this region did not overlap with another gene). We counted the number of mapping reads and the amount of covered bases for each 5′ and 3′ region.
Twenty micrograms of mixed asexual stage total RNA prepared according to the Trizol method was DNase treated and cleaned up using a QIAgen RNeasy column before reverse transcription using Superscript III (Invitrogen) with a 1:1 mixture of poly-dT : poly-dN according to manufacturer's protocol. One hundred nanograms of cDNA was used in each PCR reaction and run for 35 cycles. Primers used are listed in Table S6.
We thank Mike Quail and Mandy Sanders of the pathogen genomics group and members of the Illumina library preparation and sequencing groups within the Wellcome Trust Sanger Institute (WTSI) and acknowledge the support of the WTSI core sequencing and informatics groups. We thank Erandi De Silva and Moriah Szpara for critical reading and helpful advice with this manuscript. This investigation was supported by the Arnold and Mabel Beckman Foundation (ML), the Center for Quantitative Biology (P50 GM071508) (ML), the Wellcome Trust (grant number WT 085775/Z/08/Z), and through a programme grant to C.N. T.O. and S.A. were funded by the European Union 6th Framework Program grant to the BioMalPar Consortium (grant number LSHP-LT-2004-503578). T.O., D.W., S.A., T.K., L.R.S., J.L., B.B. and A.P. performed research; D.W., A.P., M.B., C.N. and M.L. designed research, T.O., S.A., J.L., U.B., B.B., M.B., C.N. and M.L. analysed data, and T.O., A.P., M.B., C.N. and M.L. wrote the paper. The authors declare that they have all seen the manuscript and have no conflict of interest.
Additional supporting information may be found in the online version of this article.
Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.