|Home | About | Journals | Submit | Contact Us | Français|
Sequencing of RNAs (RNA-Seq) has revolutionized the field of transcriptomics, but the reads obtained often contain errors. Read error correction can have a large impact on our ability to accurately assemble transcripts. This is especially true for de novo transcriptome analysis, where a reference genome is not available. Current read error correction methods, developed for DNA sequence data, cannot handle the overlapping effects of non-uniform abundance, polymorphisms and alternative splicing. Here we present SEquencing Error CorrEction in Rna-seq data (SEECER), a hidden Markov Model (HMM)–based method, which is the first to successfully address these problems. SEECER efficiently learns hundreds of thousands of HMMs and uses these to correct sequencing errors. Using human RNA-Seq data, we show that SEECER greatly improves on previous methods in terms of quality of read alignment to the genome and assembly accuracy. To illustrate the usefulness of SEECER for de novo transcriptome studies, we generated new RNA-Seq data to study the development of the sea cucumber Parastichopus parvimensis. Our corrected assembled transcripts shed new light on two important stages in sea cucumber development. Comparison of the assembled transcripts to known transcripts in other species has also revealed novel transcripts that are unique to sea cucumber, some of which we have experimentally validated.
Transcriptome analysis has been revolutionized by next-generation sequencing technologies (1). The sequencing of polyadenylated RNAs (RNA-Seq) is rapidly becoming standard practice in the research community owing to its ability to accurately measure RNA levels (2,3), detect alternative splicing (4) and RNA editing (5), determine allele (6) and isoform-specific expression (7,8) and perform de novo transcriptome assembly (9–11).
Although RNA-Seq experiments are often more accurate than their microarray predecessors (2,7), they still exhibit a high error rate. These errors can have a large impact on the downstream bioinformatics analysis and lead to wrong conclusions regarding the set of transcribed mRNAs. One class of errors concerns biases in the abundance of read sequences due to RNA priming preferences (12,13), fragment size selection (14,15) and GC-content (16). Sequencing errors, which are a result of mistakes in base calling of the sequencer (mismatch), or the insertion or deletion of a base (indel), are another important source of errors for which no general solution for RNA-Seq is currently available. For example, error rates of up to 3.8% were observed when using Illumina’s GenomeAnalyzer (17).
A common approach to sequencing error removal is read trimming of bad-quality bases from the read end to improve downstream analysis (4,18). Such an approach reduces the absolute amount of errors in the data but can also lead to significant loss of data, which affects our ability to identify lowly expressed transcripts.
A number of approaches were primarily proposed for the correction of DNA sequencing data (19). These methods use suffix trees (20,21), k-mer indices (22,23) and multiple alignments (24). While successful, as we show in ‘Results’ section, these approaches are not always suited for RNA-Seq data. Unlike genome sequencing, which often results in uniform coverage, transcripts exhibit non-uniform expression levels. The only error correction method that we are aware of that explicitly targets non-uniform coverage data is Hammer (25). Unfortunately, Hammer cannot be used to correct reads, as it only outputs corrected k-mers of much shorter length. Even after contacting the authors of Hammer and using their implementation, we could not use it with standard methods for read alignment or assembly, and we are not aware of other articles that had. Finally, all the above methods often fail at the border of alternatively spliced exons, which may lead to false-positive corrections.
Other sequencing error correction methods have been designed for tag-based sequencing or microRNA sequencing where the read spans the complete tag or transcript region under investigation (26–28). These methods, including SEED (28), are based on clustering similar read sequences, but do not consider partially overlapping read sequences, alternative splicing and the correction of indel errors.
Here we present the first general method for SEquencing Error CorrEction in Rna-seq data (SEECER) that specifically addresses the shortcomings of previous approaches. SEECER is based on a probabilistic framework using hidden Markov models (HMMs). SEECER can handle different coverage levels of transcripts, joins partially overlapping reads into contigs to improve error correction, avoids the association of reads at exon borders of alternative splicing events and supports the correction of mismatch and indel errors. Because SEECER does not rely on a reference genome, it is applicable to de novo RNA-Seq. We tested SEECER using diverse human RNA-Seq datasets and show that the error correction greatly improves performance of the downstream assembly and that it significantly outperforms previous approaches. We next used SEECER to correct RNA-Seq data for the de novo transcriptome assembly of the sea cucumber. The ability to accurately analyze de novo RNA-Seq data allowed us to identify both conserved and novel transcripts and provided important insights into sea cucumber development.
Error correction of a read is done by trying to determine its context (overlapping reads from the same transcript) and using these to identify and correct errors. SEECER builds a set of contigs from reads, where each contig is theoretically a subsequence of a transcript. Ideally, we would like each contig to be exactly one transcript. However, in several cases, transcripts may share common subsequences owing to sequence repeats or alternative splicings. In such cases, each contig in our model represents an unbranched subsequence of some transcript.
We use a profile HMM to represent contigs. Such models are appropriate for handling the various types of read errors we anticipate (including substitutions and insertion/deletion). Owing to several restrictions imposed by the read data, even though we may need to handle a large number of contigs, learning these HMMs can be done efficiently (linearly in the size of the reads assigned to the contig).
Profile HMM is a HMM that was originally developed to model protein families to allow multiple sequence alignment with gaps in the protein sequences (see Supplement). Here, we extend profile HMMs to model the sequencing of reads from a contig. We thus call this a contig HMM. Each contig HMM includes a consensus sequence based on the set of reads assigned to this contig. The consensus is constructed from the most probable output nucleotides of the match states. Using this consensus sequence we can make correction to the reads assigned to this contig HMM.
The core functionality of SEECER is constructing the contig HMM from sequencing reads. We now outline the details of each step in the following sections.
We maintain a global pool (Figure 1, step 0) of reads during the execution of our method. SEECER creates many threads, each independently builds a separate contig HMM. For each such HMM, we start with a random read as the seed and iteratively extend it using overlapping reads. See supplement for discussion on how to avoid collision between two HMMs.
Using the seed read, we obtain an initial set of reads to use for constructing the HMM contig (Figure 1, step 1). We build a k-mer hash dictionary, where the keys are k-mers and the values are the indices of the reads and the position of the k-mers within them. This hash table could be large, and hence we discard k-mers appearing in less than c reads (here we use c = 3). Counting of k-mers is efficiently done using Jellyfish (29), a parallel k-mer counter. After counting, only k-mers that appear at least c times are stored in a hash table that also records the positions of the k-mer within a read, and as a result, we keep memory requirements as small as possible. Read sequences are saved in the ReadStore from the SeqAn library (30).
SEECER starts the contig construction by selecting (without replacement) a random read (or seed) s from the pool of reads. We use the dictionary to retrieve a set of reads such that each read in shares at least one k-mer with the seed s. At the same time, we record the locations of the shared k-mers among the reads to construct a multiple-sequence alignment . For each column i () of , let Ti be the nucleotide that is the most frequent in that column. Let be set of such nucleotides from all columns. Using our current alignment we define , that is, mi is the set of reads that have a mismatch with Ti. For each read x, we also define . In other words, m(x) is the set of columns for which x has a mismatch with T.
Because it is only based on k-mer matches, our initial set is most likely from a mixture of different transcripts. This situation arises from genomic repeats and alternative splicing. To build a homogenous contig, we use cluster analysis to identify the largest subset of , which satisfies a quality measure.
To identify the largest subset, the main challenge is in distinguishing genuine errors from other intrinsic difference such as repeats. Note that real biological differences should be supported by a set of reads with similar mismatches to the consensus. This means that we could identify a set of reads associated with intrinsic differences by looking at the intersections of mis. Based on this intuition, we use a clustering algorithm (spectral clustering (31) and a spectral relaxation of k-means (32)) to find clusters of columns in M. These clusters are used to identify coherent subsets of reads, which we then use as the initial set for learning a contig HMM. See supplement for details on how to extract a biologically similar subset using this clustering method and on how to efficiently implement the clustering step.
SEECER has two learning options (Figure 1, step 3). In the first one, we implemented online expectation maximization (EM) algorithm (33) in which we restricted the alignment to have at most v indels to speed up the Forward–Backward algorithm. In the second one, we estimate the parameters based on the alignment of reads using k-mer positions. The first option is much slower than the second because we have to run Forward–Backward algorithm until the EM converges. The second option is faster because we only need to do one pass over all reads. Our experiments show that the second option is good enough for correction and keeps the runtime tractable because often the set of reads is consistent and the amount of errors is low, therefore yielding a good read alignment.
We discard positions in the contig HMM with high entropy of the emission probabilities in the match states. Entropy is a probabilistic statistic, which captures the uncertainty in the discrete distribution of emissions. Positions with high entropy (default max entropy = 0.6) indicate that the initial alignment estimation is not reliable because the set of reads is not consistent. For example, at splitting positions in alternative splicing events, reads from different isoforms may be retrieved, which will lead to high entropy. By discarding these ambiguities, we improve the contig quality and reduce false-positive corrections.
Before contig extension (Figure 1, step 4) all parameters learned for the HMM thus far are fixed. We iteratively extend the contig HMM by repeatedly retrieving more reads sharing k-mers with the new consensus using the dictionary. Each additional read is partially aligned to the HMM, and read bases that are not overlapping the HMM are used to learn the newly extended columns of the HMM, repeating cluster analysis and entropy computation. This iterative process stops when we cannot retrieve any new reads or extend the consensus further.
After the construction of the contig HMM, each read that was used in the construction is aligned to the HMM using Viterbi’s algorithm. Reads whose log-likelihood of being generated by the contig HMM exceeds a threshold of −1 are considered ‘assigned’ to that HMM. We also restrict the number of corrections for a single read to five to avoid making false-positive corrections. Finally, assigned reads are removed from the pool of reads (Figure 1, step 6).
We remove ambiguous bases (Ns) from the read sequences before running SEECER by randomly substituting an N with one of the nucleotides (A,T,G,C). However, if there are regions with many Ns in a read, we discard the whole read unless these regions occur at the end, in which case, we truncate and keep the read if the new truncated length is at least half of the original. Reads that have >70% of their bases all As or all Ts are also discarded, as they likely originate from sequenced poly-A tails.
Three human paired-end RNA-seq datasets were downloaded for the comparisons: 55 M reads of length 45 bp (ID SRX011546, http://www.ncbi.nlm.nih.gov/sra/) (6), 64 M reads of length 76 bp (34) were downloaded from the GEO database (35) (Accession: GSM759888) and 145 M reads of length 101 bp from the ENCODE consortium (http://genome.ucsc.edu/cgi-bin/hgFileUi?g=wgEncodeCshlLongRnaSeq). The spliced alignment of reads was performed using TopHat version 1.3.3 and Bowtie version 0.12.5 (36). Number of aligned reads is reported for uniquely mapped reads as described in (3).
Quake version 0.3 (22) was run as suggested in the manual for RNA-Seq data, the k-mer size was set to 18 and the automatic cutoff mode was disabled, instead all k-mers with count 1 were classified as erroneous. The other programs were run as follows: Coral version 1.4 (24) with the -illumina option, HiTEC 64 bit version 1.0.2 (21) with options 57000000 4, and Echo version 1.12 (23) with options –ncpu 8 -nh 1024 -b 2000000.
To measure the accuracy of the read error correction methods, we used Tophat to align original and corrected reads to the human reference sequence. Using the reference sequence as ground truth, we used the following definitions (37): a false positive was a base that was changed (corrected) although it was correct in the original read. A true positive was a base that was corrected to the nucleotide in the reference. A false negative was a base that was not corrected even though it is wrong, while a true negative was a base that was left uncorrected and aligned with the reference. The gain metric was computed as explained in (37).
We used Oases (version 0.2.5) for the de novo RNA-Seq assembly for the human and sea cucumber datasets. Similar to (11), we conducted a merged assembly for k = 21, … , 35 using default parameters. SEED (version 1.5.1) was run with default parameters, and the resulting cluster sequences were used as input to Oases as described in (28). The evaluation of the human assemblies was conducted by aligning assembled transfrags to the human genome with Blat version 34 (38) and comparing with Ensembl 65 transcript annotation to derive 80% and full-length covered transcripts, as previously described (11). The evaluation metrics were computed using custom scripts.
Gravid Parastichopus parvimensis adults were spawned by heat shock and embryos grown in artificial sea water at 15°C. Total RNA was extracted from 2-day-old gastrula and 6-day-old larvae using the Total Mammalian RNA Miniprep kit (Sigma). RNA was sent to the Wistar Institute for library preparation with Illumina adaptors and 72-bp paired-end sequencing was performed on a Solexa Genome Analyzer II. First strand cDNA synthesis was performed with the iScript Select cDNA Synthesis Kit (BioRad).
From the top 100 expressed transfrags that were expressed in both time points, 14 were randomly selected, seven with a match to either RefSeq or Swissprot and seven without a match. For the validation, PCR primers were designed with Primer3Plus (39) to amplify ~300–500 bp products corresponding to the 14 selected transfrags (primer sequences are provided in Supplementary Table S11). The PCR was performed using GoTaq (Promega) standard protocols on RNA samples from the first time point.
For peptide searches, we used Blastx (40) with an E-value cutoff of 10−5 to avoid spurious alignments in Swissprot (41) and the Sea Urchin known proteome (SPU_peptide.fasta at http://www.spbase.org/SpBase/download/). Similarly, for the search in Refseq (42), we used Blastn with the same cutoff. The expression of all assembled transfrags was quantified using RNA-seq by expectation maximization (RSEM) with default parameters (43) after read alignment of the reads to the transfrags with Bowtie (44). The Gene Ontology (GO) annotation for the known and predicted Sea Urchin proteome was downloaded from SpBase (annotation.build6.tar at http://www.spbase.org/SpBase/download/). GO enrichment analysis was done using FuncAssociate 2.0 (45) with a multiple-testing corrected P-value cutoff of 0.05.
SEECER and other error correction methods were run with an 8 core Intel Xeon CPU with 2.40 GHz and 128 GB RAM. The de novo assembly with Oases was run on a 48 core AMD Opteron machine with 265 GB RAM.
The new RNA-Seq datasets for the transcriptome of P. parvimensis were deposited on NCBI Sequence Read Archive (SRA) under the Accession SRA052605.
Figure 1 presents a high-level overview of SEECER’s read error correction. The overall goal is to model each contig with a HMM allowing us to model substitutions, insertions and deletions. We start by selecting a random read from the set of reads that have not yet been assigned to any HMM contig. Next, we extract (using a fast hashing of k-mers method) all reads that overlap with the selected read in at least k nucleotides. Because the subset of overlapping reads can be derived from alternatively spliced or repeated segments, we next perform clustering of these reads selecting the most coherent subset (see ‘Materials and Methods’ section) for forming the initial set of our HMM contig. Using this set, we learn an initial HMM using the alignment specified by the k-mer matches. This learning step can either directly rely on the multiple alignment of reads or use standard HMM learning (Expectation Maximization) but with a limited number of indels to keep the run time of the Forward–Backward algorithm linear (see ‘Materials and Methods’ section). Next, we use the consensus sequence defined by the HMM to extract more reads from our unassigned set by looking for those that overlap the current consensus in k or more nucleotides. These additional reads likely overlap the edges of the HMM (because those overlapping the center have been previously retrieved) and so they can be used to extend the HMM in both directions in a similar manner to the method used to construct the initial HMM. This process (learning HMM, retrieving new overlapping reads, etc.) repeats until no more reads overlap the current HMM or the entropy at the edges of the HMM exceeds a predefined threshold (see ‘Materials and Methods’ section).
When the algorithm terminates for a HMM, we determine for each of the reads that were used to construct the HMM how likely it is that they have been generated by this contig HMM. For those reads where this likelihood is above a certain threshold, we use the HMM consensus to correct errors in places where the read sequence disagrees with the HMM. We use several filtering steps to avoid false-positive corrections including testing for the number of similar errors at the same position, the entropy of a position in the HMM and the number of corrections made to a single read. See ‘Materials and Methods’ section for complete details.
We first tested SEECER on human data to compare it with other approaches that are widely used for other sequencing data (primarily DNA sequencing as mentioned above). Unlike de novo RNA-Seq data, when analyzing human data, we can use a reference genome to determine the accuracy of the resulting corrections and assembly. An established metric to measure the success of error correction after read alignment is the gain metric (19), which is defined as the ratio of newly created versus correctly removed errors (see ‘Materials and Methods’ section).
Before testing SEECER on the human data, we used a subset of ~34 million reads to assess the influence of the two main parameters for SEECER, the length of k for the initial hashing phase and the value for the maximum entropy at a position. Our experiments show that k = 17 works best for this subset (Supplementary Figures S1 and S2) with stable results for similar values. The maximum entropy was set to 0.6, as lower entropy values resulted in fewer corrections because fewer contigs could be constructed (Supplementary Figure S3).
We next have used these parameters to compare SEECER with four other methods for correcting the reads by initially testing their ability to improve the unique alignment of reads to the human genome after correction (see ‘Materials and Methods’ section). We used three diverse datasets to compare SEECER with the k-mer–based methods Quake (22) and ECHO (23), Coral (24), which relies on multiple alignments of reads for correction, as well as with HiTEC (21), which builds a suffix tree and automatically estimates parameters for correction.
The first dataset we used was derived from human T-cell RNA sequencing resulting in 55 million paired-end reads of length 45 bp (6). In Table 1, we list important statistics regarding the success of the error correction methods. Using SEECER, the number of aligned reads increased by 8.4% when compared with the uncorrected reads, much higher than Quake (3.6%), Coral (4.5%) and ECHO (1.3%). Unlike the other methods, error correction with HiTEC did not result in a higher number of reads mapped. Similarly, the number of reads that align without mismatch errors to the reference sequence using SEECER increased by 50%, which was by far the biggest improvement for all methods tested (Supplementary Figure S4). None of the error correction methods uses paired-end information, and therefore the number of properly aligned read pairs can serve as a good indicator for the accuracy of the error correction. Again, SEECER error corrected reads showed the highest improvement with 15% more pairs properly aligned. The gain metric shows the normalized difference between true-positive and false-positive corrections (Supplementary Table S1) and, again, SEECER outperforms the other methods. In addition, we investigated the error bias in terms of read positions and forward/reverse read strands. Figure 2 presents the distribution of mismatches following TopHat alignments relative to the read positions before and after error correction by SEECER. As can be seen, the previously reported bias that higher error rates are found at read ends for Illumina data (17), is observed in our data as well. However, after SEECER error correction, much of this bias is removed, and the corrected reads have a more uniform distribution of mismatches along the read positions. See Supplementary Figures S6–S9 for details on other types of corrections made by SEECER.
To further test the influence of error correction on downstream analysis, we investigated the ability to identify homozygous single-nucleotide polymorphisms (SNPs) before and after error correction. This analysis demonstrates the usefulness of error correction for such downstream SNP studies and, in particular, shows that using SEECER corrected reads leads to the identification of the highest number of SNPs. See Supplementary Table S6 and Supplementary Figure S10 for details.
While the ability to align individual reads is important, another important goal of de novo RNA-Seq experiments is transcriptome assembly. To test the impact of error correction on downstream assembly, we used the Oases de novo assembler (11). In addition to the read-based error correction methods we compared with above, we have compared with SEED read clustering and subsequent Oases assembly as previously suggested (28). In Table 1, the results for the human T-cell data are shown. An important metric for assembly comparisons is the number of full-length assembled transcripts. Compared with the original reads, after SEECER error correction, 21% more transcripts are reconstructed to full length. SEECER also leads to a 46% increase of detected alternative isoforms (Supplementary Table S4). An example of how Oases benefits from the SEECER error correction is shown in Figure 3 for the gene EIF3CL (see also Supplementary Results). Quake, Echo and Coral led to a lower improvement of assembled full-length transcripts with 13, 9.6 and 19.6%, respectively, whereas SEED and HiTEC resulted in a reduction of full-length reconstructed transcripts of −22 and −2.7%, respectively. The clustering approach used by SEED discards some of the data, which leads to loss of lowly to mid-level expressed transcripts (Supplementary Figure S5). The correction of the human dataset with SEECER took ~12.25 h, whereas the assembly with Oases took 19 h.
To test the scalability of SEECER when using datasets with more reads and longer read length, we further tested SEECER on two additional human datatsets: a HeLa cell line dataset of 64 M reads of length 76 bp (GEO Accession: GSM759888) (34) and 145 M reads of length 101 bp from the ENCODE consortium (see ‘Materials and Methods’ section). Owing to the time requirements of the assembly step, we have only focused here on the top three performing methods in our original analysis (SEECER, Quake and Coral). SEECER scales well, and for both datasets, it achieves the best performance for the number of aligned reads, read pairs, full-length assembly and gain (Tables 2 and and3).3). Additional information about the number of true-positive and false-positive corrections can be found in Supplementary Tables S2 and S3. While SEECER memory requirements scaled more or less linearly with the size of the dataset, Coral’s requirements did not scale in a similar manner. Specifically, we could not run Coral on the largest dataset (Table 3) because its memory requirements were beyond the available memory on the machine we used to test all methods.
The sea urchin Strongylocentrotus purpuratus is a model system for understanding the genetic mechanisms of embryonic development, e.g. (46). Other species of echinoderms, including the Californian warty sea cucumber P. parvimensis (Figure 4A), are being developed as comparative developmental model systems, e.g. (47). This work, however, is limited by the absence of a sequenced genome for the sea cucumber. It is thus critical for comparative studies that methods are developed to inexpensively obtain highly accurate transcriptomes for organism for which no sequenced genome exists.
To test how SEECER can help in this direction, we have produced two new datasets for the transcriptome of P. parvimensis. These datasets allow us to determine the expressed mRNAs at the embryonic gastrula (time point 1) and feeding larval (time point 2) stages, which provides insights into the development of this species. Illumina paired-end 72-nt sequencing was conducted and resulted in 88 641 446 and 85 575 446 reads for time points 1 and 2, respectively. We have next used SEECER to correct errors in these datasets, resulting in 28 655 078 and 25 546 050 corrections for 19 465 515 and 17 305 905 reads, respectively. Each corrected read set was then used to produce a de novo RNA-Seq assembly. Error correction took ~4.7 and ~4.6 h, whereas de novo assembly took ~11.3 and ~13 h for time points 1 and 2. In all, 850 056 transcript fragments (transfrags) were assembled for the embryonic stages (time 1) and 682 913 transfrags for the larval (time 2) stage using Oases (see ‘Materials and Methods’ section).
The only other echinoderm with a sequenced genome is the sea urchin S. purpuratus, which last shared a common ancestor with sea cucumbers ~350 million years ago (48). Thus, we initially analyzed the similarity between the transfrags we obtained and sea urchin proteins. For the embryonic and larva stages 261 405 and 189 101 transfrags mapped to fragments of 13 330 and 11 793 distinct known peptides in sea urchin (minimum length 50 amino acids for each match). Although we only sequenced RNAs from two developmental stages, thereby not sampling much of the long developmental process and many adult tissues of these organisms, the assembled transfrags from both time points nonetheless matched to >50% of known sea urchin peptides. This suggests both that we have achieved a high-sequence coverage in the assembly, and that many of the sea cucumber genes are already being expressed during early development. In addition, the fact that 14% of these matches were restricted to only one of the two time points suggests that we are able to detect stage-specific developmentally regulated genes, an important requirement for developmental studies (see Figure 4B). To illustrate the usefulness of de novo sequencing, we next performed a GO enrichment analysis for sea urchins’ peptides matched to both time points, and those matched only to time point 1 or time point 2. The results are presented in Supplementary Tables S7–S9.
Time point 1 embryos are undergoing active development including cell movements involved with gastrulation. Larval stages (time point 2) meanwhile are actively swimming and feeding in the water column. As can be seen in the GO analysis, many differences in expression between these stages are of mRNAs that encode for proteins involved in energy metabolism, which is likely due to a switch in how sessile non-feeding embryos and motile feeding larvae use energy resources. We also find an enrichment of expression of genes involved in RNA splicing and translation control in time point 1 (embryos), which may be related to the active transcriptional processing requirements of early embryogenesis. This analysis thus provides an entry point into understanding these important processes.
Although 62–65% of transfrags matched known sea urchin peptides, 297 173 and 255 672 sea cucumber transfrags for time points 1 and 2 did not significantly match any sea urchin peptide (see ‘Materials and Methods’ section). We computed the expression levels of the assembled transfrags and investigated the top 100 expressed transfrags that we could not match to sea urchin peptides from both time points in more detail (see ‘Materials and Methods’ section). In the top 100, 28 and 9 transfrags matched to the RefSeq and Swissprot databases, respectively. Still, we were unable to match 64 transfrags expressed in both time points to any known entry in these databases. To further test the accuracy of our correction and assembly and whether the non-matched transfrags are indeed novel expressed RNAs, we have performed additional follow-up experiments. We selected 14 transfrags that were highly expressed in both time points and performed RT-PCR analysis on these to confirm that the predicted products could be amplified from sea cucumber–derived embryonic cDNA (Figure 4C and Supplementary Table S11). Of the 14, seven were derived from transfrags that matched known peptides and the other seven were derived from transfrags with no match to any of the databases we looked at. As can be seen in Figure 4C, all 14 transfrags were successfully validated, indicating that these are indeed expressed mRNAs and lending support to our correction and assembly procedure.
We have developed and tested SEECER, a new method based on profile HMMs to perform error correction in RNA-Seq data. Our method does not require a reference genome. We first learn a contig HMM using a subset of reads and use the HMM to correct errors in reads that are likely associated with the HMM. Our method can handle non-uniform coverage and alternative splicing, both key challenges when performing RNA-Seq. We tested SEECER using complex human RNA-Seq data and have shown that it outperforms several other error correction methods that have been used for RNA-Seq data, in some cases leading to a large improvement in our ability to correctly identify full-length transcripts. We next applied it to perform de novo transcriptome correction and assembly of sea cucumber expression data, providing new insights regarding the development of this species and identifying novel transcripts that cannot be matched to proteins in other species. We note that although a recent report of a 454 sequencing analysis of mixed embryo, larval and adult tissues provides some coverage of an unrelated species, the Japanese sea cucumber Apostichopus japonicas (49), to the best of our knowledge, this is the first published transcriptome of P. parvimensis.
Our analysis of the sea cucumber data indicates that we were able to obtain good transcriptome coverage. The expressed genes from the two developmental stages matched 50% of the protein-coding regions of sea urchin. In addition, de novo correction and assembly was able to accurately detect taxon-specific transcripts. This is critical for comparative development studies, which, in the absence of a genome sequence, often rely on gene discovery from homology searches in related model species. Full appreciation of the role of species-specific genes is essential to understand the developmental origins of animal diversity.
Although one of the main motivations for developing SEECER are applications of de novo RNA-Seq, the human data are useful because alignments allow us to explore the accuracy of the methods, and it is thus a common practice for testing sequencing error correction approaches (19). However, we would like to point out that the classification into false and true positives/negatives is based on the human reference sequence, which may miss haplotype allels. Thus, the false-positive rates reported in the tables may be slightly higher than the real false-positive rates. Nevertheless, we doubt that this approach favors any of the methods because none of them use the reference sequence for performing corrections.
The genome read error correction methods Quake and Coral were able to correct many reads but resulted in a large number of false negatives, as indicated by their lower rates of aligned reads and the drop in the gain statistic compared with SEECER. Coral was the closest to SEECER in terms of the resulting number of full-length assembled transcripts for two of the three datasets. However, Coral seems to suffer from lack of scalability, which may be problematic as dataset size increase. Indeed, its memory requirements for the largest dataset we analyzed were larger than the capacity of our machine cluster.
Our experiments have shown that read clustering leads to a loss of assembled full-length transcripts, especially for low-to-mid level expressed transcripts, because parts of the data are discarded. Owing to non-uniform expression levels in RNA-Seq data, error correction sensitivity critically depends on a method’s ability to detect errors. The performance drop for HiTEC and ECHO, compared with the other methods tested, may be explained by their uniform coverage assumption leading to missing higher frequency errors in highly expressed genes. In contrast, Quake and Coral do not have these strong assumptions and perform much better. However, unlike SEECER, they do not use a probabilistic HMM model and read clustering. These steps allowed SEECER to outperform all other methods in the number of alignable reads, full-length assemblies and false-negative rate with only linear increase in memory requirements for larger datasets.
While we have focused here on the improvement to RNA assembly following error correction, it has been shown that de novo assemblies allow reliable detection of genes that are differentially expressed between two conditions (50). Thus, by improving the resulting assembly, SEECER is likely to improve downstream differential expression analyses as well.
There are many directions to improve SEECER further by using base call quality scores to improve performance on lowly expressed transcripts or using the paired-end information to improve construction of contigs. Currently, SEECER was designed to work without an available reference sequence (de novo RNA-Seq), but an available reference sequence could help with correction of repetitive regions and lowly expressed transcripts.
Finally, while we have primarily developed SEECER for RNA-Seq data, it may also prove useful for single-cell and single-molecule sequencing. In other studies, including metagenomics and ribosome profiling experiments, researchers encounter sequencing data where the coverage is non-uniform and as such SEECER, which does not assume uniformity, can improve the analysis of these data as well.
Supplementary Data are available at NAR Online: Supplementary Tables 1–11, Supplementary Methods, Supplementary Results, Supplementary Figures 1–11 and Supplementary References [51–55].
National Institutes of Health (NIH) [1RO1 GM085022] and National Science Foundation (NSF) [DBI-0965316 award to Z.B.J.]. Funding for open access charge: NIH [1RO1 GM085022 to Z.B.J.].
Conflict of interest statement. None declared.
Animals were collected by Patrick Leahy and Peter Halmay or Marinus Inc under permit number SC-11478 from California Department of Fish and Game to V.F.H. This research was funded by NSF grant IOS-0844948 and the Winters Foundation to V.F.H.