Whole transcriptome sequencing (RNA-seq) is a powerful transcriptional profiling technology using next generation sequencing platforms [
4,
23-
25] and has signaled a new age in clinical genomics. Several recent studies have indicated that RNA-seq will be more useful than the current microarray technology due to the increased dynamic range of signal of sequencing [
4,
26-
28] and its ability to identify the exact location of transcription boundaries at single base resolution. RNA-seq also provides the sequence information needed to identify single nucleotide variants, map variant transcription start sites, and detect novel transcript splicing. These features make RNA-seq particularly useful for studying complex transcriptomes, such as those found in the clinical blood samples.
Although attractive, clinical application of RNA-seq is feasible only if the tool can demonstrate high specificity, sensitivity and reproducibility with limited amount of starting material. Although current technology for transcriptome sequencing requires at least 100

ng total RNA (tens of thousands of cell equivalents), along with additional enrichment steps to select for poly(A)

+

RNA and/or to reduce the content of ribosomal RNA (rRNA) prior to NGS library construction, to minimize the loss of input material researchers tend to start with a minimum of 1 microgram total RNA. The utility of RNA-seq data generated is believed to be sensitive to read length, mapping and assembly of reads and statistical and computational challenges. However, with the current availability of substantially improved mapping software, these challenges are expected to be well tackled. Microarrays, on the other hand, are known to suffer from reduced dynamic range of signals due to saturation biases and high background, and non-specific or cross-hybridization resulting in false-positive signals, especially for transcripts that have low expression levels.
Considering these challenges inherent to each of the high throughput platforms, we undertook this comparative study in an attempt to better understand the relative merits of high density exon microarrays and RNA-seq for biomarker discovery in the clinical setting. We chose the sickle cell disease because strong differential expression has been previously observed, and the phenotype of the disease is in manifested in blood, an accessible tissue for study. Sampling whole blood, a globin abundant tissue, allowed us to examine the potential interference of high abundant globin transcripts during sequencing and also to potentially discover novel genes that are associated with sickle cell disease from cell types such as nucleated red cells in addition to the conventional peripheral blood mononuclear cells. We believe that this is the first study that has compared the 2 platforms on a monogenic human disease model using easily accessible whole blood clinical specimens mimicking a large scale clinical research project.
In this study we used 50

ng of total RNA on exon arrays without any globin reduction or poly(A)

+

enrichment to identify differentially expressed genes and alternative splicing events in sickle cell disease. The same samples were also analyzed by RNA-seq using 1.5 micrograms of total RNA. RNA-seq analysis generated an average of 83% mappable reads from the whole blood samples after poly(A)

+

enrichment. The globin reduction process had insignificant effect on the mappable read count, even for sickle cells samples which have high levels of globin RNA suggetsing that it is not necessary to reduce the globins while analyzing whole blood samples by RNA-seq.
As expected, comparison of the dynamic range of the two platforms confirmed that RNA-seq has a dramatically larger range varying from 4 to 16 fold. This enticing feature of RNA-seq effectively removes the saturation biases inherent to the array platform. Examination of the technical reproducibility and coefficient of variation of the array and RNA-seq platforms at the gene and exon levels, as a function of the mean expression level, indicated that the coefficient of variation for microarray is much lower than that for RNA-seq and is also independent of the expression level for each transcript, suggesting that technical variability within group is higher in the RNA-seq platform than the arrays. A similar observation has been reported by Marioni et al. [
15]. They observed extremely high CVs when the read counts were low, a domain where Poisson counting error dominates in RNA-seq. In this domain, microarray produces moderately low CV (20%), suggesting that microarray may in fact be more effective at detecting expression changes for low-abundance genes.
Our comparative analysis of detection sensitivity with material from clinical samples revealed that even with the usage of 30 times less starting material (50

ng vs 1.5 micrograms) Exon arrays could detect as many transcripts above background as in RNA-seq. It should be noted that the sequencing depth in this study (~10 million reads) is comparable with most published RNA-seq studies. Xu and others [
29] from their comparative study using GG Exon arrays to RNA-seq reported that although both platforms detect similar expression changes at the gene level, the Exon array is more sensitive at the exon level and deeper sequencing is required to adequately cover low abundance transcripts [
29]. It should be mentioned here that with the latest much improved sequencing instruments, it would be easier to generate

~

80 million reads and this would substantially increase the sensitivity of detection in RNA-seq platform.
We found 331 transcripts with differentially expressed transcripts in SCD. These included genes involved in pathways related to sickle cell disease such as inflammatory response, oxidoreductase pathways, stress response, cell signaling and apoptosis. Of these 331 transcripts which showed a high degree of correlation (R

=

0.64), 96 genes were identified by both the technologies. A similar observation has also been reported by correlating gene expression arrays and RNA-seq on their study on differentially expressed liver and kidney tissues [
15]. Only 11 genes out of the 331 genes from the current study showed an opposite trend in differential expression in sickle cell disease, suggesting that the number of false positives was small, using either method.
Gene ontology analysis of these genes helped to classify their molecular functions into ten highly significant functional pathway such as cellular cycle regulators, apoptosis, oxidative stress response, inflammation and immune response, free radical scavenging, protein modification, and hematopoiesis. Examination of these pathways suggests that differentially regulation may be in response to oxidant and hemolytic stress, vascular injury and participation in repair and homeostasis [
17,
30-
34]. Interestingly, GDF15 expression was upregulated, which also has been observed in thalassemia intermedia, and associated with repression of hepcidin, an important mediator of the inflammatory response on erythropoiesis [
35].
We also observed upregulated expression of several reticulocyte specific genes as expected in SCD where a higher proportion of reticulocytes are observed. This finding validates the performance of both technologies in identifying alterations relevant to sickle cell disease. From a biological perspective, the whole blood expression profile provided a window into real time erythrocyte expression profiles. Insights into the transcription profile of these red blood cells may contribute greatly to our understanding of mechanism of disease, prognosis, and responses to therapeutics.
Using ExonAnova analysis on RNA-seq data, we identified 16 alternatively spliced genes. While further validation of these splice variants is needed, it is interesting to note that both
RHCE and
RHD are components of the important Rh antigen system on red cells. However the potential implications of altered Rh splicing in SCD is still unclear. Deficiency of
UNC13D is known to result in defective exocytosis of cytolytic granules of cytotoxic T lymphocytes and natural killer cells, causing immune dysregulation [
36]. Whether altered expression of UNC13D in SCD could contribute to the relative immune compromise of SCD may merit future investigation.
To illustrate the power of RNA-seq in detecting differential transcription not associated with known genes, we scanned the entire genome for novel differential expression focusing only on unannotated genomic regions. By doing so, we found an interesting region to include an apparently novel, minor exon between exons 4 and 5 in the
ALAS2 gene with SCD patients showing at least six times higher expression levels compared to the control subjects (p

=

0.003). This could suggest alternative splicing in SCD which might serve as an
ALAS2 transcription regulator. Follow up of this suggestion would require a functional analysis of this newly identified region of
ALAS2 but is beyond the scope of the current study but is planned for the future. ALAS2 gene expression is restricted to the erythroid lineage [
37] and plays a pivotal role in heme synthesis. In addition to heme-mediated feedback inhibition of enzymatic function, ALAS-2, a member of a small family of genes is modulated by iron [
38]. This ability of RNA-seq to identify regions in detail holds great promise for the future discovery of novel transcripts and biomarkers in clinical genomic studies.
Another key advantage of RNA-seq over existing technologies for transcriptomic studies is its ability to identify sequence variations in expressed transcripts. To illustrate the feasibility of identifying sequence variation in expressed genes, we focused on the known single nucleotide mutation in SCD in which glutamic acid-6 is replaced by valine (GAG replaced by GTG). We were able to successfully detect this mutation in all the sickle cell patients. Interestingly, we were also able to identify, that same mutation in heterozygous combination with a hemoglobin C beta globin variant having glutamic acid replaced by lysine (GAG replaced by AAG) in one compound heterozygous sickle cell patient, thereby demonstrating the ability of RNA-seq to reliably identify heterozygous single base mutations in the expressed transcripts.
In conclusion, our study clearly illustrates a high level of concordance between the array platform and the RNA-seq technology, and suggests that the high density Exon array still remains a powerful tool to generate meaningful data when the amount of material is limited. Although RNA-seq is still in the early stages of use in clinical studies, it has clear advantages over the array based transcriptomic methods, based on its ability to discover novel transcripts, identify sequence variants, and increased dynamic range of signals leading to increase fold change in measured expression levels. With the rapid evolution of NGS instruments and library preparation methods with multiplexing barcodes, longer read lengths and large number of paired end reads associated with reduced cost per lane is highly feasible in the near future. The use of picogram to few nanogram amounts to total RNA for RNAseq still needs to be optimized in order to capture low abundance transcripts.
We believe that the results from this study provide guidelines on the choice of tools in the form of arrays or RNA-seq for clinical transcriptomic studies using limited amount of starting material. We believe that the selection of an appropriate tool for clinical genomic studies is mostly driven by the biological question underlying the study: whether a formal hypothesis is being tested or is the study intended to better describe the complete transcriptome and discover novel transcripts. An emerging approach is to apply both RNA-seq and arrays in combination, in large scale clinical studies, where RNA-seq is used first to define the transcriptome elements associated with the disease in question, followed by high throughput and reliable screening of these elements on thousands of patient samples using the arrays. Integrating data from both microarray and RNA-seq experiments may open up new possibilities for creating meaningful informational networks which will aid our understanding of disease pathology and development of novel therapeutics.