Massively parallel sequencing for transcriptome profiling generates digital counts of gene expression levels compared to the "analog" hybridization signals from the traditional microarray or quantitative PCR methods. Our analysis highlights the precise nature of DGE profiling, as demonstrated by the high reproducibility between technical replicates of the same library run in different lanes, and between biological replicates of different libraries on the same or different flow cells runs. The correlations between technical replicates were >0.97. Although variability between biological replicates of two different libraries (L6 and L8) was less than 0.95, it is likely that the variance came from library construction and not the sequencing process, since the correlations between biological replicates of the libraries constructed at one lab (Florida) were excellent (> 0.95, Additional File
2). We also demonstrated that DGE profiling is accurate. The measurement of fold changes of genes between HBRR and UHRR was highly correlated with data obtained from qPCR. This correlation is similar to that between microarrays and qPCR.
With 20 million tags each from HBRR and UHRR library, which is the current sequencing throughput of one lane of sequencing on GA II, DGE detected 10-20% more transcripts than microarrays, a majority of which were expressed at levels below the sensitivity threshold of microarray platforms. The detection of the lower-expressed genes and the wider dynamic range have been shown to be the main advantages by DGE compared to microarray analysis, since other parameters evaluated between the two platform were mostly comparable. It has been suggested that lower expressed transcripts may account for nearly half of all transcripts in a cell, and play critical but currently undefined roles in pathology and physiology. DGE's ability to quantify these transcripts may open new horizons for the application of genomic profiling to translational research. In addition, DGE may lead to the discovery of new functional genomic regions. For example, 13-15% of the reads from our DGE libraries aligned to intergenic regions which may be related to novel transcripts. However, since no DNase treatment was performed during RNA extraction, these un-mapped reads could also be from the contamination of the genomic DNA although this scenario is less likely because of the selection of PolyA+ RNAs during library construction.
One limitation of 3' DGE is that using a particular enzyme such as
DpnII for library preparation requires the presence of
DpnII restriction site(s) in the mRNA. Some transcripts, even though highly expressed, may lack the
DpnII site(s) and therefore not represented in the libraries. Among all 43,569 transcripts in Human RefSeq RNA database (version of Aug 24, 2009), 2,912 (6.68%) don't have a
DpnII site. Among 28,061 mature RNAs with accession numbers starting with NM_ (excluding XR_, NR_ and XM_ accession numbers), 571 (2.04%) don't have a
DpnII site. Additional File
5 lists all Human RefSeq RNAs and the number of
DpnII sites within each molecule. In addition, the non-polyadenylated transcripts are not represented in the current libraries. In an effort to measure both mRNA and non-polyadenylated RNAs (data not shown), we tried to treat total RNAs with RiboMinus™ Transcriptome Isolation Kit (Invitrogen) prior to library preparation to deplete 18S and 28S rRNAs and to enrich polyadenylated mRNA, non-polyadenylated RNA, pre-processed RNA, tRNA, and small rRNAs (5S rRNA, 5.8S rRNA). However, the sequencing of RiboMinus libraries revealed that ~70% of the reads were rRNAs (data not shown). Therefore, we made a decision to prepare polyadenylated mRNA libraries for the current study to avoid wasting sequencing depth on rRNAs.
Several challenges currently limit the adoption of 3' DGE in replacement of microarrays. The higher cost per sample and low sample throughput per run limit the sample size in a study. Currently 3' tag DGE analysis costs ~$1200 per lane of sequencing including library construction. It also has long individual run times, sequencing only ~12 bases per day, and sample preparation is significantly more difficult and time consuming than that of microarray. Finally, the bioinformatics challenges associated with 3'DGE analysis are significant, including storage, archiving, and retrieval of the vast volume of data, and development of algorithms to assemble and align sequence reads as short as 35-40 nucleotides [
19]. However, advances in these new sequencing technologies will substantially increase sample throughput, leveraging techniques such as multiplexing that could reduce the cost of sequencing per sample. Bioinformatics challenges are currently being addressed, faster and more powerful alignment tools being developed.
We are aware that RNA-Seq enables a more extensive profiling of the transcriptome by facilitating the discovery of fusion genes [
20], detection and quantification of alternative splice forms, and characterization of expressed mutations and polymorphisms [
14]. However, in order to comprehensively sequence the full length of all transcripts, the sequencing depth of RNA-seq needs to be significantly greater compared to that of 3' tag DGE profiling. It was estimated that at least 40 million reads (compared to < 5 million reads required for DGE) needs to be sequenced from a single library to achieve 90% coverage of the transcriptome [
10] making RNA-seq even more financially demanding that 3' tag DGE. In addition, the increased complexity of the data poses even greater analytic challenges. Therefore, the advantages of 3' tag DGE over microarray in profiling lower-expressed genes and measuring with greater dynamic range make it arguably attractive for use in today's medical and biological research.