|Home | About | Journals | Submit | Contact Us | Français|
The hippocampal expression profiles of wild-type mice and mice transgenic for δC-doublecortin-like kinase were compared with Solexa/Illumina deep sequencing technology and five different microarray platforms. With Illumina's digital gene expression assay, we obtained ~2.4 million sequence tags per sample, their abundance spanning four orders of magnitude. Results were highly reproducible, even across laboratories. With a dedicated Bayesian model, we found differential expression of 3179 transcripts with an estimated false-discovery rate of 8.5%. This is a much higher figure than found for microarrays. The overlap in differentially expressed transcripts found with deep sequencing and microarrays was most significant for Affymetrix. The changes in expression observed by deep sequencing were larger than observed by microarrays or quantitative PCR. Relevant processes such as calmodulin-dependent protein kinase activity and vesicle transport along microtubules were found affected by deep sequencing but not by microarrays. While undetectable by microarrays, antisense transcription was found for 51% of all genes and alternative polyadenylation for 47%. We conclude that deep sequencing provides a major advance in robustness, comparability and richness of expression profiling data and is expected to boost collaborative, comparative and integrative genomics studies.
Gene expression microarrays are at present the default technology for transcriptome analysis. Since they rely on sequence-specific probe hybridization, they suffer from background and cross-hybridization problems and measure only the relative abundances of transcripts (1). Moreover, only predefined sequences are detected. In contrast, tag-based sequencing methods like SAGE (Serial Analysis of Gene Expression) measure absolute abundance and are not limited by array content (2). However, laborious and costly cloning and sequencing steps have thus far greatly limited the use of SAGE. This has radically changed with the introduction of deep sequencing technology, enabling the simultaneous sequencing of up to millions of different DNA molecules. The shared idea behind the different deep sequencing approaches is the clonal detection of single DNA molecules at physically isolated locations(3–5). We used the Solexa/Illumina 1G Genome Analyzer, in which adapter sequences, ligated to both ends of the DNA molecule, are bound to a glass surface coated with complementary oligonucleotides. This is followed by solid-phase DNA amplification and sequencing-by-synthesis (6). The system yields millions of short reads (currently up to 36 bp), and is therefore very suitable for tag-based transcriptome sequencing. The technology is also referred to as Digital Gene Expression tag profiling (DGE), and is essentially an improved version of the earlier Massively Parallel Signature Sequencing (MPSS) technology(3,7).
The first steps of the procedure are similar to classical LONG-SAGE. Two restriction enzymes are used to generate tags, cutting at the most 3' CATG and 17 bp downstream of the first enzyme site. Unlike in classical SAGE, tags are neither concatenated nor cloned, but sequenced immediately. The unprecedented sequencing depth now enables the analysis of individual biological samples, while pooling of samples was previously the only affordable option in SAGE. Our results include a striking example of the intrinsic hazards of pooling in expression profiling.
The biological question addressed in the current study was the identification of transcripts differentially expressed in the hippocampus between wild-type and transgenic mice overexpressing a splice variant of the doublecortin-like kinase-1 (Dclk1) gene. This splice variant, δC-doublecortin-like kinase (DCLK)-short, makes the kinase constitutively active (8), and causes subtle behavioral phenotypes (Schenk et al., in preparation). The exact same RNA samples have been analyzed before on five different genome-wide microarray expression profiling platforms (9), which detected few differences in expression between the two groups. We report here that DGE detects a lot more small, yet significant differences between the two groups of mice, including those in antisense transcripts and transcripts with different 3′-untranslated regions (UTRs). Furthermore, we discuss the advantages of deep sequencing over microarray expression profiling.
Wild-type male C57/BL6j and transgenic male mice overexpressing DCLK-short with a C57/BL6j background were individually housed 7 days prior to the start of the experiment. Animals were housed under standard conditions, 12 h/12 h light/dark cycle and had access to food and water ad libitum. Wild-type (N = 4) and transgenic (N = 4) tissue samples were collected by taking the brain from the skull and quickly dissecting out both hippocampi. Dissection was performed at 0° C to prevent degradation of RNA. Hippocampi were put directly in pre-chilled tubes containing Trizol reagent (Invitrogen Life Technologies, Carlsbad, CA, USA). All animal treatments were approved by the Leiden University Animal Care and Use Committee (UDEC# 01022).
After transfer to ice-cold Trizol, hippocampi were homogenized using a tissue homogenizer (Salm&Kipp, Breukelen, The Netherlands) and total RNA was isolated according to the manufacturer's protocol. After precipitation, RNA was purified with Qiagen's RNeasy kit with on-column DNase digestion. The quality of the RNA was assessed with the RNA 6000 Labchip kit in combination with the Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA), using the Eukaryote Total RNA Nano assay according to the manufacturer's instructions.
Sequence tag preparation was done with Illumina's Digital Gene Expression Tag Profiling Kit according to the manufacturer's protocol (version 2.1B). A schematic overview of the procedure is given in Supplementary Figure 1. One microgram of total RNA was incubated with oligo-dT beads to capture the polyadenlyated RNA fraction. First- and second-strand cDNA synthesis were performed while the RNA was bound to the beads. While on the beads, samples were digested with NlaIII to retain a cDNA fragment from the most 3′ CATG to the poly(A)-tail. Subsequently, the GEX adapter 1 was ligated to the free 5′ end of the RNA, and a digestion with MmeI was performed, which cuts 17 bp downstream of the CATG site. At this point, the fragments detach from the beads. After dephosphorylation and phenol extraction, the GEX adapter 2 was ligated to the 3′ end of the tag. A PCR amplifcation with 15 cycles using Phusion polymerase (Finnzymes) was performed with primers complementary to the adapter sequences to enrich the samples for the desired fragments. The resulting fragments of 85 bp were purified by excision from a 6% polyacrylamide TBE gel. The DNA was eluted from the gel debris with 1× NEBuffer 2 by gentle rotation for 2 h at room temperature. Gel debris were removed using Spin-X Cellulose Acetate Filter (2 ml, 0.45 µm) and the DNA was precipitated by adding 10 µl of 3 M sodium acetate (pH 5.2) and 325 µl of ethanol (–20°C), followed by centrifugation at 14 000 r.p.m. for 20 min. After washing the pellet with 70% ethanol, the DNA was resuspended in 10 µl of 10 mM Tris–HCl, pH8.5 and quantified the DNA with a Nanodrop 1000 spectrophotometer.
Cluster generation was performed after applying 4 pM of each sample to the individual lanes of the Illumina 1G flowcell. After hybridization of the sequencing primer to the single-stranded products, 18 cycles of base incorporation were carried out on the 1G analyzer according to the manufacturer's instructions. Image analysis and basecalling were performed using the Illumina Pipeline, where sequence tags were obtained after purity filtering. This was followed by sorting and counting the unique tags. The raw data (tag sequences and counts) have been submitted to Gene Expression Omnibus (GEO) under series GSE10782.
All tags were annotated using a database provided by Illumina. Briefly, a preprocessed database of all possible CATG + 17-nt tag sequences was created, using mouse genome (mm8 version from UCSC site) and mouse transcriptome (all refseq, mRNA and ESTs found in GenBank as of November 2006 and Unigene version Mm159). All tags were classified based on the location and orientation in the original sequence as outlined in Supplementary Table 1. The genome was used as a backbone for tag clustering, using tag per genome position as a unique key. Best possible ‘local’ annotation was chosen for each genome location. Finally, best annotation for each distinct tag sequence was chosen based on quality of local annotation and number of transcripts in that location. The total number of genome and transcriptome hits for each tag is also recorded. This nonredundant set of all tags (‘tophit’) could be used as a lookup table for all experimental tags annotation. Only perfect matches were considered, and no mismatches were allowed.
The total set of all annotation tags could be separated into several groups: canonical transcriptomic tags—3′-most tags from known transcripts (the 52 281 tags most expected in a DGE tag profiling experiment); noncanonical transcriptomic tags–all tags in the mouse genome that map to any known exon (both strands) but not 3′-most or derived from few ESTs only (~1.6 million tags); tags derived from ribosomal (46 tags) and mitochondrial RNA (108 tags); REPEAT tags—tags that map to the genome more than 100 times (2900 tags); and tags that map to the genome but not to any known exon (~17 million ‘just genome’ tags).
The microarray analysis of the exact samples as used for DGE is described in our previous paper (9). Microarray data are available through Gene Expression Omnibus under series GSE8349.
To enable comparison with microarray probes, all canonical sequence tags and microarray probe sequences were put in FASTA format and then aligned to the ENSEMBL mus_musculus_core_46_36g cDNA (transcript) database using the PERL API. The probe sequences on the Agilent (AGL: WMG G4122A), Illumina (ILL: Sentrix Mouse-6 Expression BeadChip) and home-spotted long oligonucleotide arrays (LGTC: 65-mer Sigma-Compugen mouse library, version 1), were provided by the manufacturer. For the Affymetrix chips (AFF: Mouse Genome 430 v2.0 Array), the sequence from the first probe in the probeset to the last probe in the probeset was taken. For the Applied Biosystems arrays (ABI: AB1700), only the surrounding 180 nt of the probe were given and these were taken into the alignment. Microarray and Ilumina DGE tag-profiling results were compared in pairs. Only ENSEMBL transcripts that were shared between the Illumina Genome Analyzer platform and a certain microarray platform were considered.
Initially, a Student's t-test was performed to determine significant differences in gene expression between the group of wild-type and transgenic samples. Before performing the t-test, we corrected for differences in the total number of counts by multiplying with a linear scaling factor that is defined as the total number of tags obtained for a certain sample divided by the average number of obtained tags in all samples. In addition, we stabilized the variance by applying a square root transformation on the linearly scaled data. This square root transformation gives a better stabilization of the variance in the region of low abundance than a logarithmic transformation. In addition, the square root transformation can handle observations with zero counts.
As a better suited alternative for the t-test, we applied a Bayesian model developed by Vencio et al. (10). We considered only canonical tags which had at least one count in each group. It fits a probability density function per gene and per group, employing the Beta-Binomial distribution, and taking into account the number of observed tags in each sample and the library size (=total number of tags) for each sample. A Bayesian error rate is calculated that reflects the posteriori chance that the probability density function of the group of wild types is in reality not different from the one of the transgenic mice. To estimate the number of false positives in the list of differentially expressed genes obtained by setting a cutoff on the maximum Bayesian error rate, we calculated the number of genes below the same Bayesian error rate in all unique permutations for the comparison of two groups, where the first group contained two wild-type and two transgenic mice, and the second group contained the other two wild-type and transgenic mice.
The RNA samples used for the qPCR assays were the same as for the DGE experiments. cDNA was synthesized using the Transcriptor First Strand cDNA Synthesis Kit (Roche). Quantitative RT-PCRs (qPCRs) were done on the Lightcycler480 (Roche), with SYBR-Green detection or (when amplification efficiencies with SYBR-Green were below 90%) using the universal probe library (UPL, Roche). Each cDNA was analyzed in quadruplicate, after which the average threshold cycle (Ct) was calculated per sample. The relative expression levels were calculated with the 2−ΔΔCt method, while using the average threshold cycles for all genes analyzed to correct for differences in cDNA input.
The global test (11) (available from Bioconductor: www.bioconductor.org) was used to test which Gene Ontology (GO)-defined pathways were significantly deregulated in DCLK compared to wild-type mice. After summarization of the tags for each Entrez Gene entry, the global test was run on the scaled and square root transformed data. The asymptotic method was used to calculate the P-values. Additional filtering of pathways was done on the median of the z-scores for each gene in the pathway (median should be >1.5), to retrieve only those pathways for which the majority of genes contribute to the significance of the pathways.
We sequenced hippocampal DGE libraries from four individual wild-type and four individual DLCK transgenic mice. We obtained 2.4 ± 1.2·106 sequence reads per sample with ~2.0·105 unique tag sequences. Figure 1 shows the distribution of the tags over the different classes that we discriminate (see ‘Materials and methods’ section and Supplementary Table 1). Canonical tags, i.e. those which map to the most 3′ CATG site in high-confidence transcripts, account for 70% of the total number of reads. Since they account for only 20% of all unique tags, these appear to have an overall much higher abundance than tags corresponding to low-confidence transcripts (see also Supplementary Figure 2). Around 8% of the reads mapped to mitochondrial RNAs. The collective percentage of reads in repeat regions, regions with no evidence for transcription, and tags that could not be mapped to the genome was around 12%.
To evaluate the reproducibility of DGE across different laboratories, the same RNAs were pooled and a wild-type and a transgenic pool were analyzed in triplicate at a different site (Illumina Inc., Hayward, CA) using the same protocol. The Pearson correlation coefficients for the number of counts and the normalized (scaled and square root-transformed) number of counts across technical replicates in the same laboratory were >0.99.The correlation between the normalized number of counts from the summed individual samples in our laboratory and the pool analyzed in the other laboratory were 0.98 and 0.96 for wild-type and transgenic samples, respectively (plots in Supplementary Figure 3). This is indicative of low technical variability, even across different laboratories.
The dynamic range of DGE is three to four orders of magnitude. The most abundant transcript, arising from the Ckb gene (brain isoform of creatine kinase), comprises 0.55% of all canonical tags [5.5·103 transcripts per million (t.p.m.)]. The lowest expressed transcripts which were still consistently detected in all samples had an abundance of 2 t.p.m., which corresponds with an average of ~0.3 copies per cell (12). The hippocampus is a rich source of unique transcripts: 28 341 different canonical tags were detected in both wild-type and transgenic groups; including noncanonical mappings increases the number even further. Within the noncanonical group alone, 45 550 tags were identified in both groups.
DGE is able to discriminate between transcripts with different 3′-ends when they are separated by at least one restriction site. A remarkable 47% of the detected ENSEMBL transcripts were detected by more than one tag. This is unlikely to be caused by partial digestion of the NlaIII enzyme, in which case the more abundant and the less abundant tag for the same transcript would be found at an approximately fixed ratio. In addition, the majority of tags had been identified before in LONG-SAGE libraries. Most likely, it is due to the use of alternative polyadenylation signals in the 3′-UTR. In addition, a small fraction may be explained by alternative cleavage site selection from the same polyadenylation site (13). The observed 47% alternative polyadenylation is much higher than the 29% estimated previously based on EST sequences (14). We note that the actual incidence may yet be higher, because 3′-ends downstream of the annotated ENSEMBL transcripts are not mapped to the transcript, and alternative polyadenylation sites with no CATG sites in-between are missed. On the other hand, we have only investigated the hippocampus, while this incidence may well vary between tissues.
By considering canonical and noncanonical tags with an abundance of >2 t.p.m., and employing the strand-specific nature of the sequencing reads obtained, we find evidence for bidirectional transcription in 51% of all detectable Unigene clusters. While confirming earlier observations of bidirectional transcription in the majority of genes (15–19), our results show that the antisense transcripts are also expressed at substantial levels. Although in most cases the sense transcripts have higher abundance than the antisense transcripts, the opposite is true in 11% of the cases (Supplementary Figure 4). The on-the-bead cDNA synthesis, together with the absence of a correlation between the abundance of sense and antisense transcripts (i.e. antisense tags are generally not more prominent in highly abundant transcripts), almost excludes the possibility that the antisense tags are found due to reverse transcriptase artifacts, as suggested previously (20).
As a first indication for subtle, yet significant differential gene expression between the two groups of mice, the intragroup Pearson correlations (among wild-type or transgenic samples) were higher (0.96) than the correlations between samples from the different experimental groups (0.93) (P-value: 0.056, permutation test, Supplementary Table 2). A Fisher or similar 2 × 2 contingency table statistical test has previously been used to identify tags with significantly different abundance in two pooled SAGE libraries (21). In such experiments, biological variation between samples is not addressed. Our sequencing of the pooled samples clearly demonstrates the hazards of pooling. Table 1 shows tags that were highly significant in the pooled experiment (based on Fisher's test), and not significant when analyzing the individual samples (Student's t-test). Clearly, these tags originate from wild-type sample 1 only. Significant expression of the Mup1 transcript in wild-type sample 1 only was confirmed by qRT-PCR (Supplementary Table 3). Detailed study shows that all these transcripts are highly expressed in blood. Blood contamination of one of the samples, not noted during the tissue dissection procedure, thus leads to the false-positive identification of several differential transcripts in the pooled experiment. While sequencing of pooled SAGE libraries was previously the only option, it is now both advisable and affordable to sequence individual samples.
Since we sequenced multiple libraries from individual samples, we can estimate within- and between-group variation. Initially, we used a Student's t-test, which takes into account both types of variation, to identify genes differentially expressed between the two groups. In doing this, however, we found some important flaws of classic statistics: a t-test can only be applied in a meaningful sense after normalization for the total number of tags in the library and proper variance stabilization. We did this by linear scaling based on the total number of counts and subsequent square root transformation. The square root transformation (approximately) stabilizes the variance of raw counts but not of the scaled data. Hence, we cannot stabilize variance while normalizing for library size at the same time (22). This problem is particularly prominent in our experiment where one wild-type and one transgenic sample had, respectively, 3 and 10 times lower numbers of counts than the other samples. Vencio et al. (10) proposed a Bayesian method for the analysis of replicated SAGE data that takes into account stochastic effects for the low abundant genes as well as differences in library size. It reports the Bayesian error rate which can be interpreted as the chance that a gene is found differentially expressed under the null hypothesis. With a Bayesian error rate of 0.05, we detected 1559 up- and 1620 downregulated canonical tags in the comparison of transgenic with wild-type mice. The distribution of the detected fold-changes can be inferred from the Volcano plot in Figure 2 and ranged between 71-fold (2700089E24Rik, found only once in all wild-type samples, but 19 times in transgenic samples) and 1.13-fold. Differentially expressed tags were found in the entire range of expression levels (Supplementary Figure 5). A list of the 20 most significant tags is given in Table 2. Vencio's test does not consider multiple testing. To estimate the number of false positives obtained, we calculated Bayesian error rates when permuting the samples (Supplementary Figure 6). The number of tags found differentially expressed with an error rate of 0.05 in the permutated situations was 270 ± 103. Thus, the false discovery rate in our list of 3179 differentially expressed genes is estimated to be 8.5%.
In addition to differentially expressed canonical tags, we detected differential expression of 2479 noncanonical and 15 mitochondrial tags.
By alternative splicing the DCLK gene produces numerous proteins. Recent functional studies from a.o. knock-out mice strongly indicate involvement of the DCLK gene in several molecular pathways. Some are microtubule-associated proteins (23) that may regulate microtubule-guided transport of SNARE-protein containing synaptic vesicles (24), while the DCLK-short variant has Ca++/calmodulin-dependent protein kinase (CaMK) properties (8,25). In the current study, we evaluated which biological pathways were affected in the hippocampus by the expression of the DCLK-short isoform. The global test (11) was applied to the DGE data to identify the differential regulation of gene sets, as defined by the Gene Ontology consortium. Unlike commonly used overrepresentation tests or gene set enrichment analysis, this method uses the gene expression measurements of a particular set of genes, giving optimal power for small sample-size experiments and detection of gene sets where many genes display a small effect. The most significantly affected pathways are reported in Table 3. Strikingly, the CaMK pathway was the second most significant pathway. Disturbances in the expression of genes in the CaMK pathway are possibly a consequence of transcriptional feedback mechanisms. Also in line with the function of the DCLK gene, we find indications for disturbed synaptic vesicle transport along microtubules as a result of alterations in gene expression of vesicle SNARE proteins (rank 19) and microtubule plus-end binding proteins (rank 1), potentially affecting neurotransmitter release and axonal outgrowth.
Before the development of deep sequencing technology, construction of a large-scale SAGE library containing up to 100 000 canonical tags would typically take 1 year and a considerable financial investment. The number of tags in such a library is 60 times smaller than the number of tags we obtained for each group of samples in a 3-day experiment. To illustrate the effect of the increased sequencing depth, we have compared our results to the results from a simulated SAGE experiment, which includes only 1/60 of the original number of DGE reads, randomly taken. The number of detected differentially expressed genes decreased 15-fold, from 3179 with the original number of reads to 200 in the simulated SAGE experiment (Bayesian error rate <0.05). The lowest abundance of a significantly detected differentially expressed transcript was 0.8 t.p.m. in our deep sequencing experiment versus 91 t.p.m. in the simulated SAGE experiment. As noted before (26), many of the genes with most significant changes in expression are low-abundant genes and would not have been identified in a typical SAGE experiment.
The exact same RNA samples had been analyzed previously by five different whole-genome expression microarray platforms: Applied Biosystems, Affymetrix, Agilent, Illumina and home-spotted oligonucleotide arrays (9). We compared the results from DGE and the microarray experiments after mapping all canonical tags and microarray probes to the ENSEMBL transcript database. With DGE, we detected 15 189 ENSEMBL transcripts with abundances >2 t.p.m. With most microarray platforms, a lower number of transcripts gave signal above background, except for Agilent, where there may have been considerable background signal caused by cross-hybridization (Table 4). Affymetrix had the highest percentage of transcripts in common with DGE. In general, less abundant transcripts were more difficult to detect with microarrays. The median expression of 538 transcripts detected by DGE but not by any of the microarray platforms had a median abundance of only 4 t.p.m., while the transcripts that were detected by all platforms had a median abundance of 106 t.p.m.
Figure 3 shows the correlation between absolute transcript abundance and microarray probe intensity. In line with other reports (27–29), we observed a reasonable correlation between the intensity of the microarray hybridization signal and the number of tags sequenced. The correlation was highest for Affymetrix chips (Pearson correlation: 0.63). For the Affymetrix data, intensities of the 11 perfectly matched probes were summarized into a single value. Indeed, the use of 11 different probes per transcript, in contrast to a single probe per transcript for the other platforms, should even out probe-specific hybridization characteristics. The correlation in detected transcripts was higher than previously found for SAGE or MPSS versus Affymetrix (30,31), mainly due to the higher number of tags sequenced with DGE.
Technical replicate measurements were used to compare the precision of DGE with that of microarrays. As a measure for precision we determined the distribution of the differences between independent replicate measurements of log ratios between wild-type and transgenic samples, as proposed by Irizarry et al. (1). Figure 4A shows the distribution of these differences for DGE and the two microarray platforms with highest and lowest precision (Agilent and home-spotted oligonucleotide arrays, respectively). The distribution of DGE is narrower (interquartile range (IQR): 0.51) than that of Agilent (IQR: 0.61) and home-spotted arrays (IQR: 0.75), indicating that DGE has a higher precision than microarrays.
With DGE we found a much wider distribution of fold-changes between the closely correlated groups of mice than for the microarray platforms, where the highest fold-change measured was 2. By DGE, we observed 1491 significantly differentially expressed tags (error rate <0.05) with an absolute fold change >2 (Figure 2). The only three genes that were significant on all microarray platforms and confirmed by qRT-PCR, Plac9, D14Ertd449e and Gabra2, were also significant in DGE (Bayesian error rates of 2.0·10−48, 3.5·10−47 and 3.9·10−12, respectively). For the comparison between DGE and qPCR, we selected 29 significant genes from the DGE experiments (randomly chosen and covering the entire range of significance (Bayesian error rates between error rates between 1·10−47 and 0.05) and fold-changes), and 33 genes significant genes from the microarray analyses (9). Results are given in Supplementary Table 3 and displayed in Figure 4B. The fold-changes obtained by DGE were generally also more extreme than those obtained by quantitative PCR, as is evident from the slopes of the curve. Out of 62 genes assayed, 43 demonstrated a concordant direction of change for DGE and qPCR, but only five were significant according to both technologies.
We made a more general comparison of the lists of differentially expressed genes from the DGE and microarray experiments. Differential gene expression for DGE was established with Vencio's algorithm as described above (estimated FDR 8.5%) and for microarrays with the Empirical Bayes model LIMMA (32) (estimated FDR 10%). Complete results on correspondence between DGE tag counts and microarrays are reported in Table 5. The biggest overlap was found with the Affymetrix platform (P = 1.2·10−5; chi-square test): 31 transcripts were significant on both platforms with expression changes in the same direction. Also when assessing the correlation of the expression between transgenic and wild-type mice, Affymetrix chips were found to correlate better with DGE (Pearson correlation: 0.25) than the other microarray platforms (Supplementary Figure 7). The number of differentially expressed transcripts by DGE was closest to the number detected with the Agilent platform (2414 and 2710). However, the overlap between these transcripts was hardly greater than would be expected by chance and there was little correspondence in the direction of change.
Deep sequencing is a powerful technique for the identification of differentially expressed transcripts. The large sequencing depth clearly boosts the detection of differential expression of low-abundant transcripts that are well beyond the reach of classical SAGE. The sequencing depth of the Solexa/Illumina DGE technology compares favorably to the earlier MPSS system from Lynx Therapeutics [7·105 sequences per run (7)] and Roche [454 sequencer, 3·105 sequences per run (33)], and is comparable to the polony multiplex analysis of gene expression (34).
Instead of sequencing SAGE tags, some recently published papers now describe the use of random shotgun RNA sequencing (RNASeq) (27–29,35–38). This overcomes the limitations of tag-based methods in the detection of transcripts alternative splicing in regions remote from the 3′-end, and enables detection of allele-specific transcription. With the continuously increasing number of reads at reduced costs, RNASeq will become affordable for standard differential gene expression analysis. However, at the present throughput it is favorable to use methods that provide a specific tag for each transcript, when the aim is to detect subtle expression differences in larger group of samples: We demonstrate that ~2 million tags are required to reliably detect low abundant genes with DGE, whereas RNASeq requires at least 20 million tags per sample to obtain reasonable coverage of most transcripts (29,36).
We have implemented a dedicated Bayesian method to identify genes that are significantly differentially expressed between two groups of biological replicates. In most previously published reports analyzing differential gene expression in count-based data, the statistical tests applied did not account for within-group variation (28,34). We illustrated the importance of proper estimation of within- and between-group variation by showing that classical tests lead to the identification of false-positive genes due to the presence of a single blood contaminated sample. In an earlier deep sequencing report (27), in analogy to microarray data analysis, quantile normalization and a moderated t-statistic as implemented in the R package Limma were used to find differentially expressed genes. We believe that our method is better suited for the comparison of independent sequence libraries because one of the intrinsic properties of the test is that it puts more weight on samples which were sequenced at greater sequencing depth.
The availability of biological replicate measurements allowed us to use the global test (11), which takes into account the expression levels in individual samples, for the detection of disturbances in several biological pathways. Several of the identified pathways were highly relevant given the function of the DCLK1 protein (8,23–25). These pathways had not been identified by any of the microarrays using the same statistical test (9).
Our results demonstrate many advantages of DGE over expression microarray technology: (i) DGE gives an unbiased view of the transcriptome, not limited by predictions of expressed transcripts used to determine array content; (ii) DGE detects high levels of differential polyadenylation and antisense transcription, which are not detectable with standard microarrays; (iii) DGE data are more precise than microarray data; (iv) DGE data analysis requires a lower number of preprocessing steps (like background correction and normalization), which facilitates interlaboratory comparisons; (v) interlaboratory comparability of DGE data is high, probably due to the avoidance of hybridization processes, which are notoriously difficult to standardize (1); and (vi) DGE is more sensitive in the detection of low-abundant transcripts and of small changes in gene expression. This is probably due to the absence of background signal and saturation effects, which are major causes of ratio compression on microarrays (39). Some of these advantages have already been discussed in older literature comparing tag-based methods (SAGE, MPSS) and microarray data (2,26,30,31,40–45). The higher sequencing depth of DGE and the avoidance of laborious cloning steps add to the presumably superior precision and accuracy of DGE over these older methods, in particular when low-abundant transcripts are considered, and makes DGE a much more practical technique.
The correlation between DGE and microarrays and between DGE and qPCR assays was clear but modest. In accordance to what has been previously reported in comparisons between SAGE or MPSS and microarrays (31,40), the correlation between tag-based methods and microarrays was particularly poor for low-abundant transcripts. An important reason for the relatively low correlation across different technologies is the great similarity between our two sample sets. The resulting small differences in gene expression are difficult to pick up with microarrays, as also shown in the inter-microarray comparison of the same samples published recently (9), and also with qPCR assays. In samples with larger differences in gene expression, like the samples analyzed by the MAQC consortium (46), the correlation is likely higher. We believe that, apart from differences in sensitivity, an important reason is that the different platforms detect different transcripts. The microarray probes and qPCR assays detect, in many cases, a mix of different transcripts (1), where DGE can discriminate between transcripts with different 3′-ends; standard qPCR assays will detect cumulative presence of sense and antisense transcripts. Indeed, when all DGE tags behave similarly, as with the Gabra2 gene where we find 6 tags with an ~2.5-fold decrease in the DCLK mice (four from the sense and two from the antisense strand, see Supplementary Table 4), DGE results are consistent with all microarray platforms and qPCR (see Supplementary Figure 8). In many other cases, there will be no co-regulation between alternatively spliced transcripts or sense and antisense transcripts, which, especially in low-abundance situations, will result in poor correlation with microarrays and qPCR. In addition to the limited overlap in transcripts detected by both DGE and microarrays, many transcripts are detected only by one or a few of the platforms. For DGE, missing data for some transcripts are likely attributable to the absence of a CATG or a unique tag sequence (estimated frequency: 1% of murine RefSeq RNAs); for microarrays this is due to inadequate probe design. We also noted that there was a higher consistency between the fold-changes obtained by qPCR and microarrays when compared to those obtained by DGE. Apart from the explanations mentioned above, this is likely attributable to the fact that DGE measures absolute expression levels and DGE data are Poisson distributed (47), while qPCR and microarrays provide relative expression levels, which are log normal distributed.
Our finding that DGE results were more consistent with Affymetrix results than with other microarray platforms is consistent with an earlier study (31,42), in which MPSS results correlated better with Affymetrix than with other arrays. We think this lies in the use of multiple probes per gene, which should even out most probe-specific effects. Sequence biases in the different technologies have been described before. Comparative analysis of SAGE and microarrays shows that the GC content of microarray probes is important for detection sensitivity and for the correlation across technologies (26,30,41,43–45). We investigated GC bias in the DGE tags. The overall GC percentage observed in our tags is 42%. This is lower than for classical SAGE or MPSS (44) and better reflects the relatively low GC content of 3'-UTRs (48). By ranking the tags from high to low abundance, we find a higher percentage of Ts in the higher abundant tags (Supplementary Figure 9). This supports an earlier observation that highly expressed genes contain more T-rich 3′-UTRs than lowly expressed genes (48). Thus, the GC bias in DGE seems to be limited, but needs further investigation, also in the light of a recently published study where considerable overrepresentation of GC-rich sequences was observed in Solexa/Illumina-based resequencing experiments (49).
We foresee that further enhancements in sequencing depth will yet improve accuracy, in particular for low-abundant transcripts. Whole transcript sequencing (RNAseq) is another step forward. These advances, in combination with the currently achieved improvements in sensitivity, resolution and, notably, interlaboratory consistency, will tremendously boost the field of expression profiling. Multicenter biobanking and rare disease studies, where biological materials are scarce and widely spread and legal and logistical limitations may impede sharing of source materials, would gain enormously from better possibilities for robust post hoc integration of results. Also basic research and comparative genomics fields, which have been held back by extensive and lengthy standardization issues, will greatly benefit from the major improvement of data portability.
Supplementary Data are available at NAR Online.
The Netherlands Genomics Initiative/Netherlands Organization for Scientific Research (NGI/NWO); a VENI-grant from the Dutch Organization for Scientific Research (NWO grant 2005/03808/ALW to P.A.C.’tH.).
Conflict of interest statement. None declared.
We would like to thank Irina Khrebtukova and Gary Schroth (Illumina Inc., Hayward,CA) for sample preparation protocols, analysis of the pooled samples and assistance with data analysis. Michiel van Galen and Mattias den Hollander (LUMC and Hogeschool Leiden) are acknowledged for skillful assistance in bioinformatics. We would like to thank Jelle Goeman for helpful comments and Prof. Silvère van der Maarel and Prof. Rune Frants for critically reading the manuscript.