Our main novel finding is the extent to which normalization affects differential expression results: sensitivity varies more between normalization procedures, than between test statistics. Although the standard total-count normalization results in Poisson variation across replicate lanes, it has poor detection sensitivity when benchmarked against qRT-PCR. Instead, we propose scaling gene counts by a quantile of the gene-count distribution (the upper-quartile) and show that such normalization improves sensitivity without loss of specificity.
An important aspect of the MAQC datasets, which could have an impact on the interpretation of the analyses presented, is the large difference in gene expression between Brain and UHR. Often, gene expression analyses consider much more closely related sets of samples, with only relatively few genes expected to be differentially expressed. In the comparison of Brain and UHR, by contrast, only 5-30% of genes examined by qRT-PCR were deemed as non
-differentially expressed (depending on the choice of the multiple testing procedure used to correct the p
-values). Indeed, there may be no truly non-DE genes queried by the qRT-PCR experiment, but rather, very small differences in expression for every gene. This creates a possibility for errors when specifying a set of true negatives; we have tried to control for this by a careful and stringent definition of true negatives and by evaluating the effect of changes in this definition (see [Additional file 2
: Supplemental Section S6]).
Furthermore, the extreme difference in transcriptional profiles between the Brain and UHR samples means that the p
-values from the sequencing experiment are smaller than would be expected if all the genes were truly non-DE. In particular, the p
-values for non-DE genes (according to qRT-PCR) do not follow the expected uniform distribution, but are noticeably shifted toward zero ([Additional file 1
: Supplemental Figure S10]). The microarray data demonstrate the same behavior ([Additional file 1
: Supplemental Figure S10]), suggesting it is caused by the samples under consideration and not by inherent problems of the statistical methods. In contrast to the qRT-PCR tests for differential expression, the tests applied to sequencing data take into account the total number of reads mapping to each gene and, as a result, tend to have greater power for longer genes.
Another possible critique is that the improvement of UQ over total-count normalization is due to this normalizations more closely matching the normalization
procedure used with the qRT-PCR data rather than proper reflection of actual biological differences; in other words, UQ normalization might be closely matching the effect of dividing by POLR2A, as is done with the qRT-PCR data but not the underlying biology. Indeed, additional scaling of the microarray data by POLR2A slightly improves the ROC compared to the standard microarray quantile normalization ([Additional file 1
: Supplemental Figure S8b]). It is more likely, however, that total-count normalization, with its reliance on high-count genes, poorly reflects biological differences. This can be seen by taking a closer look at the POLR2A gene, which was chosen as a reference for qRT-PCR data because of its very similar expression in UHR and Brain across many qRT-PCR replicates [13
]: the UHR to Brain fold-change of POLR2A is estimated as 1.3 for total-count normalization in contrast to 0.97 for upper-quartile normalization and 0.90 for microarray data.
In regards to DE test statistics, the GLM-based likelihood ratio statistics and Fisher's exact statistics perform equally well in terms of sensitivity and handling of low-count genes. We find likelihood ratio tests appealing because of their generality. Indeed, using the GLM framework, one can adjust for potential confounding variables, including quantitative covariates, e.g., age of sample, as well as accommodate different count distributions (negative binomial in cases of over-dispersion).
A serious concern with all the DE methods considered here is the inherent dependence of power on read count, which in turn is related to both gene expression level and length. As most DE studies produce gene-lists, which are often then related to functional annotation (e.g., GO), it is undesirable for significance values to be driven by features such as length. A weighted analysis based on gene length might lead to a reasonable length-independent ranking of genes, that would allow short genes with large effects to gain in significance compared to long genes with small effects.
We find that technical variation is quite low across lanes and flow-cells and slightly larger across library preparations. In all cases, however, the effect on differential expression results is minimal. As noted above, the MAQC datasets are unusual, in that we expect extremely large differences in expression between Brain and UHR and only small library preparation effects because of the high quality of the RNA. In practice, library preparation effects may be closer in magnitude to biological effects.
We have demonstrated that while there are some differences between phi X and auto-calibration in the early stages of the analysis pipeline, the differences in terms of differential expression are small. Overall, auto-calibration seems advantageous, as it yields more balanced designs, frees up one lane per flow-cell, and produces a larger number of higher quality reads per lane.
The analysis conducted in this work, as well as others, is predicated on a "whole-gene" view of expression profiling. We evaluated technical effects, phi X calibration, and normalization methods using a very constrained UI gene definition. We limited ourselves to such a strict definition in order to ensure that the evaluation was not biased by alternative splicing or overlapping genes. Our UI gene definition is a gross over-simplification, as a large amount of biologically relevant information is lost; we exclude more than 50% of the reads which fall within Ensembl genes.
As high-throughput sequencing becomes more prevalent, our ability to precisely characterize the transcriptome of a sample will dramatically increase. More refined analyses, such as isoform-level expression, allele-specific expression, and genome annotation (segmentation), involve comparing distinct regions within a sample as opposed to the same region across samples. Such analyses will require an understanding of the effect of sequence composition on base coverage to account for the heterogeneity of base-level count distributions