We have compared differential expression results based on our three proposed within-lane GC-content normalization methods and the CQN method of Hansen et al
], on the MAQC and Yeast datasets. Only full-quantile GC-content normalization appears to effectively remove the dependence of the proportion of DE genes on GC-content. This could mean either that, for some biological reason, GC-richer genes are more likely to be truly DE (in which case normalization erroneously removes this dependence) or that GC-content bias is so strong that an aggressive normalization method is needed. Since we are not aware of any plausible biological explanation for the dependence of DE results on GC-content, we believe that the MAQC and Yeast data require a full-quantile approach and that merely regressing counts on GC-content is not sufficient to completely remove the bias.
To rule out a biological reason for the dependence of DE on GC-content, we compared UHR vs. Brain DE results based on the MAQC-2 RNA-Seq data to those based on Affymetrix chip data [23
]. Figure clearly indicates that the dependence of p
-values on GC-content is technology-specific, i.e., unlike RNA-Seq p
-values, microarray p
-values do not depend on GC-content. Full-quantile within-lane normalization reduces the dependence of p
-values on GC-content. Interestingly, and encouragingly, the much smaller p
-values for the RNA-Seq data suggest that this newer assay is more powerful than microarrays for DE analysis (although it is unclear how to relate numbers of lanes and numbers of chips in terms of sample size).
Another well-known selection bias in RNA-Seq is due to gene length [3
]. For the Yeast dataset, we noticed only a minor length effect on read counts and DE results (Figure S17, Panel (a)). In fact, genes with high GC-content tend to be shorter, so there seems to be a compensation effect due to sequence composition (data not shown). Mappability does not appear to affect read counts for this dataset (Figure S17, Panel (b)). Gene length bias for the MAQC datasets is discussed in [3
In addition to their good performance noted above, our proposed normalization methods offer a number of advantages. They are very simple to implement and extend and lead to DE results that are robust to tuning parameters such as the number of GC-content bins (Figure S16). They could be applied to other genomic regions (e.g., exons), either "from scratch" or by retaining the scaling from a previous gene-level normalization. They can easily be adapted to incorporate other sequence features such as gene length and mappability. Note, however, that in the process of adjusting for GC-content one may already be adjusting indirectly for other covariates such as length. Controls (e.g., housekeeping genes, spiked-in sequences) could also be included.
Our normalization procedures return genes-by-lanes tables of normalized counts, on the original unlogged scale and rounded to the nearest integer. Some authors have argued that it is better to leave the count data unchanged to preserve their sampling properties and instead use an offset
for normalization purposes in the statistical model for read counts [21
]. It is out of the scope of this article to discuss whether it is preferable to normalize counts prior to modeling or to perform normalization within the model. Nevertheless, it is worth noting that our normalization approaches can easily be modified to produce an offset, by considering the difference between normalized and unnormalized counts, in a manner similar to Hansen et al
]. The EDASeq package implements both strategies, i.e., its normalization functions can return either a table of normalized counts or a table of offsets.
We identified differentially expressed genes using a likelihood ratio test based on a negative binomial model for read counts. For the MAQC datasets, Bullard et al
] found that it was appropriate to model read counts using the Poisson distribution (negative binomial distribution with null dispersion parameter). For the Yeast dataset, substantial over-dispersion remains after both within and between-lane normalization (Figures S3 and S18), which precludes relying on the Poisson distribution. Over-dispersion is greatly reduced by between-lane normalization and much less so by within-lane GC-content normalization. The four within-lane normalization procedures seem to have similar impact on the mean-variance relationship (with slightly smaller variances for CQN), so that DE results do not appear to be driven by differences in dispersion estimates for the different procedures. Furthermore, for the Yeast dataset, goodness-of-fit analysis suggests that a negative binomial model with common dispersion parameter for the ensemble of genes is sufficient to capture the over-dispersion present in the counts (data not shown). Virtually identical results were obtained for three over-dispersion scenarios implemented in edgeR: tagwise, trended, and common dispersion. Note that the violation of Type I error control for the Yeast pseudo-datasets is actually not as serious as it might seem at first. Indeed, the largest deviations correspond to culture/library preparation effects (worst-case scenario of Figure ) and nominal and actual Type I error rates are close for most pseudo-datasets (Figure S12). A detailed evaluation of read count models and DE methods is out-of-scope here, since our aim is to compare normalization approaches for a given DE method.
There are two different types of GC-content effects. The first effect is to act as a proxy for sample size
, in a similar manner as length, and relates to power
: as GC-content increases, read counts first increase then decrease, and evidence in favor of DE increases. If the effect was not sample-specific and simply a proxy for sample size, one would expect no dependence of expression fold-changes on GC-content and the effect on p
-values to be due to the dependence of the variance on GC-content (a simple calculation can be done in the case of length and assuming counts are roughly proportional to the product of gene length and expression level). One could therefore argue that it is not justified after all to normalize for GC-content and, in particular, that FQ normalization is too aggressive. Indeed, as seen in Figure , within-lane normalization methods not only remove the dependence of fold-changes on GC-content, but also tend to reduce the spread of fold-changes at high GC-content (especially for FQ). This results in an overall decrease in the proportion of genes declared DE (Figures and ). Other approaches which account for GC-content could be based on standardized p
-values, i.e., p
-values that explicitly account for sample size [34
]. A rule-of-thumb for standardizing a p
based on a sample size of n
to sample size 100 is
. In lieu of the sample size n
, one could use gene length or GC-content. The second and more insidious effect, however, is sample-specific
and hence biases
fold-changes and the resulting DE statistics (likelihood ratio statistics and p
-values). In particular, the standardized p
-value approach does not address the sample-specificity (and complexity) of the GC-content effect and would still lead to biased DE results. Likewise for methods that correct for the GC-content bias after performing DE tests, e.g., in a fashion similar to that proposed in Young et al
] for gene length bias in context of Gene Ontology analysis. We therefore find it preferable to adjust for GC-content prior to statistical modeling and DE analysis. The value of performing a within-lane GC-content normalization before combining/comparing counts between lanes is further supported by Figure , which shows that p
-values based on microarray data do not vary with GC-content and hence suggests that the GC-content effect is a technology-related artifact. Of the normalization procedures we considered, full-quantile normalization seems most effective at removing the dependence of DE results on GC-content. However, results may vary in a dataset-specific manner and less aggressive approaches, such as loess or median normalization, may be robust alternatives. In the absence of controls, we recommend a thorough exploration of the data before choosing an appropriate normalization. In summary, there is a trade-off
between bias removal and power: without within-lane GC-content normalization, fold-changes are biased, however normalization may mask true DE.
GC-content bias is even more of an issue when comparing read counts between species, e.g., allele-specific expression in diploid hybrid of S. bayanus
and S. cerevisiae
]. We are considering extensions of our methods to address GC-content bias for between-species, within-gene DE analyses.
It would also be interesting to consider adaptations of our methods to other sequencing assays, such as ChIP-Seq and DNA-Seq.
Finally, as with microarrays, positive and negative controls
(e.g., housekeeping genes, spiked-in sequences) are essential for conclusive validation and comparison of any inference method, e.g., in terms of bias, variance, Type I error, and power. Controls could also be incorporated as "anchors" within the normalization procedure itself [35
]. The use of controls from the External RNA Control Consortium (ERCC) in the recent article of Jiang et al
] is an encouraging step in this direction.