We set out to evaluate technical, biological, and experimental variation in gene expression measured by mRNA-Seq counts from n

=

25 subjects. Technical variation in these data were in keeping with Poisson distributional assumptions in general as has been reported by some [

3-

5], while others have reported greater than Poisson variation for low counts [

18]. The biological variance across all observed genes was over-dispersed relative to the mean, increasing as a quadratic function of the mean. This has important implications for study planning in terms of power and sample size estimation as well as choice of statistical modeling procedures to detect differential expression. Differential expression results were not included here since the focus is on the mean-variance relationship.

Over-dispersion, i.e., variance larger than the mean, is a common phenomenon in observed count data [

7]. Others have suggested that over-dispersion in mRNA-Seq data will likely be observed in mRNA-Seq biological replicates and have suggested models relying on OD Poisson or NB assumptions to account for this [

5,

6,

13,

16,

19]. There was clear evidence of over-dispersion from the standard Poisson distribution in our data. A quadratic function described the mean-variance relationship across all genes better than a linear function, which is consistent with the NB family of distributions. The interpretation of the NB distribution as counts from several within-subject Poisson distributions, each subject having their own mean parameter makes biological sense as well in the study at hand. In μ

+

μ

^{2}, the square root of

corresponds to the subject-to-subject CV of the Gamma distributed Poisson mean parameters. The CV corresponding to the global estimate of

for the data presented herein is 36%. We have observed a CV of approximately 9% for technical variation in genetically similar rats and 22% for biological replicates in genetically similar rats (data not shown). This is in keeping with the expectation that biological variation is larger than technical variation, and human variation is larger than genetically inbred animal model variation.

There are several potential sources of experimental variation in these studies including PCR library preparation batch, flow cell, lane, and for large studies, machine and software upgrades. Some effects can be addressed through normalization while others must be modeled directly. We found that normalization strategy had little impact on GOF statistics. Bullard et al. [

5] reported similar results and found that the 75% count normalization resulted in nearly no bias and the best sensitivity and specificity without increasing variance. Our data was similar to theirs in that a small percent of the genes contributed a large portion of the counts. Thus, we chose to use the 75% count as the normalization scale in our models. Others have evaluated normalization strategies beyond this [

17,

20].

Bullard et al. found experimental variation due to library PCR preparation batch to be greater than that due to flow cell [

5]. We found that inclusion of library PCR batch, lane-pair and flow cell each resulted in a dramatic reduction in the large GOF statistics resulting in the GOF statistics being too small (i.e., less than expected) in the majority of genes. The GOF statistics were reduced to the extent that, in this particular experiment with only one subject per treatment per flow cell, we believe these terms should not be included in the differential expression models in order to avoid over-fitting. Note that this experiment is an incomplete block design with respect to subject, in which case there is not complete recovery of all effects. The discussion of incomplete recovery of effects has spanned decades in the statistical literature. The classic text by Scheffe points to the use of random effects as a compromise between the extremes of not including the effects or including them as fixed effects [

21]. Since estimation in a NB model of both the dispersion parameter and random effects is not straightforward, the experimental effects were modeled as fixed effects for the purposes of this investigation.

Estimation of the over-dispersion parameter has been the subject of much research; see for example the summary in [

15]. While the edgeR local estimates of over-dispersion resulted in the most reasonable GOF distributions, the results herein point to the need for further investigation in the estimation of variance for these data. Biologically it makes sense for the variance to differ across genes, and that the mean-variance relationship may vary across genes as well. For example, some systems in the body such as pH are very tightly controlled with dramatic consequences for deviations from the baseline normal levels. Other systems in the body such as cholesterol are less tightly controlled, and allowed to vary widely before consequences are observed. However, with small sample sizes it is likely too difficult to estimate the mean-variance relationship accurately and precisely on a per-gene basis. Local or per-gene estimates of variance are common practice in gene expression microarray and shotgun proteomic literature [

22-

24]. Anders and Huber recently presented data from two pools of fruit fly embryos in each of two study groups that agree with this assumption and they described a model for variation based on NB distributional assumptions with a “shot noise” portion and a “raw variance” portion [

6]. Alternatively, the fact that these model fits result in smaller than expected GOF statistics for a majority of genes may point to the need for a linear variance function for some genes and quadratic variance function for others due to biological phenomenon such as alternative splicing. Indeed, such a modeling strategy has been proposed [

25]. It is difficult to compare per-gene appropriateness of OD Poisson and NB distributional assumptions due to the poor behavior of the residual deviance statistic and the fact that estimation of the OD Poisson dispersion parameter is a function of the Pearson statistic. Application of a power transformation to the data in order to use Poisson modeling assumptions has recently been proposed [

17]. We expect continual improvement in informatics mapping algorithms to result in decreased variance; e.g. including reads that map to multiple locations on the genome has been shown to improve technical reproducibility [

26]. Robust GLMs based on M-estimators may be an attractive alternative as well. A Gaussian approximation to the Poisson distribution is said to be valid when the mean is ≥10 [

27]; thus linear models which account for the mean-variance relationship, e.g. via weighted least squares, may be an attractive alternative since model fitting and assessment of model fit is more straightforward than for GLMs.

Our study had several strengths and limitations. We reported on technical and biological variation in mRNA-Seq data from a relatively large sample set and provide valuable insight into biological variation that will be useful to many researchers. The data herein were from true human biological replicates rather than contrived cell line replicates. We expect that the conclusion of a quadratic mean-variance relationship for biological replication will extend to other experimental settings. However, the precise estimates of over-dispersion are expected to be study-specific; it is plausible they would be smaller in inbred mouse models, smaller yet in cell lines. The present study was not designed to assess bias in estimating gene expression, fold-changes or sensitivity and specificity of fold-change detection. While the modeling strategy utilized does not account for the degrees of freedom used in estimating the dispersion parameter, thousands of data points are used in this estimation, so the estimate should be very stable and precise [

14]. Unfortunately there was a software upgrade mid-way through the study. Technology in the field of next generation sequencing is progressing at a rapid pace; indeed, data from third generation sequencing is already available. This rapid pace makes upgrades, whether in software or hardware, difficult to avoid in large studies making it important to utilize randomization and blocking when determining assay processing order.