|Home | About | Journals | Submit | Contact Us | Français|
RNA sequencing has generated much excitement for the advantages offered over microarrays. This excitement has led to a barrage of publications discounting the importance of biological variability; as microarray publications did in the 1990s. By comparing microarray and sequencing data, we demonstrate that expression measurements exhibit biological variability across individuals irrespective of measurement technology. Our analysis suggests RNA-sequencing experiments designed to estimate biological variability are more likely to produce reproducible results.
RNA sequencing (RNAseq) technology provides various advantages over microarrays. For example, it is possible to measure alternative transcription1 or measure transcription for non-coding regions2de novo. Another potential advantage is low technical variation2-4. This has led to rapid adoption of the technology and a recent surge of publications5. However, the euphoria has led many of these publications to discount the influence of biological variability; forgetting perhaps that unwanted variability in gene expression measurements is not due only to measurement error. Gene expression is a stochastic process6 and is known to vary between units considered to be of the same population - for example in samples from a specific healthy tissue across individuals7. In a typical experiment, variation in gene expression measurements can be decomposed8 as:
Group variability is the variation in gene expression due to the groups under consideration in an experiment. For example, it is well known that gene expression profiles for tumor samples differ from expression profiles for matched healthy controls9. This type of variability can be measured by comparing samples from different biological groups and is typically the outcome of interest. The second component of gene expression variation, measurement error, can be estimated with technical replicates – different aliquots of the same sample measured with a technology multiple times. This is the type of variation that may be reduced with technology improvements4. Well-known sources of technical variability in both sequencing and microarray studies are laboratory10, 11 and batch12 effects. The third component of expression variation is true biological variability, which can only be measured by considering expression measurements taken from multiple biological samples within the same group. Regardless of the technology used to measure expression levels, the true gene expression levels will vary among individuals, because expression is inherently a stochastic process6. In an experiment where the group comparison is of primary interest, both measurement error and biological variation may be confused with the outcome of interest: the estimated difference in expression between groups.
To illustrate how biological variability among individuals within the same group is not eliminated by sequencing technology, we collected public data from two of the only RNA-sequencing experiments with a large number of biological replicates, n=60 and n=69, respectively13, 14. We compared a subset of these sequencing data (n=43 and 51, samples respectively) with microarray data from two different platforms15, 16. In each comparison, the exact same cell lines were analyzed on both technologies. In study one, m=14,797 genes had expression measurements from both sequencing and microarrays on all samples. In study two, m=7,157 genes had expression measurements from both technologies on all samples (Supplementary Methods).
For each expressed gene in each of the two studies, we calculated an estimate of the variability in expression levels across individuals as measured with microarrays and sequencing (Supplementary Methods). We found that variability in expression for each gene was similar in microarray and sequencing technologies (Fig. 1a-b). The same trend existed for different choices of variability measures (Supplementary Fig. 1a-b) and for different methods of calculating expression from sequencing (Supplementary Fig. 1c-d). We also found that transcripts showed substantial differences in biological variability. For example, COX4NB was not strongly variable in either population while RASGRP1 was highly variable for both populations, again regardless of technology (Fig. 1c). The technical variability for both genes was substantially smaller than the total variability (Supplementary Fig. 2a). These results are consistent with biological variability being a property of gene expression itself, rather than the technology used to measure expression. To confirm this result, we estimated the proportion of the total variability for each gene that is attributable to biology by applying a mixed effects model to data from the sequencing (11 samples) and microarray (14 samples) experiments for which we had two technical replicates. In general most of the observed variation was biological, rather than technical (Supplementary Fig. 2b).
Biological variability has important implications for the design, analysis and interpretation of RNA-sequencing experiments. For example, a large observed difference in expression of COX4NB between two groups is likely important, since the expression of this gene varies little across individuals. Meanwhile, that same difference in expression for RASGRP1 may be meaningless, since the expression for that gene is highly variable. If only a few biological replicates are available, it will be impossible to estimate the level of biological variability in expression for each gene in a study. Supplementary Table 1 summarizes a large number of published RNA-sequencing studies over the last three years. In every case, except for the two studies we analyzed here, conclusions were based on a small number (n ≤ 2) biological replicates. One goal of RNA-sequencing studies may be simply to identify and catalog expression of new or alternative transcripts. However, all of these studies make broader biological statements on the basis of a very small set of biological replicates.
Our analysis has two important implications for studies performed with a small number of biological replicates: (1) significant results in these studies may be due to biological variation and may not be reproducible and (2) it is impossible to know whether expression patterns are specific to the individuals in the study or are a characteristic of the study populations. These ideas are now widely accepted for microarray experiments, where a large number of biological replicates are now required to justify scientific conclusions. Our analysis suggests that since biological variability is a fundamental characteristic of gene expression, sequencing experiments should be subject to similar requirements.