Large-scale gene expression studies allow one to characterize transcriptional variation with respect to measured variables of interest, such as differing environments, treatments, time points, phenotypes, or clinical outcomes. However, a number of unmeasured or unmodeled factors may also influence the expression of any particular gene. Besides inducing widespread dependence in measurements across genes [1
], these influential factors create additional sources of differential expression, which, unlike gene-specific fluctuations, represent common sources of variation in gene expression that can be observed among multiple genes.
We call “primary measured variables” (or primary variables) those variables that are explicitly modeled in the analysis of an expression study. These variables may or may not be associated with any given gene's expression variation. We classify all the remaining sources of expression variation into three basic types. “Unmodeled factors” are sources of variation explained by measured variables, but are not explicitly included in the statistical model (e.g., because their relationship to expression is intractable or the relevant measured variables were excluded because of sample size restrictions). “Unmeasured factors” are sources of expression variation that are not measured in the course of the study, so we also call these unmodeled factors. Finally, “gene-specific noise” refers to random fluctuations in gene expression independently realized from gene to gene.
As a simple example meant only for illustrative purposes, consider a human expression study where disease state on a particular tissue type is the primary variable. Suppose that in addition to changes in expression being associated with disease state, the age of the individuals also has a substantial influence on expression. Thus, some genes exhibit differential expression with respect to disease state, some with respect to age, and some with respect to both. If age is not included in the model when identifying differential expression with respect to disease state, we show that this may (a) induce extra variability in the expression levels due to the effect of age, decreasing our power to detect associations with disease state, (b) introduce spurious signal due to the fact that the effect of age on expression may be confounded with disease state, or (c) induce long-range dependence in the apparent “noise” of the expression data, complicating any assessment of statistical significance for differential expression. In practice, even if age were known, it may be one of dozens of available measured factors, making it statistically intractable to determine which to include in the model. Furthermore, even measured factors such as age may act on distinct sets of genes in different ways, or may interact with an unobserved factor, making the effect of age on expression difficult to model. “Expression heterogeneity” (EH) is used here to describe patterns of variation due to any unmodeled factor.
Major sources of expression variation are due to technical [3
], environmental [5
], demographic [7
], or genetic [9
] factors. It is well known that sources of variation due to experimental design or large-scale systematic sources of signal may be present in expression data [3
], sometimes even after normalization has been applied [14
]. Genetic factors can also have a large-scale impact on gene expression levels. Specific genetic loci have been shown to influence the expression of hundreds or thousands of genes in several organisms [10
]. Expression heterogeneity is particularly pronounced in human expression data, especially in the study of complex systems, such as cancer or responses to stress [16
]. Recently, Lamb et al. proposed the “Connectivity Map” for identifying functional connections between cancer subtypes, genetic background, and drug action [19
]. Lamb et al. noted EH (e.g., due to cell type and batch effects) presented a major hurdle for extracting relevant biological signal from the Connectivity Map.
In each of these studies, expression variation with respect to one or at most a handful of variables is explored. However, it is likely that in each study multiple sources of EH will act on distinct, but possibly overlapping, sets of genes. Normalization techniques are commonly used to detect and adjust for systematic expression variation due to well-characterized laboratory and technical sources [12
]. However, to date there has been no approach for identifying and accounting for all sources of systematic expression variation, including variation due to unmeasured or unmodeled factors of both biological and technical sources. We show here that biological sources of variation not modeled in the analysis can be just as problematic as technical sources of variation.
Here, we introduce “surrogate variable analysis” (SVA) to identify, estimate, and utilize the components of EH. shows the effects of failing to account for unmodeled factors in a differential expression analysis, and the potential benefits of the SVA approach. EH causes drastic increases in the variability of the ranking of genes for differential expression (A), distorts the null distribution potentially causing highly conservative or anticonservative significance estimates (B), and reduces the power to distinguish true associations between a measured variable of interest and gene expression (C). However, employing SVA in these studies produces operating characteristics nearly equivalent to what one would obtain with no EH at all.
Impact of Expression Heterogeneity
We apply SVA to three distinct expression studies [7
], where each study contains clear patterns of EH (Figure S1
). These studies represent major classes of gene expression studies performed in practice: genetic dissection of expression variation, differential expression analysis between disease classes, and differential expression over time. We show that SVA is able to accurately identify and estimate the impact of unmodeled factors in each type of study, using only the expression data itself. We further show that SVA improves accuracy and consistency in detecting differential expression. SVA orders the significant gene lists to more accurately and reproducibly reflect the ordering of the genes with respect to their true differential expression signal. SVA is particularly useful in producing reproducible results in microarray studies, because adjusting for surrogate variables reduces differential expression due to sources other than the primary variables. These results indicate that EH is prevalent across a range of studies and that SVA can be used to capture and account for these patterns to improve the characterization of biological signal in expression analyses.