We have introduced the sva
package, including the popular ComBat
function for removing batch and other unmeasured or unmodeled sources of variation. We have also introduced the first function for removing batch effects in genomic prediction problems. The sva
package is freely available from the Bioconductor website and is compatible with widely used differential expression software such as limma
3.1 Surrogate variables versus direct adjustment
The goal of sva is to remove all unwanted sources of variation while protecting the contrasts due to the primary variables specified in the function call. This leads to the identification of features that are consistently different between groups, removing all common sources of latent variation.
In some cases, latent variables may be important sources of biological variability. If the goal of the analysis is to identify heterogeneity in one or more subgroups, the sva
function may not be appropriate. For example, suppose it is expected that cancer samples represent two distinct, but unknown subgroups of biological interest. If these subgroups have a large impact on expression, then one or more of the estimated surrogate variables may be highly correlated with the subgroup (Teschendorff et al., 2011
). This is true regardless of whether the surrogate variables are estimated with principal components, singular vectors (Leek and Storey, 2007
) or independent components (Teschendorff et al., 2011
). However, removing surrogate variables that are correlated with the phenotype of interest may lead to inconsistent and anti-conservatively biased significance analysis, specially if unknown latent variables are correlated with the phenotype of interest (Leek and Storey, 2007
). Thus, whether exclusion of surrogate variables improves inference or not is an open unsolved problem.
In contrast, direct adjustment only removes the effect of known batch variables. Batch effects are the best-known source of latent variation in genomic experiments (Leek et al., 2010
). However, there are many variables that may have a substantial impact on genomic measurements, from environmental variables (Gibson, 2008
) to genetic variation (Brem et al., 2002
; Schadt et al., 2003
). These variables may be the focus of the study being performed. But there are many studies that focus on identifying the association between genomic measurements and specific outcomes or phenotypes. In these studies, genetic and environmental variables are often unmeasured or unmodeled. If ignored, these biological variables may act in the same way that batch effects act by obscuring signal, reducing power and biasing biological conclusions (Leek and Storey, 2007
As a rule of thumb, when there are a large number of known or unknown potential confounders, surrogate variable adjustment may be more appropriate. Alternatively, when one or more biological groups is known to be heterogeneous, and there are known batch variables, direct adjustment may be more appropriate.