Statistical design and analysis of perturbation screens involve (i) experimental design, (ii) normalization, (iii) summarization of the phenotype of each sample from multiple replicates and estimation of the associated variation, (iv) determination of ‘hits’, i.e. samples with systematic changes in phenotype and (v) evaluation and quality control.
Experimental design: a typical design of a perturbation screen is overviewed in . Samples in the screen are processed in 96-well or similar plates. To enable high throughput, the samples are profiled with a small number of replicates [e.g. 4, as recommended by (
Zhang and Heyse, 2009)], and all replicates of a sample are systematically allocated to the same plate. The screens typically require tens, hundreds or even thousands of plates. Therefore, the plates are handled in batches defined by the availability of biological material and capacity of equipment.
The scored phenotypes can be systematically altered by batches and plates (
Malo et al., 2006), within-plate effects due e.g. to rows and columns on the plate (
Malo et al., 2006) or excessive evaporation of media around the edges (
Wiles et al., 2008). To account for these artifacts, one or more control samples are included in all plates. These can be negative controls (e.g. unperturbed samples) and positive controls (e.g. samples with known changes in the phenotype).
Malo et al. (2006) recommend to allocate the controls around the edges of the plate, in order to limit the negative effects of evaporation on the perturbed samples.
The limited capacity of plates, the limited within-plate replication of perturbed samples, the absence of between-plate replication and a small number of controls makes elimination of experimental artifacts in perturbation screens extremely challenging in practice. We argue in
Section 3 that the existing statistical methods under-use the information provided by the controls in these situations, and that it is possible to obtain a more specific detection of hits by a separate use of two or more distinct positive or negative controls.
Normalization: scored phenotypes undergo quality control to eliminate the outlying or failed samples or plates. After that, a normalization procedure accounts for confounding and for experimental artifacts, and makes the scored phenotypes comparable across samples, batches and plates.
Two most frequently used families of normalization are sample based and control based. Sample-based normalization methods (detailed in
Supplementary Section 1) assume that the majority of perturbations do not affect the phenotype. Examples are
B-score (
Tukey, 1960),
Z-score and plate-wise median (
Collins et al., 2006).
Malo et al. (2006) reviewed sample-based normalizations for perturbation screens and recommended using
B-score. Another popular method is quantile normalization, which was introduced for the analysis of gene expression microarrays (
Bolstad et al., 2003;
Yang et al., 2002), and is applied to perturbation screens (
Bankhead et al., 2009). Principal component analysis can be used to account for the batch effect (
Leek et al., 2010), and surrogate variable analysis can help remove the heterogeneous effect of plates between batches (
Leek and Storey, 2007). Within-plate artifacts can be normalized using lowess smoothing (
Baryshnikova et al., 2010).
Sample-based normalization is attractive because it is based on the entire collection of measurements in the experiment, uses the maximal number of observations and therefore produces an accurate estimate of the normalized phenotype. However, it is not appropriate for screens where many perturbations affect the phenotype, and also in secondary and confirmatory screens. Alternative normalization procedures, based on controls, are more appropriate in these situations (
Birmingham et al., 2009). Examples of control-based normalization are detailed in
Supplementary Section 2. Given a relatively small number of controls in a plate, control-based normalization can only account for limited types of experimental artifacts, and can yield highly variable estimates of bias.
Wiles et al. (2008) compared the performance of seven sample-based and control-based normalization methods, and found, in the words of
Birmingham et al. (2009), that ‘no single method excelled’ in all situations. Software implementations, such as the ones in the open-source Bioconductor packages
RNAither (
Rieber et al., 2009) and
cellHTS2 (
Boutros et al., 2006) offer multiple above-mentioned alternatives.
In this work, we demonstrate that control-based normalization can improve the accuracy of results, as compared to the currently available methods, in screens where a large proportion of samples show changes in the phenotypes. We argue that such procedure should involve more than one control sample, and should be used not only for normalization, but also for estimation of residual between-plate variation.
Summarization of phenotypes and estimation of variation: this step summarizes the normalized phenotypes of a biological sample across replicates in a single value, typically by averaging, and estimates the associated variation. Estimation of variation is important, as it allows us to distinguish random variation from stress-related changes in the phenotype. Most existing methods estimate the variation by sample variance (
Collins et al., 2006), or by its robust alternatives.
Malo et al. (2006) recommended using Empirical Bayes approach to variance stabilization, which was originally introduced in the context of gene expression microarrays (
Smyth, 2004,
2005), but is applicable directly to the context of perturbation screens. The approach is summarized in mathematical formulation in
Supplementary Section 3.
The goal of this article is to demonstrate that such estimation of variation has serious deficiencies, in particular in screens with sensitive phenotypes and no between-plate replication. If the experimental design allocates all replicates of a biological sample in the same plate, these methods only estimate the variation within the plate. In other words, the methods assume that within-plate variation represents the full extent of variation of the normalized phenotypes.
We argue in
Section 3.3 that this assumption oversimplifies the structure of variation in the screens, and is rarely verified. We note that in control-based normalization, where normalizing quantities are estimated from a small number of observations, estimates of plate- and batch-specific bias are subject to uncertainty. Moreover, the effect of batches and plates on the phenotype can differ somewhat across biological samples, and further contribute to the variation. We show that appropriately accounting for this residual variation can play an important role in the determination of hits.
Determination of hits: determination of hits is formalized as testing the null hypothesis ‘the perturbed phenotype is consistent with the phenotype of a control’ or ‘the perturbed phenotype is consistent with the average phenotype of all perturbations’ against the corresponding alternative. The test is conducted using a test statistic, such as the Student's T or the moderated T above, which compares the summary quantification of the phenotype to its estimate of variation. Depending on the experiment, the reference distribution of the statistic is assumed Student or Normal, or is estimated empirically based on controls. Non-parametric alternatives, e.g. the Mann–Whitney test and the Rank Product test (
Rieber et al., 2009) can also be used, but have lower power.
The second aspect of determination of hits is the selection of the test statistic cutoff, which controls the rate of false positive hits at the desired level. Multiple testing procedures controlling for the false discovery rate (FDR), such as
Benjamini and Hochberg (1995) or
Efron (2008), can be used directly. Alternatively, (
Zhang et al., 2008) developed a specialized Bayesian procedure, which directly models the probabilities of phenotypes and controls FDR. Using ordered
Z-scores,
Kaplow et al. (2009) designed a tool called RNAiCut for automated identification of pathway-relevant hits. Although all these approaches are appropriate, their sensitivity and specificity depend on the choice of the test statistic, and in particular on its estimate of variation.
Evaluation: development of statistical methods for high-throughput screens is challenging in part because of difficulties in their evaluation on experimental datasets. The evaluation is facilitated in the case of multivariate phenotypes, where we can examine the consistency of normalized phenotypes of the controls in a multivariate space. We use such multivariate phenotypes, and both control-based and sample-based evaluation in
Section 5.