Traditional microarray analysis methods are oblivious to sample cell-type composition. They can neither distinguish between variations in gene expression resulting from an actual physiological change versus differences in cell-type frequency, nor identify the contributions of different cell types to the total measured gene expression. Therefore, their power to detect differentially expressed genes is strongly affected by the sample variation in cell-type frequencies1–3
Ideally, one would perform between-group differential expression analysis for each of the cell types in a tissue. Experimental methods for isolating subsets of tissues, such as cell sorting or enrichment, are prohibitively expensive and may affect cell physiology and gene expression4,5
. In theory, a statistics-based alternative is to quantify the relative abundance of each cell type in each sample, then deconvolve and compare cell type–specific average expression profiles for groups of mixed tissue samples (). Cell-type subset composition can be measured using labeled antibodies to cell-surface markers and flow cytometry, quantified by histology analyses6
or even estimated from the gene expression data by deconvolution from cell type–specific probes7–10
. Though previous attempts at gene expression deconvolution have assumed deconvolution to be linear6–8
, the relationship between the gene expression in mixed samples and the actual gene expression of the constituting cell subsets is unclear. This prevents assessment of the accuracy of deconvolution-derived profiles, their widespread application and development of such statistics-based techniques.
Overview of csSAM. Different cell types are denoted by circles, diamonds and hexagons. csSAM identifies cell type–specific differential expression, as shown by the arrows on the right.
We tested the relationship between measured gene expression in mixed samples and the expression of genes in the isolated pure subsets, in a situation in which all factors are known. We analyzed tissue samples from the brain, liver and lung of a single rat in isolation (referred to as `measured pure tissue') as well as in ten different mixture ratios (referred to as `measured mixtures'; Supplementary Table 1
) using Affymetrix expression arrays (Online Methods). Such mixtures mimic the common scenario in which biological samples in a dataset are heterogeneous and vary in the relative frequency of the component subsets from one another.
Next, we reconstituted mixture sample expression profiles by multiplying the measured pure tissue expression profiles by the frequency of the tissue subset in a given mixture sample. Overall, experimentally measured mixture data had high correlation with the reconstituted mixture data (r
> 0.95; Supplementary Fig. 1
). Probes for which data deviated from the diagonal comprised only a small fraction of the probes up to a twofold expression change cutoff (Supplementary Fig. 2
); these probes were more abundant in experimentally measured mixtures than in reconstituted samples, likely because of nonlinear biases in sample amplification and normalization procedures or probe cross-hybridization (Supplementary Note 1
, Supplementary Fig. 3
and Supplementary Table 2
The high correlation that we observed between the measured and reconstituted mixtures suggests that statistical deconvolution of tissue-specific expression profiles from complex tissue samples using linear regression should yield accurate expression estimates for most genes. To test this, we applied linear regression fitting to the measured mixture samples using the mixture ratios (Online Methods). For each tissue, a comparison of the estimated expression profile of each subset to the measured expression pattern in the pure tissue showed a high correlation (), indicating that we could accurately deconvolute subset-specific expression patterns for the majority of genes from whole-sample measurements.
Figure 2 Statistical deconvolution of complex tissues yields accurate estimates of pure tissue-subset expression. (a–c) Density plots of estimated tissue-specific gene expression deconvoluted from mixed tissue samples plotted against measured gene expression (more ...)
Accurate deconvolution of cell type–specific expression profiles enables the development and application of statistical techniques aimed at maximizing the information obtainable from a heterogeneous tissue gene expression assay. To estimate the specificity and sensitivity of statistical deconvolution to detect differentially expressed genes, we compared deconvoluted and measured differences in gene expression between tissues. Akin to fold change, all probes whose estimated abundance difference was greater than a set threshold were predicted to be differentially expressed. We compared these to a `gold standard' set of differentially expressed probes between tissues identified from the pure tissue sample measurements (Online Methods). Receiver operating characteristic (ROC) curve analysis showed the detection of differentially expressed genes by statistical deconvolution to be both highly specific and sensitive with an area under the curve of 0.85 and greater (Supplementary Fig. 4
In real-life settings, differences are often assayed between groups of samples, each containing many cell types, and no `gold standard' gene list exists to tell true difference from noise. To test the utility of our method to address an important clinical problem in a complex tissue, we applied cell type–specific significance analysis of microarrays (csSAM) to human whole-blood gene expression array data from 24 kidney transplant recipients. Of these, 15 were experiencing acute rejection of the kidney, whereas 9 were stable after transplant. Blood cells represent a particularly complex tissue type, with over a dozen distinct cell types that can vary in frequency up to 10–20-fold between healthy individuals. In this case, data on white blood cell subsets from Coulter counter measurements was available for all individuals analyzed (Supplementary Table 3
), distinguishing five major cell types: lymphocytes, monocytes, neutrophils, eosinophils and basophils.
We observed high variation in relative cell-type frequency between individuals but detected no significant differences in cell-type frequencies between the two groups (P ≥ 0.24 for all cell types). Whole-blood differential expression analysis using a previously published method, significance analysis of microarrays (SAM)11
, revealed no differentially expressed genes between the two groups at a relatively permissive false discovery rate (FDR) of 0.3 and reduction in the number of multiple hypothesis tests ( and Supplementary Fig. 5
Figure 3 csSAM reveals cell type–specific differential expression undetectable at heterogeneous tissue level. (a–f) Differential expression analysis in whole blood (a) and the indicated cell types (b–f) between samples from individuals (more ...)
Next, for each of the two groups of individuals, we deconvoluted the cell type–specific gene expression profile by linear regression analysis for each of the quantified cell types in each group of individuals. Each such cell type–specific expression profile represents the average for that cell type in that group of individuals. We used these deconvolved cell type–specific expression profiles to perform cell type–specific differential expression analysis (Online Methods). For each gene, in each cell type, we calculated the contrast in its deconvoluted expression between groups of individuals. We repeated the deconvolution and cell-type contrast procedure with permuted group-label data. To analyze differences in a gene's expression between two deconvolved cell types, we calculated FDR as the ratio of genes whose contrast exceeds a given threshold in the real dataset compared with the average number of genes exceeding the same threshold in the permuted dataset (Online Methods).
Though we detected no differentially expressed genes between the two groups in whole-blood analyses, sample heterogeneity may have masked biological differences. Applying the csSAM procedure to the kidney transplant dataset for each of the five quantified cell types, we identified 318 differentially expressed genes in monocytes at an FDR of 0.15 (). We identified no genes as differentially expressed even at an FDR of 0.3 in any of the other cell types (). However, repeated analysis by considering the one-tailed tests of up- and downregulated genes separately, identified differentially expressed genes between lymphocytes and neutrophils of these two groups of individuals as well as 137 genes upregulated in monocytes in samples from individuals experiencing acute kidney rejection at an FDR of 0.05 (Supplementary Fig. 6
In conclusion, here we described the csSAM algorithm, which addresses the extensive loss of biological signal in microarray datasets when analyzing complex tissue samples that vary in cellular composition. What are the limitations of this methodology? First, probe saturation and cross-hybridization may result in inaccuracies of cell-specific expression profiles, though these do not seem to have a large effect on the accuracy of downstream differential expression analysis. Similarly, for those genes whose cellular expression changes in response to changes in the cell subset composition of their microenvironment, deconvolved cell type–specific expression profile may be inaccurate. Alternative, more sophisticated models to linear regression may be developed to address this problem. Unlike traditional methodologies, csSAM accuracy benefits from variation between samples. Though additional experiments would be needed to identify csSAM's lower detection boundaries, accurate estimates of rare cell types may be aided by sample enrichment or inclusion of highly variable samples, which will yield cell-type frequency–dependent changes in transcript amounts. The key advantage of csSAM is that it localizes the identified differential expression to a particular cellular context, which allows clear hypothesis formulation for follow-up experiments. Though the principal test case here involves blood cells, our methodology is readily usable with microarray analysis of any heterogeneous tissue and can be applied to other types of molecular measurements as well.