Several statistical approaches have been proposed to deconvolute gene expression profiles obtained from heterogeneous tissue samples into cell-type-specific subprofiles. Most of the methods are based on a framework first proposed by Venet et al.
], incorporating the linearity assumption that the expression of each gene in a mixture of cell types is a weighted average of the expression values that would exist for pure populations of those cell types. The weights are determined by the proportional composition of the cell types in the mixture and hence are the same for each gene but differ among sample mixtures. Since the publication of Venet et al.
], several additional publications have appeared dealing with deconvolution of gene expression profiles on complex tissues (for example, [5
]). Without reviewing the details that distinguish the various methods, we attempt here to summarize the status of this area of development.
When the proportions of the cell types in each mixture sample are known from fluorescence activated cell sorting analysis, histopathological evaluation or other experimental methods, deconvolution is relatively straightforward. With the known proportions of the cell types in the mixture, deconvolution can be solved as a linear regression problem in which the cell-type-specific gene expression levels represent the regression coefficients. In fact, under these conditions, the regression problem can be solved separately for each gene.
In some cases the cell-type-specific gene expression levels may be of interest in their own right, or interest may focus on differences in expression among cell types. For cancer studies, however, interest is often on differential expression among classes of tumors (such as responders versus non-responders to a treatment), with expression from normal epithelium and infiltrating immune cells of lesser interest. Shen-Orr et al.
] developed cell-type-specific significance analysis of microarray (csSAM) for analyzing differentially expressed genes for each cell type in sample mixtures with microarray data. The relationship between measured gene expression in mixed samples and the expression of genes in the isolated pure subsets was tested experimentally for synthetic mixtures of liver, brain and lung cells from rats. Their in silico
synthesized mixture expression profiles, obtained by multiplying the measured pure tissue expression profiles by the proportion of the tissue subset in a given mixture sample, were highly correlated with the experimentally measured expression profiles for the mixtures. This provided direct support for the linearity assumption of all previous models. The deconvoluted estimates of cell-type-specific expression were in good agreement with expression measured in pure cell types for the vast majority of probes.
The authors [8
] then applied csSAM to human whole blood gene expression array data from kidney transplant recipients. When they used the whole blood analyses, there were no differentially expressed genes detected between the rejection group and stable group. However, a large number of differentially expressed genes were identified between the two groups in two individual cell types when applying the csSAM for each of the five quantified cell types: monocyte, basophile, neutrophil, eosinophil and lymphocyte. The method requires experimental measurements of the proportional composition of the component cell types in each sample. Although there are some pre-processing issues such as normalization that require further consideration, csSAM seems to be a useful tool for analysis of gene expression profiling of heterogeneous samples with known relative cell type frequencies. Source code for csSAM in the R statistical programming language is available [8
Several investigations performed deconvolution when the proportions of the component cell types were unknown but expression of signature genes in pure cell types was known (for example, [5
]). Abbas et al.
] developed an approach to estimate the proportions of white blood cell subtypes in samples from patients with systemic lupus erythematosus. First, they selected the most highly expressed signature probesets (genes) among several of the 18 immune cell types of interest using the expression data from the pure cells. They then used expression profiles for these signature genes to solve a linear equation for the proportions of the 18 immune cell subtypes in both healthy donors and patients with lupus. The deconvoluted results allowed them to find patterns of leukocyte dynamics and their correlations with clinical outcomes. In circumstances such as described by Abbas et al.
] in which careful preliminary studies have been conducted to identify signature genes and determine their expression in pure cell subtypes, such deconvolution can be successful.
Some proposals for deconvolution have been made for cases in which neither the proportions of the cell types in the mixtures nor signature genes are known. These approaches use a variety of methods, such as non-negative matrix factorization [9
]. The validations available are limited, however, and the number of samples required for accurate deconvolution may be large [9
]. Consequently, when measurements of the proportions of the component cell types in individual samples are not available and signature genes for each cell subtype are unknown, we believe that the status of deconvolution of expression profiles of mixtures is less clear.
Identifying genes that are differentially expressed among groups of diseased tissue samples is a frequent objective of gene expression profiling. Many of the publications referenced here ignore class information (such as disease versus normal or responder versus non-responder) in performing the deconvolution and state or imply that the deconvoluted cell-type-specific expression profiles can then be used with standard software packages for investigating class comparisons [6
]. This approach is potentially problematic, however, because the deconvoluted expression profiles are no longer statistically independent. Shen-Orr et al.
] indicate that the deconvolution should be performed separately for each class being compared and that in using permutation tests to assess statistical significance, deconvolution should be repeated for each permutation of class labels.