We have created a novel algorithm for predicting in vitro drug response from a signature of basal gene expression. Unlike previous methods, this approach incorporates multivariate interaction of input variables (gene expression levels), automatically indentifies core cell lines associated with each drug, and models drug response as a continuous variable. As demonstrated, this approach outperforms a comparable method based on univariate differential gene expression.
Although statistical tests of differential gene expression have been an important tool for the analysis of microarray data, interactions between the biological pathways that drive gene expression levels provide another layer of information that can be mined using multivariate approaches such as Random Forest. Since regression trees incorporate variable interactions as a natural consequence of data partitioning, they provide an ideal algorithmic approach for incorporating variable interactions in the creation of a gene expression signature. Techniques for explicitly encoding gene–gene interactions, such a multifactor dimensionality reduction (MDR), may also be worthwhile to investigate in future work. Although single trees do not generally provide the statistical power of other multivariate techniques, ensemble methods such as Random Forest that randomly sample from both cases and input variables have shown to be competitive with class-leading techniques such as support vector machines and stochastic gradient boosting (Diaz-Uriarte and Alvarez de Andres, 2006). In addition, Random Forest requires little or no parameter tuning and is therefore suitable for machine-learning tasks such an in silico screen that require the creation of a large number of statistical models.
The heterogeneity of cell line panels such as the NCI-60 presents a challenge to the creation of drug gene expression signatures. Previous workers have created models using only cell lines from the NCI-60 showing extreme values of IC50 response to any particular drug. However, defining resistant and sensitive cell lines becomes problematic when many drugs show IC50 distributions across the NCI-60 that are not normally or uniformly distributed. Using a ranked-based definition of drug sensitivity may also produce non-optimal training sets for drugs in which the IC50 distribution is skewed. To overcome these obstacles, we created a novel approach to identify core cell lines for each drug using the case proximity metric in Random Forest. We note that another group has recently published a method for associating drugs with sets of core cell lines (
Kutalik et al., 2008). However, this approach was based on a fully linear method, does not incorporate variable interactions and was not used to develop predictive models of drug response.
Functional screens that combine basal gene expression and drug response from panels of cell lines such as the NCI-60 may prove to be an important tool for the discovery of compound leads—especially for complex and heterogeneous diseases such as cancer. By experimentally testing the inhibitory profiles of 40 FDA-approved cancer drugs in seven glioma cell lines, we have provided one of the most complete validation tests to date of this approach. The predictive algorithm that we have developed can be generalized to other problems in machine learning that require the generation of predictive signatures from large numbers of input variables that may exhibit a high degree of noise and self-correlation.