DNA microarray technology has been proven to be a powerful tool for exploring gene expression patterns in biological systems in the past decade. Many medical applications of microarrays involve class prediction, that is, prediction of a categorical class or phenotype based on the expression profile of the patient. The classes often represent diagnostic categories or binary treatment response. For example, Wang et al
1 developed a gene-expression based predictor of whether a patient with advanced melanoma would respond to IL2-based treatment.
Challenges are experienced where the development and validation of predictive models for settings where the number of candidate predictors (
p) is much larger than the number of cases (
n). Many algorithms have been studied for developing and evaluating gene-expression-based predictors of a categorical class variable. Classification methods widely used include the compound covariate predictor,
2 diagonal linear discriminant analysis,
3 nearest neighbor
4 and shrunken centroid methods,
5 support vector machines,
6 and random forests,
7 all of which are available in the BRB-ArrayTools software, provided without charge for non commercial purposes by the National Cancer Institute.
8 Sophisticated methods of complete cross-validation or bootstrap re-sampling efficiently utilize the data and avoid biased estimates of predictive accuracy.
9 Methods for predicting survival risk based on censored survival times and microarray data have been described by several authors and recently compared by Bovelstad et al.
10 Methods of complete cross-validation are much less developed for such settings and most published studies involving survival prediction transform the outcome data into discrete categories (see the review by Dupuy & Simon).
11There have been relatively few publications using linear regression models to predict a continuous response based on microarray expression profiles. Standard linear regression methods are problematic when the number of predictor variables exceeds the number of cases because X’X is singular, where X is the design matrix. Software available to biomedical investigators has not included the more sophisticated methods needed for developing and properly validating continuous response models in the
p >
n setting. One example of such a study is that of Bibikova et al
12 who identified a group of 16 genes significantly associated with Gleason scores for prostatic carcinomas. They avoided the
p >
n problem by first identifying 16 genes which individually appeared predictive of the Gleason score, and then fitting single variable linear regression models for each of the 16 genes. The final predicted Gleason grade for each sample was the average of 16 independently derived predicted values from each model. Although this method has the merit of simplicity, the method of validation they used was problematic and consequently their model requires further validation with an independent data set.
To properly estimate the accuracy of a prediction model, the test set cannot be used for selecting the genes to be included in the model or for estimating the parameters of the model. This key principle of separating the data used for model development from the data used for model validation must be carefully observed in using either a split-sample or cross validation approach of estimating prediction accuracy. The simulation study
13 shows the importance of cross validating all steps of model building in estimating the error rate, especially the feature selection step that is often overlooked. Enormous bias in estimation of prediction error can result if the full dataset is used for gene selection and sample splitting or cross-validation applied to fitting a model based on those selected genes. Unfortunately, the survey by Dupuy and Simon
11 indicated that improper use of incomplete cross-validation is prevalent in the published literature with class prediction methods and this problem also occurred in the study of Bibikova et al.
12We have evaluated three linear regression algorithms that can be used for prediction of a continuous response based on high dimensional gene expression data. The first two algorithms are Least Angle Regression (LAR)
14 and LASSO.
15 LASSO is a penalized regression method. It identifies regression coefficients for all genes to minimize a weighted average of mean squared prediction error for cases in the training set plus the sum of absolute values of all regression coefficients. The weighting factor is optimized by cross-validation. LAR can be viewed as an accelerated version of forward stagewise regression.
16,17 The algorithm developed by Efron et al
14 is highly efficient and can also be used to find the LASSO solution. Both methods develop relatively parsimonious models and do not require the prior step of gene selection. There have been many applications of LASSO in different fields such as on protein mass spectrometry data
18 and SNP data.
19 To our knowledge, the use of these models with gene expression profiles to predict continuous outcome have not been reported. The third algorithm we evaluated is the averaged linear regression (ALM) method used in Bibikova et al.
12 We used an unbiased complete cross validation approach in order to get a correct error estimate for the model. All methods were tested using simulations in which the gene expression levels were based on a real dataset and analysis of two sets of real gene expression data.