Microarray experiments that monitor gene expression profiles associated with different disease phenotypes have become commonplace in biomedical research. Each DNA sequence represented in microarrays can be considered a potential biomarker. Classification and prediction using such genomic measurements may contribute to a better understanding of the genetic pathways involved in diseases, and hence may lead to better diagnosis of disease and better prescription of treatment. Analysis of genomic data is challenging due to high dimensionality of data and low sample size. Although the number of genes assayed is large, but there may be only a small number of genes that are associated with variations of phenotypes. By employing standard methods directly, we usually obtain estimates that are not "regular", i.e., estimates are not unique or ill-behaved. Regularization, through which we achieve unique and well-behaved estimates, is usually needed. Regularization can be achieved via model reduction or variable selection methods.
Several dimension reduction techniques have been employed for classification using genomic data. Examples include the partial least squares [1
], the principal component regression [2
], and the singular value decomposition under the Bayesian framework [3
], among others. By using low dimensional projections of covariates as features in model estimation, one may obtain estimators with better prediction performance due to the bias-variance tradeoff. One drawback of such dimension reduction techniques is that all genes are used in estimation and prediction. Biological interpretation of such classifiers are usually not straightforward. Moreover, if certain genes are not associated with the clinical outcome, it is important to exclude them from the predictive model.
An alternative approach to the dimension reduction techniques is to use methods that are capable of simultaneous biomarker selection and model fitting, which can be realized by penalization. Such methods include the least absolute shrinkage and selection operator-LASSO [5
], the least angle regression-LARS [6
], and the threshold gradient directed regularization method-TGDR [7
]. These methods can produce parsimonious models with a small number of biomarkers and hence more lucent biological interpretations.
In this article, we propose an approach for simultaneous estimation and biomarker selection using a scaled TGDR method.
It is important to assess both false-positive and false-negative errors, since the clinical and financial consequences of the two types of errors can differ significantly. A common practice is to use the receiver operating characteristic (ROC) curve [8
], where the classification performance can be measured by the area under the ROC curve (AUC). Advantages of the ROC method include: (1) it does not assume a parametric form of the class probability. This is different from the logistic regression method. Although one can construct an ROC curve from logistic regression, it assumes a parametric form of the class probability. This parametric assumption may be not be satisfied; (2) it is adaptable to outcome-dependent samplings, for example the case control design; and (3) it is capable of penalizing false positives and false negatives differently. Therefore, the ROC method may be preferable in biomarker selection and classification using genomic measurements.
] proposed the empirical AUC as the objective function for combining multiple biomarkers in a low dimensional setting. Ma and Huang [11
] proposed a smooth sigmoid approximation of the empirical AUC for high-dimensional data. An alternative to the empirical AUC is the binormal AUC [8
]. The binormal AUC technique was developed parallel to, but separated from the empirical AUC method. For small sample sizes, the empirical AUC may change dramatically due to small perturbations and differ significantly from the expected AUC, whereas the binormal AUC is more stable. Studies with low-dimensional biomarkers show that the binormal AUC may provide valuable information beyond the empirical AUC [8
]. For data with high dimensional covariates or large sample sizes, the binormal AUC is computationally much more affordable than the empirical AUC. Since both the empirical AUC and the binormal AUC are extensively used in low dimensional settings, it is of great interest to extend the study of [11
] and explore use of the binormal AUC for disease classification with microarray data.
In this article, we proposed an approach for biomarker selection and classification with microarray data by optimizing the binormal AUC. The scaled TGDR method, which is a modified version of the TGDR, is adopted for estimation and biomarker selection. Tuning parameters are selected using V
-fold cross validation, and Monte Carlo based methods are proposed for evaluation purposes. We assess the proposed approach by extensive simulation studies and demonstrate it on two cancer datasets. Comparing to the method in [11
], which uses a smoothed version of the empirical AUC as the objective function, the contributions of this paper are as follows. First, the binormal AUC, which is at least as important as the empirical AUC, is used as the objective function. Second, we propose using the scaled TGDR, which can significantly reduce the computational cost. Moreover, the occurrence index, which is a way to rank the selected biomarkers and measure their relative stability in the presence of sampling variation, is proposed in this paper.