Home | About | Journals | Submit | Contact Us | Français |

**|**Stat Appl Genet Mol Biol**|**PMC2861315

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Regression and classification using feature independence
- 3. Survival data and Cox’s proportional hazards model
- 4. Simulation study
- 5. Some real data examples
- References

Authors

Related links

Stat Appl Genet Mol Biol. 2009 January 1; 8(1): 21.

Published online 2009 April 14. doi: 10.2202/1544-6115.1438

PMCID: PMC2861315

Copyright © 2009 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

We propose a method for prediction in Cox's proportional model, when the number of features (regressors), *p*, exceeds the number of observations, *n*. The method assumes that the features are independent in each risk set, so that the partial likelihood factors into a product. As such, it is analogous to univariate thresholding in linear regression and nearest shrunken centroids in classification. We call the procedure Cox univariate shrinkage and demonstrate its usefulness on real and simulated data. The method has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics. We illustrate the new method on real and simulated data, and compare it to other proposed methods for survival prediction with a large number of predictors.

High-dimensional regression problems are challenging, and a current focus of statistical research. One simplifying assumption that can be useful is that of independence of features. While this assumption will rarely be even approximately true, in practice, it leads to simple estimates with predictive performance sometimes as good or better than more ambitious multivariate methods.

In linear regression with squared error loss and a lasso penalty, independence of the features leads to the *univariate soft-thresholding* estimates, that often behave well.

In classification problems, the idea of class-wise independence forms the basis for diagonal discriminant analysis. In that setting, if we assume that the features are independent within each class, and incorporate a lasso penalty, the resulting estimator is the *nearest shrunken centroid* procedure proposed by Tibshirani et al. (2001): we review this fact in the next section. The nearest shrunken centroid classifier is very simple but useful in practice, and has many attractive properties. It has a single tuning parameter that automatically selects the features to retain in the model, and it provides a single ranking of all features, that is independent of the value of the tuning parameter.

We review the details of the regression and classification problems in the next section. Then we extend the idea of class-wise feature independence to the Cox model for survival data: this is the main contribution of the paper. We study this new procedure on both real and simulated datasets, and compare it to competing methods for this problem.

Consider the usual regression situation: we have data (**x**_{i}*, y** _{i}*),

$$\sum _{i=1}^{N}{({y}_{i}\hspace{0.17em}-\hspace{0.17em}\sum _{j}{\beta}_{j}{x}_{ij})}^{2}\hspace{0.17em}+\hspace{0.17em}\lambda \sum _{j=1}^{p}\left|{\beta}_{j}\right|.$$

(1)

Now if we assume that the features are independent (uncorrelated), it is easy to show that

$${\widehat{\beta}}_{j}=\text{sign}({\widehat{\beta}}_{j}^{\text{o}}){(|{\widehat{\beta}}_{j}^{\text{o}}|-\lambda )}^{+}$$

(2)

where
${\widehat{\beta}}_{j}^{\text{o}}$ are the ordinary (univariate) least squares estimates. These are *univariate soft-thresholded estimates*, and perform particularly well when *p* *n*. Zou & Hastie (2005) study these estimates as a special case of their elastic net procedure. See Fan & Lv (2008) for some theoretical support of univariate soft-thresholding.

Here we review the nearest shrunken centroid method for high-dimensional classification and its connection to the lasso. Consider the *diagonal-covariance* linear discriminant rule for classification. The *discriminant score* for class *k* is

$${\delta}_{k}\left(x*\right)\hspace{0.17em}=\hspace{0.17em}-\sum _{j=1}^{p}\frac{{\left({x}_{j}^{*}\hspace{0.17em}-\hspace{0.17em}{\overline{x}}_{kj}\right)}^{2}}{{s}_{j}^{2}}\hspace{0.17em}+\hspace{0.17em}2\hspace{0.17em}\text{log}\hspace{0.17em}{\pi}_{k}.$$

(3)

Here
${x}^{*}={({x}_{1}^{*},{x}_{2}^{*},\dots ,{x}_{p}^{*})}^{T}$ is a vector of expression values for a test observation, *s** _{j}* is the pooled within-class standard deviation of the

$$C({x}^{*})=\ell \text{if}{\delta}_{\ell}({x}^{*})={\text{max}}_{k}{\delta}_{k}({x}^{*}).$$

(4)

We see that the diagonal LDA classifier is equivalent to a nearest centroid classifier after appropriate standardization.

The nearest shrunken centroid classifier regularizes this further, and is defined as follows. Let

$${d}_{kj}\hspace{0.17em}=\hspace{0.17em}\frac{{\overline{x}}_{kj}\hspace{0.17em}-\hspace{0.17em}{\overline{x}}_{j}}{{m}_{k}\left({s}_{j}\hspace{0.17em}+\hspace{0.17em}{s}_{0}\right)},$$

(5)

where * _{j}* is the overall mean for gene

Finally, we can derive the nearest shrunken centroid classifier as the solution to a lasso-regularized problem that assumes independence of the features. Consider a (naive Bayes) Gaussian model for classification in which the features *j* = 1*,* 2*, . . . p* are assumed to be independent within each class *k* = 1*,* 2*, . . . , K*. With observations *i* = 1*,* 2, *. . . , n* and *C** _{k}* equal to the set of indices of the

$$\underset{\left\{{\mu}_{j},{\mu}_{jk}\right\}}{\text{min}}\hspace{0.17em}\left\{\frac{1}{2}\sum _{j=1}^{p}\sum _{k=1}^{K}\sum _{i\in {C}_{k}}\frac{{\left({x}_{ij}\hspace{0.17em}-\hspace{0.17em}{\mu}_{j}\hspace{0.17em}-\hspace{0.17em}{\mu}_{jk}\right)}^{2}}{{s}_{j}^{2}}\hspace{0.17em}+\hspace{0.17em}\lambda \sqrt{{n}_{k}}\sum _{j=1}^{p}\sum _{k=1}^{K}|\frac{{\mu}_{jk}|}{{s}_{j}}\right\}.$$

(6)

By direct calculation, it is easy to show that the solution is equivalent to the nearest shrunken centroid estimator (5), with *s*_{0} set to zero, and *M** _{k}* equal to 1

For the rest of this paper we consider the right-censored survival data setting. The data available are of the form (*y*_{1}*,* **x**_{1,} *δ*_{1}*,* )*, . . . ,* (*y*_{n}*,* **x**_{n}*, δ** _{n}*), the survival time

The proportional-hazards model for survival data, also known as the Cox model, assumes that

$$\lambda (t|x)\hspace{0.17em}=\hspace{0.17em}{\lambda}_{0}(t)\hspace{0.17em}\text{exp}\left(\sum _{j}{x}_{j}{\beta}_{j}\right)$$

(7)

where *λ*(*t*|**x**) is the hazard at time *t* given predictor values **x** = (*x*_{1, . . . ,}*x** _{p}*), and

One usually estimates the parameter *β* = (*β*_{1} *β*_{2} *, . . . β** _{p}*)

$$\text{PL}(\beta )\hspace{0.17em}=\hspace{0.17em}\prod _{k\in D}\frac{\text{exp}({x}_{k}\beta )}{\left\{{\sum}_{m\in {R}_{k}}\text{exp}({x}_{m}\beta )\right\}}.$$

Here *D* is the set of indices of the failure times, *R** _{k}* is the set of indices of the individuals at risk at time

If *p* *n* then maximizer of the partial likelihood is not unique, and some regularization must be used. One possibility is to include a “lasso” penalty and jointly optimize over the parameters, as suggested by Tibshirani (1997).

Here we extend the idea of feature independence to the Cox model for survival data. Now the partial likelihood term exp(**x**_{k}*β*)*/Σ*_{mRk} exp(**x**_{m}*β*) equals Pr(*k*|**x**_{i}*, i* *R** _{k}*) the probability that individual

$$\text{PL}(\beta )\hspace{0.17em}=c\hspace{0.17em}\cdot \hspace{0.17em}\prod _{j=1}^{p}\hspace{0.17em}\prod _{k\in D}\frac{\text{exp}({x}_{kj}{\beta}_{j})}{\left\{{\sum}_{m\in {R}_{k}}\text{exp}({x}_{mj}{\beta}_{j})\right\}}$$

The log partial likelihood is

$$\ell (\beta )\hspace{0.17em}=\hspace{0.17em}\sum _{j=1}^{p}\sum _{k=1}^{K}({x}_{kj}{\beta}_{j}\hspace{0.17em}-\hspace{0.17em}\text{log}\sum _{m\in {R}_{k}}\text{exp}({x}_{mj}{\beta}_{j}))$$

(8)

$$\equiv \hspace{0.17em}\sum _{j=1}^{p}{g}_{j}({\beta}_{j}).$$

(9)

We propose the *Cox univariate shrinkage* (CUS) estimator as the maximizer of the penalized partial log-likelihood

$$J(\beta )\hspace{0.17em}=\hspace{0.17em}\sum _{j=1}^{p}{g}_{j}({\beta}_{j})\hspace{0.17em}-\hspace{0.17em}\lambda \sum \left|{\beta}_{j}\right|.$$

(10)

Here *λ* ≥ 0 is a tuning parameter, and we solve this problem for a range of *λ* values. Note that is just a set of one-dimensional maximizations, since we can maximize each function *g** _{j}*(

Note that the estimates will tend to be biased towards zero, especially for larger values of *λ*. In the test set and cross-validation computations in this paper, we fit the predictor *z* = **x** in the new (test or validation) data, and so obtain a scale factor to debias the estimate. If we are interested in the values of themselves, a better approach would be fit the single predictor *γ***x*** _{i}* in a Cox model on the training data, and hence obtain the adjusted estimates

The lasso penalty above might not make sense if the predictors are in different units. Therefore we first standardize each predictor *x** _{j}* by

Zhao et al. (2005) collected gene expression data on 14, 814 genes from 177 kidney patients. Survival times (possibly censored) were also measured for each patient. The data were split into 88 samples to form the training set and the remaining 89 formed the test set. Figure 1 shows the solution paths for Cox univariate shrinkage for 500 randomly chosen genes.

Notice that the solutions paths have a particularly nice form: they are monotonically decreasing in absolute value and once they hit zero, they stay at zero. It is easy to prove that this holds in general. The proof is given in the Appendix.

Note also that the profiles decrease at different rates. This is somewhat surprising. In the least squares regression case under orthogonality of the predictors, the lasso estimates are simply the soft-thresholded versions of the least squares estimates * _{j}*:, that is, sign(

Let *U*_{j,}*V** _{j}* be the gradient of the (unpenalized) log-partial likelihood and the (negative) observed Fisher information, evaluated at

$${\widehat{\beta}}_{\lambda}\hspace{0.17em}\ne \hspace{0.17em}0\hspace{0.17em}\iff \hspace{0.17em}\left|{U}_{j}\right|/\sqrt{{V}_{j}}\hspace{0.17em}>\hspace{0.17em}\lambda .$$

(11)

The proof is given in the Appendix. Therefore we have the following useful fact:

*For any value of λ, the set of predictors that appear in the model estimated by Cox univariate shrinkage are just those whose score statistic exceeds λ in absolute value*.

That is, the Cox univariate shrinkage method used a fixed ranking of all features based on the Cox score statistic. This is very convenient for interpretation, as the Cox score is often used for determining the univariate significance of features (e.g. in the SAM procedure of Tusher et al. (2001)). However the non-zero coefficients given to these predictors are not simply the score statistics or soft thresholded versions of them.

Figure 3 shows the drop in test sample deviance for CUS (Cox univariate shrinkage), lasso, SPC (Supervised principal components) and UST (univariate soft thresholding), plotted against the number of genes in the model. The CUS and SPC methods work best.

Figure 4 shows the univariate Cox coefficients plotted against the Cox score statistics, for the kidney cancer data. Although the two measures are correlated, this correlation is far from perfect, and again this underlines the difference between our proposed univariate shrinkage method, which enters features based on the size of the Cox score, and simple soft-thresholding of the univariate coefficients. The latter ranks features based on the size of their *unregularized* coefficient, while the former looks at the standardized effect size at the null end of the scale.

To estimate the tuning parameter *λ* we can use cross-validation. However it is not clear how to apply cross-validation in the setting of partial likelihood, since this function is not a simple sum of independent terms. Verweij & van Houwelingen (1993) propose an interesting method for leave-one-out cross-validation of the partial likelihood: for each left-out observation they compute the probability of its sample path over the risk sets. For the fitting methods presented in this paper, this leave-one-out cross-validation was too slow, requiring too many model fits. When we instead tried 5 or 10- fold partial likelihood cross-validation, the resulting curve was quite unstable. So we settled on a simpler approach, 5-fold cross-validation where we fit a Cox model to the left out one-fifth and record the drop in deviance. The results for the kidney cancer data is shown in Figure 2. The curve is a reasonable estimate of the test deviance curve of Figure 3.

In some microarray datasets, there are duplicate genes, or multiple clones representing the same gene. In this case, the data for the different features will be highly correlated, and it makes sense to average the data for each gene, before applying UC. One can also average genes whose pairwise correlation is very high, whether or not they’re supposed to represent the same gene. For example, when we averaged genes with correlation greater than 0.9 in the kidney cancer dataset, it reduced the number of genes by a few hundred and caused very little change in the test deviance curve.

The CUS procedure enters predictors into the model based on this univariate Cox scores. Thus if two predictors are strongly predictive and correlated with each other, both will appear in the model. In some situations it is desirable pare down the predictors to a smaller, mode independent set. The lasso does this selection automatically, but sometimes doesn’t predict as well as other methods (as in Figure 3.) An alternative approach is *pre-conditioning* (Paul et al. 2008), in which we apply the standard (regression) lasso to the fitted values from a model. In Paul et al. (2008) the initial fitting method used was supervised principal components. Here we start with the CUS fit using 500 genes, which approximately the best model from Figure 2. We then apply the (regression version) of the lasso to the fitted values, and we report the drop in test deviance in Figure 5, along with that for CUS itself (green curve copied from Figure 3.) We see that pre-conditioning gives about the same drop in test deviance as the 500 gene CUS model, but using fewer than 100 genes. And it performs better here than the Cox model lasso (blue curve in Figure 3, a finding supported theoretically in Paul et al. (2008). Figure 6 shows the univariate Cox scores for the predictors entered by each method. We see that the pre-conditioning approach enters only predictors that are individually strong, while the usual lasso enter many predictors that are individually weak.

We generated Gaussian samples of size *n* = 50 with *p* = 1000 predictors, with c population correlation *ρ*(*x*_{i}*, x** _{j}*) between each pair of predictors. The outcome

Figure 8 shows the number of genes correctly included in the model versus the number of genes in the model, for each of CUS and lasso. Neither method is very good at isolating the true non-zero predictors, but surprisingly the lasso does no better than the simple CUS procedure.

In Figures 9 and and1010 we chose the model complexity by cross-validation and report the drop in test set deviance (Figure 9) and the C-index (Figure 10). The latter index is time-integrated area under the ROC curve, appropriate for survival studies (Heagerty & Zheng 2005). In every case the CUS performs on average as well or better than the lasso and supervised principal components).

We applied three methods- lasso, supervised principal components, and CUS, to three real datasets. The first is the kidney data discussed earlier; the other datasets are the diffuse large cell lymphoma data of Rosenwald et al. (2002) (7399 genes and 240 patients), and the breast cancer data of van’t Veer et al. (2002) (24881 genes, 295 patients). We randomly divided each dataset into training and test sets ten times, and averaged the results. The first two datasets included test sets of size 88 and 80 respectively, so we retained those test set sizes. For the third dataset, we took a test set of about one-third (98 patients), and to ease the computation load, restricted attention to the 10*,*000 genes with largest variance. Figure 11 shows the median test deviance plus and minus one standard error over 10 simulations, plotted against the number of genes in each model. The CUS method yields a larger drop in test deviance than the lasso, in at least two of the datasets and performs a little better than SPC overall.

The R package uniCox implementing the methods of this paper will soon be available on CRAN.

I thank Daniela Witten for helpful comments, and acknowledge support from National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183.

*Monotonicity of profiles.* The subgradient equation for each *β** _{j}* is:

$$\sum _{k=1}^{K}({x}_{kj}\hspace{0.17em}-\hspace{0.17em}{d}_{k}\frac{{\sum}_{m\in {R}_{k}}{x}_{mj}\text{exp}({x}_{mj}{\beta}_{j})}{{\sum}_{m\in {R}_{k}}\text{exp}({x}_{mj}{\beta}_{j})})\hspace{0.17em}-\hspace{0.17em}\lambda \hspace{0.17em}\cdot \hspace{0.17em}{t}_{j}\hspace{0.17em}=\hspace{0.17em}0$$

(12)

where *t** _{j}* sign(

*Claim*: |* _{j}*(

*Proof*: for ease of notation, suppose we have a single *β* having the subgradient equation

$$g(\beta )-\lambda t=0$$

(13)

with *t* sign(*β*).

Then *g*′(*β*) is the negative of the observed information and it is easy to show that *g*′(*β*) *<* 0.

Denote the solution to (13) for parameter *λ* by (*λ*). Suppose that we have a solution (*λ*) ≠ 0 for some *λ*. WLOG assume (*λ*) *>* 0. Then by (13) we have

$$g(\widehat{\beta}(\lambda ))=\lambda $$

Then if *λ′> λ*, we can’t have (*λ′*) ≥ (*λ*) since this would imply *g*((*λ′*)) *< g*((*λ*)) = *λ < λ′*. Hence (*λ′*) *< *(*λ*).

On the other hand, if (*λ*) = 0 then

$$g(0)-\lambda t=0$$

for some *t* with |*t*| ≤ 1. Then if *λ′ > λ* the equation *g*(0) – *λ*′*t*′ = 0 can be solved by taking *t*′ = *t*(*λ/λ′*). Hence (*λ*′) = 0.

*Proof of (11)*. Let

$${g}_{j}({\beta}_{j})\hspace{0.17em}=\hspace{0.17em}\sum _{k=1}^{K}({x}_{kj}\hspace{0.17em}-\hspace{0.17em}{d}_{k}\frac{{\sum}_{m\in {R}_{k}}{x}_{mj}\text{exp}({x}_{mj}{\beta}_{j})}{{\sum}_{m\in {R}_{k}}\text{exp}({x}_{mj}{\beta}_{j})})$$

Suppose * _{j}*(0)

- Bickel P, Levina E. ‘Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations’ Bernoulli. 2004;10:989–1010. doi: 10.3150/bj/1106314847. [Cross Ref]
- Efron B. 2008. Empirical bayes estimates for large scale prediction problemsUnpublished [PMC free article] [PubMed]
- Fan J, Lv J. ‘Sure independence screening for ultra-high dimensional feature space’ Journal of the Royal Statistical Society Series B, to appear. 2008 doi: 10.1111/j.1467-9868.2008.00674.x. [PMC free article] [PubMed] [Cross Ref]
- Heagerty PJ, Zheng Y. ‘Survival model predictive accuracy and roc curves’ Biometrics. 2005;61:92–105. doi: 10.1111/j.0006-341X.2005.030814.x. [PubMed] [Cross Ref]
- Paul D, Bair E, Hastie T, Tibshirani R. ‘“pre-conditioning” for feature selection and regression in high-dimensional problems’ Annals of Statistics. 2008;36(4):1595–1618. doi: 10.1214/009053607000000578. [Cross Ref]
- Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Staudt LM. ‘The use of molecular profiling to predict survival after chemotherapy for diffuse large b-cell lymphoma’ The New England Journal of Medicine. 2002;346:1937–1947. doi: 10.1056/NEJMoa012914. [PubMed] [Cross Ref]
- Tibshirani R. ‘The lasso method for variable selection in the cox model’ Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3. [PubMed] [Cross Ref]
- Tibshirani R, Hastie T, Narasimhan B, Chu G. ‘Diagnosis of multiple cancer types by shrunken centroids of gene expression’ Proceedings of the National Academy of Sciences. 2001;99:6567–6572. doi: 10.1073/pnas.082099299. [PubMed] [Cross Ref]
- Tusher V, Tibshirani R, Chu G. ‘Significance analysis of microarrays applied to transcriptional responses to ionizing radiation’ Proc Natl Acad Sci USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [PubMed] [Cross Ref]
- van’t Veer LJ, van de Vijver HDMJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Anke T, Witteveen, GJS, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. ‘Gene expression profiling predicts clinical outcome of breast cancer’ Nature. 2002;415:530–536. doi: 10.1038/415530a. [PubMed] [Cross Ref]
- Verweij P, van Houwelingen H. ‘Cross-validation in survival analysis’ Statistics in Medicine. 1993;12:2305–2314. doi: 10.1002/sim.4780122407. [PubMed] [Cross Ref]
- Zhao H, Tibshirani R, Brooks J. ‘Gene expression profiling predicts survival in conventional renal cell carcinoma’ PLOS Medicine. 2005:511–533. [PMC free article] [PubMed]
- Zou H, Hastie T. ‘Regularization and variable selection via the elastic net’ Journal of the Royal Statistical Society Series B. 2005;67(2):301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [Cross Ref]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |