Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2867485

Formats

Article sections

Authors

Related links

Stat Appl Genet Mol Biol. Author manuscript; available in PMC 2010 May 11.

Published in final edited form as:

Published online 2009 February 11. doi: 10.2202/1544-6115.1423

PMCID: PMC2867485

NIHMSID: NIHMS191413

The publisher's final edited version of this article is available at Stat Appl Genet Mol Biol

See other articles in PMC that cite the published article.

Use of microarray technology often leads to high-dimensional and low-sample size (HDLSS) data settings. A variety of approaches have been proposed for variable selection in this context. However, only a small number of these have been adapted for time-to-event data where censoring is present. Among standard variable selection methods shown both to have good predictive accuracy and to be computationally efficient is the elastic net penalization approach. In this paper, adaptations of the elastic net approach are presented for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model. Assessment of the two methods is conducted through simulation studies and through analysis of microarray data obtained from a set of patients with diffuse large B-cell lymphoma where time to survival is of interest. The approaches are shown to match or exceed the predictive performance of a Cox-based and an AFT-based variable selection method. The methods are moreover shown to be much more computationally efficient than their respective Cox- and AFT-based counterparts.

Analysis of high-dimensional and low-sample size (HDLSS) data is increasingly an objective of interest. Such analyses are of particular interest in the analysis of DNA microarray data where the number of genes typically far exceeds sample size. In this setting, a frequent objective is the identification of a subset of genes whose expression levels are significantly correlated with a given clinical outcome or classification. Estimation of the effect of each identified gene is also usually desired. Identified genes are then often employed to build a predictive model in which prediction of outcome for new patients is conducted.

A number of variable selection and estimation methodologies based on the maximization of a penalized likelihood have been proposed. Methods of penalization include traditional approaches such as AIC (Akaike et al., 1973) and BIC (Schwarz, 1978) as well as more recent developments including bridge regression (Frank and Friedman, 1993), the LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), LARS (Efron et al., 2004), the elastic net (Zou and Hastie, 2005), and MM algorithms (Hunter and Li, 2005). Implementation of a number of these methods, however is not feasible in HDLSS environments.

Microarray data analysis is further complicated when the outcome of interest is a time to an event. In these cases, either dropout or study termination may occur prior to event occurrence for a number of subjects. Typically, then, a number of the outcome variables are censored.

Several authors have proposed variable selection methods for HDLSS time-to-event data under the Cox proportional hazards model (Cox, 1972). For example, Cox-based methods utilizing kernel transformations (Li and Luan, 2003), threshold gradient descent minimization (Gui and Li, 2005a), and lasso penalization (see Gui and Li, 2005b; Segal, 2005; Park and Hastie, 2007) have been proposed.

Likewise, a few authors have proposed variable selection methods based on accelerated failure time models (see Wei, 1992). Methods based on the lasso penalization and the threshold gradient descent (Huang et al., 2006) have been proposed as well as an approach based on Bayesian variable selection (Sha et al., 2006).

There are a number of drawbacks to current methods of variable selection in HDLSS settings when censored data is present. The Li and Luan (2003) method is limited, for example, in that for prediction, all genes in the data set are included; a straightforward method of gene selection for prediction is not outlined. The TGD approaches of Gui and Li (2005a) and Huang et al. (2006) seem to be limited in that, at least in initial data analyses, very small changes in the threshold parameter dramatically altered the number of variables selected. Hence, effective identification of the optimal threshold might be unwieldy. A second drawback is that in the same analyses, the TGD method appeared to have less predictive power than alternative methods (see Gui and Li, 2005a; Gui and Li, 2005b). Use of the lasso in the methods proposed by Gui and Li (2005b) and Huang et al. (2006) might also lead to difficulties. For one, when the number of variables *p* is larger than the number of subjects *n*, the number of variables selected by the lasso is at most *n*. This restriction may be problematic for gene expression data where *p* *n*. A second drawback of the lasso is a result of its convexity. Zou and Hastie (2005) show that for non-strictly convex penalty functions such as the lasso, performance is suboptimal when highly correlated variables are present. Given a set of highly correlated variables associated with outcome, procedures that employ a penalty function that is not strictly convex often will identify only one of the variables and ignore the others. This limitation might be particularly problematic in the analysis of gene expression data where identification of an entire set of correlated genes may lead to an improved understanding of the biological pathway. It should be noted that the adaptive lasso, a recent improvement to the lasso, has been proposed by Zhang and Lu (2007) for censored data. While the approach overcomes a number of the drawbacks of lasso, use of the adaptive lasso may not be appropriate in high-dimensional data settings without reliance upon ridge regression (see Zhang and Lu, 2007; Lu and Zhang, 2007).

Modification of the elastic net penalization approach may be useful for the analysis of HDLSS time-to-event data. First, the elastic net approach is not limited in the number of variables selected by the number of available subjects. That is, the number of variables selected can be greater than the number of subjects. Second, the elastic net penalty function is strictly convex and therefore will more frequently identify an entire set of correlated genes than do methods based on penalty functions that are not strictly convex. Finally, as shown by Zou and Hastie (2005), the elastic net is computationally efficient. To date, the only attempt to employ the elastic net penalization approach to HDLSS censored data under the AFT model (Wang et al., 2008) employs an imputation approach based on the Buckley and James algorithm (Buckley and James, 1979). However, the Buckley-James approach entails an iterative least squares procedure that is known to suffer from convergence problems (see Wu and Zubovic, 1995) and is more computationally intensive than other methods.

In this paper, two elastic net based variable selection methods for high-dimensional low sample size time-to-event data are presented. First, a Cox elastic net (EN-Cox) approach is outlined that is based on the Cox proportional hazards model and utilizes modifications of the algorithms proposed by Tibshirani (1997) and Gui and Li (2005b). Second, an accelerated failure time elastic net (EN-AFT) approach is presented which employs a mean imputation approach for the estimation of AFT model parameters. The approaches are shown to be an improvement over existing methods in terms of prediction accuracy and computational efficiency.

In the linear regression setting, the elastic net objective function is defined (Zou and Hastie, 2005) as

$$L({\lambda}_{1},{\lambda}_{2},\beta )={\mid \mathbf{y}-\mathbf{X}\beta \mid}^{2}+{\lambda}_{2}\sum _{j=1}^{p}{\beta}_{j}^{2}+{\lambda}_{1}\sum _{j=1}^{p}\mid {\beta}_{j}\mid $$

(1)

for some fixed, non-negative *λ*_{1} and *λ*_{2}, where **y** = (*y*_{1}, …, *y _{n}*) is the centered response vector for

To adjust for HDLSS data settings (and the resultant difficulties in the estimation of ** β**), Zou and Hastie employ two simple modifications to the elastic net model. First, an augmentation of

$$\widehat{\beta}=\mathrm{arg}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{min}}_{\beta}\left[{\beta}^{\prime}\left(\frac{{\mathbf{X}}^{\prime}\mathbf{X}+{\lambda}_{2}\mathbf{I}}{1+{\lambda}_{2}}\right)\beta -2{\mathbf{y}}^{\prime}\mathbf{X}\beta +{\lambda}_{1}\sum _{j=1}^{p}\mid {\beta}_{j}\mid \right].$$

(2)

Of interest, then, is the elastic net estimator when the outcome is time to an event and censoring is present. Let time *t _{i}* for subject

Under the Cox proportional hazards model, the hazard function for individual *i* is specified as *λ*(*t _{i}*) =

$$\ell \left(\beta \right)=\frac{1}{n}\sum _{r\in D}\mathrm{ln}\left(\frac{\mathrm{exp}\left({\beta}^{\prime}{\mathbf{x}}_{\left(r\right)}\right)}{\sum _{j\in {R}_{r}}\mathrm{exp}\left({\beta}^{\prime}{\mathbf{x}}_{j}\right)}\right),$$

(3)

where *D* denotes the set of indices for observed events. The Cox elastic net estimate of ** β** in this setting can be obtained through adaptation of a quadratic programming approach outlined by Tibshirani and Hastie (see Hastie and Tibshirani, 1990; Tibshirani, 1997). Namely, let

Modifying the approach of Gui and Li (2005b), this approximation can be rewritten as ${({\stackrel{~}{\mathbf{z}}}_{0}-\stackrel{~}{\mathbf{X}}\beta )}^{\prime}({\stackrel{~}{\mathbf{z}}}_{0}-\stackrel{~}{\mathbf{X}}\beta )$, where ${\stackrel{~}{\mathbf{z}}}_{0}=\mathbf{Q}{\mathbf{z}}_{0}$ and $\stackrel{~}{\mathbf{X}}=\mathbf{QX}$, where **Q** = **A ^{1/2}**. An estimate, $\widehat{\mathbf{A}}$, of

$$\widehat{\beta}=\mathrm{arg}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{min}}_{\beta}\left[{\beta}^{\prime}\left(\frac{{\stackrel{~}{\mathbf{X}}}^{\prime}\stackrel{~}{\mathbf{X}}+{\lambda}_{2}\mathbf{I}}{1+{\lambda}_{2}}\right)\beta -2{\stackrel{~}{\mathbf{z}}}^{\prime}\stackrel{~}{\mathbf{X}}\beta +{\lambda}_{1}\sum _{j=1}^{p}\mid {\beta}_{j}\mid \right].$$

(4)

Estimation of $\widehat{\beta}$ is accomplished through the following algorithm:

- Set tuning parameters and initialize $\widehat{\beta}=0$.
- Compute
,*η***u**, $\widehat{\mathbf{A}}$, and**Q**based on the current value of $\widehat{\beta}$. - Let
**z**_{0}=**z**for the first iteration, otherwise compute**z**_{0}. - Compute $\stackrel{~}{\mathbf{X}}=\mathbf{QX}$ and ${\stackrel{~}{\mathbf{z}}}_{0}=\mathbf{Q}{\mathbf{z}}_{0}$.
- Minimize ${({\stackrel{~}{\mathbf{z}}}_{0}-\stackrel{~}{\mathbf{X}}\widehat{\beta})}^{\prime}({\stackrel{~}{\mathbf{z}}}_{0}-\stackrel{~}{\mathbf{X}}\widehat{\beta})$ subject to the elastic net constraints.
- Update $\widehat{\beta}$.
- Repeat steps 2–6, subject to the elastic net constraints, until $\widehat{\beta}$ does not change.

Of note, **Q** can then be obtained through the Cholesky decomposition of $\widehat{\mathbf{A}}$. Selection of tuning parameters in Step 1 and their effect on the elastic net constraints in Steps 5 and 7 is discussed in Section 2.5.

When the assumption of proportional hazards is not tenable, the accelerated failure time (AFT) model can be utilized. The AFT model is a linear regression model in which the logarithm of response *t _{i}* is related linearly to covariates

$$h\left({t}_{i}\right)={\beta}_{0}+{\mathbf{x}}_{i}^{\prime}\beta +{\epsilon}_{i},\phantom{\rule{1em}{0ex}}i=1,\dots ,n,$$

(5)

where *h*(·) is the log transformation or some other monotone function. In this case, the Cox assumption of multiplicative effect on hazard function is replaced with the assumption of multiplicative effect on outcome. In other words, it is assumed that the variables **x**_{i} act multiplicatively on time and therefore affect the rate at which individual *i* proceeds along the time axis.

Because censoring is present, the standard least squares approach cannot be employed to estimate the regression parameters in (5) even when *p* < *n*. One approach for AFT model implementation entails the replacement of censored *y _{i}* with imputed values. One such approach is that of mean imputation in which each censored

$$h\left({y}_{i}^{\ast}\right)=\left({\delta}_{i}\right)h\left({y}_{i}\right)+(1-{\delta}_{i}){\left\{\widehat{S}\right({y}_{i}\left)\right\}}^{-1}\sum _{{t}_{\left(r\right)}>{t}_{i}}h\left({t}_{\left(r\right)}\right)\Delta \widehat{S}\left({t}_{\left(r\right)}\right),$$

(6)

where $\widehat{S}$ is the Kaplan-Meier estimator (Kaplan and Meier, 1958) of the survival function and where the $\Delta \widehat{S}\left({t}_{\left(r\right)}\right)$ is the step of $\widehat{S}$ at time *t*_{(r)}.

Datta et al. (2007) recently assessed the performance of several approaches to AFT model implementation, including reweighting the observed *t _{i}*, replacement of each censored

Of interest, then, is the elastic net estimate of ** β** for settings when

- Set tuning parameters and initialize $\widehat{\beta}=0$.
- Minimize ${\sum}_{i}{({y}_{i}^{\ast}-{\widehat{\beta}}^{\prime}{\mathbf{x}}_{i})}^{\prime}({y}_{i}^{\ast}-{\widehat{\beta}}^{\prime}{\mathbf{x}}_{i})$ subject to the elastic net constraints.
- Update $\widehat{\beta}$.
- Repeat steps 2–3, subject to the elastic net constraints, until $\widehat{\beta}$ does not change.

Selection of tuning parameters in Step 1 and their effect on the elastic net constraints in Steps 2 and 4 is discussed in Section 2.5.

Zou and Hastie (2005) show that the elastic net is superior to the lasso in its ability to identify entire groups of highly correlated variables in the linear regression setting. This characteristic can be referred to as a grouping effect. A variable selection method, then, that exhibits the grouping effect will assign non-zero coefficients to an entire set of highly correlated variables. This characteristic is especially important in analysis of gene expression data where identification of an entire set of correlated genes may lead to an improved understanding of the biological pathway.

Both EN-Cox and EN-AFT exhibit the grouping effect. Because EN-AFT is based on a linear regression model, this follows by the same reasoning outlined by Zou and Hastie (2005). By similar reasoning, it is also easy to show that EN-Cox exhibits the grouping effect for 0 < *λ*_{2} ≤ 1. Proposition 1 describes the expected behavior of EN-Cox for an extreme case and Proposition 2 provides a general property of EN-Cox when correlated variables are present. Derivation of Proposition 1 and 2 is provided in the Appendix.

Let **x**_{i} = **x**_{j} for some *i*, *j* {1, …, *p*}. Let $\widehat{\beta}$ be the EN-Cox estimate of the Cox regression parameter ** β**. Then ${\widehat{\beta}}_{i}={\widehat{\beta}}_{j}$.

Proposition 1 states that given identical covariate vectors **x**_{i} and **x**_{j}, the EN-Cox estimate of ** β** will assign identical values to ${\widehat{\beta}}_{i}$ and ${\widehat{\beta}}_{j}$.

Let transformed response vector $\stackrel{~}{\mathbf{z}}$ and covariate matrix $\stackrel{~}{\mathbf{X}}$ be mean-centered and standardized. Let original covariate vectors **x**_{i} and **x**_{j} be highly correlated. Without loss of generality, assume *ρ* > 0. Let $\widehat{\beta}$ be the EN-Cox estimate of the Cox regression parameter ** β** and assume $\mathrm{sign}\left({\widehat{\beta}}_{i}\right)=\mathrm{sign}\left({\widehat{\beta}}_{j}\right)$. Then for fixed

$$\frac{\mid {\widehat{\beta}}_{i}-{\widehat{\beta}}_{j}\mid}{\mid \stackrel{~}{\mathbf{z}}\mid}\le \frac{\sqrt{2\{1-({\mathbf{x}}_{i}^{\prime}\mathbf{A}{\mathbf{x}}_{j}\left)\right\}}}{{\lambda}_{2}},$$

(7)

where ${\mathbf{x}}_{i}^{\prime}\mathbf{A}{\mathbf{x}}_{j}$ is equal to the correlation between transformed covariate vectors ${\stackrel{~}{\mathbf{x}}}_{i}$ and ${\stackrel{~}{\mathbf{x}}}_{j}$.

Proposition 2 states that the standardized difference between the EN-Cox estimates ${\widehat{\beta}}_{i}$ and ${\widehat{\beta}}_{j}$ corresponding to correlated variables **x**_{i} and **x**_{j} is bounded above by a function of the correlation between transformed covariate vectors ${\stackrel{~}{\mathbf{x}}}_{i}={\stackrel{~}{\mathbf{x}}}_{j}$. Of note, Proposition 1 and 2 extend the results of Zou and Hastie (2005) to settings in which censored data is present. Further examination of the grouping effect of EN-Cox and EN-AFT is provided in Section 3.2.

The elastic net requires the selection of two tuning parameters, *λ*_{1} and *λ*_{2}. Alternatives to *λ*_{1} are possible. The various choices correspond to different methods of identifying the stopping point of the procedure and hence affect Steps 4 and 6 of the algorithms outlined in Sections 2.2 and 2.3. Among those alternatives proposed is the maximum number of steps *k* allowable in the entire solution path where one iteration of the above algorithms constitutes a single step. The choice of *k* is useful as its selection requires no prior knowledge (or guesswork) regarding the actual values of the regression coefficients and is employed in both EN-Cox and EN-AFT.

Evaluation of the two parameters *λ*_{2} and *k* across a two-dimensional surface of parameter values is required. Potential values of *λ*_{2} should span a wide range, *e.g*., *λ*_{2} = (0, 0.01, 0.1, 1, 10, 100). The potential values of *k* will depend on the size of the data set. Tuning parameter selection can be implemented through use of cross-validation methods over a rough grid of candidate values for *λ*_{2} and *k*. In the current setting, selection of *λ*_{2} and *k* under both EN-Cox and EN-AFT is conducted through use of a cross validation score (CVS) (Huang, 2006; see also Verwij and Van Houwelingen (1993), Huang and Harrington (2002)):

$$\mathit{CVS}(\mathbf{X},{\lambda}_{2},k)=\ell (\mathbf{X},{\widehat{\beta}}_{{\lambda}_{2},k,-\left(i\right)})-\ell ({\mathbf{X}}_{-\left(i\right)},{\widehat{\beta}}_{{\lambda}_{2},k,-\left(i\right)}),$$

(8)

where ${\widehat{\beta}}_{{\lambda}_{2},k,-\left(i\right)}$ consists of the coefficient estimates (for a given variable selection approach) obtained while excluding the *i ^{th}* subject for fixed values

Potentially viable alternatives to the above approach include, but are not limited to, BIC (see Wang et al., 2007) as well as the approaches outlined by Zhang and Lu (2007) and by Wang et al. (2008).

Assessment of EN-Cox and EN-AFT can be conducted through analysis of predictive performance using time-dependent receiver operator characteristic (ROC) curves (Heagerty et al., 2000). In general, for dichotomous disease-status indicator *D* and continuous diagnostic test outcome *X*, an ROC curve is defined as the plot of the sensitivity of the test *X* > *c* versus (1 − specificity) over *c* (−∞, ∞). Heagery et al. extend this formulation to time-to-event data when censoring is present. Given linear risk score function *f*(*X*) = ** β**’

$$\text{sensitivity}(c,t\mid f(X\left)\right)=P\left[f\right(X)>c\mid \delta (t)=1]$$

(9)

$$\text{specificity}(c,t\mid f(X\left)\right)=P\left[f\right(X)\le c\mid \delta (t)=0],$$

(10)

where *δ*(*t*) is the event indicator at time *t*. At each time *t*, an ROC curve is generated for $\beta =\widehat{\beta}$ and an associated area under the curve (AUC) is calculated. The plot of AUC over time is then helpful in assessing the predictive performance of a given variable selection method.

Analyses were performed using the R software package (http://www.r-project.org). The R implementation of the Cox-based and AFT-based elastic net models presented in this paper is available at http://statweb.byu.edu/engler/ENET.

Diffuse large-B-cell lymphoma (DLBCL) is a common type of non-Hodgkin’s lymphoma in adults. Heterogeneity in response to treatment has suggested the existence of clinically distinct subypes. Rosenwald et al. (2002) utilized Lymphochip DNA microarrays to collect and analyze gene expression data from 240 biopsy samples of DLBCL tumors. For each subject, 7399 gene expression measurements were obtained. During the time of follow-up, 138 patient deaths were observed (*i.e*., 42.5% censoring).

Analysis of the Rosenwald et al. DLBCL data was conducted using both EN-Cox and EN-AFT. For comparison purposes, analysis was also conducted using the Gui and Li (2005b) lasso (LASSO-Cox) method. To assess the effect of differing imputation methods under the AFT model, separate analyses were conducted using the mean imputation method described in Section 2.3 and the Buckley-James imputation method (Wang et al., 2008). A training set of 160 randomly selected subjects was utilized. Selection of tuning parameters for each method was conducted using half of the training set while model fit (*i.e*., variable selection and coefficient estimation) was conducted using the other half. Predictive performance was assessed using a validation set composed of the 80 subjects not in the training set.

The methods varied in the number of gene expressions identified as significantly associated with survival. Both EN-Cox and EN-AFT identified a greater number of signficant features than LASSO-Cox. EN-AFT computed under mean imputation (EN-AFT-M) identified 13 genes, EN-AFT computed under Buckley-James imputation (EN-AFT-BJ) identified 18 genes, EN-Cox identified 16 genes, and LASSO-Cox identified 7 genes.

To assess predictive performance, the median AUC for each six month interval (for which there was data) was then calculated and plotted for each method. Results are presented in Figure 1. For the first ten years of follow-up, the median AUC for EN-AFT-M is 0.61 and is 0.56 for EN-AFT-BJ. Use of the Cox model results in a median AUC of 0.58 for both EN-Cox and LASSO-Cox. Instability in AUC estimates for subsequent times (post year 10) appears to be due to sparsity of event times. For this analysis, then, EN-AFT-M outperformed EN-AFT-BJ (in terms of prediction) using a smaller set of identified genes. The predictive performance of EN-AFT-M was also slightly superior to EN-Cox and LASSO-Cox in this data analysis.

Comparison of predictive performance (area under the ROC curve, over time) for the Rosenwald DLBCL data set.

Several features of the variable selection process for this data set are notable. First, EN-COX, EN-AFT-M, and EN-AFT-BJ each select genes not identified by any of the other three variable selection methods. In part, this is due to the noise of gene expression data. Such results are also indicative of the stochastic nature of the variable selection process.

Second, the methods based on the elastic net penalization do exhibit the grouping effect discussed in Section 2.4 while LASSO-Cox does not. For example, both EN-Cox and LASSO-Cox select gene 5442, but EN-Cox also selects gene 5301 which is moderately correlated with gene 5442 (*ρ* = 0.43). EN-AFT-M and EN-AFT-BJ each identify correlated gene expressions. For example, gene 5254 and gene 5296 (uniquely identified by EN-AFT-M) are correlated (*ρ* = 0.57). Likewise, genes 1671, 2154, and 5773 (uniquely identified by EN-AFT-BJ) are correlated (*ρ* ≥ 0.51). With regard to LASSO-Cox, *ρ* ≤ 0.30 for any two identified gene expressions.

In summary, both EN-Cox and EN-AFT-M (EN-AFT based on mean imputation) perform as well or better than the lasso-based method and EN-AFT-BJ (EN-AFT based on the Buckley-James imputation) in terms of predictive power. It is additionally important to note that the elastic-net based methods are much more computationally efficient than their Cox-based and AFT-based counterparts (see Section 3.3); completion of the Lasso-Cox method exceeded several days while EN-AFT-M (including parameter selection through cross-validation) completed in well under an hour.

In order to assess performance of EN-Cox and EN-AFT, several simulation studies were conducted under different data scenarios. For each scenario, covariate data was simulated following the strategy for generating gene expressions proposed by Gui and Li (2005b) which allows for correlation between certain subsets of the data. In essence, an *n* × *n* array *B* is initially generated from a uniform *U*(−1.5, 1.5) distribution. A second set of data *C* can then be generated utilizing the normalized, orthogonal basis of the initial array. Gui and Li (2005b) demonstrate that the maximum correlation between any two data vectors selected from *B* and *C*, respectively, can be specified during the data generation process. Implementation of this procedure can be conducted by prespecifying *p _{γ}* genes significantly associated with outcome. The gene expression data associated with these

For each of the following three data scenarios, 100 simulations were conducted in which, for each simulation, data for *n* = 150 subjects and *p* = 200 gene expressions were generated. For each data set, subjects were randomly divided into two training sets of *n _{t}* = 50 each and one prediction set of

It was first of interest to establish baseline performance for EN-Cox and EN-AFT in a relatively simple setting in which no correlation existed between any of the covariate vectors and where, on average, about 40% of the event times were censored. For this first data scenario, then, data for the first *p _{γ}* gene expression were drawn from a uniform

For the second data scenario, it was of interest to assess the grouping effect of EN-Cox and EN-AFT. That is, the performance of EN-Cox and EN-AFT was assessed for a scenario in which subsets of the *p _{γ}* variables were highly correlated. First, data for

Finally, it was of interest to assess the performance of EN-Cox and EN-AFT when an elevated level of censoring was present. For this third data scenario, gene expression data were generated as described above for Scenario 1. Likewise, the same *β*_{γ} parameter vector was used. The level of censoring, however, was increased to 60%.

For each of the three scenarios, performance of EN-Cox and EN-AFT was assessed in two ways. First, the relative frequency of selection of significant variables (*i.e*., *β _{j}*,

Variable selection results (frequency of selection) for LASSO-Cox (L-Cox), EN-Cox, EN-AFT (BJ: Buckley-James imputation, M: mean imputation) methods for independen variables, 40% censoring. The column “Actual” dentotes the true parameter **...**

Variable selection results (frequency of selection) for LASSO-Cox (L-Cox), EN-Cox, EN-AFT (BJ: Buckley-James imputation, M: mean imputation) methods for correlated variables, 40% censoring. Variables 1–6 are grouped into two sets: {**x**_{1}, **x**_{2}, **x**_{3} **...**

Variable selection results (frequency of selection) for LASSO-Cox (L-Cox), EN-Cox, EN-AFT (BJ: Buckley-James imputation, M: mean imputation) methods for independent variables, 60% censoring. The column “Actual” dentotes the true parameter **...**

Second, predictive performance was assessed as described in Section 2.6. For each simulation, the AUC was calculated at each unique event time. Because unique times varied across simulations, the time scale was divided into equal sized “bins”. The average AUC in each time-bin was then calculated. Figure 3.2 contains the plotted average AUCs over time for each of the three scenarios. For comparison purposes, the same sets of data were also analyzed using the Gui and Li (2005b) LASSO-Cox procedure for censored data. To assess the effect of imputation method under the AFT model, separate analyses were conducted using the mean imputation method of Section 2.3 and the Buckley-James imputation method of Wang et al. (2008).

Results for the first scenario (*i.e*., independent covariates, 40% censoring) are presented in Table 1 and in Figure 3.2 (under “Scenario 1”). For this simple scenario, the Cox-based methods seem roughly equivalent in terms of performance results; both EN-Cox and LASSO-Cox have a median AUC (across all times) of 0.80. With regard to the AFT-based methods, both EN-AFT-M and EN-AFT-BJ appear to outperform the Cox-based models in this setting, more frequently identifying variables of interest. The AFT approach based on mean imputation (EN-AFT-M) performs particularly well. For coefficients with moderate or high absolute effects (*β*_{1–5}), the mean frequency of selection of EN-AFT-M is 0.986. With regard to the selection of the remaining non-zero coefficient (*β*_{6}), EN-AFT-M outperforms the Cox-based methods as well as the method based on Buckley-James imputation (EN-AFT-BJ).

Results for the second scenario (*i.e*., grouped covariates with high correlation within groups, 40% censoring) are presented in Table 2 and in Figure 3.2 (under “Scenario 2”). With regard to variable selection (Table 2), the AFT-based selection methods exhibit the highest accuracy, followed by EN-Cox and then LASSO-Cox. The LASSO-Cox does not exhibit the grouping effect but instead appears to select one of several highly correlated variables and ignores the others. For example, in about half the simulations, LASSO-Cox selects *β*_{1}, ignoring *β*_{2} and *β*_{3} whereas in the remaining simulations, LASSO-Cox selects *β*_{2}, ignoring *β*_{1} and *β*_{3}. A similar pattern is observed for the second group of correlated variables, *β*_{4}, *β*_{5}, and *β*_{6}. As in the first setting, the performance of EN-AFT-M with regard to frequency of variable selection is superior to the Cox-based methods and to EN-AFT-BJ. Regarding predictive performance (Figure 3.2), all three EN-AFT-M, EN-AFT-BJ and EN-Cox perform well both with a median AUC (across all times) of 0.92. The over-time average AUC of LASSO-Cox in this setting is 0.82.

Results for the third scenario (*i.e*., independent covariates, 60% censoring) are presented in Table 3 and Figure 3.2 (under “Scenario 3”). For this scenario in which a high level of censoring is present, the three elastic net methods outperform LASSO-Cox in both variable selection accuracy and in predictive performance. Interestingly, while the three elastic net methods are roughly equivalent with regard to variable selection, EN-Cox (median AUC: 0.76) appears to slightly outperform the two AFT-based methods (EN-AFT-M median AUC: 0.68, EN-AFT-BJ median AUC: 0.66) in terms of predictive performance. The poorer predictive performance of the AFT-based methods may be due, in part, to the fact that the required imputation in the AFT models is based on fewer observed events and is therefore less accurate.

In summary, then, EN-Cox performs as well or better than LASSO-Cox in each of the three scenarios. The improvement of EN-Cox is particularly notable when correlated covariates are present. Moreover, the computational efficiency of EN-Cox exceeds that of LASSO-Cox. With regard to the AFT-based methods, EN-AFT-M performs as well or better than EN-AFT-BJ in all three scenarios, particularly wtih regard to frequency of variable selection. Additionally, the improvement in computational efficiency is substantial.

Use of the elastic net penalty leads to computationally efficient algorithms. Typical run times (3.2Ghz Xeon Linux workstation) for EN-AFT-M, EN-AFT-BJ, EN-Cox, and LASSO-Cox are listed in Table 4 for various data set dimensionalities.

Comparison of computation times for LASSO-Cox (L-Cox), EN-Cox, EN-AFT (BJ: Buckley-James imputation, M: mean imputation) methods (in seconds). *p*: number of variables, *N*: number of subjects.

Note that the run times listed in Table 4 are for fixed tuning parameters and that differences in run times are even more pronounced when time of cross-validation is included. For example, a typical total run-time (cross-validation and model fitting) for *N* = 150 and *p* = 200 for EN-AFT-M is 25.0 seconds whereas the EN-AFT-BJ time is 2716.6 seconds. For *N* = 150 and *p* = 1000, the total run time for EN-AFT-M is 47.6 seconds and is 106280.7 seconds for EN-AFT-BJ.

Adaptation of the elastic net penalization criterion for use in high-dimensional and low-sample size censored data settings leads to computationally efficient variable selection methods with good predictive performance. Through simulation studies, EN-Cox and EN-AFT were shown to perform well in comparison to the Gui and Li (2005b) LASSO-Cox approach in simple settings with low censoring and independent covariates. The two methods were also shown to outperform LASSO-Cox in settings with a high degree of censoring and in settings where sets of highly correlated variables were present. The EN-AFT approach entailing mean imputation was also shown to outperform the approach based on Buckley-James imputation (Wang et al., 2008) in terms of both frequency of variable selection and computational efficiency.

Several features of the EN-Cox and EN-AFT implementations may warrant further investigation. Some have proposed methods for improving the computational efficiency of the LASSO-Cox (Segal, 2005). While EN-Cox was shown to perform efficiently in comparison to LASSO-Cox, improvements might be made. For example, utilization of the penalized likelihood approach of Park and Hastie (2007) may be of particular interest under the Cox model.

The presented models can also be adapted to situations in which it is of interest to assign separate penalty functions to different coefficients or groups of coefficients. That is, equation (1) can be extended to

$$L({\lambda}_{1},{\lambda}_{2},\beta )={\mid \mathbf{y}-\mathbf{X}\beta \mid}^{2}+{\lambda}_{2}\sum _{j=1}^{p}{W}_{2j}{\beta}_{j}^{2}+{\lambda}_{1}\sum _{j=1}^{p}{W}_{1j}\mid {\beta}_{j}\mid ,$$

(11)

where the *W _{mj}*,

It may also be of interest to obtain standard error estimates for the EN-Cox or EN-AFT regression coefficients. One possible approach is based on an adaptation of the lasso local quadratic approximation (LQA) proposed by Fan and Li (2001) (see also Zou, 2006). First, assume the nonzero elements of ** β** have been identified, perhaps through an initial EN-Cox or EN-AFT analysis. Let

$$\widehat{\beta}=\mathrm{arg}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{min}}_{\beta}\left[{\beta}^{\prime}\left(\frac{{\mathbf{X}}^{\prime}\mathbf{X}+{\lambda}_{2}\mathbf{I}}{1+{\lambda}_{2}}\right)\beta -2{y}^{\prime}\mathbf{X}\beta +{\lambda}_{1}\{\sum _{j=1}^{p}\mid {\beta}_{j0}\mid +\frac{1}{2\mid {\beta}_{j0}\mid}({\beta}_{j}^{2}-{\beta}_{j0}^{2}\left)\right\}\right],$$

(12)

where **y** and **X** are replaced with $\stackrel{~}{\mathbf{z}}$ and $\stackrel{~}{\mathbf{X}}$ for EN-Cox and where **y** is replaced with **y*** for EN-AFT. Let *β*_{m} consist of the *m* nonzero elements of ** β** and let

$$\widehat{\beta}=(1+{\lambda}_{2}){\{{\mathbf{X}}_{m}^{\prime}{\mathbf{X}}_{m}+{\lambda}_{2}\mathbf{I}+{\lambda}_{1}\Sigma ({\beta}_{0}\left)\right\}}^{-1}{\mathbf{X}}_{m}^{\prime}\mathbf{y},$$

(13)

where $\Sigma \left({\beta}_{0}\right)=\mathrm{diag}(\frac{1}{{\beta}_{1}},\dots ,\frac{1}{{\beta}_{m}})$. Equation (13) can then be utilized to obtain the sandwich estimator for the covariance matrix for *β*_{m}.

Finally, a current drawback of the elastic net is that, like the lasso, it may not always yield consistent results (see Ghosh, 2007). The adaptive elastic net, proposed by Zou and Zhang (2009), resolves this issue for HDLSS data. Adaptation of this new approach for HDLSS censored data settings will be of future interest.

Assume that ${\widehat{\beta}}_{i}\ne {\widehat{\beta}}_{j}$. Define estimator ${\widehat{\beta}}^{\ast}$ : let ${\widehat{\beta}}_{k}^{\ast}={\widehat{\beta}}_{k}$ for all *k* ≠ *i,j*, otherwise let ${\widehat{\beta}}_{k}^{\ast}=p{\widehat{\beta}}_{i}+(1-p){\widehat{\beta}}_{j}$ for *p* = 1/2. Since **x**_{i} = **x**_{j}, clearly $T{\mathbf{x}}_{i}={\stackrel{~}{\mathbf{x}}}_{i}={\stackrel{~}{\mathbf{x}}}_{j}=T{\mathbf{x}}_{j}$, $\stackrel{~}{\mathbf{X}}{\widehat{\beta}}^{\ast}=\stackrel{~}{\mathbf{X}}\widehat{\beta}$ and ${\mid \stackrel{~}{\mathbf{z}}-\stackrel{~}{\mathbf{X}}{\widehat{\beta}}^{\ast}\mid}^{2}={\mid \stackrel{~}{\mathbf{z}}-\stackrel{~}{\mathbf{X}}\widehat{\beta}\mid}^{2}$. However, because the elastic net penalization function $f\left(\beta \right)={\lambda}_{2}{\sum}_{k=1}^{p}{\beta}_{k}^{2}+{\lambda}_{1}{\sum}_{k=1}^{p}\mid {\beta}_{k}\mid $ is strictly convex, it is the case that

$$f\left({\widehat{\beta}}_{i,j}^{\ast}\right)=f(p{\widehat{\beta}}_{i}+(1-p\left){\widehat{\beta}}_{j}\right)\phantom{\rule{1em}{0ex}}<\phantom{\rule{1em}{0ex}}pf\left({\widehat{\beta}}_{i}\right)+(1-p)f\left({\widehat{\beta}}_{j}\right)\phantom{\rule{1em}{0ex}}<\phantom{\rule{1em}{0ex}}f\left({\widehat{\beta}}_{i,j}\right).$$

Because $f\left({\widehat{\beta}}^{\ast}\right)=f\left(\widehat{\beta}\right)$ for *i* ≠ *j*, and because *f*(·) is additive, $f\left({\widehat{\beta}}^{\ast}\right)<f\left(\widehat{\beta}\right)$ and it therefore cannot be the case that $\widehat{\beta}$ is a minimizer. Hence, ${\widehat{\beta}}_{i}={\widehat{\beta}}_{j}$.

By definition,

$${\frac{\partial L({\lambda}_{1},{\lambda}_{2},\beta )}{\partial {\beta}_{k}}\mid}_{\beta =\widehat{\beta}}=0\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{thickmathspace}{0ex}}{\widehat{\beta}}_{k}\ne 0.$$

(14)

Also, note that

$$L({\lambda}_{1},{\lambda}_{2},\widehat{\beta})\le L({\lambda}_{1},{\lambda}_{2},\beta =0).$$

(15)

By (14) (for non-zero ${\widehat{\beta}}_{i}$ and ${\widehat{\beta}}_{j}$),

$$-2{\stackrel{~}{\mathbf{x}}}_{i}^{\prime}(\stackrel{~}{\mathbf{z}}-\stackrel{~}{\mathbf{X}}\widehat{\beta})+{\lambda}_{1}\mathrm{sign}\left({\widehat{\beta}}_{i}\right)+2{\lambda}_{2}{\widehat{\beta}}_{i}=0,$$

and

$$-2{\stackrel{~}{\mathbf{x}}}_{j}^{\prime}(\stackrel{~}{\mathbf{z}}-\stackrel{~}{\mathbf{X}}\widehat{\beta})+{\lambda}_{1}\mathrm{sign}\left({\widehat{\beta}}_{j}\right)+2{\lambda}_{2}{\widehat{\beta}}_{j}=0.$$

Hence,

$${\widehat{\beta}}_{i}-{\widehat{\beta}}_{j}=\frac{1}{{\lambda}_{2}}({\stackrel{~}{\mathbf{x}}}_{j}^{\prime}-{\stackrel{~}{\mathbf{x}}}_{i}^{\prime})(\stackrel{~}{\mathbf{z}}-\stackrel{~}{\mathbf{X}}\widehat{\beta})\le \frac{1}{{\lambda}_{2}}\mid {\stackrel{~}{\mathbf{x}}}_{j}-{\stackrel{~}{\mathbf{x}}}_{i}\mid \mid \stackrel{~}{\mathbf{z}}-\stackrel{~}{\mathbf{X}}\widehat{\beta}\mid ,$$

where $\mid \mathbf{x}\mid =\sqrt{{\mathbf{x}}^{\prime}\mathbf{x}}$. By (15),

$${\mid \stackrel{~}{\mathbf{z}}-\stackrel{~}{\mathbf{X}}\widehat{\beta}\mid}^{2}\le {\mid \stackrel{~}{\mathbf{z}}\mid}^{2},$$

since $\stackrel{~}{\mathbf{z}}$ is centered. Hence,

$$\frac{\mid {\widehat{\beta}}_{i}-{\widehat{\beta}}_{j}\mid}{\mid \stackrel{~}{\mathbf{z}}\mid}\le \frac{1}{{\lambda}_{2}}\mid {\stackrel{~}{\mathbf{x}}}_{j}-{\stackrel{~}{\mathbf{x}}}_{i}\mid \frac{\mid \stackrel{~}{\mathbf{z}}-\stackrel{~}{\mathbf{X}}\widehat{\beta}\mid}{\mid \stackrel{~}{\mathbf{z}}\mid}\le \frac{1}{{\lambda}_{2}}\mid {\stackrel{~}{\mathbf{x}}}_{j}-{\stackrel{~}{\mathbf{x}}}_{i}\mid \le \frac{1}{{\lambda}_{2}}\sqrt{2(1-{\mathbf{x}}_{i}\mathbf{A}{\mathbf{x}}_{j})},$$

where ${\stackrel{~}{\mathbf{x}}}_{i}^{\prime}{\stackrel{~}{\mathbf{x}}}_{j}={\mathbf{x}}_{i}\mathbf{A}{\mathbf{x}}_{j}$ is the correlation between standardized variables ${\stackrel{~}{\mathbf{x}}}_{i}$ and ${\stackrel{~}{\mathbf{x}}}_{j}$.

- Akaike H. Information theory and the extension of the maximum likelihood principle. In: Petrov V, Csaki F, editors. Proceedings of the Second International Symposium on Information Theory; Budapest: Akailseoniaikiudo; 1973. pp. 267–281.
- Buckley J, James I. Linear regression with censored data. Biometrika. 1979;66:429–436.
- Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society, Series B. 1972;34:187–220.
- Datta S. Estimating the mean life time using right censored data. Statistical Methodology. 2005;2:65–69.
- Datta S, Le-Rademacher J, Datta S. Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and LASSO. Biometrics. 2007;63:259–271. [PubMed]
- Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407–499.
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1359.
- Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–148.
- Ghosh S. Adaptive elastic net: an improvement of elastic net to achieve oracle properties. IUPUI Tech report no. pr07-01. 2007
- Gui J, Li H. Threshold gradient descent method for censored data regression, with applications in pharmacogenomics. Pacific Symposium on Biocomputing. 2005a;10:272–283. [PubMed]
- Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005b;21:3001–3008. [PubMed]
- Hastie T, Tibshirani R. Exploring the nature of covariate e ects in the proportional hazards model. Biometrics. 1990;46:1005–1016. [PubMed]
- Heagerty PJ, Lumley T, Pepe MS. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics. 2000;56:337–344. [PubMed]
- Huang J, Harrington D. Penalized partial likelihood regression for right-censored data. Biometrics. 2002;58:781–791. [PubMed]
- Huang J, Ma S, Xie H. Regularized estimation in the accelerated failure time model with high-dimensional covariates. Biometrics. 2006;62:813–820. [PubMed]
- Hunter D, Li R. Variable selection using MM algorithms. Annals of Statistics. 2005;33:1617–1642. [PMC free article] [PubMed]
- Kaplan EL, Meier P. Nonparametric estimation from incompete observations. Journal of the American Statistical Association. 1958;53:457–481.
- Li H, Luan Y. Kernel Cox regression models for linking gene expression profiles to censored survival data. Pacific Symposium of Biocomputing. 2003;8:65–76. [PubMed]
- Lu W, Zhang HH. Variable selection for proportional odds model. Statistics in Medicine. 2007;26:3771–3781. [PubMed]
- Park MY, Hastie T. An
*L*_{1}regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B. 2007;69:659–677. - Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher R, Gascoyne RD, Muller-Hermelink K, Smeland EB, Staudt LM. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-Cell lymphoma. New England Journal of Medicine. 2002;346:1937–1947. [PubMed]
- Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464.
- Segal MR. Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisted. Biostatistics. 2005;7:268–285. [PubMed]
- Sha N, Tadesse MG, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics. 2006;22:2262–2268. [PubMed]
- Tibshirani R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288.
- Tibshirani R. The LASSO method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. [PubMed]
- Verwij P, Van Houwelingen H. Cross validation in survival analysis. Statistics in Medicine. 1993;12:2305–2314. [ISI] [PubMed]
- Wei LJ. The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine. 1992;11:1871–1879. [PubMed]
- Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. [PMC free article] [PubMed]
- Wang S, Nan B, Zhu J, Beer DG. Doubly penalized Buckley-James method for survival data with high-dimensional covariates. Biometrics. 2008;64:132–140. [PubMed]
- Wu CSP, Zubovic Y. A large-scale monte carlo study of the buckley-james estimator with censored data. Journal of Statistical Computation and Simulation. 1995;51:97–119.
- Zhang HH, Lu W. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703.
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B. 2005;67:301–320.
- Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429.
- Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics. 2009 to appear. [PMC free article] [PubMed]