PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Lifetime Data Anal. Author manuscript; available in PMC 2010 November 20.
Published in final edited form as:
PMCID: PMC2989175
NIHMSID: NIHMS251151

Variable selection in the accelerated failure time model via the bridge method

Abstract

In high throughput genomic studies, an important goal is to identify a small number of genomic markers that are associated with development and progression of diseases. A representative example is microarray prognostic studies, where the goal is to identify genes whose expressions are associated with disease free or overall survival. Because of the high dimensionality of gene expression data, standard survival analysis techniques cannot be directly applied. In addition, among the thousands of genes surveyed, only a subset are disease-associated. Gene selection is needed along with estimation. In this article, we model the relationship between gene expressions and survival using the accelerated failure time (AFT) models. We use the bridge penalization for regularized estimation and gene selection. An efficient iterative computational algorithm is proposed. Tuning parameters are selected using V-fold cross validation. We use a resampling method to evaluate the prediction performance of bridge estimator and the relative stability of identified genes. We show that the proposed bridge estimator is selection consistent under appropriate conditions. Analysis of two lymphoma prognostic studies suggests that the bridge estimator can identify a small number of genes and can have better prediction performance than the Lasso.

Keywords: Bridge penalization, Censored data, High dimensional data, Selection consistency, Stability, Sparse model

1 Introduction

High throughput technologies make it possible to identify genomic markers that are associated with disease development and progression. Gene profiling studies have been extensively conducted using microarrays. Identification of genomic markers from analysis of microarray data may lead to a better understanding of the genomic mechanism beneath disease development and assist future clinical diagnosis and prognosis. Among many disease outcomes measured in microarray studies, censored disease-free or overall survival has attracted much attention. See Alizadeh et al. (2000), Rosenwald et al. (2003), and Dave et al. (2004) for representative examples. Because of the high dimensionality of gene expression data, standard survival analysis techniques cannot be directly used. In addition, among the thousands of genes surveyed, only a subset may be associated with disease. Thus, gene selection is needed along with survival model construction.

When analyzing censored survival data with microarray gene expression measurements, the Cox proportional hazards model and the additive risk model have been adopted (Gui and Li 2005; Ma and Huang 2007). An alternative to those models is the accelerated failure time (AFT) model. Unlike the Cox and additive models, the AFT model is a linear regression model, in which logarithm (or in general a known monotone transformation) of the failure time is directly regressed on gene expressions (Kalbfleisch and Prentice 1980). Compared with the Cox and additive models, the AFT model has an intuitive linear regression interpretation (Wei 1992). In this article, we apply the method of Zhou (1992) and Stute (1993), which uses the Kaplan–Meier weights to account for censoring and has a weighted least squares loss function. The simple form of the loss function makes this estimation approach especially suitable for high dimensional data.

To tackle the high dimensionality problem of gene expression data, various dimension reduction or variable selection techniques have been employed. Previously used dimension reduction techniques include principal component analysis, singular value decomposition, partial least squares, and others. Among the many variable selection techniques developed, penalized selection has attracted extensive attentions. Penalization methods put penalties on the regression coefficients. By properly balancing goodness of fit and model complexity, penalization approaches can lead to parsimonious models with reasonable fit.

The most famous example of penalization methods is the Lasso (Tibshirani 1996), which has been used in gene expression analysis with survival data (Gui and Li 2005; Ma and Huang 2007; Wang et al. 2008). However, it has been shown that the Lasso is in general not variable selection consistent (Leng et al. 2006). Various penalization methods that can have consistent selection have been proposed. Examples include the adaptive Lasso and the SCAD. Another penalty that also enjoys consistent selection is the bridge penalty. Under conventional setup, i.e., when the number of observations is much larger than the number of covariates, the bridge penalty has been investigated. See for example Fu (1998). Huang et al. (2008a) shows that the bridge penalty can have the oracle estimation and selection properties in linear regression models with a divergent number of covariates.

In this article, we consider genomic studies where gene expressions are measured along with censored disease survival. The bridge penalization approach is used for regularized estimation and gene selection. The rest of the article is organized as follows. The AFT model and bridge estimation are introduced in Sect. 2. An efficient computational algorithm is proposed in Sect. 3. Resampling based methods are proposed to evaluate prediction performance and relative stability of selected genes in Sect. 4. Asymptotic selection consistency is established in Sect. 5. Analysis of two lymphoma studies are provided to illustrate the proposed method in Sect. 6. The article concludes with discussions in Sect. 7. Proofs are given in the Appendix.

2 Bridge estimation in the AFT model

Let Ti be logarithm of the failure time and Xi be the p-dimensional gene expressions for the ith subject in a random sample of size n. The AFT model assumes

Ti=α+Xiβ+εi,i=1,,n,
(1)

where α is the intercept, β [set membership] IRp is the regression coefficient, and εi is the error term. When Ti is subject to right censoring, we observe (Yi, δi, Xi), where Yi = min{Ti, Ci}, Ci is logarithm of the censoring time, and δi = 1{TiCi} is the censoring indicator.

Estimation in the AFT model with an unspecified error distribution has been studied extensively. The following two approaches have received special attentions. The first is the Buckley-James estimator which adjusts censored observations using the Kaplan–Meier estimator (Buckley and James 1979); and the second is the rank based estimator motivated by the score function of the partial likelihood function (Ying 1993). Although they both perform well when there are a small number of covariates, with high dimensional gene expression data, both approaches have high computational cost.

A computationally more feasible approach is the weighted least squares (LS) approach (Zhou 1992; Stute 1993). Let Fn be the Kaplan–Meier estimator of F, the distribution function of T. It can be computed as F^n(y)=i=1nwi1{Y(i)y}. Here wis are the jumps in the Kaplan–Meier estimator computed as w1=δ(1)nandwi=δ(i)ni+1j=1i1(njnj+1)δ(j), i = 2, …, n. wis have also been referred to as the Kaplan–Meier weights in Stute (1993). Here Y(1)(...)Y(n) are the order statistics of Yi’s, δ(1), …, δ(n) are the associated censoring indicators, and X(1), …, X(n) are the associated covariates. The weighted LS loss function is

12i=1nwi(Y(i)αX(i)β)2.

We center X(i) and Y(i) with their wi-weighted means, respectively. Let X¯w=i=1nwiX(i)/i=1nwi and Y¯w=i=1nwiY(i)/i=1nwi. Denote X(i)*=(nwi)1/2(X(i)X¯w) and Y(i)*=(nwi)1/2(Y(i)Y¯w). We can then rewrite the weighted LS loss function as

Qn(β)=12i=1n(Y(i)*X(i)*β)2.
(2)

The bridge penalized objective function is

Ln(β)=Qn(β)+λj=1p|βj|γ,
(3)

where λ is a data dependent tuning parameter and γ > 0 is the bridge index. The value [beta]n that minimizes (3) is called the bridge estimator (Frank and Friedman 1993; Fu 1998).

The bridge estimator includes two important special cases. When γ = 2, it is the familiar ridge estimator, which does not have a “built-in” variable selection mechanism. When γ = 1, it is the Lasso estimator. In this article, we focus on the case with γ < 1.

3 Computation

3.1 Computational algorithm

Direct minimization of Ln(β) is difficult, since the bridge penalty is not convex. An approximation approach is proposed in Huang et al. (2008a). As an alternative, we consider the following approach, which is more efficient and does not need any approximation.

For 0 < γ < 1, define

Sn(β,θ)=Qn(β)+j=1pθj11/γ|βj|+τnj=1pθj,
(4)

where τn is a penalty parameter.

Proposition 1 If λ=τn1γγγ(1γ)γ1, then [beta]n minimizes Ln(β) if and only if ([beta]n, [theta w/ hat]) minimizes Sn(β, θ) subject to [theta w/ hat]j ≥ 0 for j = 1, …, p.

This proposition can be proved as in Huang et al. (2009), in which a similar result is shown for the group bridge estimator in linear regression without censoring. Based on Proposition 1, we propose the following iterative algorithm for computing the bridge estimate in the AFT models.

  1. Compute an initial estimate β(0). Specifically, we propose using the Lasso estimate, i.e, the minimizer of Eq. 3 with γ = 1.
    For s = 1, 2, …
  2. Compute θj(s)=(1γτnγ)γ|βj(s1)|γ, j = 1, …, p.
  3. Compute β(s)=arg minβ{Qn(β)+j=1p(θj(s))11/γ|βj|}.
  4. Repeat Steps 2–3 until convergence.

The proposed algorithm always converges, since at each step the non-negative objective function (4) decreases. In our numerical studies, convergence is usually achieved within ten iterations. In Step 1, we choose the Lasso estimate as the initial value with the penalty parameter in the Lasso criterion determined by V-fold cross validation. Theorem 1 in Sect. 5 establishes that the Lasso tends to select all the important genes plus a few false positives. Thus, using the Lasso as the starting value will not miss any important genes. The main computational cost comes from Step 3, which computes a weighted Lasso estimate and can be achieved with many existing algorithms as such the LARS (Efron et al. 2004) or the boosting (Ma and Huang 2007). In this article, we adopt the boosting, since its computational cost is relatively insensitive to the number of genes.

3.2 Tuning parameter selection

We use V-fold cross validation to determine the tuning parameter λ. For a pre-defined integer V, partition the data randomly into V non-overlapping subsets with equal sizes. For a given λ, we define CV score=υ=1VQ(υ)(β^(υ)), where [beta](−υ) is the bridge estimator of β based on the data without the υth subset and Q(υ) is the function defined in (2) evaluated on the υth subset. Optimal tuning is defined as the minimizer of the CV score. In this article, we set V = 5.

4 Evaluation

With gene expression data, p [dbl greater-than sign] n. Most of the conventional evaluation techniques are valid only under the p [double less-than sign] n scenario and cannot be applied here. In this study, we are most interested in two aspects: (1) prediction performance. That is, whether the identified genes and corresponding AFT models can make proper predictions for subjects not used in the model estimation; and (2) stability of identified genes. Early studies have shown that gene signatures identified from analysis of gene expression data may suffer from low reproducibility. That is, genes identified using different data sets may differ significantly. Ideally, evaluation should be based on independent data, which is usually not available. As an alternative, we propose the following resampling based approaches.

4.1 Evaluation of prediction

We propose prediction evaluation based on random partitions as follows.

  1. Partition the data randomly into a training set of size n1 and a testing set of size n2 with n1 + n2 = n. We use n1 = 2/3n.
  2. Compute the bridge estimate using the training set data. Cross validation is needed to select the optimal tuning for the training set.
  3. Use the training set estimate to make predictions for subjects in the testing set. Specifically, we first compute the risk scores X[beta]. We then dichotomize the risk scores at the median and create two risk groups (referred to as high and low risk groups respectively). Compare the survival functions for the two risk groups, and compute the logrank statistic.
  4. To avoid over fitting caused by a “lucky” partition, we repeat Steps 1–3 B = 500 times. Each time a new partition is made and the value of the logrank statistic is computed.

We partition the dataset in Step 1. To generate a fair evaluation, we recompute tuning parameters for each individual partitions in Step 2. In Step 3, we adopt the logrank statistic as the evaluation measurement. A larger logrank statistic suggests that the high and low risk groups are better separated and the proposed approach is more effective. We create two risk groups, mainly because of the small sample sizes. By repeating the partitioning process many times, we can obtain a Monte Carlo estimation of the distribution of the logrank statistics (as opposed to a single logrank statistic in several early studies). We call it the observed predictive distribution (OPD) of the logrank statistic.

When n [dbl greater-than sign] p, the logrank statistic is asymptotically χ2 distributed. With gene expression data and n [double less-than sign] p, it is not clear how effective the χ2 approximation is. To tackle this problem, we propose the following permutation based approach to generate the reference distribution for the OPD. We first randomly permute the event times together with the censoring indicators. We then follow the same procedure as for the OPD and obtain a Monte Carlo estimation of the distribution of the logrank statistic under permutation. We call it permutation predictive distribution (PPD) of the logrank statistic. With permutation, the event times and gene expressions are expected to be independent. The distribution of logrank statistics so computed can serve as the reference distribution for the OPD.

Calculations of the OPD and PPD are parallel: the OPD is calculated from the observed data, whereas the PPD is calculated from the permuted data. Well separated OPD and PPD may indicate that the propose approach can identify genes and models with satisfactory prediction performance, whereas substantially overlapped distributions suggest that either the proposed approach is not effective or the gene expressions do not have good discriminant power.

4.2 Evaluation of stability

The prediction evaluation described above assesses overall performance of the proposed approach and selected genes/models. In what follows, we evaluate the relative stability of each identified gene. The rationale behind the proposed approach is that, if a gene is more “important” or more “stable”, it should be identified “more often” in analysis of multiple data sets. Since multiple independent data sets not available, we resort to random sampling again.

We first randomly sample n1 = 2/3n subjects. We then use the bridge approach to identify genes in the sampled subset. We repeat this procedure B = 500 times. For the jth gene, we count cj, the number of times it is identified. The proportion oj = cj/B gives a measure of the relative importance and stability of the jth gene, and will be referred to as the observed occurrence index (OOI). Following the same rationale as in the above section, we also permute the data and recompute the occurrence index, which will be referred to as the permutation occurrence index (POI) (since permutated data is used). The occurrence indexes are simply byproducts of the prediction evaluation and incur no additional computational cost.

5 Asymptotic properties

In this section, we investigate asymptotic properties of the proposed bridge approach with p [dbl greater-than sign] n. We are especially interested in the gene selection consistency property, because once genes are consistently selected, standard approaches can lead to consistent estimates. We note that for fixed p, the asymptotic results can be obtained easily using standard approaches. Since the case with fixed p is not relevant to our data applications, we will not consider it here.

We note that, with the proposed iterative algorithm, the Lasso estimate is used as the starting value. Genes not selected by the Lasso will not be selected in the final model. Thus, it is crucial to first establish properties of the Lasso estimate under the present data/model setup. Careful inspection of the proposed computational algorithm suggests that, once the initial estimate is obtained, in each step, an adaptive Lasso estimate is computed. Thus, we are able to use similar methods as in Huang et al. (2008b), which studies properties of the adaptive Lasso in high dimensional linear regression models, to establish properties of the bridge estimate.

We consider the rescaled X(i)* and Y(i)* defined in Sect. 2. For simplicity of notations, we use X(i) and Y(i) to denote X(i)* and Y(i)* hereafter. Let Y = (Y(1), …, Y(n))′. Let X be the n × p covariate matrix consisting of row vectors X(1),,X(n). Let X1, …, Xp be the p columns of X. Let W = diag(nw1, …, nwn) be the diagonal matrix of the Kaplan–Meier weights. For A [subset, dbl equals] {1, …, p}, let XA = (Xj, j [set membership] A) be the matrix with columns Xj’s for j [set membership] A. Denote ΣA=XAWXA/n. Denote the cardinality of A by |A|.

Let β0 = (β01, …, β0p)′ be the true value of the regression coefficients. Let A1 = {j : β0j ≠ 0} be the set of nonzero coefficients and let q = |A1|. We make the following assumptions.

  • (A1)
    The number of nonzero coefficients q is finite.
  • (A2)
    (a) The observations (Yi, Xi, δi), 1 ≤ in are independent and identically distributed; (b) The errors ε1, …, εn are independent and identically distributed with mean 0 and finite variance σ2. Furthermore, they are subgaussian, in the sense that there exist K1, K2 > 0 such that the tail probabilities of εi satisfy P(|εi| > x) ≤ K2 exp(−K1x2) for all x ≥ 0 and all i.
  • (A3)
    (a) The errors (ε1, …, εn) are independent of the Kaplan–Meier weights (w1, …, wn); (b) The covariates are bounded. That is, there is a constant M > 0 such that |Xij| ≤ M, 1 ≤ in, 1 ≤ jp.
  • (A4)
    The covariate matrix satisfies the sparse Riesz condition (SRC) with rank q*: there exist constants 0 < c* < c* < ∞, such that for q* = (3 + 4C)q and C = c*/c*, with probability converging to 1, c*νΣAνν2c*, ∀A with |A| = q* and ν [set membership] IRq* where ‖ · ‖ is the [ell]2 norm.

By (A1), the model is sparse in the sense that although the total number of covariates may be large, the number of covariates with nonzero coefficients is still small. The tail probability assumption in (A2) has been made with high-dimensional linear regression models. See for example Zhang and Huang (2008). With assumption (A3), it can be shown that the subguassian tail property still holds under censoring. The SRC condition (A4) has been formulated in study of the Lasso with linear regressions without censoring (Zhang and Huang 2008). This condition implies that all the eigenvalues of any d × d submatrix of X′WX/n with dq* lie between c* and c*. It ensures that any model with dimension no greater than q* is identifiable.

We first consider the Lasso estimator defined as β˜=arg min{Qn(β)+λj=1p|βj|}. With beta = (beta1, …, betap)′, let Ã1 = {j,betaj ≠ 0} be the set of nonzero Lasso estimated coefficients.

Theorem 1 Suppose that (A1)–(A4) hold and λO(1)n log p. Then

  1. With probability converging to 1,1| ≤ (2 + 4C)q.
  2. If λ/n → 0 and (log p)/n → 0, then with probability converging to 1, all the covariates with nonzero coefficients are selected.
  3. β˜β02216λ2qn2c*2+Op(|A˜1| log pnc*2). In particular, if λ=O(n log p), then β˜β022=Op(logp/n).

This theorem suggests that, with high probability, the number of covariates selected by the Lasso is a finite multiply of the number of covariates with nonzero coefficients. Moreover, all the covariates with nonzero coefficients are selected with probability converging to one. This justifies using the Lasso as the initial estimator in the algorithm proposed in Sect. 3.1. In addition, the Lasso estimator is estimation consistent.

Starting from the initial Lasso estimator beta, we denote [beta] as the estimate after one iteration (in the algorithm described in Sect. 3.1). Simple algebra shows that the value of θj(1) computed in Step 2 of the proposed algorithm is θj(1)=(λ/2)|β˜j|1/2. Thus Step 3 of the proposed algorithm is

β^=arg min{Qn(β)+λ2j=1p|βj˜|1/2|βj|}.

[beta] computed above takes the form of an adaptive Lasso estimator. Of note, here, the penalty parameter is the same as the λ used in the Lasso estimator.

For any vector x = (x1, x2, …), denote its sign vector by sgn(x) = (sgn(x1), sgn(x2), …) where sgn(xi) = 1, 0, −1 if xi > 0,= 0, < 0, respectively.

Theorem 2 Suppose that (A1)–(A4) are satisfied, (log p)/n → 0, and λ=O(n log p). Then

P(sgn(β^)=sgn(β0))1.

The above theorem shows that the one-step estimator is sign consistent. Thus, the one-step estimator is selection consistent, in the sense that it can correctly distinguish covariates with zero and nonzero coefficients with probability converging to 1. Following similar arguments, we can prove that any finite-step estimator (computed from the algorithm described in Sect. 3.1) is sign consistent and hence selection consistent. We note that, although the one-step estimator is selection consistent, our numerical studies suggest that iterating until convergence tends to improve finite sample performance.

We note that in Theorem 2, we allow log p = o(n) or p = exp(o(n)). Thus the dimension of covariates can be larger than the sample size, which accommodates gene expression data.

6 Data analysis

6.1 Mantle cell lymphoma data

A study using microarray expression analysis of mantle cell lymphoma (MCL) is reported in Rosenwald et al. (2003). The primary goal of this study is to identify genes that have good predictive power of patients’ survival risk. Among 101 untreated patients with no history of previous lymphoma, 92 were classified as having MCL based on established morphologic and immunophenotypic criteria. Survival times of 64 patients were available and the other 28 patients were censored. The median survival time was 2.8 years (range 0.02 to 14.05 years). Lymphochip DNA microarrays were used to quantify mRNA expression in the lymphoma samples from the 92 patients. The gene expression data that contains expression values of 8810 cDNA elements is available at http://llmpp.nih.gov/MCL/.

We model survival with the AFT model, and use the proposed bridge approach for gene selection. Although there is no limitation on the number of genes that can be used in the proposed approach, we pre-process the data as follows to exclude noises and gain further stability: (1) Un-supervised screening: compute the interquartile ranges of all gene expressions. Remove genes with interquartile ranges smaller than their first quartile. 6608 genes pass this screening; (2) Supervised screening: compute correlation coefficients of the uncensored survival times with gene expressions. Select 500 genes with the largest absolute values of the correlation coefficients. We then standardize these 500 gene expressions to have zero mean and unit variance. We note that the supervised screening utilizes the survival information. In the random sampling based evaluation, to guarantee a fair evaluation, the supervised screening needs to be conducted for each sampled data.

We employ the proposed approach and select the optimal tuning with 5-fold cross validation. Genes selected with the bridge approach and their corresponding estimates are shown in Table 1. For comparison, we also provide the Lasso estimate. 40 and 34 genes are identified using the Lasso and bridge approaches, respectively. Because of the special setup of the computational algorithm, genes identified using the bridge are a subset of those identified using the Lasso.

Table 1
Mantle cell lymphoma data

We evaluate prediction performance using the approach described in Sect. 4.1. For comparison, we also evaluate the Lasso approach using the same evaluation technique. We show in Fig. 1 (upper panels) the density estimates of OPD and PPD. We can see that (1) the bridge yields well separated OPD and PPD (Wilcoxon test p-value < 0.001), which suggests satisfactory prediction performance. The 90% percentile of the PPD is 3.32, and 62% of the logrank statistics from the OPD are larger than that value; (2) the PPD is close to the χ2 distribution, but there is still very small discrepancy. The 95% and 90% percentiles of the PPD are 4.11 and 3.32, respectively, which are slightly larger than their counterparts (3.84 and 2.71) from the χ2; (3) the prediction performance of Lasso is also satisfactory, but inferior compared to that of the bridge. The mean and median of the Lasso OPD are 4.617 and 3.616, which are smaller than their bridge counterparts (5.319 and 4.404).

Fig 1
MCL data. Left-upper panel: Lasso estimation. Green solid line: OPD; Red dash-dotted line: PPD; Black dotted line: density function of 500 χ2 distributed random variables; Right-upper penal: Bridge estimation. Green solid line: OPD; Red dash-dotted ...

We evaluate stability of identified genes using the approach described in Sect. 4.2. Results are shown in the lower panels of Fig. 1. For a better view, we only plot the 500 genes that pass the screening in the whole dataset. We can see that (1) most of the identified genes have relatively large OOI; (2) there are a few genes that are not identified, but have moderate OOI. It is still not clear why those genes are not identified. Such a question is worth investigating in future studies; and (3) with permutated data, the POI for all genes are small, with no genes having significantly larger occurrence indexes than others.

6.2 Diffuse large B-cell lymphoma data

The DLBCL (diffuse large B-cell lymphoma) data was first analyzed in Rosenwald et al. (2002). This dataset consists of 240 patients with DLBCL, including 138 patient deaths during the followup. Expression profiles of 7399 genes are obtained. Missing values are imputed using a K-nearest neighbors approach. We carry out supervised selection and select 500 genes with the largest absolute values of marginal correlation coefficients with the uncensored event times to gain further stability. Gene expressions are then normalized to have zero mean and unit variance. We note that, in the random sampling based evaluation, the supervised screening is conducted for each sampled data.

With the proposed approach and optimal tuning selected using 5-fold cross validation, 44 genes are identified. As a comparison, the Lasso identifies 46 genes. The UNI-QID, gene names, and bridge and Lasso estimated coefficients are shown in Table 2.

Table 2
DLBCL data

We evaluate prediction performance and show the results in Fig. 2. Similar conclusions as those in Sect. 6.1 can be drawn. With the bridge, the OPD has mean and median 4.416 and 3.674, respectively, which are larger than their Lasso counterparts (3.53 and 2.59). 61% of the logrank statistics from the OPD are greater than the 90% percentile of the PPD. The Wilcox test suggests that the OPD and PPD are well separated (p-value < 0.001). Evaluation of stability using the occurrence index is presented in Fig. 2 (lower panels). The observations are similar to those summarized in Sect. 6.1.

Fig 2
DLBCL data. Left-upper panel: Lasso estimation. Green solid line: OPD; Red dash-dotted line: PPD; Black dotted line: density function of 500 χ2 distributed random variables; Right-upper penal: Bridge estimation. Green solid line: OPD; Red dash-dotted ...

6.3 Remark

Analyses of the MCL and DLBCL data suggest that the bridge approach is capable of identifying a smaller number of genes than the Lasso. With gene expression data, a smaller number of identified genes means more focused hypothesis for future confirmation studies, and is thus preferred. In addition, prediction performance of the bridge is better than that of the Lasso. We note that, although prediction is not based on completely independent data, by properly using resampling and comparing the bridge and Lasso on the same basis, the prediction comparison is expected to be valid.

7 Discussions

Genomic studies with high dimensional markers measured along with censored survival outcomes are becoming more and more common. In this article, we model the relationship between gene expressions and censored survival with AFT models. AFT models have been commonly adopted and provide useful alternatives to the Cox and additive hazards models. Of note, since it is still not clear how to compare different models under the “large p, small n” setting, we do not pursue any model comparison. More methodological studies are needed before such a comparison can be conducted.

We propose using the bridge penalty for gene selection. Our numerical studies suggest that the bridge has better performance than the Lasso in terms of variable selection in AFT models. There are other penalties, for example the adaptive Lasso and SCAD, that can be used in the present setup. Since it is beyond the scope of this paper to compare our proposed method with all the existing ones, we only pursue comparison with the Lasso, which has been commonly used as benchmark.

Acknowledgements

This work is partially supported by CA120988 from the National Cancer Institute and DMS 0805670 from the National Science Foundation. We thank the editors and reviewers for their helpful and constructive comments on an earlier version of the paper.

Appendix: Proofs

Let τ = (τ1, …, τn)′ where τi=wiε(i)wi(Y(i)X(i)β0).

Lemma 1 Suppose that conditions (A2) and (A3) hold. Let ξj=i=1nXijτi,1jp. Let ξn = max1≤jpj|. Then

E(ξn)C1log(p)(2C2nlog(p)+4log(2p)+C2n)1/2,

where C1 and C2 are two positive constants. In particular, when log (p)/n → 0,

E(ξn)=O(1)n log p.

Proof of Lemma 1 Let snj2=i=1nXij2. Conditional on Xij’s, assumptions (A2) and (A3) imply that ξj’s are subgaussian. Let sn2=max1jpsnj2. By (A2) and the maximal inequality for subgaussian random variables (Van der Vaart and Wellner 1996, Lemmas 2.2.1 and 2.2.2),

E(max1jp|ξj||Xij,1in,1jp)C1snlog(p),

for a constant C1 > 0. Therefore,

E(max1jp|ξj|)C1log(p)E(sn).
(5)

Since

i=1nE[Xij2EXij2]24C2n,
(6)

and

max1jpi=1nEXij2C2n,
(7)

by Lemma 4.2 of Van de Geer (2008), (6) implies

E(max1jp|i=1n{Xij2EXij2}|)2C2nlog(p)+4log(2p).

Therefore, by (7) and the triangle inequality,

Esn22C2nlog(p)+4log(2p)+C2n.

Now since Esn(Esn2)1/2, we have

Esn(2C2nlog(p)+4log(2p)+C2n)1/2.
(8)

The lemma follows from (5) and (8).

In the proofs below, let Y* = W1/2Y and X* = W1/2X. Then

Qn(β)=12i=1n(Yi*Xi*β)2=12Y*X*β2,

where ‖ · ‖ is the [ell]2 norm.

Proof of Theorem 1, part (i) Part (i) follows from the proof of Theorem 1 of Zhang and Huang (2008). The only difference is that here we use the subgaussian assumption to control certain tail probabilities, instead of the normality condition assumed in Zhang and Huang (2008). Since subgaussian random variables have the same tail behavior as normal random variables, the argument of Zhang and Huang goes through.

Proof of Theorem 1, part (ii) Part (ii) follows from part (iii) and the assumption that the number of nonzero coefficients is fixed. Thus the absolute values of the nonzero coefficients are bounded away from 0 by a positive constant independent of n.

Proof of Theorem 1, part (iii) By the definition of beta,

Y*X*β˜22+2λj=1pn|β˜j|Y*X*β02+2λj=1pn|β0j|.

Thus

Y*X*β˜22+2λjA1|β˜j|Y*X*β02+2λjA1|β0j|.

This implies

Y*X*β˜22Y*X*β022λjA1|β˜jβ0j|.

That is,

X*(β˜β0)22τX*(β˜β0)2λjA1|β˜jβ0j|.
(9)

Let B = A1 [union or logical sum] A2 = {j : β0j ≠ 0 or betaj ≠ 0}. Note that |B| ≤ q* with probability converging to 1 by part (i), where q* is given in (A4). Denote XB*=(Xj*,jB), betaB = (betaj, j [set membership] B), and β0B = (β0j, j [set membership] B). Denote

ηB=XB*(β˜Bβ0B).

Since A1 [subset or is implied by] B,

jA1|β˜jβ0j||A1|β˜A1β0A1|A1|β˜Bβ0B.
(10)

By (9) and (10),

ηB22τηB2λ|A1|β˜Bβ0B.
(11)

Let τB* be the projection of τ to the span of XB*, i.e., τB*=XB*(XB*XB*)1XB*τ. We have

τηB=τXB*(β˜nBβ0B)={(XB*XB*)1/2XB*τ}{(XB*XB*)1/2(β˜Bβ0B)}.

Therefore, by the Cauchy–Schwarz inequality,

2|τηB|2τB*·ηB2τB*2+12ηB2.
(12)

Combining (11) and (12),

ηB24τB*2+4λ|A1|·β˜Bβ0B
(13)

By the SRC condition (A4), ‖ηB2nc*betaB − β0B2. Thus (13) implies

nc*β˜Bβ0B24τB*2+(4λ|A1|)22nc*+12nc*β˜Bβ0B2.

It follows that

β˜Bβ0B28τB*2nc*+16λ2|A1|n2c*2.
(14)

Now

τB*2=(XB*XB*)1/2XB*τ21nc*XB*τ21nc*maxA:|A|q*XA*τ2.

We have

maxA:|A|q*XA*τ2=maxA:|A|q*jA|Xj*τ|2q*max1jp|Xj*τ|2.

By Lemma 1,

max1jp|Xj*τ|2=nmax1jp|n1/2Xj*τ|2=Op(n log p).

Therefore,

τB*2=Op(q*logpc*).
(15)

The result follows from (14) and (15).

Proof of Theorem 2 The proof follows from the argument of Huang et al. (2008b). So we only provide the basic idea below. Let aj = |betaj|−1/2/2, 1 ≤ jp. By the Karush–Kunh–Tucker conditions, [beta] = ([beta]1, …, [beta]p)′ is the unique solution of the adaptive Lasso if

{Xj*(Y*X*β^)=λnajsgn(β^j),   β^nj0|Xj*(Y*X*β^)|λnaj,   β^j=0
(16)

and the vectors {Xj*,β^j0} are linearly independent. Recall A1 = {j : β0j ≠ 0}. Let sn1 = (aj sgn(β0j), j [set membership] A1)′ and XA1*=(Xj*,jA1), β0A1 = (βj, j [set membership] A1)′. So XA1* is a n × q matrix.

Define

β^A1=(XA1*XA1*)1(XA1*Y*λns˜n1)=β0A1+C111(XA1*τλns˜n1)/n,
(17)

where C11=XA1*XA1*/n. If sgn([beta]A1) = sgn(β0A1), then the equation in (16) holds for β˜=(β'^A1,0). Thus, since X*β^=XA1*β^A1 for this [beta],

sgn(β^)=sgn(β0)   if   {sgn(β^A1)=sgn(β0A1)|Xj*(Y*XA1*β^A1)|λnaj,jA1.
(18)

Let Hn=InXA1*C111Xn1*/n. It follows from (17) that Y*XA1*β^A1=τXA1*(β^A1β0A1)=Hnτ+XA1*C111s˜n1λn/n, so that by (18),

sgn(β^)=sgn(β0)   if   {sgn(β0j)(β0jβ^j)|β0j|,jA1|Xj*(Hnτ+XA1*C111s˜n1λn/n)|<λnaj,jA1.
(19)

Thus, by (19) and (17),

P{sgn(β^)sgn(β0)}P{|ejC111XA1*τ|/n|β0j|/2 for some jA1}+P{|ejC111s˜n1|λn/n|β0j|/2 for some jA1}+P{|Xj*Hnτ|λnaj/2 for some jA1}+P{|Xj*XA1*C111s˜n1|/naj/2 for some jA1}=P{Bn1}+P{Bn2}+P{Bn3}+P{Bn4},   say,

where ej is the unit vector in the direction of the j-th coordinate. Therefore, to prove the theorem, it suffices to show that each probability in the last line converges to zero. The same argument as in Huang et al. (2008b) can be used here and is omitted. This completes the outline of the proof.

Contributor Information

Jian Huang, Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242, USA. Department of Biostatistics, University of Iowa, Iowa City, IA 52242, USA.

Shuangge Ma, Department of Epidemiology and Public Health, Yale University, New Haven, CT 06520, USA.

References

  • Alizadeh AA, Eisen MB, Davis RE, Ma C, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. [PubMed]
  • Buckley J, James I. Linear regression with censored data. Biometrika. 1979;66:429–436.
  • Dave SS, Wright G, Tan B, et al. Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. New Engl J Med. 2004;351:2159–2169. [PubMed]
  • Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat. 2004;32:407–499.
  • Frank IE, Friedman JH. A statistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35:109–148.
  • Fu WJ. Penalized regressions: the bridge versus the Lasso. J Comput Graph Stat. 1998;7:397–416.
  • Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. [PubMed]
  • Huang J, Ma SG, Xie HL. Regularized estimation in the accelerated failure time model with high-dimensional covariates. Biometrics. 2006;62:813–820. [PubMed]
  • Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Stat. 2008a;36:587–613.
  • Huang J, Ma SG, Xie HL, Zhang C-H. A group bridge approach for variable selection. Biometrika. 2009;96:339–355. [PMC free article] [PubMed]
  • Huang J, Ma S, Zhang C. Adaptive Lasso for high-dimensional regression models. Stat Sinica. 2008b;18:1603–1618.
  • Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. New York: John Wiley; 1980.
  • Leng C, Lin Y, Wahba G. A note on the LASSO and related procedures in model selection. Stat Sinica. 2006;16:1273–1284.
  • Ma S, Huang J. Additive risk survival model with microarray data. BMC Bioinform. 2007;8:192. [PMC free article] [PubMed]
  • Rosenwald A, Wright G, Chan WC, Conners JM, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B cell lymphoma. New Engl J Med. 2002;346:1937–1947. [PubMed]
  • Rosenwald A, Wright G, Wiestner A, Chan WC, et al. The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell. 2003;3:185–197. [PubMed]
  • Stute W. Consistent estimation under random censorship when covariables are available. J Multivar Anal. 1993;45:89–103.
  • Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B. 1996;58:267–288.
  • van de Geer S. High-dimensional generalized linear models and the Lasso. Ann Stat. 2008;36:614–645.
  • Van der Vaart AW, Wellner JA. Weak convergence and empirical processes: with applications to statistics. New York: Springer; 1996.
  • Wang S, Nan B, Zhu J, Beer DG. Doubly penalized Buckley-James method for survival data with high-dimensional covariates. Biometrics. 2008;6:132–140. [PubMed]
  • Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med. 1992;11:1871–1879. [PubMed]
  • Ying ZL. A large sample study of rank estimation for censored regression data. Ann Stat. 1993;21:76–99.
  • Zhang C, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Stat. 2008;36:1567–1594.
  • Zhou M. M-estimation in censored linear models. Biometrika. 1992;79:837–841.