Microarray techniques provide a way of monitoring gene expressions on a global scale. An important application of microarray is to discover subsets of genes that are associated with occurrence of certain diseases, for example breast cancer, leukemia or lymphoma. Biomedical studies have been conducted to measure gene expression levels and patients' survival information simultaneously. This article is partly motivated by studies like the Mantle Cell Lymphoma (MCL) study reported in [1
], where expression levels of 8810 genes and survival information are measured for 92 subjects. A main goal of the MCL study is to discover subsets of genes that are linked with patients' survival risk. Statistically, standard estimation approaches cannot yield a unique estimator, when the dimension of gene expressions is greater than the sample size. Biologically, it is reasonable to assume that only a small number of genes are relevant to predicting the lymphoma occurrence. It is thus of great interest to develop statistical methodologies that can carry out simultaneous estimation and dimension reduction or variable selection.
Dimension reduction has been extensively investigated for linear regression models [2
]. One widely used approach is to use low dimensional projections of the covariates (genes) as surrogates for the true covariates. Examples include the ridge regression, partial least squares (PLS), and principal component regression (PCR). For a detailed discussion, see [2
]. Including all the covariates in the predictive models through projections may introduce noises and lead to poor predictive performance. In addition, the underlying assumption that all covariates are associated with the disease progression is not necessarily true.
An alternative approach is to use variable selection techniques and identify the important covariates. This can usually be accomplished by using penalization methods, where penalties with data dependent tuning parameters are used to control the sparsity of the models. A general scheme is outlined in [4
]. Penalization methods are especially effective when there exist a small number of large covariate effects. Compared with projection approaches, penalized variable selection may be preferred in microarray studies since it is capable of selecting a small number of individual covariates and only those covariates are used in the predictive models. Further biological investigation can focus on those identified influential genes only.
Modeling survival outcome with high dimensional gene expressions is more challenging due to the presence of censoring and the use of complicated semiparametric models. One approach used by [5
] is to cluster genes first, and then use the sample averages of the gene expression levels in a Cox model for right censored survival outcome. Another well developed clustering based algorithm is the gene harvesting procedure of [6
]. Standard PLS method has been employed in [7
], where the resulted PLS components are used in the Cox model. A penalized estimation procedure has been considered for the Cox model using kernels [8
], under the assumption that the covariate effects are smooth functions of gene expression levels. The Lasso (least absolute shrinkage and selection) type estimator in the Cox model with right censored data has been proposed [9
], and [10
] applies the LARS algorithm in computing the Lasso estimator in the Cox model. In a recent study, [11
] applies the principal component regression to additive risk models with right censored data. Aforementioned studies, except [11
], assume the proportional hazards models for survival time. Although this is the most straightforward assumption, it is not necessarily satisfied.
Survival analysis assuming the additive risk model may provide more insights beyond the proportional hazards model analysis. An additive model is generally adopted when it is reasonable to assume that the covariate effects under consideration contribute additively to the conditional hazard. Consider one special form of the additive risk models investigated in [12
], where we model the conditional hazard at time t
given a d
-dimensional vector of time-independent covariates Z
. Here β0
(·) denote the unknown regression parameter and the unknown baseline hazard function, respectively. Previous studies have concluded its sound biological and empirical basis [13
] and satisfactory statistical properties [12
The objective function under the additive risk model proposed in [12
] takes a least-squares form, which makes estimation computationally easier than under the proportional hazards model. This is especially important for high dimensional microarray studies where computation cost is a serious concern. Inspired by the special form of the objective function, we propose a Lasso approach for estimation and gene selection in the additive risk model (1), which minimizes a least-squares objective function subject to a L1
constraint, when high dimensional microarray measurements are present. Because of the nature of the L1
constraint, the Lasso method shrinks coefficients and sets some estimated coefficients exactly zero. Thus it can yield a sparse model.
In this article, we investigate survival analysis with high dimensional gene expressions under the additive risk model. The goal is to provide an alternative estimation and gene selection method, when the proportional hazard assumption cannot be satisfied. Beyond modeling the survival risk in an additive manner, the proposed approach is computationally affordable and well behaved. Evaluation of stability of individual genes and overall predictive power, which has been ignored in some of previous studies, are also investigated.