In biomedical studies, it is of substantial interest to develop risk prediction scores using high-dimensional data such as gene expression data for clinical endpoints that are subject to censoring. In the presence of well-established clinical risk factors, investigators often prefer a procedure that also adjusts for these clinical variables. While accelerated failure time (AFT) models are a useful tool for the analysis of censored outcome data, it assumes that covariate effects on the logarithm of time-to-event are linear, which is often unrealistic in practice. We propose to build risk prediction scores through regularized rank estimation in partly linear AFT models, where high-dimensional data such as gene expression data are modeled linearly and important clinical variables are modeled nonlinearly using penalized regression splines. We show through simulation studies that our model has better operating characteristics compared to several existing models. In particular, we show that there is a non-negligible effect on prediction as well as feature selection when nonlinear clinical effects are misspecified as linear. This work is motivated by a recent prostate cancer study, where investigators collected gene expression data along with established prognostic clinical variables and the primary endpoint is time to prostate cancer recurrence. We analyzed the prostate cancer data and evaluated prediction performance of several models based on the extended c statistic for censored data, showing that 1) the relationship between the clinical variable, prostate specific antigen, and the prostate cancer recurrence is likely nonlinear, i.e., the time to recurrence decreases as PSA increases and it starts to level off when PSA becomes greater than 11; 2) correct specification of this nonlinear effect improves performance in prediction and feature selection; and 3) addition of gene expression data does not seem to further improve the performance of the resultant risk prediction scores.
Accelerated Failure Time Model; Feature Selection; Lasso; Partly Linear Model; Penalized Splines; Rank Estimation; Risk Prediction
In breast cancer research, it is important to identify genomic markers associated with prognosis. Multiple microarray gene expression profiling studies have been conducted, searching for prognosis markers. Genomic markers identified from the analysis of single datasets often suffer a lack of reproducibility because of small sample sizes. Integrative analysis of data from multiple independent studies has a larger sample size and may provide a cost-effective solution.
We collect four breast cancer prognosis studies with gene expression measurements. An accelerated failure time (AFT) model with an unknown error distribution is adopted to describe survival. An integrative sparse boosting approach is employed for marker selection. The proposed model and boosting approach can effectively accommodate heterogeneity across multiple studies and identify genes with consistent effects.
Simulation study shows that the proposed approach outperforms alternatives including meta-analysis and intensity approaches by identifying the majority or all of the true positives, while having a low false positive rate. In the analysis of breast cancer data, 44 genes are identified as associated with prognosis. Many of the identified genes have been previously suggested as associated with tumorigenesis and cancer prognosis. The identified genes and corresponding predicted risk scores differ from those using alternative approaches. Monte Carlo-based prediction evaluation suggests that the proposed approach has the best prediction performance.
Integrative analysis may provide an effective way of identifying breast cancer prognosis markers. Markers identified using the integrative sparse boosting analysis have sound biological implications and satisfactory prediction performance.
Breast cancer prognosis; Gene Expression; Integrative analysis; Sparse boosting
The goal of this study is to extend the applications of parametric survival models so that they include cases in which accelerated failure time (AFT) assumption is not satisfied, and examine parametric and semiparametric models under different proportional hazards (PH) and AFT assumptions.
The data for 12,531 women diagnosed with breast cancer in British Columbia, Canada, during 1990–1999 were divided into eight groups according to patients’ ages and stage of disease, and each group was assumed to have different AFT and PH assumptions. For parametric models, we fitted the saturated generalized gamma (GG) distribution, and compared this with the conventional AFT model. Using a likelihood ratio statistic, both models were compared to the simpler forms including the Weibull and lognormal. For semiparametric models, either Cox's PH model or stratified Cox model was fitted according to the PH assumption and tested using Schoenfeld residuals. The GG family was compared to the log-logistic model using Akaike information criterion (AIC) and Baysian information criterion (BIC).
When PH and AFT assumptions were satisfied, semiparametric and parametric models both provided valid descriptions of breast cancer patient survival. When PH assumption was not satisfied but AFT condition held, the parametric models performed better than the stratified Cox model. When neither the PH nor the AFT assumptions were met, the log normal distribution provided a reasonable fit.
When both the PH and AFT assumptions are satisfied, the parametric and semiparametric models provide complementary information. When PH assumption is not satisfied, the parametric models should be considered, whether the AFT assumption is met or not.
Breast cancer; generalized gamma distribution; parametric regression; stratified Cox model; survival analysis
Use of microarray technology often leads to high-dimensional and low-sample size (HDLSS) data settings. A variety of approaches have been proposed for variable selection in this context. However, only a small number of these have been adapted for time-to-event data where censoring is present. Among standard variable selection methods shown both to have good predictive accuracy and to be computationally efficient is the elastic net penalization approach. In this paper, adaptations of the elastic net approach are presented for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model. Assessment of the two methods is conducted through simulation studies and through analysis of microarray data obtained from a set of patients with diffuse large B-cell lymphoma where time to survival is of interest. The approaches are shown to match or exceed the predictive performance of a Cox-based and an AFT-based variable selection method. The methods are moreover shown to be much more computationally efficient than their respective Cox- and AFT-based counterparts.
survival analysis; microarray; elastic net; variable selection
Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with L1 and Lp penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n ≪ m (the number of genes), directly identifying a small subset of genes from ultra-high (m > 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.
The accelerated failure time (AFT) model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller n × n matrix. It is very efficient when the number of unknown variables (genes) is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited.
Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.
Identification of genes associated with cancer development and progression using microarray data is challenging because of the high dimensionality and cluster structure of gene expressions. Here the clusters are composed of multiple genes with coordinated biological functions and/or correlated expressions. In this article, we first propose a hybrid approach for clustering gene expressions. The hybrid approach uses both pathological pathway information and correlations of gene expressions. We propose using the group bridge, a novel clustering penalization approach, for analysis of cancer microarray data. The group bridge approach explicitly accounts for the cluster structure of gene expressions, and is capable of selecting gene clusters and genes within those selected clusters that are associated with cancer. We also develop an iterative algorithm for computing the group bridge estimator. Analysis of three cancer microarray datasets shows that the proposed approach can identify biologically meaningful gene clusters and genes within those identified clusters.
Gene selection; group bridge; microarray; clustering; penalization
Microarray techniques survey gene expressions on a global scale. Extensive biomedical studies have been designed to discover subsets of genes that are associated with survival risks for diseases such as lymphoma and construct predictive models using those selected genes. In this article, we investigate simultaneous estimation and gene selection with right censored survival data and high dimensional gene expression measurements.
We model the survival time using the additive risk model, which provides a useful alternative to the proportional hazards model and is adopted when the absolute effects, instead of the relative effects, of multiple predictors on the hazard function are of interest. A Lasso (least absolute shrinkage and selection operator) type estimate is proposed for simultaneous estimation and gene selection. Tuning parameter is selected using the V-fold cross validation. We propose Leave-One-Out cross validation based methods for evaluating the relative stability of individual genes and overall prediction significance.
We analyze the MCL and DLBCL data using the proposed approach. A small number of probes represented on the microarrays are identified, most of which have sound biological implications in lymphoma development. The selected probes are relatively stable and the proposed approach has overall satisfactory prediction power.
We consider estimation and variable selection in the partial linear model for censored data. The partial linear model for censored data is a direct extension of the accelerated failure time model, the latter of which is a very important alternative model to the proportional hazards model. We extend rank-based lasso-type estimators to a model that may contain nonlinear effects. Variable selection in such partial linear model has direct application to high-dimensional survival analyses that attempt to adjust for clinical predictors. In the microarray setting, previous methods can adjust for other clinical predictors by assuming that clinical and gene expression data enter the model linearly in the same fashion. Here, we select important variables after adjusting for prognostic clinical variables but the clinical effects are assumed nonlinear. Our estimator is based on stratification and can be extended naturally to account for multiple nonlinear effects. We illustrate the utility of our method through simulation studies and application to the Wisconsin prognostic breast cancer data set.
Lasso; Logrank; Penalized least squares; Survival analysis
There has been increasing interest in predicting patients’ survival after therapy by investigating gene expression microarray data. In the regression and classification models with high-dimensional genomic data, boosting has been successfully applied to build accurate predictive models and conduct variable selection simultaneously. We propose the Buckley-James boosting for the semiparametric accelerated failure time models with right censored survival data, which can be used to predict survival of future patients using the high-dimensional genomic data. In the spirit of adaptive LASSO, twin boosting is also incorporated to fit more sparse models. The proposed methods have a unified approach to fit linear models, non-linear effects models with possible interactions. The methods can perform variable selection and parameter estimation simultaneously. The proposed methods are evaluated by simulations and applied to a recent microarray gene expression data set for patients with diffuse large B-cell lymphoma under the current gold standard therapy.
boosting; accelerated failure time model; Buckley-James estimator; censored survival data; LASSO; variable selection
The aim of this study is to evaluate the prognostic factors of overall survival (OS) after haematopoietic stem cell transplant (HSCT) in acute lymphoblastic leukaemia (ALL) patients using accelerated failure time (AFT), Cox proportional hazard (PH), and Cox time-varying coefficient models.
206 patients were enrolled after HSCH in Shariati Hospital between 1993 and 2007. There was evidence of marked departures from the proportional hazards assumption with two prognostic factors, relapse and chronic graft-versus-host disease (cGVHD) (P < .001). Performance among AFT and Cox's models was assessed using explained variation and goodness of fit methods. Discrimination among the exponential, Weibull, generalized gamma (GG), log-logistic, and lognormal distributions was done using maximum likelihood and Akaike information criteria.
The 5-year OS was 52% (95%CI: 47.3–56.7). Peak mortality hazard occurred at months 6–7 after HSCT followed by a decreasing trend. In univariate analysis, the data was better fitted by GG distribution than by other distributions. Univariate analysis using GG distribution showed a positive association between OS with acute graft-versus-host disease (aGVHD) (P = .021), no relapse (P < .001), cGVHD (P < .001), neutrophil recovery (P < .001) and platelet recovery (P < .001). Based on Cox PH models; however cGVHD and relapse were the predictive factors of OS (P < .001). Multivariate analysis indicated that, OS is related to relapse (P < .001) and platelet recovery (P = .037), where predictive power of Weibull AFT models was superior to Cox PH model and Cox with time-varying coefficient (R2 = 0.46 for AFT, R2 = .21 for Cox PH and R2 = .34 for Cox time-varying coefficient). Cox-Snell residual shows Weibull AFT fitted to data better than other distributions in multivariate analysis.
We concluded that AFT distributions can be a useful tool for recognizing prognostic factors of OS in acute lymphoblastic leukemia patients.
In the presence of high dimensional predictors, it is challenging to develop reliable regression models that can be used to accurately predict future outcomes. Further complications arise when the outcome of interest is an event time which is often not fully observed due to censoring. In this paper, we develop robust prediction models for event time outcomes by regularizing the Gehan’s estimator for the AFT model (Tsiatis, 1990) with LASSO penalty. Unlike existing methods based on the inverse probability weighting and the Buckley and James estimator (Buckely and James, 1979), the proposed approach does not require additional assumptions about the censoring and always yields a solution that is convergent. Furthermore, the proposed estimator leads to a stable regression model for prediction even if the AFT model fails to hold. To facilitate the adaptive selection of the tuning parameter, we detail an efficient numerical algorithm for obtaining the entire regularization path. The proposed procedures are applied to a breast cancer dataset to derive a reliable regression model for predicting patient survival based on a set of clinical prognostic factors and gene signatures. Finite sample performances of the procedures are evaluated through a simulation study.
AFT model; LASSO regularization; linear programming
With the growing number of public repositories for high-throughput genomic data, it is of great interest to combine the results produced by independent research groups. Such a combination allows the identification of common genomic factors across multiple cancer types and provides new insights into the disease process. In the framework of the proportional hazards model, classical procedures, which consist of ranking genes according to the estimated hazard ratio or the p-value obtained from a test statistic of no association between survival and gene expression level, are not suitable for gene selection across multiple genomic datasets with different sample sizes. We propose a novel index for identifying genes with a common effect across heterogeneous genomic studies designed to remain stable whatever the sample size and which has a straightforward interpretation in terms of the percentage of separability between patients according to their survival times and gene expression measurements.
The simulations results show that the proposed index is not substantially affected by the sample size of the study and the censoring. They also show that its separability performance is higher than indices of predictive accuracy relying on the likelihood function. A simulated example illustrates the good operating characteristics of our index. In addition, we demonstrate that it is linked to the score statistic and possesses a biologically relevant interpretation.
The practical use of the index is illustrated for identifying genes with common effects across eight independent genomic cancer studies of different sample sizes. The meta-selection allows the identification of four genes (ESPL1, KIF4A, HJURP, LRIG1) that are biologically relevant to the carcinogenesis process and have a prognostic impact on survival outcome across various solid tumors.
The proposed index is a promising tool for identifying factors having a prognostic impact across a collection of heterogeneous genomic datasets of various sizes.
Dimension reduction, model and variable selection are ubiquitous concepts in modern statistical science and deriving new methods beyond the scope of current methodology is noteworthy. This note briefly reviews existing regularization methods for penalized least squares and likelihood for survival data and their extension to a certain class of penalized estimating function. We show that if one’s goal is to estimate the entire regularized coefficient path using the observed survival data, then all current strategies fail for the Buckley-James estimating function. We propose a novel two-stage method to estimate and restore the entire Dantzig-regularized coefficient path for censored outcomes in a least-squares framework. We apply our methods to a microarray study of lung andenocarcinoma with sample size n = 200 and p = 1036 gene predictors and find 10 genes that are consistently selected across different criteria and an additional 14 genes that merit further investigation. In simulation studies, we found that the proposed path restoration and variable selection technique has the potential to perform as well as existing methods that begin with a proper convex loss function at the outset.
Accelerated failure time; Boosting; Dantzig selector; Lasso; Penalized least squares; Rank regression; Survival analysis
Crossing hazard functions have extensive applications in modeling survival data. However, existing studies in the literature mainly focus on comparing crossed hazard functions and estimating the time at which the hazard functions cross, and there is little theoretical work on conditions under which hazard functions from a model will have a crossing. In this paper, we investigate crossing status of hazard functions from the proportional hazards (PH) model, the accelerated hazard (AH) model, and the accelerated failure time (AFT) model. We provide and prove conditions under which the hazard functions from the AH and the AFT models have no crossings or a single crossing. A few examples are also provided to demonstrate how the conditions can be used to determine crossing status of hazard functions from the three models.
accelerated hazard model; accelerated failure time model; monotone hazard; U-shape hazard; bell-shape hazard
Microarray technology is increasingly used to identify potential biomarkers for cancer prognostics and diagnostics. Previously, we have developed the iterative Bayesian Model Averaging (BMA) algorithm for use in classification. Here, we extend the iterative BMA algorithm for application to survival analysis on high-dimensional microarray data. The main goal in applying survival analysis to microarray data is to determine a highly predictive model of patients' time to event (such as death, relapse, or metastasis) using a small number of selected genes. Our multivariate procedure combines the effectiveness of multiple contending models by calculating the weighted average of their posterior probability distributions. Our results demonstrate that our iterative BMA algorithm for survival analysis achieves high prediction accuracy while consistently selecting a small and cost-effective number of predictor genes.
We applied the iterative BMA algorithm to two cancer datasets: breast cancer and diffuse large B-cell lymphoma (DLBCL) data. On the breast cancer data, the algorithm selected a total of 15 predictor genes across 84 contending models from the training data. The maximum likelihood estimates of the selected genes and the posterior probabilities of the selected models from the training data were used to divide patients in the test (or validation) dataset into high- and low-risk categories. Using the genes and models determined from the training data, we assigned patients from the test data into highly distinct risk groups (as indicated by a p-value of 7.26e-05 from the log-rank test). Moreover, we achieved comparable results using only the 5 top selected genes with 100% posterior probabilities. On the DLBCL data, our iterative BMA procedure selected a total of 25 genes across 3 contending models from the training data. Once again, we assigned the patients in the validation set to significantly distinct risk groups (p-value = 0.00139).
The strength of the iterative BMA algorithm for survival analysis lies in its ability to account for model uncertainty. The results from this study demonstrate that our procedure selects a small number of genes while eclipsing other methods in predictive performance, making it a highly accurate and cost-effective prognostic tool in the clinical setting.
The censored linear regression model, also referred to as the accelerated failure time (AFT) model when the logarithm of the survival time is used as the response variable, is widely seen as an alternative to the popular Cox model when the assumption of proportional hazards is questionable. Buckley and James [Linear regression with censored data, Biometrika 66 (1979) 429−436] extended the least squares estimator to the semiparametric censored linear regression model in which the error distribution is completely unspecified. The Buckley–James estimator performs well in many simulation studies and examples. The direct interpretation of the AFT model is also more attractive than the Cox model, as Cox has pointed out, in practical situations. However, the application of the Buckley–James estimation was limited in practice mainly due to its illusive variance. In this paper, we use the empirical likelihood method to derive a new test and confidence interval based on the Buckley–James estimator of the regression coefficient. A standard chi-square distribution is used to calculate the P-value and the confidence interval. The proposed empirical likelihood method does not involve variance estimation. It also shows much better small sample performance than some existing methods in our simulation studies.
Censored data; Wilks theorem; Accelerated failure time model; Linear regression model
Survivorship experiments play a central role in aging research and are performed to evaluate whether interventions alter the rate of aging and increase lifespan. The accelerated failure time (AFT) model is seldom used to analyze survivorship data, but offers a potentially useful statistical approach that is based upon the survival curve rather than the hazard function. In this study, AFT models were used to analyze data from 16 survivorship experiments that evaluated the effects of one or more genetic manipulations on mouse lifespan. Most genetic manipulations were found to have a multiplicative effect on survivorship that is independent of age and well-characterized by the AFT model “deceleration factor”. AFT model deceleration factors also provided a more intuitive measure of treatment effect than the hazard ratio, and were robust to departures from modeling assumptions. Age-dependent treatment effects, when present, were investigated using quantile regression modeling. These results provide an informative and quantitative summary of survivorship data associated with currently known long-lived mouse models. In addition, from the standpoint of aging research, these statistical approaches have appealing properties and provide valuable tools for the analysis of survivorship data.
AFT model; cox; insulin-like growth factor; proportional hazard; survival analysis
Mining of gene expression data to identify genes associated with patient survival is an ongoing problem in cancer prognostic studies using microarrays in order to use such genes to achieve more accurate prognoses. The least absolute shrinkage and selection operator (lasso) is often used for gene selection and parameter estimation in high-dimensional microarray data. The lasso shrinks some of the coefficients to zero, and the amount of shrinkage is determined by the tuning parameter, often determined by cross validation. The model determined by this cross validation contains many false positives whose coefficients are actually zero. We propose a method for estimating the false positive rate (FPR) for lasso estimates in a high-dimensional Cox model. We performed a simulation study to examine the precision of the FPR estimate by the proposed method. We applied the proposed method to real data and illustrated the identification of false positive genes.
cancer prognostic; false positive rate; gene selection; high-dimensional regression; microarray data; survival analysis
An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers.
The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs.
In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method.
Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
genomic studies; semiparametric prognosis models; model comparison
We propose a general strategy for variable selection in semiparametric regression models by penalizing appropriate estimating functions. Important applications include semiparametric linear regression with censored responses and semiparametric regression with missing predictors. Unlike the existing penalized maximum likelihood estimators, the proposed penalized estimating functions may not pertain to the derivatives of any objective functions and may be discrete in the regression coefficients. We establish a general asymptotic theory for penalized estimating functions and present suitable numerical algorithms to implement the proposed estimators. In addition, we develop a resampling technique to estimate the variances of the estimated regression coefficients when the asymptotic variances cannot be evaluated directly. Simulation studies demonstrate that the proposed methods perform well in variable selection and variance estimation. We illustrate our methods using data from the Paul Coverdell Stroke Registry.
Accelerated failure time model; Buckley-James estimator; Censoring; Least absolute shrinkage and selection operator; Least squares; Linear regression; Missing data; Smoothly clipped absolute deviation
Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.
We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.
Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at .
Gene co-expressions have been widely used in the analysis of microarray gene expression data. However, the co-expression patterns between two genes can be mediated by cellular states, as reflected by expression of other genes, single nucleotide polymorphisms, and activity of protein kinases. In this paper, we introduce a bivariate conditional normal model for identifying the variables that can mediate the co-expression patterns between two genes. Based on this model, we introduce a likelihood ratio test and a penalized likelihood procedure for identifying the mediators that affect gene co-expression patterns. We propose an efficient computational algorithm based on iterative reweighted least squares and cyclic coordinate descent and have shown that when the tuning parameter in the penalized likelihood is appropriately selected, such a procedure has the oracle property in selecting the variables. We present simulation results to compare with existing methods and show that the likelihood ratio-based approach can perform similarly or better than the existing method of liquid association and the penalized likelihood procedure can be quite effective in selecting the mediators. We apply the proposed method to yeast gene expression data in order to identify the kinases or single nucleotide polymorphisms that mediate the co-expression patterns between genes.
Cyclic coordinate descent; eQTL; Gene regulation; Penalized likelihood; Variable selection
It is rather challenging for current variable selectors to handle situations where the number of covariates under consideration is ultra-high. Consider a motivating clinical trial of the drug bortezomib for the treatment of multiple myeloma, where overall survival and expression levels of 44760 probesets were measured for each of 80 patients with the goal of identifying genes that predict survival after treatment. This dataset defies analysis even with regularized regression. Some remedies have been proposed for the linear model and for generalized linear models, but there are few solutions in the survival setting and, to our knowledge, no theoretical support. Furthermore, existing strategies often involve tuning parameters that are difficult to interpret. In this paper we propose and theoretically justify a principled method for reducing dimensionality in the analysis of censored data by selecting only the important covariates. Our procedure involves a tuning parameter that has a simple interpretation as the desired false positive rate of this selection. We present simulation results and apply the proposed procedure to analyze the aforementioned myeloma study.
Cox model; Multiple myeloma; Sure independence screening; Ultra-high-dimensional covariates; Variable selection
It has been demonstrated that genes in a cell do not act independently. They interact with one another to complete certain biological processes or to implement certain molecular functions. How to incorporate biological pathways or functional groups into the model and identify survival associated gene pathways is still a challenging problem. In this paper, we propose a novel iterative gradient based method for survival analysis with group Lp penalized global AUC summary maximization. Unlike LASSO, Lp (p < 1) (with its special implementation entitled adaptive LASSO) is asymptotic unbiased and has oracle properties . We first extend Lp for individual gene identification to group Lp penalty for pathway selection, and then develop a novel iterative gradient algorithm for penalized global AUC summary maximization (IGGAUCS). This method incorporates the genetic pathways into global AUC summary maximization and identifies survival associated pathways instead of individual genes. The tuning parameters are determined using 10-fold cross validation with training data only. The prediction performance is evaluated using test data. We apply the proposed method to survival outcome analysis with gene expression profile and identify multiple pathways simultaneously. Experimental results with simulation and gene expression data demonstrate that the proposed procedures can be used for identifying important biological pathways that are related to survival phenotype and for building a parsimonious model for predicting the survival times.