Mining of gene expression data to identify genes associated with patient survival is an ongoing problem in cancer prognostic studies using microarrays in order to use such genes to achieve more accurate prognoses. The least absolute shrinkage and selection operator (lasso) is often used for gene selection and parameter estimation in high-dimensional microarray data. The lasso shrinks some of the coefficients to zero, and the amount of shrinkage is determined by the tuning parameter, often determined by cross validation. The model determined by this cross validation contains many false positives whose coefficients are actually zero. We propose a method for estimating the false positive rate (FPR) for lasso estimates in a high-dimensional Cox model. We performed a simulation study to examine the precision of the FPR estimate by the proposed method. We applied the proposed method to real data and illustrated the identification of false positive genes.
cancer prognostic; false positive rate; gene selection; high-dimensional regression; microarray data; survival analysis
We consider estimation and variable selection in the partial linear model for censored data. The partial linear model for censored data is a direct extension of the accelerated failure time model, the latter of which is a very important alternative model to the proportional hazards model. We extend rank-based lasso-type estimators to a model that may contain nonlinear effects. Variable selection in such partial linear model has direct application to high-dimensional survival analyses that attempt to adjust for clinical predictors. In the microarray setting, previous methods can adjust for other clinical predictors by assuming that clinical and gene expression data enter the model linearly in the same fashion. Here, we select important variables after adjusting for prognostic clinical variables but the clinical effects are assumed nonlinear. Our estimator is based on stratification and can be extended naturally to account for multiple nonlinear effects. We illustrate the utility of our method through simulation studies and application to the Wisconsin prognostic breast cancer data set.
Lasso; Logrank; Penalized least squares; Survival analysis
There has been increasing interest in predicting patients’ survival after therapy by investigating gene expression microarray data. In the regression and classification models with high-dimensional genomic data, boosting has been successfully applied to build accurate predictive models and conduct variable selection simultaneously. We propose the Buckley-James boosting for the semiparametric accelerated failure time models with right censored survival data, which can be used to predict survival of future patients using the high-dimensional genomic data. In the spirit of adaptive LASSO, twin boosting is also incorporated to fit more sparse models. The proposed methods have a unified approach to fit linear models, non-linear effects models with possible interactions. The methods can perform variable selection and parameter estimation simultaneously. The proposed methods are evaluated by simulations and applied to a recent microarray gene expression data set for patients with diffuse large B-cell lymphoma under the current gold standard therapy.
boosting; accelerated failure time model; Buckley-James estimator; censored survival data; LASSO; variable selection
There have been relatively few publications using linear regression models to predict a continuous response based on microarray expression profiles. Standard linear regression methods are problematic when the number of predictor variables exceeds the number of cases. We have evaluated three linear regression algorithms that can be used for the prediction of a continuous response based on high dimensional gene expression data. The three algorithms are the least angle regression (LAR), the least absolute shrinkage and selection operator (LASSO), and the averaged linear regression method (ALM). All methods are tested using simulations based on a real gene expression dataset and analyses of two sets of real gene expression data and using an unbiased complete cross validation approach. Our results show that the LASSO algorithm often provides a model with somewhat lower prediction error than the LAR method, but both of them perform more efficiently than the ALM predictor. We have developed a plug-in for BRB-ArrayTools that implements the LAR and the LASSO algorithms with complete cross-validation.
regression model; gene expression; continuous outcome
Personalized medicine is predicated on the concept of identifying subgroups of a common disease for better treatment. Identifying biomarkers that predict disease subtypes has been a major focus of biomedical science. In the era of genome-wide profiling, there is controversy as to the optimal number of genes as an input of a feature selection algorithm for survival modeling.
Patients and methods
The expression profiles and outcomes of 544 patients were retrieved from The Cancer Genome Atlas. We compared four different survival prediction methods: (1) 1-nearest neighbor (1-NN) survival prediction method; (2) random patient selection method and a Cox-based regression method with nested cross-validation; (3) least absolute shrinkage and selection operator (LASSO) optimization using whole-genome gene expression profiles; or (4) gene expression profiles of cancer pathway genes.
The 1-NN method performed better than the random patient selection method in terms of survival predictions, although it does not include a feature selection step. The Cox-based regression method with LASSO optimization using whole-genome gene expression data demonstrated higher survival prediction power than the 1-NN method, but was outperformed by the same method when using gene expression profiles of cancer pathway genes alone.
The 1-NN survival prediction method may require more patients for better performance, even when omitting censored data. Using preexisting biological knowledge for survival prediction is reasonable as a means to understand the biological system of a cancer, unless the analysis goal is to identify completely unknown genes relevant to cancer biology.
brain; feature selection; glioblastoma; personalized medicine; survival modeling; TCGA
Recent biomedical studies often measure two distinct sets of risk factors: low-dimensional clinical and environmental measurements, and high-dimensional gene expression measurements. For prognosis studies with right censored response variables, we propose a semiparametric regression model whose covariate effects have two parts: a nonparametric part for low-dimensional covariates, and a parametric part for high-dimensional covariates. A penalized variable selection approach is developed. The selection of parametric covariate effects is achieved using an iterated Lasso approach, for which we prove the selection consistency property. The nonparametric component is estimated using a sieve approach. An empirical model selection tool for the nonparametric component is derived based on the Kullback-Leibler geometry. Numerical studies show that the proposed approach has satisfactory performance. Application to a lymphoma study illustrates the proposed method.
Semiparametric regression; variable selection; right censored data; iterated Lasso
In high throughput genomic studies, an important goal is to identify a small number of genomic markers that are associated with development and progression of diseases. A representative example is microarray prognostic studies, where the goal is to identify genes whose expressions are associated with disease free or overall survival. Because of the high dimensionality of gene expression data, standard survival analysis techniques cannot be directly applied. In addition, among the thousands of genes surveyed, only a subset are disease-associated. Gene selection is needed along with estimation. In this article, we model the relationship between gene expressions and survival using the accelerated failure time (AFT) models. We use the bridge penalization for regularized estimation and gene selection. An efficient iterative computational algorithm is proposed. Tuning parameters are selected using V-fold cross validation. We use a resampling method to evaluate the prediction performance of bridge estimator and the relative stability of identified genes. We show that the proposed bridge estimator is selection consistent under appropriate conditions. Analysis of two lymphoma prognostic studies suggests that the bridge estimator can identify a small number of genes and can have better prediction performance than the Lasso.
Bridge penalization; Censored data; High dimensional data; Selection consistency; Stability; Sparse model
Censored median regression has proved useful for analyzing survival data in complicated situations, say, when the variance is heteroscedastic or the data contain outliers. In this paper, we study the sparse estimation for censored median regression models, which is an important problem for high dimensional survival data analysis. In particular, a new procedure is proposed to minimize an inverse-censoring-probability weighted least absolute deviation loss subject to the adaptive LASSO penalty and result in a sparse and robust median estimator. We show that, with a proper choice of the tuning parameter, the procedure can identify the underlying sparse model consistently and has desired large-sample properties including root-n consistency and the asymptotic normality. The procedure also enjoys great advantages in computation, since its entire solution path can be obtained efficiently. Furthermore, we propose a resampling method to estimate the variance of the estimator. The performance of the procedure is illustrated by extensive simulations and two real data applications including one microarray gene expression survival data.
We present a framework for hyperspectral image (HSI) analysis validation, specifically abundance fraction estimation based on HSI measurements of water soluble dye mixtures printed on microarray chips. In our work we focus on the performance of two algorithms, the Least Absolute Shrinkage and Selection Operator (LASSO) and the Spatial LASSO (SPLASSO). The LASSO is a well known statistical method for simultaneously performing model estimation and variable selection. In the context of estimating abundance fractions in a HSI scene, the “sparse” representations provided by the LASSO are appropriate as not every pixel will be expected to contain every endmember. The SPLASSO is a novel approach we introduce here for HSI analysis which takes the framework of the LASSO algorithm a step further and incorporates the rich spatial information which is available in HSI to further improve the estimates of abundance. In our work here we introduce the dye mixture platform as a new benchmark data set for hyperspectral biomedical image processing and show our algorithm’s improvement over the standard LASSO.
(000.5490) Probability theory, stochastic processes, and statistics; (110.4234) Multispectral and hyperspectral imaging; (120.0120) Instrumentation, measurement, and metrology; (170.0170) Medical optics and biotechnology; (170.3880) Medical and biological imaging; (180.0180) Microscopy; (350.4800) Optical standards and testing
We consider variable selection in the Cox regression model (Cox, 1975, Biometrika
362, 269–276) with covariates missing at random. We investigate the smoothly clipped absolute deviation penalty and adaptive least absolute shrinkage and selection operator (LASSO) penalty, and propose a unified model selection and estimation procedure. A computationally attractive algorithm is developed, which simultaneously optimizes the penalized likelihood function and penalty parameters. We also optimize a model selection criterion, called the ICQ statistic (Ibrahim, Zhu, and Tang, 2008, Journal of the American Statistical Association
103, 1648–1658), to estimate the penalty parameters and show that it consistently selects all important covariates. Simulations are performed to evaluate the finite sample performance of the penalty estimates. Also, two lung cancer data sets are analyzed to demonstrate the proposed methodology.
ALASSO; Missing data; Partial likelihood; Penalized likelihood; Proportional hazards model; SCAD; Variable selection
Variable selection in genome-wide association studies can be a daunting task and statistically challenging because there are more variables than subjects. We propose an approach that uses principal-component analysis (PCA) and least absolute shrinkage and selection operator (LASSO) to identify gene-gene interaction in genome-wide association studies. A PCA was used to first reduce the dimension of the single-nucleotide polymorphisms (SNPs) within each gene. The interaction of the gene PCA scores were placed into LASSO to determine whether any gene-gene signals exist. We have extended the PCA-LASSO approach using the bootstrap to estimate the standard errors and confidence intervals of the LASSO coefficient estimates. This method was compared to placing the raw SNP values into the LASSO and the logistic model with individual gene-gene interaction. We demonstrated these methods with the Genetic Analysis Workshop 16 rheumatoid arthritis genome-wide association study data and our results identified a few gene-gene signals. Based on our results, the PCA-LASSO method shows promise in identifying gene-gene interactions, and, at this time we suggest using it with other conventional approaches, such as generalized linear models, to narrow down genetic signals.
We use least absolute shrinkage and selection operator (LASSO) regression to select genetic markers and phenotypic features that are most informative with respect to a trait of interest. We compare several strategies for applying LASSO methods in risk prediction models, using the Genetic Analysis Workshop 17 exome simulation data consisting of 697 individuals with information on genotypic and phenotypic features (smoking, age, sex) in 5-fold cross-validated fashion. The cross-validated averages of the area under the receiver operating curve range from 0.45 to 0.63 for different strategies using only genotypic markers. The same values are improved to 0.69–0.87 when both genotypic and phenotypic information are used. The ability of the LASSO method to find true causal markers is limited, but the method was able to discover several common variants (e.g., FLT1) under certain conditions.
In the presence of high dimensional predictors, it is challenging to develop reliable regression models that can be used to accurately predict future outcomes. Further complications arise when the outcome of interest is an event time which is often not fully observed due to censoring. In this paper, we develop robust prediction models for event time outcomes by regularizing the Gehan’s estimator for the AFT model (Tsiatis, 1990) with LASSO penalty. Unlike existing methods based on the inverse probability weighting and the Buckley and James estimator (Buckely and James, 1979), the proposed approach does not require additional assumptions about the censoring and always yields a solution that is convergent. Furthermore, the proposed estimator leads to a stable regression model for prediction even if the AFT model fails to hold. To facilitate the adaptive selection of the tuning parameter, we detail an efficient numerical algorithm for obtaining the entire regularization path. The proposed procedures are applied to a breast cancer dataset to derive a reliable regression model for predicting patient survival based on a set of clinical prognostic factors and gene signatures. Finite sample performances of the procedures are evaluated through a simulation study.
AFT model; LASSO regularization; linear programming
Because every disease has its unique survival pattern, it is necessary
to find a suitable model to simulate followups. DNA microarray is a useful technique to detect thousands of gene expressions at one time and is usually employed to classify different types of cancer. We propose combination methods of penalized regression models and nonnegative matrix
factorization (NMF) for predicting survival. We tried L1- (lasso), L2- (ridge), and L1-L2 combined (elastic net) penalized regression for diffuse large B-cell lymphoma (DLBCL) patients'
microarray data and found that L1-L2 combined method predicts survival best with the smallest logrank P value. Furthermore, 80% of selected genes have been reported to correlate with carcinogenesis or lymphoma. Through NMF we found that DLBCL patients can be divided into 4 groups clearly, and it implies that DLBCL may have 4 subtypes which have a little different survival patterns. Next we excluded some patients who were indicated hard to classify in NMF and executed three penalized regression models again. We found that the performance of survival prediction has been improved with lower logrank P values. Therefore, we conclude that after preselection of patients by NMF, penalized regression models can predict DLBCL patients' survival successfully.
Dimension reduction, model and variable selection are ubiquitous concepts in modern statistical science and deriving new methods beyond the scope of current methodology is noteworthy. This note briefly reviews existing regularization methods for penalized least squares and likelihood for survival data and their extension to a certain class of penalized estimating function. We show that if one’s goal is to estimate the entire regularized coefficient path using the observed survival data, then all current strategies fail for the Buckley-James estimating function. We propose a novel two-stage method to estimate and restore the entire Dantzig-regularized coefficient path for censored outcomes in a least-squares framework. We apply our methods to a microarray study of lung andenocarcinoma with sample size n = 200 and p = 1036 gene predictors and find 10 genes that are consistently selected across different criteria and an additional 14 genes that merit further investigation. In simulation studies, we found that the proposed path restoration and variable selection technique has the potential to perform as well as existing methods that begin with a proper convex loss function at the outset.
Accelerated failure time; Boosting; Dantzig selector; Lasso; Penalized least squares; Rank regression; Survival analysis
Our purpose was to develop and test a predictive model of the acute glucose response to exercise in individuals with type 2 diabetes.
Design and methods
Data from three previous exercise studies (56 subjects, 488 exercise sessions) were combined and used as a development dataset. A mixed-effects Least Absolute Shrinkage Selection Operator (LASSO) was used to select predictors among 12 potential predictors. Tests of the relative importance of each predictor were conducted using the Lindemann Merenda and Gold (LMG) algorithm. Model structure was tested using likelihood ratio tests. Model accuracy in the development dataset was assessed by leave-one-out cross-validation.
Prospectively captured data (47 individuals, 436 sessions) was used as a test dataset. Model accuracy was calculated as the percentage of predictions within measurement error. Overall model utility was assessed as the number of subjects with ≤1 model error after the third exercise session. Model accuracy across individuals was assessed graphically. In a post-hoc analysis, a mixed-effects logistic regression tested the association of individuals’ attributes with model error.
Minutes since eating, a non-linear transformation of minutes since eating, post-prandial state, hemoglobin A1c, sulfonylurea status, age, and exercise session number were identified as novel predictors. Minutes since eating, its transformations, and hemoglobin A1c combined to account for 19.6% of the variance in glucose response. Sulfonylurea status, age, and exercise session each accounted for <1.0% of the variance. In the development dataset, a model with random slopes for pre-exercise glucose improved fit over a model with random intercepts only (likelihood ratio 34.5, p < 0.001). Cross-validated model accuracy was 83.3%.
In the test dataset, overall accuracy was 80.2%. The model was more accurate in pre-prandial than postprandial exercise (83.6% vs. 74.5% accuracy respectively). 31/47 subjects had ≤1 model error after the third exercise session. Model error varied across individuals and was weakly associated with within-subject variability in pre-exercise glucose (Odds ratio 1.49, 95% Confidence interval 1.23-1.75).
The preliminary development and test of a predictive model of acute glucose response to exercise is presented. Further work to improve this model is discussed.
Prediction; Exercise; Type 2 Diabetes; Blood Glucose
There are several techniques for fitting risk prediction models to high-dimensional data, arising from microarrays. However, the biological knowledge about relations between genes is only rarely taken into account. One recent approach incorporates pathway information, available, e.g., from the KEGG database, by augmenting the penalty term in Lasso estimation for continuous response models.
As an alternative, we extend componentwise likelihood-based boosting techniques for incorporating pathway information into a larger number of model classes, such as generalized linear models and the Cox proportional hazards model for time-to-event data. In contrast to Lasso-like approaches, no further assumptions for explicitly specifying the penalty structure are needed, as pathway information is incorporated by adapting the penalties for single microarray features in the course of the boosting steps. This is shown to result in improved prediction performance when the coefficients of connected genes have opposite sign. The properties of the fitted models resulting from this approach are then investigated in two application examples with microarray survival data.
The proposed approach results not only in improved prediction performance but also in structurally different model fits. Incorporating pathway information in the suggested way is therefore seen to be beneficial in several ways.
Censored data are increasingly common in many microarray studies that attempt to relate gene expression to patient survival. Several new methods have been proposed in the last two years. Most of these methods, however, are not available to biomedical researchers, leading to many re-implementations from scratch of ad-hoc, and suboptimal, approaches with survival data.
We have developed SignS (Signatures for Survival data), an open-source, freely-available, web-based tool and R package for gene selection, building molecular signatures, and prediction with survival data. SignS implements four methods which, according to existing reviews, perform well and, by being of a very different nature, offer complementary approaches. We use parallel computing via MPI, leading to large decreases in user waiting time. Cross-validation is used to asses predictive performance and stability of solutions, the latter an issue of increasing concern given that there are often several solutions with similar predictive performance. Biological interpretation of results is enhanced because genes and signatures in models can be sent to other freely-available on-line tools for examination of PubMed references, GO terms, and KEGG and Reactome pathways of selected genes.
SignS is the first web-based tool for survival analysis of expression data, and one of the very few with biomedical researchers as target users. SignS is also one of the few bioinformatics web-based applications to extensively use parallelization, including fault tolerance and crash recovery. Because of its combination of methods implemented, usage of parallel computing, code availability, and links to additional data bases, SignS is a unique tool, and will be of immediate relevance to biomedical researchers, biostatisticians and bioinformaticians.
The least absolute shrinkage and selection operator (LASSO) can be used to predict SNP effects. This operator has the desirable feature of including in the model only a subset of explanatory SNPs, which can be useful both in QTL detection and GWS studies. LASSO solutions can be obtained by the least angle regression (LARS) algorithm. The big issue with this procedure is to define the best constraint (t), i.e. the upper bound of the sum of absolute value of the SNP effects which roughly corresponds to the number of SNPs to be selected. Usai et al. (2009) dealt with this problem by a cross-validation approach and defined t as the average number of selected SNPs overall replications. Nevertheless, in small size populations, such estimator could give underestimated values of t. Here we propose two alternative ways to define t and compared them with the "classical" one.
The first (strategy 1), was based on 1,000 cross-validations carried out by randomly splitting the reference population (2,000 individuals with performance) into two halves. The value of t was the number of SNPs which occurred in more than 5% of replications. The second (strategy 2), which did not use cross-validations, was based on the minimization of the Cp-type selection criterion which depends on the number of selected SNPs and the expected residual variance.
The size of the subset of selected SNPs was 46, 189 and 64 for the classical approach, strategy 1 and 2 respectively. Classical and strategy 2 gave similar results and indicated quite clearly the regions were QTL with additive effects were located. Strategy 1 confirmed such regions and added further positions which gave a less clear scenario. Correlation between GEBVs estimated with the three strategies and TBVs in progenies without phenotypes were 0.9237, 0.9000 and 0.9240 for classical, strategy 1 and 2 respectively.
This suggests that the Cp-type selection criterion is a valid alternative to the cross-validations to define the best constraint for selecting subsets of predicting SNPs by LASSO-LARS procedure.
We study the Cox models with semiparametric relative risk, which can be partially linear with one nonparametric component, or multiple additive or nonadditive nonparametric components. A penalized partial likelihood procedure is proposed to simultaneously estimate the parameters and select variables for both the parametric and the nonparametric parts. Two penalties are applied sequentially. The first penalty, governing the smoothness of the multivariate nonlinear covariate effect function, provides a smoothing spline ANOVA framework that is exploited to derive an empirical model selection tool for the nonparametric part. The second penalty, either the smoothly-clipped-absolute-deviation (SCAD) penalty or the adaptive LASSO penalty, achieves variable selection in the parametric part. We show that the resulting estimator of the parametric part possesses the oracle property, and that the estimator of the nonparametric part achieves the optimal rate of convergence. The proposed procedures are shown to work well in simulation experiments, and then applied to a real data example on sexually transmitted diseases.
Backfitting; partially linear models; penalized variable selection; proportional hazards; penalized partial likelihood; smoothing spline ANOVA
Motivation: DNA microarrays are routinely applied to study diseased or drug-treated cell populations. A critical challenge is distinguishing the genes directly affected by these perturbations from the hundreds of genes that are indirectly affected. Here, we developed a sparse simultaneous equation model (SSEM) of mRNA expression data and applied Lasso regression to estimate the model parameters, thus constructing a network model of gene interaction effects. This inferred network model was then used to filter data from a given experimental condition of interest and predict the genes directly targeted by that perturbation.
Results: Our proposed SSEM–Lasso method demonstrated substantial improvement in sensitivity compared with other tested methods for predicting the targets of perturbations in both simulated datasets and microarray compendia. In simulated data, for two different network types, and over a wide range of signal-to-noise ratios, our algorithm demonstrated a 167% increase in sensitivity on average for the top 100 ranked genes, compared with the next best method. Our method also performed well in identifying targets of genetic perturbations in microarray compendia, with up to a 24% improvement in sensitivity on average for the top 100 ranked genes. The overall performance of our network-filtering method shows promise for identifying the direct targets of genetic dysregulation in cancer and disease from expression profiles.
Availability: Microarray data are available at the Many Microbe Microarrays Database (M3D, http://m3d.bu.edu). Algorithm scripts are available at the Gardner Lab website (http://gardnerlab.bu.edu/SSEMLasso).
Supplementary information: Supplementary Data are available at Bioinformatics on line.
High-throughput gene expression technologies such as microarrays
have been utilized in a variety of scientific applications. Most
of the work has been done on assessing univariate associations
between gene expression profiles with clinical outcome
(variable selection) or on developing classification procedures
with gene expression data (supervised learning). We consider a
hybrid variable selection/classification approach that is based
on linear combinations of the gene expression profiles that
maximize an accuracy measure summarized using the receiver
operating characteristic curve. Under a specific probability
model, this leads to the consideration of linear discriminant
functions. We incorporate an automated variable selection
approach using LASSO. An equivalence between LASSO estimation
with support vector machines allows for model fitting using
standard software. We apply the proposed method to simulated
data as well as data from a recently published prostate cancer study.
With competing risks failure time data, one often needs to assess the covariate effects on the cumulative incidence probabilities. Fine and Gray proposed a proportional hazards regression model to directly model the subdistribution of a competing risk. They developed the estimating procedure for right-censored competing risks data, based on the inverse probability of censoring weighting. Right-censored and left-truncated competing risks data sometimes occur in biomedical researches. In this paper, we study the proportional hazards regression model for the subdistribution of a competing risk with right-censored and left-truncated data. We adopt a new weighting technique to estimate the parameters in this model. We have derived the large sample properties of the proposed estimators. To illustrate the application of the new method, we analyze the failure time data for children with acute leukemia. In this example, the failure times for children who had bone marrow transplants were left truncated.
competing risks; cumulative incidence function; proportional hazards model; subdistribution
Because multiple loci control complex diseases, there is great interest in testing markers simultaneously instead of one by one. In this paper, we applied two model selection algorithms: the stochastic search variable selection (SSVS) and the least absolute shrinkage and selection operator (LASSO) to two quantitative phenotypes related to rheumatoid arthritis (RA).
The Genetic Analysis Workshop 16 data includes 2,062 unrelated individuals and 545,080 single-nucleotide polymorphism markers from the Illumina 550 k chip. We performed our analyses on the cases as the quantitative phenotype data was not provided for the controls. The performance of the two algorithms was compared. Using sure independence screening as the prescreening procedure, both SSVS and LASSO give small models. No markers are identified in the human leukocyte antigen region of chromosome 6 that was shown to be associated with RA. SSVS and LASSO identify seven common loci, and some of them are on genes LRRC8D, LRP1B, and COLEC12. These genes have not been reported to be associated with RA. LASSO also identified a common locus on gene KTCD21 for the two phenotypes (marker rs230662 and rs483731, respectively).
SSVS outperforms LASSO in simulation studies. Both SSVS and LASSO give small models on the RA data, however this depends on model parameters. We also demonstrate the ability of both LASSO and SSVS to handle more markers than the number of samples.
Microarray technology is widely used in cancer diagnosis. Successfully identifying gene biomarkers will significantly help to classify different cancer types and improve the prediction accuracy. The regularization approach is one of the effective methods for gene selection in microarray data, which generally contain a large number of genes and have a small number of samples. In recent years, various approaches have been developed for gene selection of microarray data. Generally, they are divided into three categories: filter, wrapper and embedded methods. Regularization methods are an important embedded technique and perform both continuous shrinkage and automatic gene selection simultaneously. Recently, there is growing interest in applying the regularization techniques in gene selection. The popular regularization technique is Lasso (L1), and many L1 type regularization terms have been proposed in the recent years. Theoretically, the Lq type regularization with the lower value of q would lead to better solutions with more sparsity. Moreover, the L1/2 regularization can be taken as a representative of Lq (0
In this work, we investigate a sparse logistic regression with the L1/2 penalty for gene selection in cancer classification problems, and propose a coordinate descent algorithm with a new univariate half thresholding operator to solve the L1/2 penalized logistic regression. Experimental results on artificial and microarray data demonstrate the effectiveness of our proposed approach compared with other regularization methods. Especially, for 4 publicly available gene expression datasets, the L1/2 regularization method achieved its success using only about 2 to 14 predictors (genes), compared to about 6 to 38 genes for ordinary L1 and elastic net regularization approaches.
From our evaluations, it is clear that the sparse logistic regression with the L1/2 penalty achieves higher classification accuracy than those of ordinary L1 and elastic net regularization approaches, while fewer but informative genes are selected. This is an important consideration for screening and diagnostic applications, where the goal is often to develop an accurate test using as few features as possible in order to control cost. Therefore, the sparse logistic regression with the L1/2 penalty is effective technique for gene selection in real classification problems.
Gene selection; Sparse logistic regression; Cancer classification
Results 1-25 (923286)