|Home | About | Journals | Submit | Contact Us | Français|
Feature selection and prediction are the most important tasks for big data mining. The common strategies for feature selection in big data mining are L 1, SCAD and MC+. However, none of the existing algorithms optimizes L 0, which penalizes the number of nonzero features directly.
In this paper, we develop a novel sparse generalized linear model (GLM) with L 0 approximation for feature selection and prediction with big omics data. The proposed approach approximate the L 0 optimization directly. Even though the original L 0 problem is non-convex, the problem is approximated by sequential convex optimizations with the proposed algorithm. The proposed method is easy to implement with only several lines of code. Novel adaptive ridge algorithms (L 0ADRIDGE) for L 0 penalized GLM with ultra high dimensional big data are developed. The proposed approach outperforms the other cutting edge regularization methods including SCAD and MC+ in simulations. When it is applied to integrated analysis of mRNA, microRNA, and methylation data from TCGA ovarian cancer, multilevel gene signatures associated with suboptimal debulking are identified simultaneously. The biological significance and potential clinical importance of those genes are further explored.
The developed Software L 0ADRIDGE in MATLAB is available at https://github.com/liuzqx/L0adridge.
The online version of this article (doi:10.1186/s13040-017-0159-z) contains supplementary material, which is available to authorized users.
Integrating multilevel molecular and clinical data to design preventive, diagnostic, and therapeutic solutions that are individually tailored to each patient’s requirements is the ultimate goal of precision medicine. However, the huge number of features makes it neither practical nor feasible to predict clinical outcomes with all omics features directly. Thus, selecting a small subset of informative features (biomarkers) to conduct association studies and clinical predictions has become an important step toward effective big data mining. Statistical tests or univariate correlation analysis for feature selection ignore the interacting relationship among genes. To evaluate the predictive power of the features, one appealing approach for feature selection is L 0 regularized sparse modeling, which penalizes the number of nonzero features directly. L 0 is known as the most essential sparsity measure and has nice theoretical properties. However, it is computational impossible to perform an exhaustive search when analyzing omics data sets with millions of features. L 0 penalized optimization is known to be NP-hard in general (Lin et al. 2010).
One common strategy for feature selection is to replace the non-convex L 0 with the L 1 norm. L 1 is a convex relaxation and loose approximation of L 0. Although L 1 penalized sparse models  can be solved efficiently, the estimators with L 1 are penalized too much and asymptotically biased. In addition, L 1 inclines to select more spurious features than necessary, and may not always choose the true model consistently . Theoretically, L 1 never outperforms L 0 by a constant . Depending on the location of true optimum, L 1 may perform much worse than L 0 [4, 5]. As a result, the convex relaxation techniques have been shown to be suboptimal in many cases . More recent approaches aimed to reduce bias and overcome discontinuity include the non-convex SCAD  and MC+ . However, none of the existing algorithms directly approximate the L 0 optimization problem. Either SCAD or MC+ has been rarely used for feature selection in big data analytics because of their computational intensity with multiple tuning parameters. On the other hand, recent research works including ours show that sparse regression models with L 0 penalty (local solution) outperforms L 1 (global solution) by a substantial margin [5, 9–11].
Debulking cytoreductive surgery is a standard treatment for ovarian cancer. The goal of debulking is to remove as much visible cancer as possible. However, if tumor nodules have invaded vital organs, surgeons may not be able to remove them without compromising the patient’s life. Leaving tumor nodules larger than 1 cm is defined as suboptimal debulking (cytoreduction). It has been shown that suboptimal debulking is associated with reduced chemosensitivity and poor survival in ovarian cancer. Biomarkers derived from multi-omics data may help physicians decide which patients should undergo surgery and which should be treated with chemotherapy first [12–14]. Identifying biomarkers from multi-omics data has been an exciting but challenging task. Sparse modeling is one of the important approaches for simultaneous phenotype prediction and biomarker identification. In this paper, we propose a L 0 penalized generalized linear regression (GLM) for feature selection and prediction. Adaptive ridge algorithm (L 0ADRIDGE) is developed to approximate L 0 penalized GLM with sequential convex optimization and is efficient in handling ultra high-dimensional omics data. The proposed method outperforms other cutting-edge convex and non-convex penalties including L 1, SCAD and MC+ with simulations. When applied to the important suboptimal debulking prediction problem in ovarian cancer, the proposed approach identifies multilevel molecular signatures through mining methylation, microRNA and mRNA expression data jointly from TCGA. The identified molecular signatures are further evaluated using public databases.
Given an input X N×P, where NP, and output Y, we have a generalized linear model with canonical link in the following form:
where G is a canonical link function. Different link functions lead to different models. For instance, a logit link function leads to logistic regression, while an exponential link function leads to Poisson regression.
The distribution of Y in GLM is assumed to be from the exponential families with the following probability (density) function:
where ϕ is a dispersion parameter, and different functions A(), B() and C() are for different distributions Y . The corresponding mean and variance are:
where V(μ)=B ′′(θ). Let Y=[Y 1,…,Y N]t, X=[x 1,x 2,…,x N]t, and μ=[μ 1,…,μ N]t, so and . The log-likelihood of Y is
Dropping the constants A(ϕ), and C(Y i,ϕ), we have the simplified log likelihood as follows:
Hence, L 0 penalized error function to minimize is
where is the number of nonzero elements in β, μ i=G(θ i) and . If we define , then . Equation (2) is equivalent to
which is equivalent to the following system:
Given η and , the derivative of E w.r.t. β is
where indicates element-wise division. The Hessian matrix is
where V i=V(μ i)=B ″(θ i)=G ′(θ i), i=1,…,N, and let , we have
The Newton-Raphson iteration for β is
Let , we have
Different link functions will lead to different regression models as shown as in Table 1.
Other GLMs such as negative binomial, gamma, and inverse Gaussian can be implemented accordingly with a different V(μ). When dealing with big data problems with NP,where N is the number of samples and P is the number of parameters, the inverse of a P×P matrix is time-consuming and computational challenging. We proposed an efficient algorithm to calculate the inverse of a much smaller N×N matrix as follows (Liu et al. 2015):
So that when NP, we have a much efficient estimation:
The adaptive ridge algorithm (L 0ADRIDGEA) is implemented in MATLAB are as follows:
The algorithm is easy to implement and very efficient for either small sample size and large dimension or large sample size and small dimension big data problem. The regularized parameter λ can be determined either by cross-validation or by AIC and BIC with λ=2 and λ= log(N), respectively. We further discuss that the proposed method is a L 0 approximation and converges to L 0 when the number of iterations m→∞.
Algorithm justification: Given a high-dimensional big feature matrix X N×P(NP) and a threshold γ for the coefficient estimates, L 0 rejects all the coefficient estimates below γ to 0 and keeps the large coefficients unchanged. This is the same as defining a binary vector s=[…,1,0,…,1]t, with the value of 0 or 1 for each feature, where s j=1 if the coefficient estimate for that feature is above the threshold γ, and 0 otherwise. Let S=diag(s) be a matrix with s on its diagonal, we have the selected feature matrix X S=XS. We can build the standard models with the matrix X S, if we know s in advance. For instance, we can estimate the coefficients of a GLM with L 2 regulation given X S and Y with
where , , and because of the special structure of matrix S. It is guaranteed that the estimate is 0 for feature j with s j=0. However, in reality we do not know s. Estimating both s and θ is an NP-hard problem, since we need to solve a mixed-integer optimization problem. Comparing Eq. (7) with Eq. (5), β new=(DX t VX+λI)−1 DX t Z, it is clear that S is replaced by D and a binary s j is approximated by a continuous in proposed algorithm. Therefore, the proposed method is a L 0 approximation.
Recall the iterative system in Eq. (3), note that each feature is penalized by a different penalty, which is inversely proportional to the squared magnitude of that parameter estimator η j. i.e.,
Smaller β j will lead to larger λ j. A tiny β j, will become smaller and λ j will be getting larger in each iteration of L 0ADRIDGE algorithm. β j→0, and λ j→∞. On the other hand, a larger β j will lead a finite λ j, and nonzero β j, when the number of iteration goes to ∞. The solution of L 0ADRIDGE will converge to that of Eq. (7), because the effect of nonzero η j will be canceled out in Eq. (5). Note that our proposed methods will find a sparse solution with a large number of iterations and small ε, even though the solution of L 2 regularized modeling is not sparse. Small parameters (β js) become smaller at each iteration and will eventually go to zero (below the machine ε). We can also set a parameter to 0 if it is below predefined ε=1e−6 to speed up the convergence of the algorithm.
Poisson Regression: Our first simulation was used to evaluate the performance of our method for high dimensional Poisson regression. The data was generated from Poisson distribution with different sample sizes (N) and dimensions (P). However, only features 1, 5, 10 and the constant term are used to generate the Poisson counts with [β 0,β 1,β 5,β 10]=[1,0.5,0.5,0.4]. The count Y is generated with Y=Poisson(μ), where mean μ= exp(β X). The proposed method is compared with the glmnet ( and SparseReg package [17, 18]. glmnet and SparseReg implemented the elastic net, SCAD, and MC+ penalties with an efficient path algorithm. We compare the performance of our approach with L 1 (glmnet), SCAD and MC+ using the popular BIC (λ= log(N)) criteria. Our L 0ADRIDGE is compared to the glmnet for L 1 and SparseReg for both SCAD and MC+. The results of different methods are presented in Table 2.
Table 2 shows that our L 0ADRIDGE consistently achieved the best performance with BIC and different sample sizes and dimensions. With BIC, although MC+ has the lowest square root of mean squared error (rMSE), and fits the data better, L 0ADRIDGE achieves the least absolute bias , highest percentage of identified true model (PTM), and lowest false discovery rate (FDR) under different simulation settings. The average number of selected features (ANSF) with L 0ADRIDGE is also closest to the true number 4. Particularly, L 0ADRIDGE found 100% true model with the lowest average absolute bias (0.086) under the dimension of P=10,000 and sample size of N=500, indicating that the proposed approach is efficient under extra-high dimensional setting. Another interesting finding is that the square root of mean squared errors and absolute biases with L 0ADRIDGE did not vary much across different simulation setting, indicating the robustness of the proposed approach. Moreover, L 0ADRIDGE with BIC is slightly faster than different routines implemented in glmnet and SparseReg in computational time. Finally, BIC apparently is not a good model selection criteria for L 1, SCAD and MC+. More features are selected than necessary. A larger λ is needed for selecting the correct model. We reported the results with a larger λ on Additional file 1: Table SS1,1, and demonstrated that both SCAD and MC+ can achieve a much smaller FDR, but a larger absolute bias and rMSE.
Logistic regression: The logistic regression data was generated with the coefficients of [β 1,β 5,β 10]=[0.5,0.5,−0.4], respectively, and the remaining coefficients were set to zero. The score z=X β+ε, where ε is the random noise with the signal to noise ratio of 4. Then, the probability y is generated from the logistic function y=1/(1+e −z). Note that y is the true probability instead of binary (1/0) in this simulation. Unlike the previous example, the optimal values of λ in this simulation were selected with the standard 5-fold cross-validation. We divided the λ from λ min=1e−4, to λ max into 100 equal intervals in log-scale, then chose the optimal λ with the smallest test error. The simulation was also repeated 100 times. The computational results were reported in Table 3. The values in the parenthesis are the positive/negative standard deviation.
Table 3 shows that L 0ADRIDGE outperforms L 1, SCAD and MC+ with a substantial margin under the 5-fold cross-validation. Cross-validation is a standard tool for parameter selection in machine learning. L 0ADRIDGE achieved the smallest test square root of mean squared error, least absolute biases, the lowest FDR, and highest percentages of identified true models The average number of selected features are 3.33 and 3.41 for the dimensions of 100 and 1000, respectively, which are the closest to the true number of features 3. In contrary, L 1, SCAD and MC+ selected unnecessary features. L 1 on average identified 17.1 and 50.92 features, and SCAD selected 18.35 and 73.03 features on average for the dimensions of 100 and 1000, respectively, while MC+ performed slightly better, choosing 10.41 and 24.8 features for the dimensions of 100 and 1000, respectively. More impressively, out of 100 simulations, L 0ADRIDGE identified the true model 81 and 80 times with different dimensions, while L 1 and SCAD could not find the true model once, and MC+ only identified the true model 2 times for the dimension of 100, indicating the super performance of L 0ADRIDGE under cross-validation. Finally, L 0ADRIDGE is robust. The test square root of mean squared error and other performance measures did not vary much when the dimension increased from 100 to 1000. It is worth noting that our proposed method performs well with the popular statistical model selection criteria such as BIC and cross-validation. Other popular methods such as L 1, SCAD, and MC+ select more features than necessary with such criteria. Therefore, many popular packages including the commercial MATLAB usually choose a larger λ one standard deviation above the minimum test error with cross-validation, which is arbitrary and leads to larger bias. To overcome such bias in parameter estimation, some packages re-estimate the parameters with the selected features and standard GLM model. Unlike these methods, our proposed method performed much better without any postprocessing. Finally, the algorithm is very robust with different initialization. With N=100, P=1000 and 100 times of different randomized initialization, we achieved the trMSE of 0.437(±.003), average absolute bias of 0.0763(±.07), ANSF of 3.39(±1.154), PTM of 85% and FDR of 6.9%, which is quite similar to the results with a fixed initialization.
The Cancer Genomic Atlas (TCCA) has generated a large amount of next generation sequencing and other omics data for ovarian adenocarcinoma (OC). In this study, we conducted integrated analysis of RNA-seq, miRNA expression, promoter methylation, and debulking status data from 367 OC patients. There are 342 microRNAs, 13,911 mRNA expression (in FPKM), and 21,985 promoter methylation values available. We first normalized different omics data and screened the debulking associated microRNA, mRNAs, and methylation promoters with the P-values of less than 0.01 with the training data only. Based on the central dogma of biology, suboptimal debulking is associated with microRNA expression, gene expression, and DNA methylation; gene expression is a function of microRNA expression and DNA methylation; and microRNA expression is regulated by DNA methylation. L 0 Logistic regression was used for suboptimal debulking prediction, while L 0 penalized Poisson regression was used for gene expression and microRNA expression prediction with FPKM. FPKM, representing fragments per kilobase of exon per million fragments mapped, measures the normalized read counts for RNA-seq. Three-fold cross validation was used for gene selection and validation. We reported the gene signatures with the best predicted area under the ROC curves (AUCs). Molecular signatures that are directly or indirectly associated with suboptimal debulking are shown in Fig. 1.
Figure 1 indicates that there are 16 gene signatures including 7 mRNAs and 9 epigenetic markers directly associated with debulking status. Even though there is no microRNA directly associated with debulking, eight microRNA signatures are indirectly associated with debulking through their association with mRNA signatures. Moreover, there are additional 18 epigenetic markers indirectly associated with debulking. The 7 mRNAs directly associated with debulking are EIF3D, PPP1R7, ADA, HSD17B1, SRBD1, ZNF621, and BARX1, where EIF3D, PPP1R7, BARX1 and ZNF621 have positive correlations and the other 3 genes have negative correlations with suboptimal debulking. Among the 7 mRNAs, ADA (Adenosine Deaminase) is a well-studied gene in ovarian neoplasms. ADA levels were found to be significantly higher in patients with ovarian cancers as compared with benign ovarian tumors . ADA has been regarded as a potential biomarker for diagnosis and an agent for the treatment of ovarian cancer . Other mRNAs such as BARX1, EIF3D, PPP1R7, and HSD17B1 are also known to be associated with different cancers or other diseases. At the microRNA level, there are 8 microRNAs indirectly associated with debulking including mir-183, let-7b, mir-9-1, mir-377, mir-202, mir-758, mir-375, and mir-30c-2. While let-7b, mir-30c-2, and mir-377 are positively correlated with suboptimal debulking through mRNAs ADA and BARX1 indirectly, the other 5 microRNAs have indirectly negative correlations with suboptimal debulking. Seven of eight microRNAs except for mir-758 are known to be associated with ovarian cancer. Particularly, let-7b is known to be an unfavorable prognostic biomarker and predict of molecular and clinical subclasses in high-grade serous ovarian carcinoma, and it may also be useful for discriminating between controls and patients with serous ovarian cancer [21, 22]. Mir-183 is known to be associated with multiple cancers. It regulates target oncogene (Tiam1), and reduce the migration, invasion and viability of ovarian cancer cells . Finally, at the DNA level, nine epigenetically modified genes directly associated with debulking are SSX1, TBR1, ZNF621, ORC3L, COL22A1, SPEF2, SSU72, EEF1D, and ZNF621, where EEF1D, SSU72, and ORC3L are positively associated with suboptimal debulking, while 6 other epigenetic genes are negatively correlated with suboptimal debulking. In addition, 18 other epigenetic genes indirectly associated with debulking may also have biological implications. Finally, integration of multi-omic data increases the prediction power substantially. Besides analyzing three types of omics data together, we performed the same three-fold cross validation for gene expression, methylation, and microRNA expression separately. The AUC curves are in Fig. 2.
Figure 2 shows that the best predicted AUC over 100 simulations for integrated data is 0.88, while the best predictive AUCs for gene expression, methylation, and microRNA over 100 simulations are 0.81, 0.84, and 0.76, respectively. The AUC with integrated data achieved the highest AUC, indicating the importance of multi-omics data mining. Genes selected with mRNA, microRNA, and methylations separately are reported in the supplementary document. In addition, we also compare the selected features and the same number of top genes identified with statistical test. The results are reported on Additional file 1: Table SS2,2, and demonstrate that although individual genes are more statistically significant, combination of a panel of genes with standard logistic regression has less predictive power and test AUC (0.79).
Biomarkers from multi-omics data may predict disease status and help physicians to make clinical decisions. L 0 based GLM, which directly penalizes the number of nonzero parameters, has nice theoretical properties and leads to essential sparsity for biomarker discovery. Optimizing the L 0 regularization is a crucial, but difficult problem. We have developed an adaptive ridge algorithm (L 0ADRIDGE) for approximating L 0 penalized GLM. The algorithm is easy to implement and efficient for problems with either an ultra-high dimension and small sample size, or a low-dimension and large sample size. It outperforms the other cutting edge regularization methods including L 1, SCAD and MC+ through simulations. When applied to the integration of multilevel omics data from TCGA and the prediction of suboptimal debulking from ovarian cancer, it can identify a panel of gene signatures achieving the best prediction power. We also demonstrate that prediction power of a model with multi-omics data increases substantially, when comparing with a model with one omics data, indicating the importance of big data mining.
This work was partially supported by the DMS-1343506 grant from the National Science Foundation (ZL), and P01 CA098912-10 and UL1 TR0001881-01 grants from NIH. The funders had no role in the preparation of the article.
DMS-1343506 of NSF (ZL), P01 CA098912-10 and UL1 TR0001881-01.
L 0 ADRIDGE in MATLAB is available at https://github.com/liuzqx/L0adridge.
Additional file 1(86K, pdf) Table S1. Performance of different GLM methods for Poisson regression over 100 simulations, where values in the parenthesis are the standard deviations, and ANSF: Average number of selected features; rMSE: Average square root of mean squared error; : average absolute bias when comparing true and estimated parameters. PMS: Performance Measures. PTM: Percentage of true models. FDR: False discovery rates. The L0ADRIDGE is compared to the best performance chosen from λ=0.9λ
max and λ=0.5λ
max with both for SCAD and MC+. Table S2. The comparison of performance of the our sparse modeling approach and the top genes selected with Student’s t-test. The results demonstrate that although each gene is more statistically significant with statistical test, the combination of the panel of genes has less predictive power and test AUC with standard logistic regression and three-fold cross valida- tion, indicating the collinearity among theses genes. (PDF 86 kb)
Table S1. Performance of different GLM methods for Poisson regression over 100 simulations, where values in the parenthesis are the standard deviations, and ANSF: Average number of selected features; rMSE: Average square root of mean squared error; : average absolute bias when comparing true and estimated parameters. PMS: Performance Measures. PTM: Percentage of true models. FDR: False discovery rates. The L0ADRIDGE is compared to the best performance chosen from λ=0.9λ max and λ=0.5λ max with both for SCAD and MC+. Table S2. The comparison of performance of the our sparse modeling approach and the top genes selected with Student’s t-test. The results demonstrate that although each gene is more statistically significant with statistical test, the combination of the panel of genes has less predictive power and test AUC with standard logistic regression and three-fold cross valida- tion, indicating the collinearity among theses genes. (PDF 86 kb)
ZL conceptualized and designed method, developed the software, and wrote the manuscript. SF, MDP helped in method design and manuscript writing and revised the manuscript critically. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-017-0159-z) contains supplementary material, which is available to authorized users.
Zhenqiu Liu, Email: gro.shsc@xzuil.
Fengzhu Sun, Email: ude.csu.efisnrod@nusf.
Dermot P. McGovern, Email: email@example.com.