Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2910252

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Nonconcave penalized likelihood and variable selection
- 3 Variable Selection for Designed Experiments
- 4 An algorithm for the nonconvex penalized least squares
- 5 Examples
- 6 Conclusion and discussion
- References

Authors

Related links

Qual Technol Quant Manag. Author manuscript; available in PMC 2010 July 27.

Published in final edited form as:

Qual Technol Quant Manag. 2009; 6(3): 271–280.

PMCID: PMC2910252

NIHMSID: NIHMS104018

Runze Li, Associate Professor, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111;

Runze Li: ude.usp@4LIR; Dennis K. J. Lin: ude.usp@5LKD

The first step in many applications of response surface methodology is typically the screening process. Variable selection plays an important role in screening experiments when a large number of potential factors are introduced in a preliminary study. Traditional approaches, such as the best subset variable selection and stepwise deletion, may not be appropriate in this situation. In this paper we introduce a variable selection procedure via penalized least squares with the SCAD penalty. An algorithm to find the penalized least squares solution is suggested, and a standard error formula for the penalized least squares estimate is derived. With a proper choice of the regularization parameter, it is shown that the resulting estimate is root *n* consistent and possesses an oracle property; namely, it works as well as if the correct submodel were known. An automatic and data-driven approach was proposed to select the regularization parameter. Examples are used to illustrate the effectiveness of the newly proposed approach. The computer codes (written in MATLAB) to perform all calculation are available through the authors for an automatic data-driven variable selection procedure.

Many preliminary studies in industrial experimentation contain a large number of potentially relevant factors, but the number of actual active effects are believed to be sparse. The variable selection (screening) process is typically the first step in many applications of response surface methodology. It has received increasing attention recently. Traditional approaches, such as the best subset variable selection and stepwise deletion, may not be appropriate in this situation (Westfall, Young and Lin [18]). All-subset regression is typically preferable to stepwise regression, but all-subset regression is extremely time-consuming and in fact may be infeasible in practice.

Fan and Li [5] proposed a nonconcave penalized likelihood approach to selecting significant variables. Unlike stepwise variable selection and subset regression, their approaches delete insignificant variables by estimating their coefficients to be 0. Thus, their approaches estimate the regression coefficients and select significant variables simultaneously. This enables one to establish sampling properties for resulting estimates. From a theoretic point of view, they showed that their approach possesses an oracle property with proper choice of penalty function and regularization parameter; namely, the true coefficients that are zero are automatically estimated as zero, and the other coefficients are estimated as if the true submodel were known in advance.

Building on the work of Fan and Li [5], we introduce a variable selection procedure for analyzing experimental designs. Our approach can be directly implemented for cases in which the final (projective) design matrix is of full rank. The design matrix for the full model, however, is allowed to be singular. Our settings are different from those in Fan and Li [5], which focus on likelihood-based linear models and in which covariates are assumed to be random. In our settings, we have shown that the resulting estimate is root *n* consistent and possesses an oracle property. We also propose an iterative ridge regression algorithm to find the penalized least squares solution. A standard error formula for the estimate is derived. This enables us to make statistical inference on the resulting model. An automatic data-driven approach to finding the tuning parameter is suggested.

This paper is organized as follows. In Section 2, the method for variable selection via nonconvex penalized least squares is introduced. Section 3 discusses the issues related to practical implementation and derive a standard error formula for the resulting estimate. Section 4 summarizes the proposed procedure by an algorithm. In Section 5, we demonstrate the nonconvex penalized least squares methods by two examples–one is based on Plackett and Burman design and the other one is based on uniform design (Fang and Lin [7]). Section 6 presents conclusions and discussions.

Suppose that the observations (**x*** _{i}*,

$$\frac{1}{n}\sum _{i=1}^{n}logf({y}_{i},{\mathbf{x}}_{i}^{T}\mathit{\beta})-\sum _{j=1}^{d}{p}_{\lambda}(\mid {\beta}_{j}\mid ),$$

where *d* is the dimension of **x*** _{i}*,

Some penalty functions have been used in the literature. The *L*_{2} penalty
${p}_{\lambda}(\mid \beta \mid )={\scriptstyle \frac{\lambda}{2}}\mid \beta {\mid}^{2}$ yields a ridge regression, and the *L*_{1} penalty *p _{λ}*(|

Some conditions on *p _{λ}*(|·|) are needed for the penalized likelihood approach to be effective. In particular,

A simple penalty function that satisfies all the three mathematical requirements is the smoothly clipped absolute deviation (SCAD) penalty proposed by Fan and Li [5], and its first order derivative is defined by
${p}_{\lambda}^{\prime}(\beta )=\lambda \left\{I(\beta \le \lambda )+{\scriptstyle \frac{{(a\lambda -\beta )}_{+}}{(a-1)\lambda}}I(\beta >\lambda )\right\}$ for some *a* > 2 and *β* > 0. For simplicity of presentation, we will use SCAD for all procedures using the SCAD penalty. Denote by *β*_{0} the true value of ** β**, and let
${\mathit{\beta}}_{0}={({\beta}_{10},\cdots ,{\beta}_{d0})}^{T}={({\mathit{\beta}}_{10}^{T},{\mathit{\beta}}_{20}^{T})}^{T}$. Without loss of generality, it is assumed that

From a Bayesian point of view and simulation comparisons, Fan and Li [5] suggested that *a* = 3.7. This will be used throughout this paper. Figure 1 depicts the *L*_{2}, *L*_{1}, *L*_{0.5} and SCAD penalties. From Figure 1, it can be seen that all penalties but the *L*_{2} penalty are singular at the origin. In fact, this is a necessary condition for the sparsity (Fan and Li [5]).

Assume that observations *y*_{1}, ···, *y _{n}* are an independent sample from the linear regression model

$${y}_{i}={\mathbf{x}}_{i}^{T}\mathit{\beta}+{\epsilon}_{i}$$

(3.1)

with E(*ε _{i}*) = 0 and var(

$$Q(\mathit{\beta})\equiv \frac{1}{2n}\sum _{i=1}^{n}{({y}_{i}-{\mathbf{x}}_{i}^{T}\mathit{\beta})}^{2}+\sum _{j=1}^{d}{p}_{\lambda}(\mid {\beta}_{j}\mid ).$$

(3.2)

Using matrix notation, (3.2) can be expressed as

$$Q(\mathit{\beta})=\frac{1}{2n}{\left|\right|\mathbf{y}-\mathbf{X}\mathit{\beta}\left|\right|}^{2}+\sum _{j=1}^{d}{p}_{\lambda}(\mid {\beta}_{j}\mid ).$$

When (**x*** _{i}*,

- (C1) ${max}_{1\le k\le n}{\mathbf{x}}_{1k}^{T}{({\mathbf{X}}_{1}^{T}{\mathbf{X}}_{1})}^{-1}{\mathbf{x}}_{1k}\to 0$, as
*n*→ ∞; and - (C2)
**V**is finite and**V**_{11}> 0,

where, **X**_{1} consist of the first *s* columns of **X**,
$\mathbf{V}={lim}_{n\to \infty}{\scriptstyle \frac{1}{n}}{\mathbf{X}}^{T}\mathbf{X}$ and **V**_{11}consist of the first *s* columns and rows of **V**. It is worth noting that these conditions typical conditions for asymptotic normality of least squares estimator in linear regression model when the regressors are fixed designed. Furthermore, **V** is finite for most experiment designs. Here we weaken these two conditions to Conditions (C1) and (C2), which guarantee that the asymptotic normality holds for the least squares estimator of _{1}. Li and Lin [13] also show that the penalized least squares estimate is root *n* consistent and establish the oracle property for the penalized least squares estimator. This implies that the resulting penalized least squares is more efficient than the ordinary least squares estimate. Note that the proposed procedure possesses the oracle property under some mild conditions which do not require the design matrix to be of full rank. The conditions are frequently satisfied in practice.

Note that the SCAD penalty function is singular at the origin and may not have the second derivative at some points. In order to apply the Newton Raphson algorithm to the penalized least squares, Fan and Li [5] suggested to locally approximate the SCAD penalty function by quadratic functions as follows. Given an initial value *β*^{(0)} that is close to the true value of ** β**, when
${\beta}_{j}^{(0)}$ is not very close to 0, the penalty

$${[{p}_{\lambda}(\mid {\beta}_{j}\mid )]}^{\prime}={p}_{\lambda}^{\prime}(\mid {\beta}_{j}\mid )sgn({\beta}_{j})\approx \{{p}_{\lambda}^{\prime}(\mid {\beta}_{j}^{(0)}\mid )/\mid {\beta}_{j}^{(0)}\mid \}{\beta}_{j},$$

otherwise, set * _{j}* = 0. With the local quadratic approximation, the solution for the penalized least squares can be found by iteratively computing the following ridge regression with an initial value

$${\mathit{\beta}}^{(1)}={\{{\mathbf{X}}^{T}\mathbf{X}+n{\mathrm{\sum}}_{\lambda}({\mathit{\beta}}^{(0)})\}}^{-1}{\mathbf{X}}^{T}\mathbf{y},$$

(4.1)

where

$${\mathrm{\sum}}_{\lambda}({\mathit{\beta}}^{(0)})=\text{diag}\{{p}_{\lambda}^{\prime}(\mid {\beta}_{1}^{(0)}\mid )/\mid {\beta}_{1}^{(0)}\mid ,\dots ,{p}_{\lambda}^{\prime}(\mid {\beta}_{d}^{(0)}\mid )/\mid {\beta}_{d}^{(0)}\mid \}.$$

This can be easily implemented in many statistical packages. Hunter and Li [12] studied the convergence behavior of this algorithm by using techniques related to the EM algorithm.

When the sample size is greater than the number of factors, and the design matrix is full rank, there exists a unique least squares estimate. The least squares estimate can serve as the initial value of ** β**. When the sample size is relatively large, the least squares estimate possesses root

The standard errors for estimated coefficients can be obtained directly because our procedure simultaneously performs parameter estimation and variable selection. From the iterative ridge regression, the covariance of the penalized least squares estimate can be estimated via

$$\widehat{cov}(\widehat{\mathit{\beta}})={\widehat{\sigma}}^{2}{\{{\mathbf{X}}^{T}\mathbf{X}+n{\mathrm{\sum}}_{\lambda}({\mathit{\beta}}_{0})\}}^{-1}{\mathbf{X}}^{T}\mathbf{X}{\{{\mathbf{X}}^{T}\mathbf{X}+n{\mathrm{\sum}}_{\lambda}({\mathit{\beta}}_{0})\}}^{-1},$$

(4.2)

where ^{2} is an estimator of *σ*^{2}, for example, the mean squared errors. Here the formula applies only to components which do not vanish.

To implement the SCAD, it is desirable to have an automatic data-driven method for selecting the tuning parameter *λ* in *p _{λ}*(·). Such a procedure is analogous to an automatic subset selection procedure, such as forward selection, backward elimination, but with good oracle properties. Cross validation and generalized cross-validation can be employed to select the tuning parameter

For the penalized least squares, we update the solution by

$${\mathit{\beta}}^{(1)}(\lambda )={\{{\mathbf{X}}^{T}\mathbf{X}+n{\mathrm{\sum}}_{\lambda}({\mathit{\beta}}^{(0)})\}}^{-1}{\mathbf{X}}^{T}\mathbf{y}$$

with an initial value *β*^{(0)}. Thus the fitted value **ŷ** of **y** is **X**{**X**^{T}**X** + *n*Σ* _{λ}*(

$${\mathbf{P}}_{\mathbf{X}}\{\widehat{\mathit{\beta}}(\lambda )\}=\mathbf{X}{\{{\mathbf{X}}^{T}\mathbf{X}+n{\mathrm{\sum}}_{\lambda}(\widehat{\mathit{\beta}})\}}^{-1}{\mathbf{X}}^{T}$$

can be regarded as a projection matrix. Define the number of effective parameters in the penalized least squares fit as *e*(*λ*) = tr[**P _{X}**{

$$\text{GCV}(\lambda )=\frac{1}{n}\frac{{\left|\right|\mathbf{y}-\mathbf{X}\widehat{\mathit{\beta}}(\lambda )\left|\right|}^{2}}{{\{1-e(\lambda )/n\}}^{2}}$$

(4.3)

and = argmin* _{λ}*{GCV(

The SCAD variable selection procedure is summarized below, step by step.

** Step 1.** If the number of regression coefficients is greater than the sample size, then first apply stepwise variable selection to the full model with small thresholding values.

** Step 2.** Choose a grid point set for

** Step 3.** Based on the selected model by the stepwise regression, for each

** Step 4.** The final estimate for

A 12-run PB design and the corresponding outcomes are listed in Table 1, extracted from Hamada and Wu [11]. This data set was constructed by Hamada and Wu [11] based on the linear model *Y* = *A* + 2*AB* + 2*AC* + *ε* with *ε* ~ *N*(0, *σ* = 0.25). We fit the data set by a linear regression model including all linear terms and interaction terms. The full model contains 66 (11 main effects plus 55 two-factor interactions) predictors.

To identify the active effects, stepwise variable selection was employed to the full model using SAS. The variables AB, AC, A, AJ, GJ, FG and J were selected. The selected variables and their estimated coefficients, standard errors and *p*-values are depicted in Table 2. From Table 2, all selected variables are statistically significant at significance level *α* = 0.05. In fact, compared with the true model, stepwise variable selection results in a very misleading model, though the truly significant variables are included the selected model. From this example, one should be cautious when using stepwise variable selection procedure.

Now we apply penalized least squares with the SCAD penalty for the selected model by stepwise procedure, the estimated tuning parameter = 0.0121. The resulting model, including estimated coefficients and estimated standard errors, is depicted in Table 2. The square root of the mean square error is = 0.2479. All estimated coefficients and their standard errors are very close to their true values. In this case, the proposed method works very well. With small values of F-enter and F-remove, the stepwise variable selection procedure is able to retain active effects, though it also includes many inactive effects. On the other hand, this drawback of the stepwise variable selection can be alleviated by the SCAD, which may correctly identify active effects. Furthermore, the class of models considered here is much larger than the one used by Hamada and Wu [11].

In this example, we demonstrate the SCAD variable selection procedure by using analysis of a data set collected in the study of the effects of environmental pollutants on human health. The data set is taken from Fang [6]. A brief description is given below. Details can be found in Fang [6]. It is believed that contents of some metal elements in water might directly have impact on human health. Of interest here is to study the association between contents of six selected metals and the mortality of some kind cell of mice. The six selected metals are Cadmium (Cd), Copper (Cu), Zinc (Zn), Nickel (Ni), Chromium (Cr), and Lead (Pb). The levels for the content of each metal element is set to 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, 1, 2, 5, 8, 10, 12, 16, 18 and 20 (ppm). A uniform design (Fang and Wang [8]; and Fang, Lin, Winker and Zhang [9]) is used to construct a design with 17 experiments. For each run, three experiments were conducted. The average of these three outputs (mortality) is displayed in Table 3, in which the design was extracted from Fang [6].

To avoid numerical instability, all *x* variables are standardized first. In our initial model, we include an intercept, all linear effects, quadratic effects and the first order interaction effects. This yields a linear regression models with 28 terms, including the intercept. Since there are only 17 design combination in this experiment. Thus, variable selection is necessary. Similar to the previous analysis, the stepwise regression analysis is first performed. The stepwise regression selects Cd, Cu, Ni*Cr, Zn*Pb, Cr, Pb*Pb, Zn*Cr, Pb, Zn*Zn, Cu*Cr, Ni, Cd*Ni and Ni*Pb, whose estimate, standard error and P-value are depicted in Table 4. All of them except Ni*Pb are significant at level 0.05. Based on the selected model by stepwise regression, we apply the SCAD variable selection procedure to select significant variables. The selected *λ* is 0.0748. The resulting estimate of significant effects along with their standard errors and P-values are depicted in Table 4, from which it can be seen that all six metal elements have significant impact on mortality.

In this paper, we have introduced the SCAD variable selection procedure for screening experiments. It has been shown that the proposed approach possesses an oracle property. Two data examples illustrated the effectiveness of the proposed approach. In our earlier work (Li and Lin [13, 14]), the proposed procedure has been successfully applied to analyzing data from supersaturated designs(Lin [15, 16]). The computer codes (written in MATLAB) to perform all calculations can be obtained from the authors.

The regularity conditions (C1) and (C2) in Section 3 ensure that the proposed procedure can be used to screen experiments. Although the design matrix for the full model can be singular, condition (C2) implies that the design sub-matrix responding to active factors is full rank, and therefore alias structure among the active factors is not allowed. One should be cautious in the situation that active factors or their interactions are aliased with another ones. It needs some further research on how to deal with such situations.

The authors would like to thank referees for their valuable comments and suggestions. Li was supported by a National Institute on Drug Abuse (NIDA) grant P50 DA10075, and partially supported by NSF grants DMS 0348869 and CCF 0430349.

Runze Li, Associate Professor, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111.

Dennis K. J. Lin, University Distinguished Professor, Department of Supply Chain and Information Systems, The Pennsylvania State University, University Park, PA 16802.

1. Antoniadis A, Fan J. Regularization of wavelets approximations (with discussions) Journal of American Statistical Association. 2001;96:939–967.

2. Beattie SD, Fong DKH, Lin DKJ. A two-stage Bayesian model selection strategy for supersatured designs. Technometrics. 2002;44:55–63.

3. Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math. 1979;31:377–403.

4. Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–455.

5. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360.

6. Fang KT. Uniform Design and Uniform Design Tables. Science Press; Beijing: 1994.

7. Fang KT, Lin DKJ. Uniform Experimental Design and Its Applications in Industry. In: Rao CR, Khattree R, editors. Handbook of Statistics in Industry. Vol. 22. North Holland; New York: 2003. pp. 131–170.

8. Fang KT, Wang Y. Applications of Number Theoretic Methods in Statistics. Chapman and Hall; London: 1994.

9. Fang KT, Lin DKJ, Winker Peter, Zhang Y. Uniform Design: Theory and Application. Technometrics. 2000;42:237–248.

10. Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–148.

11. Hamada M, Wu CFJ. Analysis of designed experiments with complex aliasing. Journal of Quality Technology. 1992;24:130–137.

12. Hunter DR, Li R. Variable selection using MM algorithms. Annals of Statistics. 2005;33:1617–1642. [PMC free article] [PubMed]

13. Li R, Lin DKJ. Data analysis in supersatured designs. Statistics and Probability Letters. 2002;59:135–144.

14. Li R, Lin DKJ. Comparisons of variable selection approaches for analysis of supersatured design. Journal of Data Science. 2003;1:249–260.

15. Lin DKJ. A new class of supersatured designs. Technometics. 1993;35:28–31.

16. Lin DKJ. Generating system supersaturated designs. Technometrics. 1995;37:213–225.

17. Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B. 1996;58:267–288.

18. Westfall PH, Young SS, Lin DKJ. Forward selection error control in analysis of supersaturated designs. Statistica Sinica. 1998;8:101–117.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |