Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Qual Technol Quant Manag. Author manuscript; available in PMC 2010 July 27.
Published in final edited form as:
Qual Technol Quant Manag. 2009; 6(3): 271–280.
PMCID: PMC2910252

Variable Selection for Screening Experiments


The first step in many applications of response surface methodology is typically the screening process. Variable selection plays an important role in screening experiments when a large number of potential factors are introduced in a preliminary study. Traditional approaches, such as the best subset variable selection and stepwise deletion, may not be appropriate in this situation. In this paper we introduce a variable selection procedure via penalized least squares with the SCAD penalty. An algorithm to find the penalized least squares solution is suggested, and a standard error formula for the penalized least squares estimate is derived. With a proper choice of the regularization parameter, it is shown that the resulting estimate is root n consistent and possesses an oracle property; namely, it works as well as if the correct submodel were known. An automatic and data-driven approach was proposed to select the regularization parameter. Examples are used to illustrate the effectiveness of the newly proposed approach. The computer codes (written in MATLAB) to perform all calculation are available through the authors for an automatic data-driven variable selection procedure.

Keywords: Bayesian variable selection, penalized least squares, SCAD, supersatured design

1 Introduction

Many preliminary studies in industrial experimentation contain a large number of potentially relevant factors, but the number of actual active effects are believed to be sparse. The variable selection (screening) process is typically the first step in many applications of response surface methodology. It has received increasing attention recently. Traditional approaches, such as the best subset variable selection and stepwise deletion, may not be appropriate in this situation (Westfall, Young and Lin [18]). All-subset regression is typically preferable to stepwise regression, but all-subset regression is extremely time-consuming and in fact may be infeasible in practice.

Fan and Li [5] proposed a nonconcave penalized likelihood approach to selecting significant variables. Unlike stepwise variable selection and subset regression, their approaches delete insignificant variables by estimating their coefficients to be 0. Thus, their approaches estimate the regression coefficients and select significant variables simultaneously. This enables one to establish sampling properties for resulting estimates. From a theoretic point of view, they showed that their approach possesses an oracle property with proper choice of penalty function and regularization parameter; namely, the true coefficients that are zero are automatically estimated as zero, and the other coefficients are estimated as if the true submodel were known in advance.

Building on the work of Fan and Li [5], we introduce a variable selection procedure for analyzing experimental designs. Our approach can be directly implemented for cases in which the final (projective) design matrix is of full rank. The design matrix for the full model, however, is allowed to be singular. Our settings are different from those in Fan and Li [5], which focus on likelihood-based linear models and in which covariates are assumed to be random. In our settings, we have shown that the resulting estimate is root n consistent and possesses an oracle property. We also propose an iterative ridge regression algorithm to find the penalized least squares solution. A standard error formula for the estimate is derived. This enables us to make statistical inference on the resulting model. An automatic data-driven approach to finding the tuning parameter is suggested.

This paper is organized as follows. In Section 2, the method for variable selection via nonconvex penalized least squares is introduced. Section 3 discusses the issues related to practical implementation and derive a standard error formula for the resulting estimate. Section 4 summarizes the proposed procedure by an algorithm. In Section 5, we demonstrate the nonconvex penalized least squares methods by two examples–one is based on Plackett and Burman design and the other one is based on uniform design (Fang and Lin [7]). Section 6 presents conclusions and discussions.

2 Nonconcave penalized likelihood and variable selection

Suppose that the observations (xi, yi) are independent and identically distributed samples from a population with the density f(y, xTβ). A form of penalized likelihood is


where d is the dimension of xi, pλ(·) is a penalty function and λ is a tuning parameter, which controls the model complexity. The penalty function is not necessarily the same for all coefficients. For example, one may wish to keep important factors in a parametric model and hence not be willing to penalize their corresponding coefficients. For simplicity of presentation, we will assume that the penalty function is the same for all coefficients. Extension to the case with different penalty function is straightforward, however.

Some penalty functions have been used in the literature. The L2 penalty pλ(β)=λ2β2 yields a ridge regression, and the L1 penalty pλ(|β|) = λ|β| results in LASSO (Tibshirani [17]). More generally, the Lq penalty leads to a bridge regression (see Frank and Friedman [10]). The hard thresholding penalty, defined by pλ(|β|) = λ2−(|β|−λ)2I(|β| < λ), where I(·) is an indicator function, was studied by Antoniadias and Fan [1]. The penalized likelihood estimate is a hard thresholding rule (Donoho and Johnstone [4]) when the observations are from an ordinary linear regression model with an orthonormal design matrix (Fan and Li [5]).

Some conditions on pλ(|·|) are needed for the penalized likelihood approach to be effective. In particular, pλ(|·|) should be irregular at the origin, i.e., pλ(0+)>0. Antoniadis and Fan [1] provide more insights into how to choose a penalty function from theoretic points of view. A good penalty function should result in an estimator with the following three properties: unbiasedness for a large true coefficient to avoid excessive estimation bias, sparsity (estimating a small coefficient as zero) to reduce model complexity, and continuity to avoid unnecessary variation in model prediction. Necessary conditions for unbiasedness, sparsity and continuity have been derived by Antoniadis and Fan [1]. None of the Lq, including q = 1 and 2, and the hard thresholding penalty function simultaneously satisfy these three mathematical conditions.

A simple penalty function that satisfies all the three mathematical requirements is the smoothly clipped absolute deviation (SCAD) penalty proposed by Fan and Li [5], and its first order derivative is defined by pλ(β)=λ{I(βλ)+(aλβ)+(a1)λI(β>λ)} for some a > 2 and β > 0. For simplicity of presentation, we will use SCAD for all procedures using the SCAD penalty. Denote by β0 the true value of β, and let β0=(β10,,βd0)T=(β10T,β20T)T. Without loss of generality, it is assumed that β20 = 0, and all components of β10 are not zero. Under some regularity conditions, the SCAD estimator β^=(β^1T,β^2T)T possesses the following oracle property: With probability tending to 1, (i) [beta]2 = 0; and (ii) n(β^1β10)N{0,I11(β10,0)}, where I1(β10, 0) is the Fisher information matrix for β1 knowing β2 = 0 (see Fan and Li [5]).

From a Bayesian point of view and simulation comparisons, Fan and Li [5] suggested that a = 3.7. This will be used throughout this paper. Figure 1 depicts the L2, L1, L0.5 and SCAD penalties. From Figure 1, it can be seen that all penalties but the L2 penalty are singular at the origin. In fact, this is a necessary condition for the sparsity (Fan and Li [5]).

Figure 1
Plot of penalty functions: (a) L2, (b) L1, (c) L0.5, and (d) SCAD penalties.

3 Variable Selection for Designed Experiments

Assume that observations y1, ···, yn are an independent sample from the linear regression model


with E(εi) = 0 and var(εi) = σ2, where xi corresponds to a fixed experiment design. As a natural extension of the penalized likelihood, we will propose a screening experiment approach via penalized least squares. Thus, the proposed approach does not require the error distribution to be known. Let y = (y1, ···, yn)T, and X is the corresponding design matrix. A form of penalized least squares is


Using matrix notation, (3.2) can be expressed as


When (xi, yi), i = 1, ···, n, are independent and identically distributed, It has been shown by Li and Lin [13] that the oracle property holds for the penalized least squares estimate when the covariate x is a fixed design under the following conditions. Let s be the dimension of β10, i.e., the number of nonzero coefficients.

  • (C1) max1knx1kT(X1TX1)1x1k0, as n → ∞; and
  • (C2) V is finite and V11 > 0,

where, X1 consist of the first s columns of X, V=limn1nXTX and V11consist of the first s columns and rows of V. It is worth noting that these conditions typical conditions for asymptotic normality of least squares estimator in linear regression model when the regressors are fixed designed. Furthermore, V is finite for most experiment designs. Here we weaken these two conditions to Conditions (C1) and (C2), which guarantee that the asymptotic normality holds for the least squares estimator of [beta]1. Li and Lin [13] also show that the penalized least squares estimate is root n consistent and establish the oracle property for the penalized least squares estimator. This implies that the resulting penalized least squares is more efficient than the ordinary least squares estimate. Note that the proposed procedure possesses the oracle property under some mild conditions which do not require the design matrix to be of full rank. The conditions are frequently satisfied in practice.

4 An algorithm for the nonconvex penalized least squares

Note that the SCAD penalty function is singular at the origin and may not have the second derivative at some points. In order to apply the Newton Raphson algorithm to the penalized least squares, Fan and Li [5] suggested to locally approximate the SCAD penalty function by quadratic functions as follows. Given an initial value β(0) that is close to the true value of β, when βj(0) is not very close to 0, the penalty pλ(|βj|) can be locally approximated by the quadratic function as


otherwise, set [beta]j = 0. With the local quadratic approximation, the solution for the penalized least squares can be found by iteratively computing the following ridge regression with an initial value β0:




This can be easily implemented in many statistical packages. Hunter and Li [12] studied the convergence behavior of this algorithm by using techniques related to the EM algorithm.

When the sample size is greater than the number of factors, and the design matrix is full rank, there exists a unique least squares estimate. The least squares estimate can serve as the initial value of β. When the sample size is relatively large, the least squares estimate possesses root n consistency, and hence it is very close to the true value of β. However, the number of experiments may be less than the number of potential candidates. It is obvious that the design matrix is not full rank, and thus the regression coefficients are not identifiable without further assumptions. In the context of analysis of screening experiments, it is assumed that only a few of factors are active. Here we suggest to use stepwise variable selection to construct an initial value of β. In other words, we first apply stepwise variable selection to the full model with small thresholding values (i.e. large value of significance level α) such that all active factors are included in the selected model. Of course, in this step, some of insignificant factors will remain in the selected model.

4.1 Standard error

The standard errors for estimated coefficients can be obtained directly because our procedure simultaneously performs parameter estimation and variable selection. From the iterative ridge regression, the covariance of the penalized least squares estimate can be estimated via


where [sigma with hat]2 is an estimator of σ2, for example, the mean squared errors. Here the formula applies only to components which do not vanish.

4.2 Selection of regularization parameter

To implement the SCAD, it is desirable to have an automatic data-driven method for selecting the tuning parameter λ in pλ(·). Such a procedure is analogous to an automatic subset selection procedure, such as forward selection, backward elimination, but with good oracle properties. Cross validation and generalized cross-validation can be employed to select the tuning parameter λ. However, the cross validation may not be appropriate in the current setting when the number of experiments is small. Thus we estimate λ via minimizing an approximate generalized cross-validation (GCV) statistic (Craven and Wahba [3]).

For the penalized least squares, we update the solution by


with an initial value β(0). Thus the fitted value ŷ of y is X{XTX + nΣλ(β(0))}−1XTy, and


can be regarded as a projection matrix. Define the number of effective parameters in the penalized least squares fit as e(λ) = tr[PX{[beta](λ)}]. Therefore the generalized cross-validation statistics is


and [lambda with circumflex] = argminλ{GCV(λ)}.

4.3 Implementation of SCAD variable selection procedure

The SCAD variable selection procedure is summarized below, step by step.

Step 1. If the number of regression coefficients is greater than the sample size, then first apply stepwise variable selection to the full model with small thresholding values.

Step 2. Choose a grid point set for λ, say, {λ1, ···, λL}.

Step 3. Based on the selected model by the stepwise regression, for each λl, l = 1, ···, L, find the solution for the penalized least squares with the SCAD penalty by iterating (4.1) until it converges, and further compute the GCV statistic in (4.3).

Step 4. The final estimate for β is the one with the lowest GCV score.

5 Examples

Example 1

A 12-run PB design and the corresponding outcomes are listed in Table 1, extracted from Hamada and Wu [11]. This data set was constructed by Hamada and Wu [11] based on the linear model Y = A + 2AB + 2AC + ε with ε ~ N(0, σ = 0.25). We fit the data set by a linear regression model including all linear terms and interaction terms. The full model contains 66 (11 main effects plus 55 two-factor interactions) predictors.

Table 1
PB 12 Run Design and Response in Example 1

To identify the active effects, stepwise variable selection was employed to the full model using SAS. The variables AB, AC, A, AJ, GJ, FG and J were selected. The selected variables and their estimated coefficients, standard errors and p-values are depicted in Table 2. From Table 2, all selected variables are statistically significant at significance level α = 0.05. In fact, compared with the true model, stepwise variable selection results in a very misleading model, though the truly significant variables are included the selected model. From this example, one should be cautious when using stepwise variable selection procedure.

Table 2
The selected models in Example 1

Now we apply penalized least squares with the SCAD penalty for the selected model by stepwise procedure, the estimated tuning parameter [lambda with circumflex] = 0.0121. The resulting model, including estimated coefficients and estimated standard errors, is depicted in Table 2. The square root of the mean square error is [sigma with hat] = 0.2479. All estimated coefficients and their standard errors are very close to their true values. In this case, the proposed method works very well. With small values of F-enter and F-remove, the stepwise variable selection procedure is able to retain active effects, though it also includes many inactive effects. On the other hand, this drawback of the stepwise variable selection can be alleviated by the SCAD, which may correctly identify active effects. Furthermore, the class of models considered here is much larger than the one used by Hamada and Wu [11].

Example 2

In this example, we demonstrate the SCAD variable selection procedure by using analysis of a data set collected in the study of the effects of environmental pollutants on human health. The data set is taken from Fang [6]. A brief description is given below. Details can be found in Fang [6]. It is believed that contents of some metal elements in water might directly have impact on human health. Of interest here is to study the association between contents of six selected metals and the mortality of some kind cell of mice. The six selected metals are Cadmium (Cd), Copper (Cu), Zinc (Zn), Nickel (Ni), Chromium (Cr), and Lead (Pb). The levels for the content of each metal element is set to 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, 1, 2, 5, 8, 10, 12, 16, 18 and 20 (ppm). A uniform design (Fang and Wang [8]; and Fang, Lin, Winker and Zhang [9]) is used to construct a design with 17 experiments. For each run, three experiments were conducted. The average of these three outputs (mortality) is displayed in Table 3, in which the design was extracted from Fang [6].

Table 3
Data for Example 2

To avoid numerical instability, all x variables are standardized first. In our initial model, we include an intercept, all linear effects, quadratic effects and the first order interaction effects. This yields a linear regression models with 28 terms, including the intercept. Since there are only 17 design combination in this experiment. Thus, variable selection is necessary. Similar to the previous analysis, the stepwise regression analysis is first performed. The stepwise regression selects Cd, Cu, Ni*Cr, Zn*Pb, Cr, Pb*Pb, Zn*Cr, Pb, Zn*Zn, Cu*Cr, Ni, Cd*Ni and Ni*Pb, whose estimate, standard error and P-value are depicted in Table 4. All of them except Ni*Pb are significant at level 0.05. Based on the selected model by stepwise regression, we apply the SCAD variable selection procedure to select significant variables. The selected λ is 0.0748. The resulting estimate of significant effects along with their standard errors and P-values are depicted in Table 4, from which it can be seen that all six metal elements have significant impact on mortality.

Table 4
The selected models in Example 2

6 Conclusion and discussion

In this paper, we have introduced the SCAD variable selection procedure for screening experiments. It has been shown that the proposed approach possesses an oracle property. Two data examples illustrated the effectiveness of the proposed approach. In our earlier work (Li and Lin [13, 14]), the proposed procedure has been successfully applied to analyzing data from supersaturated designs(Lin [15, 16]). The computer codes (written in MATLAB) to perform all calculations can be obtained from the authors.

The regularity conditions (C1) and (C2) in Section 3 ensure that the proposed procedure can be used to screen experiments. Although the design matrix for the full model can be singular, condition (C2) implies that the design sub-matrix responding to active factors is full rank, and therefore alias structure among the active factors is not allowed. One should be cautious in the situation that active factors or their interactions are aliased with another ones. It needs some further research on how to deal with such situations.


The authors would like to thank referees for their valuable comments and suggestions. Li was supported by a National Institute on Drug Abuse (NIDA) grant P50 DA10075, and partially supported by NSF grants DMS 0348869 and CCF 0430349.

Contributor Information

Runze Li, Associate Professor, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111.

Dennis K. J. Lin, University Distinguished Professor, Department of Supply Chain and Information Systems, The Pennsylvania State University, University Park, PA 16802.


1. Antoniadis A, Fan J. Regularization of wavelets approximations (with discussions) Journal of American Statistical Association. 2001;96:939–967.
2. Beattie SD, Fong DKH, Lin DKJ. A two-stage Bayesian model selection strategy for supersatured designs. Technometrics. 2002;44:55–63.
3. Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math. 1979;31:377–403.
4. Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–455.
5. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360.
6. Fang KT. Uniform Design and Uniform Design Tables. Science Press; Beijing: 1994.
7. Fang KT, Lin DKJ. Uniform Experimental Design and Its Applications in Industry. In: Rao CR, Khattree R, editors. Handbook of Statistics in Industry. Vol. 22. North Holland; New York: 2003. pp. 131–170.
8. Fang KT, Wang Y. Applications of Number Theoretic Methods in Statistics. Chapman and Hall; London: 1994.
9. Fang KT, Lin DKJ, Winker Peter, Zhang Y. Uniform Design: Theory and Application. Technometrics. 2000;42:237–248.
10. Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–148.
11. Hamada M, Wu CFJ. Analysis of designed experiments with complex aliasing. Journal of Quality Technology. 1992;24:130–137.
12. Hunter DR, Li R. Variable selection using MM algorithms. Annals of Statistics. 2005;33:1617–1642. [PMC free article] [PubMed]
13. Li R, Lin DKJ. Data analysis in supersatured designs. Statistics and Probability Letters. 2002;59:135–144.
14. Li R, Lin DKJ. Comparisons of variable selection approaches for analysis of supersatured design. Journal of Data Science. 2003;1:249–260.
15. Lin DKJ. A new class of supersatured designs. Technometics. 1993;35:28–31.
16. Lin DKJ. Generating system supersaturated designs. Technometrics. 1995;37:213–225.
17. Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B. 1996;58:267–288.
18. Westfall PH, Young SS, Lin DKJ. Forward selection error control in analysis of supersaturated designs. Statistica Sinica. 1998;8:101–117.