Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Ann Stat. Author manuscript; available in PMC 2010 May 4.
Published in final edited form as:
Ann Stat. 2009; 37(4): 1733–1751.
doi:  10.1214/08-AOS625
PMCID: PMC2864037



We consider the problem of model selection and estimation in situations where the number of parameters diverges with the sample size. When the dimension is high, an ideal method should have the oracle property (Fan and Li, 2001; Fan and Peng, 2004) which ensures the optimal large sample performance. Furthermore, the high-dimensionality often induces the collinearity problem which should be properly handled by the ideal method. Many existing variable selection methods fail to achieve both goals simultaneously. In this paper, we propose the adaptive Elastic-Net that combines the strengths of the quadratic regularization and the adaptively weighted lasso shrinkage. Under weak regularity conditions, we establish the oracle property of the adaptive Elastic-Net. We show by simulations that the adaptive Elastic-Net deals with the collinearity problem better than the other oracle-like methods, thus enjoying much improved finite sample performance.

Keywords and phrases: Adaptive regularization, Elastic-Net, High dimensionality, Model selection, Oracle property, Shrinkage methods

1. Introduction

1.1. Background

Consider the problem of model selection and estimation in the classical linear regression model


where y = (y1,…,yn)T is the response vector and xj = (x1j,…,xnj)T, j = 1,…,p, are the linearly independent predictors. Let X = [x1, ···, xp] be the predictor matrix. Without loss of generality we assume the data are centered, so the intercept is not included in the regression function. Throughout this paper, we assume the errors are identically and independent distributed with zero mean and finite variance σ2. We are interested in the sparse modeling problem where the true model has a sparse representation, i.e., some components of β* are exactly zero. Let A={j:βj*0,j=1,2,,p}. In this work we call the size of [mathematical script A] the intrinsic dimension of the underlying model. We wish to discover the set [mathematical script A] and estimate the corresponding coefficients.

Variable selection is fundamentally important for knowledge discovery with high-dimensional data (Fan & Li 2006) and it could greatly enhance the prediction performance of the fitted model. Traditional model selection procedures follow best-subset selection and its step-wise variants. However, best-subset selection is computationally prohibitive when the number of predictors is large. Furthermore, as analyzed by Breiman (1996), subset selection is unstable, thus the resulting model has poor prediction accuracy. To overcome the fundamental drawbacks of subset selection, statisticians have recently proposed various penalization methods to perform simultaneous model selection and estimation. In particular, the lasso (Tibshirani 1996) and the SCAD (Fan & Li 2001) are two very popular methods due to their good computational and statistical properties. Efron, Hastie, Johnstone & Tibshirani (2004) proposed the LARS algorithm for computing the entire lasso solution path. Knight & Fu (2000) studied the asymptotic properties of the lasso. Fan & Li (2001) showed that the SCAD enjoys the oracle property, that is, the SCAD estimator can perform as well as the oracle if the penalization parameter is appropriately chosen.

1.2. Two fundamental issues with the [ell]1 penalty

The lasso estimator (Tibshirani 1996) is obtained by solving the [ell]1 penalized least squares problem

β^(lasso)=arg minβyXβ22+λβ1,

where β1=j=1p|βj| is the [ell]1-norm of β. The [ell]1 penalty enables the lasso to simultaneously regularize the least squares fit and shrink some components of [beta](lasso) to zero for some appropriately chosen λ. The entire lasso solution paths can be computed by the LARS algorithm (Efron et al. 2004). These nice properties make the lasso a very popular variable selection method.

Despite its popularity the lasso does have two serious drawbacks: namely the lack of oracle property and instability with high-dimensional data. First of all, the lasso does not have the oracle property. Fan & Li (2001) first pointed out that asymptotically the lasso has non-ignorable bias for estimating the nonzero coefficients. They further conjectured that the lasso may not have the oracle property because of the bias problem. This conjecture was recently proven in Zou (2006). Zou (2006) further showed that the lasso could be inconsistent for model selection unless the predictor matrix (or the design matrix) satisfies a rather strong condition. Zou (2006) proposed the following adaptive lasso estimator

β^(AdaLasso)=arg minβyXβ22+λj=1pw^j|βj|,

where { w^j}j=1p are the adaptive data-driven weights and can be computed by w^j=(|β^jini|)γ, where γ is a positive constant and [beta]ini is an initial root-n consistent estimate of β. Zou (2006) showed that with an appropriately chosen λ, the adaptive lasso performs as well as the oracle. Candes, Wakin & Boyd (2007) used the adaptive lasso idea to enhance sparsity in sparse signal recovery via the reweighted [ell]1 minimization.

Secondly the [ell]1 penalization methods can have very poor performance when there are highly correlated variables in the predictor set. The collinearity problem is often encountered in high-dimensional data analysis. Even when the predictors are independent, as long as the dimension is high, the maximum sample correlation can be large, as shown in Fan & Lv (2007). Collinearity can severely degrade the performance of the lasso. As shown in Zou & Hastie (2005), the lasso solution paths are unstable when predictors are highly correlated. Zou & Hastie (2005) proposed the Elastic-Net as an improved version of the lasso for analyzing high-dimensional data. The Elastic-Net estimator is defined as follows:

β^(enet)=(1+λ2n){arg minβyXβ22+λ2β22+λ1β1}.

If the predictors are standardized (each variable has mean zero and L2-norm one), then we should change (1+λ2n) to (1 + λ2) as in Zou & Hastie (2005). The [ell]1 part of the Elastic-Net performs automatic variable selection while the [ell]2 part stabilizes the solution paths and hence improves the prediction. In an orthogonal design where the lasso is shown to be optimal (Donoho, Johnstone, Kerkyacharian & Picard 1995), the Elastic-Net automatically reduces to the lasso. However, when the correlations among the predictors become high, the Elastic-Net can significantly improve the prediction accuracy of the lasso.

1.3. The adaptive Elastic-Net

The adaptively weighted [ell]1 penalty and the Elastic-Net penalty improve the lasso in two different directions. The adaptive lasso achieves the oracle property of the SCAD and the Elastic-Net handles the collinearity. However, following the arguments in Zou & Hastie (2005) and Zou (2006), we can easily see that the adaptive lasso inherits the instability of the lasso for high-dimensional data, while the Elastic-Net is lack of the oracle property. Thus, it is natural to consider combining the ideas of the adaptively weighted [ell]1 penalty and the Elastic-Net regularization to obtain a better method which can improve the lasso in both directions. To this end, we propose the adaptive Elastic-Net that penalizes the squared error loss using a combination of the [ell]2 penalty and the adaptive [ell]1 penalty. Since the adaptive Elastic-Net is designed for high-dimensional data analysis, we study its asymptotic properties under the assumption that the dimension diverges with the sample size.

Pioneering papers on asymptotic theories with diverging number of parameters include Huber (1988) and Portnoy (1984) which studied the M-estimators. Recently, Fan, Peng & Huang (2005) studied a semi-parametric model with a growing number of nuisance parameters, whereas Lam & Fan (2007) investigated the profile likelihood ratio inference for the growing number of parameters. In particular, our work is influenced by Fan & Peng (2004) who studied the oracle property of nonconcave penalized likelihood estimators. Fan & Peng (2004) provocatively argued why it is important to study the validity of the oracle property when the dimension diverges. We would like to know whether the adaptive Elastic-Net enjoys the oracle property with a diverging number of predictors. This question will be thoroughly investigated in this paper.

The rest of the paper is organized as follows. In Section 2 we introduce the adaptive Elastic-Net. Statistical theory, including the oracle property, of the adaptive Elastic-Net is established in Section 3. In Section 4 we use simulation to compare the finite sample performance of the adaptive Elastic-Net with the SCAD and other competitors. Section 5 discusses how to combine SIS of Fan & Lv (2007) and the adaptive Elastic-Net to deal with the ultra-high dimension cases. Technical proofs are presented in Section 6.

2. Method

The adaptive Elastic-Net can be viewed as a combination of the Elastic-Net and the adaptive lasso. Suppose we first compute the Elastic-Net estimator [beta](enet) as defined in (1.4), and then we construct the adaptive weights by


where γ is a positive constant. Now we solve the following optimization problem to get the adaptive Elastic-Net estimates

β^(AdaEnet)=(1+λ2n){arg minβyXβ22+λ2β22+λ1*j=1pw^j|βj|}.

From now on, we write [beta] = [beta](AdaEnet) for the sake of convenience.

If we force λ2 to be zero in (2.2), then the adaptive Elastic-Net reduces to the adaptive lasso. Following the arguments in Zou & Hastie (2005), we can easily show that in an orthogonal design the adaptive Elastic-Net reduces to the adaptive lasso, regardless the value of λ2. This is desirable because in that setting the adaptive lasso achieves the optimal minimax risk bound (Zou 2006). The role of the [ell]2 penalty in (2.2) is to further regularize the adaptive lasso fit whenever the collinearity may cause serious trouble.

We know the Elastic-Net naturally adopts a sparse representation. One can use ŵj = (|[beta]j(enet)| + 1/n)−γ to avoid dividing zeros. We can also define ŵj = ∞ when [beta]j(enet) = 0. Let Aenet = {j : [beta]j (enet) ≠ 0} and A^enetc denotes its complement set. Then we have β^A^enetc=0 and

β^A^enet=(1+λ2n){arg minβyXA^enetβ22+λ2β22+λ1*jΑ^enetw^j|βj|}.

where β in (2.3) is a vector of length |Aenet|, the size of Aenet.

The [ell]1 regularization parameters, λ1* and λ1, are directly responsible for the sparsity of the estimates. Their values are allowed to be different. On the other hand, we use the same λ2 for the [ell]2 penalty component in the Elastic-Net and the adaptive Elastic-Net estimators, because the [ell]2 penalty offers the same kind of contribution in both estimators.

3. Statistical Theory

In our theoretical analysis, we assume the following regularity conditions throughout.

  • (A1) We use λmin (M) and λmax (M) to denote the minimum and maximum eigenvalues of a positive definite matrix M, respectively. Then we assume
    where b and B are two positive constants.
  • (A2)
  • (A3) E[|ϵ|2+δ] < ∞ for some δ > 0.
  • (A4)
    limnlog(p)log(n)=νfor some0ν<1.
    To construct the adaptive weights (ŵ), we take a fixed γ such that γ>2ν1ν. In our numerical studies we let γ=2ν1ν+1 to avoid the tuning on γ. Once γ is chosen, we choose the regularization parameters according to the following conditions
  • (A5)
  • (A6)

Conditions (A1) and (A2) assume the predictor matrix has a reasonably good behavior. Similar conditions were considered in Portnoy (1984). Note that in the linear regression setting, condition (A1) is exactly condition (F) in Fan & Peng (2004). Condition (A3) is used to establish the asymptotic normality of [beta] (AdaEnet).

It is worth pointing out that condition (A4) is weaker than that used in Fan & Peng (2004) in which p is assumed to satisfy p4/n → 0 or at most p3/n → 0. It means their results require ν<13. Our theory removes this limitation. For any 0 ≤ ν < 1, we can choose an appropriate γ to construct the adaptive weights and the oracle property holds as long as γ>2ν1ν. Also note that in the finite dimension setting ν = 0, thus any positive γ can be used, which agrees with the results in Zou (2006).

Condition (A6) is similar to condition (H) in Fan & Peng (2004). Basically, condition (A6) allows the nonzero coefficients to vanish but at a rate that can be distinguished by the penalized least squares. In the finite dimension setting the condition is implicitly assumed.


Given the data (y, X), let ŵ = (ŵ1, …, ŵp) be a vector whose components are all non-negative and can depend on (y, X). Define

β^w^(λ2,λ1)=arg minβ{yXβ22+λ2β22+λ1j=1pw^j|βj|},

for non-negative parameters λ2 and λ1. If ŵj = 1 for all j, we denote [beta]ŵ2, λ1) by [beta]2, λ1) for convenience.

If we assume the model (1.1) and condition (A1), then


In particular, when ŵj = 1 for all j, we have


It is worth mentioning that the derived risk bounds are non-asymptotic. Theorem 3.1 is very useful for the asymptotic analysis. A direct corollary of Theorem 3.1 is that, under conditions (A1)–(A6), [beta]2, λ1) is a root-(n/p)-consistent estimator. This consistent rate is the same as the result of SCAD (Fan & Peng 2004). The root-(n/p) consistency result suggests that it is appropriate to use the Elastic-Net to construct the adaptive weights.


Let us write β*=(βA*,0) and define

β˜A*=arg minβ{yXAβ22+λ2jAβj2+λ1*jAw^j|βj|}.

Then with probability tending to 1, ((1+λ2n)β˜A*,0) is the solution to (2.2).

Theorem 3.2 provides an asymptotic characterization of the solution to the adaptive Elastic-Net criterion. The definition of β˜A* borrows the concept of “oracle” (Donoho & Johnstone 1994, Fan & Li 2001, Fan & Peng 2004, Zou 2006). If there was an oracle informing us the true subset model, then we would use this oracle information and the adaptive Elastic-Net criterion would become that in (2.3). Theorem 3.2 tells us that asymptotically speaking, the adaptive Elastic-Net works as if it had such oracle information. Theorem 3.2 also suggests that the adaptive Elastic-Net should enjoy the oracle property, which is confirmed in the next theorem.


Under conditions (A1)–(A6), the adaptive Elastic-Net has the oracle property, that is, the estimator [beta](AdaEnet) must satisfy:

  1. Consistency in selection : Pr ({j : [beta](AdaEnet)j ≠ 0} = [mathematical script A]) → 1,
  2. Asymptotic normality : αTI+λ2A11+λ2nA12(β^(AdaEnet)AβA*)dN(0,σ2), where A=XATXA and α is a vector of norm 1.

By Theorem 3.3 the selection consistency and the asymptotic normality of the adaptive Elastic-Net are still valid when the number of parameters diverges. Technically speaking, the selection consistency result is stronger than that Theorem 3.2 implies, although Theorem 3.2 plays an important role in the proof of Theorem 3.3. As a special case, when we let λ2 = 0, which is a choice satisfying conditions (A5) and (A6), Theorem 3.3 tell us that the adaptive lasso enjoys the selection consistency and the asymptotical normality:


4. Numerical Studies

In this section we present simulations to study the finite sample performance of the adaptive Elastic-Net. We considered five methods in the simulation study: the lasso(Lasso), the Elastic-Net(Enet), the adaptive lasso(ALasso), the adaptive Elastic-Net(AEnet) and the SCAD. In our implementation, we let λ2 = 0 in the adaptive Elastic-Net to get the adaptive lasso fit. There are several commonly used tuning parameter selection methods, such as cross-validation, generalized cross-validation(GCV), AIC and BIC. Zou, Hastie & Tibshirani (2007) suggested using BIC to select the lasso tuning parameter. Wang, Li & Tsai (2007) showed that for the SCAD, BIC is a better tuning parameter selector than GCV and AIC. In this work, we used BIC to select the tuning parameter for each method.

Fan & Peng (2004) considered simulation models in which pn=[4n14]5 and |[mathematical script A]| = 5. Our theory allows pn = O(nν) for any ν < 1. Thus, we are interested in models in which pn = O(nν) with ν>13. In addition, we allow the intrinsic dimension ([mathematical script A]) to diverge with the sample size as well, because such designs make the model selection and estimation more challenging than in the fixed |[mathematical script A]| situations.

Example 1

We generated data from the linear regression model,


where β* is a p-dim vector and ε ~ N(0, σ2), σ = 6 and x follows a p-dim multivariate normal distribution with zero mean and covariance Σ whose (j, k) entry is Σj,k = ρ|j−k| 1 ≤ k, j ≤ p. We considered ρ = 0.5 and ρ = 0.75. Let p = pn = [4n1/2] − 5 for n = 100,200,400. Let 1m/0m denote a m-vector of 1s/0s. The true coefficients are β* = (3 · 1q, 3 · 1q, 3 · 1q, 0p−3q)T and |[mathematical script A]| = 3q and q = [pn/9]. In this example ν=12, hence we used γ = 3 for computing the adaptive weights in the adaptive Elastic-Net.

For each estimator [beta], its estimation accuracy is measured by the mean squared error (MSE) defined as E[([beta] − β*)T Σ([beta] − β*)]. The variable selection performance is gauged by (C, IC), where C is the number of zero coefficients that are correctly estimated by zero and IC is the number of nonzero coefficients that are incorrectly estimated by zero.

Table 1 documents the simulation results. Several interesting observations can be made.

  1. When the sample size is large (n = 400), the three oracle-like estimators outperform the lasso and the Elastic-Net which do not have the oracle property. That is expected according to the asymptotic theory.
  2. The SCAD and the adaptive Elastic-Net are the best when the sample size is large and the correlation is moderate. However, the SCAD can perform much worse than the adaptive Elastic-Net when the correlation is high (ρ = 0.75) or the sample size is small.
  3. Both the Elastic-Net and the adaptive lasso can do significantly better than the lasso. What is more interesting is that the adaptive Elastic-Net often outperforms the Elastic-Net and the adaptive lasso.
Simulation I: model selection and fitting results based on 100 replications.

Example 2

We considered the same setup as in example 1, except that we let p = pn = [4n2/3] − 5 for n = 100,200,800. Since ν=23, we used γ = 5 for computing the adaptive weights in the adaptive Elastic-Net and the adaptive lasso. The estimation problem in this example is even more difficult than that in example 1. To see why, note that when n = 200 the dimension increases from 51 in example 1 to 131 in this example, and the intrinsic dimension (|[mathematical script A]|) is almost tripled.

The simulation results are presented in Table 2 from which we can see that the three observations made in example 1 are still valid in this example. Furthermore, we see that for every combination of (n,p, |[mathematical script A]|,ρ), the adaptive Elastic-Net has the best performance.

Example 2: model selection and fitting results based on 100 replications.

5. Ultra-high dimensional data

In this section we discuss how the adaptive Elastic-Net can be applied to ultra-high dimensional data in which p > n. When p is much larger than n, Candes & Tao (2007) suggested using the Dantzig selector which can achieve the ideal estimation risk up to a log(p) factor under the uniform uncertainty condition. Fan & Lv (2007) showed that the uniform uncertainty condition may easily fail and the log(p) factor is too large when p is exponentially large. Moreover, the computational cost of the Dantzig selector would be very high when p is large. In order to overcome these difficulties, Fan & Lv (2007) introduced the Sure Independence Screening (SIS) idea which reduces the ultra-high dimensionality to a relatively large scale dn but dn < n. Then, the lower dimension methods such as the SCAD can be used to estimate the sparse model. This procedure is referred to as SIS+SCAD. Under regularity conditions, Fan & Lv (2007) proved that SIS misses true features with an exponentially small probability and SIS+SCAD holds the oracle property if dn=o(n13). Furthermore, with the help of SIS, the Dantzig selector can achieve the ideal risk up to a log(dn) factor, rather than the original log(p).

Inspired by the results of Fan & Lv (2007), we consider combining the adaptive Elastic-Net and SIS when p > n. We first apply SIS to reduce the dimension to dn and then fit the data by using the adaptive Elastic-Net. We call this procedure SIS+AEnet.


Suppose the conditions for Theorem 1 in Fan and Lv (2007) hold. Let dn = O(nν), ν < 1, then SIS+AEnet produces an estimator that holds the oracle property.

We make a note here that Theorem 5.1 is a direct consequence of Theorem 1 in Fan & Lv (2007) and Theorem 3.3, thus its proof is omitted. Theorem 5.1 is similar to Theorem 5 in Fan & Lv (2007), but there is a difference. SIS+AEnent can hold the oracle property when dn exceeds O(n13), while Theorem 5 in Fan & Lv (2007) assumes dn=o(n13).

To demonstrate SIS+AEnet, we consider the simulation example used in Fan & Lv (2007) (Section 3.3.1). The model is y = xT β* + 1.5N(0,1), where β*=(β1T,0p|A|)T with |[mathematical script A]| = ∀. Here β1 is a 8-dim vector and each component has the form (−1)u (an + |z|), where an=4log(n)/n, u is randomly drawn from Ber(0.4) and z is randomly drawn from the standard normal distribution. We generated n = 200 data from the above model. Before applying the adaptive Elastic-Net, we used SIS to reduce the dimensionality from 1000 to dn=[5.5n23]=188. The estimation problem is still rather challenging, as we need to estimate 188 parameters by using only 200 observations. From Table 3 we see that SIS+AEnet performs favorably compared to SIS+SCAD.

A demonstration of SIS+AEnet: model selection and fitting results based on 100 replications.

6. Proofs


We write

β^(λ2,0)=arg minβyXβ22+λ2β22

By the definition of [beta]ŵ2, λ1) and [beta]2, 0), we know




From the above two inequalities, we have


On the other hand, we have




Note that λmin(XTX + λ2I) = λmin(XTX) + λ2. Therefore, we end up with


which results in the following inequality


Note that


which implies that


Combing (6.3) and (6.4), we have



We have used condition (A1) in the last inequality. When ŵj = 1 for all j, we have



We show that ((1+λ2n)β˜A*,0) satisfies the Karush-Kuhn-Tucker (KKT) conditions of (2.2) with probability tending to 1. By the definition of β˜A*, it suffices to show


or equivalently


Let η=minjA(|βj*|) and η^=minjA(|β^(enet)j*|). We note that



Then by Theorem 3.1 we obtain


Moreover, let M=(λ1*n)11+γ and we have


where we have used Theorem 3.1 in the last step. By the model assumption, we have


which gives us the below inequality


We now bound E(βA*β˜A*22I(η^>η/2)). Let

β˜A*(λ2,0)=arg minβ{yXAβ22+λ2jAβj2}.

Then by using the same arguments for deriving (6.1), (6.2) and (6.3), we have


Note that λmin(XATXA)λmin(XTX)bn and λmax(XATXA)λmax(XTX)Bn. Following the rest arguments in the proof of Theorem 3.1, we obtain


The combination of (6.7), (6.8), (6.9) and (6.11) yields


We have chosen γ>2ν1ν, then under conditions (A1)–(A6) it follows that


Thus the proof is completed.


From Theorem 3.2 we have shown that with probability tending to 1 the adaptive Elastic-Net estimator is equal to ((1+λ2n)β˜A*,0). Therefore, in order to prove the model selection consistency result, we only need to show Pr(minjA|β˜j*|>0)1. By (6.10) we have


Note that


Following (6.6) it is easy to see that


Moreover, λ1*pη^γbn+λ2=O(1n)(λ1*pnηγ)(η^η)γ and


In (6.12) we have shown η2np. Thus


Hence, we have


and Pr(minjA|β˜j*|>0)1.

We now prove the asymptotic normality. For convenience write


Note that


In addition, we have


Therefore, by Theorem 3.2 it follows that with probability tending to 1, zn = T1 + T2 + T3, where


We now show that T1 = o(1), T2 = oP (1) and T3N(0, σ2) in distribution. Then by Slutsky’s theorem we know znd N(0, σ2). By (A1) and αTα = 1, we have


Hence it follows by (A6) that T1 = o(1). Similarly, we can bound T2 as follows


where we have used (6.10) in the last step. Then (6.13) tells us that T22=1n2OP(1). Next we consider T3. Let X[mathematical script A][i,] denote the ith row of the matrix X[mathematical script A]. With such notation we can write T3=i=1nriϵi, where ri=αT(XATXA)12(XA[i,])T. Then it is easy to see that


Furthermore, we have for k = 2 + δ, δ > 0


Note that ri2A12(XA[i,])T(jAxij2)(λmax(A1))j=1pxij2bn. Hence,


From (6.14) and (6.15) Lyapunov conditions for the central limit theorem are established. Thus, T3d N(0, σ2). This completes the proof.


We sincerely thank an associate editor and referees for their help comments and suggestions.

Contributor Information

Hui Zou, School of Statistics, University of Minnesota, Minneapolis, Mn 55455, E-Mail: ude.nmu.tats@uozh.

Hao Helen Zhang, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, E-Mail: ude.uscn.tats@2gnahzh..


  • Breiman L. ‘Heuristics of instability and stabilization in model selection’ The Annals of Statistics. 1996;24:2350–2383.
  • Candes E, Tao T. ‘The dantzig selector: statistical estimation when p is much larger than nThe Annals of Statistics, to appear. 2007
  • Candes E, Wakin M, Boyd S. California Institute of Technology; 2007. Enhancing sparsity by reweighted [ell]1 minimization, Technical report.
  • Donoho D, Johnstone I. ‘Ideal spatial adaptation via wavelet shrinkage’ Biometrika. 1994;81:425–455.
  • Donoho D, Johnstone I, Kerkyacharian G, Picard D. ‘Wavelet shrinkage: asymptopia? (with discussion)’ Journal of Royal Statistical Society, Series B. 1995;57:301–337.
  • Efron B, Hastie T, Johnstone I, Tibshirani R. ‘Least angle regression’ The Annals of Statistics. 2004;32:407–499.
  • Fan J, Li R. ‘Variable selection via nonconcave penalized likelihood and its oracle properties’ Journal of the American Statistical Association. 2001;96:1348–1360.
  • Fan J, Li R. ‘Statistical challenges with high dimensionality: Feature selection in knowledge discovery’ Proceedings of the Madrid International Congress of Mathematicians 2006. 2006;Vol. III:595–622.
  • Fan J, Lv J. Department of Operations Research and Financial Engineering, Princeton University; 2007. Sure independence screening for ultra-high dimensional feature space, Technical report.
  • Fan J, Peng H. ‘Nonconcave penalized likelihood with a diverging number of parameters’ The Annals of Statistics. 2004;32:928–961.
  • Fan J, Peng H, Huang T. ‘Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency (with discussion)’ Journal of the American Statistical Association. 2005;100:781–813.
  • Huber P. ‘Robust regression: Asymptotics, conjectures and monte carlo’ The Annals of Statistics. 1988;1:799–821.
  • Knight K, Fu W. ‘Asymptotics for lasso-type estimators’ The Annals of Statistics. 2000;28:1356–1378.
  • Lam C, Fan J. ‘Profile-kernel likelihood inference with diverging number of parameters’ The Annals of Statistics. 2007 to appear. [PMC free article] [PubMed]
  • Portnoy S. ‘Asymptotic behavior of M-estimatiors of p regression parameters when p2/n is large. I. consistency’ The Annals of Statistics. 1984;12:1298–1309.
  • Tibshirani R. ‘Regression shrinkage and selection via the lasso’ Journal of the Royal Statistical Society, Series B. 1996;58:267–288.
  • Wang H, Li R, Tsai C. ‘Tuning parameter selectors for the smoothly clipped absolute deviation method’ Biometrika. 2007;94:553–568. [PMC free article] [PubMed]
  • Zou H. ‘The adaptive lasso and its oracle properties’ Journal of the American Statistical Association. 2006;101:1418–1429.
  • Zou H, Hastie T. ‘Regularization and variable selection via the elastic net’ Journal of the Royal Statistical Society, Series B. 2005;67:301–320.
  • Zou H, Hastie T, Tibshirani R. ‘On the degrees of freedom of the lasso’ The Annals of Statistics. 2007;35:2173–2192.