Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2913333

Formats

Article sections

- Summary
- 1 Introduction
- 2 Grouping pursuit
- 3 Estimation of tuning parameters and σ2
- 4 Theory
- 5 Numerical examples
- 6 Discussion
- 7 Technical proofs
- References

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2010 September 1.

Published in final edited form as:

J Am Stat Assoc. 2010 June 1; 105(490): 727–739.

doi: 10.1198/jasa.2010.tm09380PMCID: PMC2913333

NIHMSID: NIHMS215171

See other articles in PMC that cite the published article.

Extracting grouping structure or identifying homogenous subgroups of predictors in regression is crucial for high-dimensional data analysis. A low-dimensional structure in particular–grouping, when captured in a regression model, enables to enhance predictive performance and to facilitate a model's interpretability Grouping pursuit extracts homogenous subgroups of predictors most responsible for outcomes of a response. This is the case in gene network analysis, where grouping reveals gene functionalities with regard to progression of a disease. To address challenges in grouping pursuit, we introduce a novel homotopy method for computing an entire solution surface through regularization involving a piecewise linear penalty. This nonconvex and overcomplete penalty permits adaptive grouping and nearly unbiased estimation, which is treated with a novel concept of grouped subdifferentials and difference convex programming for efficient computation. Finally, the proposed method not only achieves high performance as suggested by numerical analysis, but also has the desired optimality with regard to grouping pursuit and prediction as showed by our theoretical results.

Essential to high-dimensional data analysis is seeking a certain lower-dimensional structure in knowledge discovery, as in web mining. Extracting one-kind of lower-dimensional structure–grouping, remains largely unexplored in regression. In gene network analysis, a large amount of current genetic knowledge has been organized in terms of networks, for instance, the Kyoto Encyclopedia of Genes and Genomes (KEGG), a collection of manually drawn pathway maps representing the knowledge about molecular interactions and reactions. In situations as such, extracting homogenous subnetworks from a network of dependent predictors, most responsible for predicating outcomes of a response, has been one key challenge of biomedical research. There homogenous subnetworks of genes are usually estimated for understanding a disease's progression. The central issue this article addresses is automatic identification of homogenous subgroups in regression, what we call grouping pursuit.

Now consider a linear model in which response *Y _{i}* depends on a vector of

$${Y}_{i}\mu ({\mathit{\text{x}}}_{i})+{\mathit{\epsilon}}_{i},\phantom{\rule{1em}{0ex}}\mu (\mathit{\text{x}}){\mathit{\text{x}}}^{T}\phantom{\rule{0.2em}{0ex}}\mathit{\beta}=\sum _{j=1}^{p}{x}_{j}{\beta}_{j},\phantom{\rule{1em}{0ex}}E({\epsilon}_{i})=0,\phantom{\rule{1em}{0ex}}\mathit{\text{Var}}({\epsilon}_{i})={\sigma}^{2};\phantom{\rule{1em}{0ex}}i=1,\dots ,n,$$

(1)

where ** β** (

Grouping pursuit seeks variance reduction of estimation while retaining roughly the same amount of bias, which is advantageous in high-dimensional analysis. First, it collapses predictors whose sample covariances between the residual and predictors are of similar values, for best predicting outcomes of ** Y**; c.f., Theorem 4. Moreover, it goes beyond the notion of feature selection. This is because it seeks not only a set of redundant predictors, or a single group of zero-coefficient predictors, but also additional homogenous subgroups for further variance reduction. As a result, it yields higher predictive performance. These aspects are confirmed by numerical and theoretical results in Sections 4 and 5. Second, the price to be paid for adaptive grouping pursuit is estimation of tuning parameters, which is small as compared to its potential gain of a simpler model with higher predictive accuracy.

Grouping pursuit considered in here is one kind of supervised clustering. Papers that investigate groping pursuit are those of Tibshirani et al. (2005), where the Fused Lasso is proposed using an *L*_{1}-penalty with respect to a certain serial order; Bondell and Reich (2008), where the OSCAR penalty involves pairwise *L*_{∞}-penalties for grouping variables in terms of absolute values, in addition to variable selection. Grouping pursuit dramatically differs from feature selection for grouped predictors. This is in the sense that the former only groups predictors *without* removing redundancy, whereas the latter removes redundancy by encouraging grouped predictors stay together in selection; see Yuan and Lin (2006), and Zhao, Rocha and Yu (2009).

Our primary objective is achieving high accuracy in both grouping and prediction through a computationally efficient method, which seems to be difficult, if not impossible, with existing methods, especially those through enumeration. To achieve our objective, we employ the regularized least squares method with a piecewise linear nonconvex penalty. The penalty to be introduced in (2) involves one thresholding parameter determining which pairs to be shrunk towards a common group, which works jointly with one regularization parameter for shrinkage towards unknown location. These two tuning parameters combine thresholding with shrinkage for achieving adaptive grouping, which is otherwise not possible with shrinkage alone. The penalty is overcomplete in that the number of individual penalty terms in the penalty may be redundant with regard to certain grouping structures, and is continuous but with three nondifferentiable points, leading to significant computational advantage, in addition to the desired optimality for grouping pursuit (Theorems 3 and Corollary 1).

Computationally, the proposed penalty imposes great challenges in two aspects: (a) potential discontinuities and (b) overcompleteness of the penalty, where an effective treatment does not seem to exist in the literature; see Friedman et al. (2007) about computational challenges for a pathwise coordinate method in this type of situation. To meet the challenges, we design a novel homotopy algorithm to compute the regularization solution surface. The algorithm uses a novel concept of grouped subdifferentials to deal with overcompleteness for tracking the process of grouping, and difference convex (DC) programming to treat discontinuities due to nonconvex minimization. This, together with a model selection routine for estimators that can be discontinuous, permits adaptive grouping pursuit.

Theoretically, we derive a finite-sample probability error bound of our DC estimator, what we call DCE, computed from the homotopy algorithm for grouping pursuit. On this basis, we prove that DCE is consistent with regard to grouping pursuit as well as reconstructing the unbiased least squares estimate under the true grouping, roughly for nearly exponentially many predictors in *n* as long as
$\frac{log\phantom{\rule{0.2em}{0ex}}p}{n}\to 0$, c.f., Theorem 3 for details.

For subnetwork analysis, we apply our proposed method to study predictability of a protein-protein interaction (PPI) network of genes on the time to breast cancer metastasis through gene expression profiles. In Section 5.2, 27 homogenous subnetworks are identified through a Laplacian network weight vector, which surround three tumor suppressor genes TP53, BRACA1 and BRACA2 for metastasis. There 17 disease genes that were identified in the study of Wang et al. (2005) belong to 5 groups containing 1, 1, 1, 1, and 13 disease genes, indicating gene functionalities with regard to breast cancer survivability.

This article is organized in seven sections. Section 2 introduces the proposed method and the homotopy algorithm. Section 3 is devoted to selection of tuning parameter. Section 4 presents a theory concerning optimal properties of DCE in grouping pursuit and prediction, followed by some numerical examples and an application to breast cancer data in Section 5. Section 6 discusses the proposed method. Finally, the appendix contains technical proofs.

In (1), let the true coefficient vector
${\mathit{\beta}}^{0}={({\beta}_{1}^{0},\dots ,{\beta}_{p}^{0})}^{T}$ be
${({\alpha}_{1}^{0}\phantom{\rule{0.1em}{0ex}}{\mathbf{1}}_{|{\mathcal{G}}_{1}^{0}|}^{T}\phantom{\rule{0.1em}{0ex}},\dots ,{\alpha}_{{K}^{0}}^{0}\phantom{\rule{0.1em}{0ex}}{\mathbf{1}}_{|{\mathcal{G}}_{{K}^{0}}^{0}|}^{T})}^{T}$, where *K*^{0} is the number of distinct groups,
${\alpha}_{1}^{0}<{\alpha}_{{K}^{0}}^{0}$, and
${\mathbf{1}}_{|{\mathcal{G}}_{1}^{0}|}$ denotes a vector of 1's with length
$|{\mathcal{G}}_{1}^{0}|$. Grouping pursuit, as defined early, estimates true grouping
${\mathcal{G}}^{0}=({\mathcal{G}}_{1}^{0},\dots ,{\mathcal{G}}_{{K}^{0}}^{0})$ as well as
${\mathit{\alpha}}^{0}={({\alpha}_{1}^{0},\dots ,{\alpha}_{{K}^{0}}^{0})}^{T}$. Without loss of generality, assume that the response and predictors are centered, that is, **Y**^{T}**1** = 0 and (*x*_{1j}, …, *x _{nj}*)

Ideally, one may enumerate over all possible least squares regressions for identifying the best grouping. However, the total number of all possible groupings, which is the *p*th Bell number (Rota, 1964), is much larger than that of all possible subsets in feature selection, hence that it is computationally infeasible even for moderate *p*. For instance, the 10th order Bell number is 115975. To circumvent this difficulty, we develop an automatic nonconvex regularization method to obtain (1) accurate grouping, (2) the least squares estimate based on the true grouping, (3) an efficient homotopy algorithm, and (4) high predictive performance.

Our approach utilizes a penalty involving pairwise comparisons: {*β _{j}* −

We now propose our penalized least squares criterion for automatic grouping pursuit:

$$S(\mathit{\beta})=\frac{1}{2n}\sum _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}{({Y}_{i}-{\mathit{\text{x}}}_{i}^{T}\mathit{\beta})}^{2}+{\mathrm{\lambda}}_{1}J(\mathit{\beta}),\phantom{\rule{1em}{0ex}}J(\mathit{\beta})=\sum _{j<j\prime}G({\beta}_{j}-{\beta}_{j\prime}),$$

(2)

where λ_{1} > 0 is the regularization parameter controlling the degree of grouping. For (2), any local/global minimizer can not attain at any of non-smooth locally concave points of *J*(** β**).

**Lemma 1** Let *h*(·) be any differentiable function in ^{p} and
$\mathit{\beta}={({\beta}_{1}^{}}^{}$ be a local minimizer of *f*(** β**) =

We now introduce a novel concept of grouped subdifferentials for a convex function, which constitutes a basis of our homotopy algorithm for tracking the process of grouping.

A subgradient of a convex function *f*(** β**) with respect to

To proceed, write ** as
${({\widehat{\alpha}}_{1}{\mathbf{1}}_{|{\mathcal{G}}_{1}|}^{T}\phantom{\rule{0.1em}{0ex}},\dots ,{\widehat{\alpha}}_{K}{\mathbf{1}}_{|{\mathcal{G}}_{K}|}^{T})}^{T}$, where the index in each group is arranged increasingly, **_{1} < < * _{K}*. Define

Ordinarily, group splitting can be tracked through certain transition conditions for {*b _{jj′}* :

$$|{B}_{A}|\le |A|(|{\mathcal{G}}_{k}|-|A|).$$

(3)

Subsequently, we work with {*B _{A}* :

This section treats non-differentiable nonconvex minimization (2) through DC programming, which is a principle for nonconvex minimization, relying on decomposing an objective function into a difference of two convex functions. The reader may consult An and Tao (1997) for DC programming. Through this DC method, we will design a novel homotopy algorithm for a DC solution of (2) in Section 2.3, which is a solution through DC programming.

First, we decompose *S*(** β**) in (2) into a difference of two convex functions
${S}_{1}(\mathit{\beta})=\frac{1}{2n}{\sum}_{i=1}^{n}{({Y}_{i}-{\mathit{\text{x}}}_{i}^{T}\mathit{\beta})}^{2}+{\mathrm{\lambda}}_{1}{\sum}_{j<j\prime}|{\beta}_{j}-{\beta}_{j\prime}|$ and

Second, we construct a sequence of upper approximations by successively replacing *S*_{2}(** β**) at iteration

$${S}_{1}\left(\mathit{\beta}\right)-{S}_{2}\left({\widehat{\widehat{\mathit{\beta}}}}^{(m-1)}\phantom{\rule{0.2em}{0ex}}\left({\mathrm{\lambda}}_{1},{\mathrm{\lambda}}_{2}\right)\right)-{\left(\mathit{\beta}-{\widehat{\widehat{\mathit{\beta}}}}^{(m-1)}\phantom{\rule{0.2em}{0ex}}\left({\mathrm{\lambda}}_{1},{\mathrm{\lambda}}_{2}\right)\right)}^{T}{S}_{2}\left({\widehat{\widehat{\mathit{\beta}}}}^{(m-1)}\left({\mathrm{\lambda}}_{1},{\mathrm{\lambda}}_{2}\right)\right),$$

(4)

where is the subgradient operator, ^{(m−1)} (λ_{1}, λ_{2}) is the minimizer of (4) at iteration *m* − 1, and ^{(−1)} (λ_{1}, λ_{2}) **0**. The last term in (4) becomes
${\mathrm{\lambda}}_{1}{\sum}_{j=1}^{p}\phantom{\rule{0.2em}{0ex}}\left({\beta}_{j}-{\widehat{\widehat{\beta}}}_{j}^{(m-1)}({\mathrm{\lambda}}_{1},{\mathrm{\lambda}}_{2})\right)\times {\sum}_{j\prime :j\prime \ne j}{G}_{2}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\widehat{\beta}}}_{j}^{(m-1)}({\mathrm{\lambda}}_{1},{\mathrm{\lambda}}_{2})-{\widehat{\widehat{\beta}}}_{j\prime}^{(m-1)}({\mathrm{\lambda}}_{1},{\mathrm{\lambda}}_{2})\right)$ with *G*_{2}(*z*) = Sign(*z*)*I*(|*z*| > λ_{2}) being a subgradient of *G*_{2} at *z*.

Third, we utilize the grouped subdifferentials to track the entire solution surface iteratively. One technical difficulty is that ^{(m)} (λ_{1}, λ_{2}) would have jumps in λ_{1} if ^{(0)} (λ_{1}, λ_{2}) were piecewise linear in λ_{1} given λ_{2}, in view of (6) and (7) in Theorem 1. This is undesirable for tracking by continuity through homotopy. For grouping pursuit, we therefore replace ^{(0)} (λ_{1}, λ_{2}) in (4) by ^{(0)} (λ_{0}, λ_{2}), where rough tuning for λ_{0} suffices because a DC algorithm is not sensitive to an initial value (An and Tao, 1997). This choice leads to a piecewise linear and continuous minimizer ^{(1)}(**λ**) of (4) in λ_{1} given (λ_{0}, λ_{2}), where **λ** = (λ_{0}, λ_{1}, λ_{2})^{T}. Successively replacing ^{(m−1)} (λ_{1}, λ_{2}) in (4) by ^{(m−1)} (λ_{0}, λ_{0}, λ_{2}) for *m* , we obtain a modified version of (4):

$${S}^{(m)}\phantom{\rule{0.1em}{0ex}}(\mathit{\beta})={S}_{1}\phantom{\rule{0.1em}{0ex}}(\mathit{\beta})-{S}_{2}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\mathit{\beta}}}^{(m-1)}\phantom{\rule{0.2em}{0ex}}\left({\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{2}\right)\right)-{\left(\mathit{\beta}-{\widehat{\mathit{\beta}}}^{(m-1)}\phantom{\rule{0.2em}{0ex}}\left({\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{2}\right)\right)}^{T}{S}_{2}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\mathit{\beta}}}^{(m-1)}\phantom{\rule{0.2em}{0ex}}\left({\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{2}\right)\right),$$

(5)

which yields its minimizer ^{(m)}(**λ**) and the estimated grouping ^{(m)}(**λ**). As suggested by Theorems 1 and 2, ^{(m)}(**λ**) converges in finite steps. Most importantly, the iterative scheme yields an estimator having the desired properties of a global minimizer, c.f., Theorem 3.

Given grouping = (_{1}, …, _{K}), let *Z*_{} = (*z*_{1}, …, *z*_{K}) be an *n* × *K* matrix with *z*_{k} = *X*_{k} **1**, and *X*_{k} be the design matrix spanned by the predictors of _{k}; *k* = 1, …, *K*.

**Theorem 1** Assume that
${\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}^{T}\phantom{\rule{0.2em}{0ex}}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}\phantom{\rule{0.2em}{0ex}}(\mathbf{\lambda})}$ is invertible. Then ^{(m)}(**λ**) defined by (5) is piecewise linear in (** Y**,

$${\widehat{\mathit{\alpha}}}^{\left(m\right)}(\mathbf{\lambda}){\left({\widehat{\alpha}}_{1}^{\left(m\right)}(\mathbf{\lambda}),\dots ,{\widehat{\alpha}}_{{K}^{\left(m\right)}(\mathbf{\lambda})}^{\left(m\right)}(\mathbf{\lambda})\right)}^{T}={\left({\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}\right)}^{-1}\left({\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}^{T}\mathit{\text{Y}}-n{\mathrm{\lambda}}_{1}{\mathit{\delta}}^{(m)}(\mathbf{\lambda})\right),$$

(6)

where ${\mathit{\delta}}^{(m)}(\mathbf{\lambda}){\left({\delta}_{1}^{(m)}(\mathbf{\lambda}),\dots ,{\delta}_{{K}^{(m)}(\mathbf{\lambda})}^{(m)}(\mathbf{\lambda})\right)}^{T},{\delta}_{k}^{(m)}(\mathbf{\lambda}){\sum}_{j{\mathcal{G}}_{k}^{(m)}(\mathbf{\lambda})}$, and

$${\mathrm{\Delta}}_{j}^{(m)}(\mathbf{\lambda})\sum _{{j}^{\prime}:{j}^{\prime}\ne j}\phantom{\rule{0.2em}{0ex}}\{\text{sign}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\beta}}_{j}^{(m)}(\mathbf{\lambda})-{\widehat{\beta}}_{{j}^{\prime}}^{(m)}(\mathbf{\lambda})\right)-{G}_{2}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\beta}}_{j}^{(m-1)}({\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{2})-{\widehat{\beta}}_{{j}^{\prime}}^{(m-1)}({\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{0},{\mathrm{\lambda}}_{2})\right)\}.$$

(7)

Moreover, for
$j{\mathcal{G}}_{k}^{(m)}(\mathbf{\lambda})$ with
$|{\mathcal{G}}_{k}^{(m)}(\mathbf{\lambda})|\ge 2$, and *k* = 1, …, *K*^{(m)}(**λ**),

$${B}_{j}^{(m)}(\mathbf{\lambda})=\frac{1}{n{\mathrm{\lambda}}_{1}}{\mathit{\text{x}}}_{j}^{T}\phantom{\rule{0.2em}{0ex}}\left(\mathit{\text{I}}-{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}^{T}\phantom{\rule{0.2em}{0ex}}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}\right)}^{-1}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}^{T}\right)\mathit{\text{Y}}+{\mathit{\text{x}}}_{j}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}\right)}^{-1}{\mathit{\delta}}^{(m)}(\mathbf{\lambda})-{\mathrm{\Delta}}_{j}^{(m)}(\mathbf{\lambda}).$$

(8)

Theorem 1 reveals two important aspects of ^{(m)}(**λ**) from (5). First, ^{(m)}(**λ**) and
${B}_{j}^{(m)}(\mathbf{\lambda})$ are continuous and piecewise linear in λ_{1} and
${\mathrm{\lambda}}_{1}^{-1}$, piecewise constant in λ_{2} with possible jumps, and piecewise linear in ** Y** with possible jumps. In other words,
$({\widehat{\mathit{\alpha}}}^{(m)}(\mathbf{\lambda}),{B}_{j}^{(m)}(\mathbf{\lambda}))$ is continuous in λ

One efficient computational tool is a homotopy method (Allgower and Georg, 2003; Wu et al., 2009), which utilizes continuity of a solution in **λ** to compute the entire solution surface simultaneously. To our knowledge, homotopy methods for nonconvex problems have not yet received attention in the literature. This section develops a homotopy method for a regularization solution surface for nonconvex minimization (2) through DC programming. One major computational challenge is that the solution may be piecewise linear with jumps in (** Y**,

The main gradients of our DC homotopy algorithm are (1) iterating the entire DC solution surfaces, (2) utilizing the piecewise linear continuity of
$({\widehat{\mathit{\alpha}}}^{(m)}(\mathbf{\lambda}),{B}_{j}^{(m)}(\mathbf{\lambda}))$ in
$({\mathrm{\lambda}}_{1},{\mathrm{\lambda}}_{1}^{-1})$ for given (λ_{0}, λ_{2}, *m*), and (3) tracking transition points (joints for a piecewise linear function) through the grouped subdifferentials. This algorithm permits efficient computation of a DC solution for nonconvex minimization (2) with an overcomplete penalty, which is otherwise difficult to treat. To compute ^{(m)}(**λ**), we proceed as follows. First, we fix at one evaluation point of (λ_{0}, λ_{2}), then move along path from λ_{1} = ∞ towards λ_{1} = 0 given (λ_{0}, λ_{2}). By Theorem 1, ^{(m)}(**λ**) is piecewise linear and continuous in λ_{1} given (λ_{0}, λ_{2}). Along this path, we compute transition points at which the derivative of ^{(m)}(**λ**) with respect to λ_{1} changes. Second, we move to other evaluation points of (λ_{0}, λ_{2}) and repeat the above process.

For given (λ_{0}, λ_{2}), transition in λ_{1} occurs when either of the following conditions is met:

- Merging: Groups ${\mathcal{G}}_{l}^{(m)}(\mathbf{\lambda})$ and ${\mathcal{G}}_{k}^{(m)}(\mathbf{\lambda})$ are combined at
**λ**, when ${\widehat{\alpha}}_{l}^{(m)}(\mathbf{\lambda})={\widehat{\alpha}}_{k}^{(m)}(\mathbf{\lambda})$; - Splitting: Group ${\mathcal{G}}_{k}^{(m)}(\mathbf{\lambda})$ is split into two disjoint sets
*A*_{1}and*A*_{2}with ${A}_{1}{A}_{2}={\mathcal{G}}_{k}^{(m)}(\mathbf{\lambda})$ at**λ**, when $|{B}_{{A}_{1}}^{(m)}(\mathbf{\lambda})|=|{B}_{{A}_{2}}^{(m)}(\mathbf{\lambda})|=|{A}_{1}||{A}_{2}|$, according to (3).

In (A) and (B), at a transition point, two or more groups may be merged, and a single group may be split into two or multiple subgroups.

We now describe the basic idea for computing {^{(m)}(**λ**) : λ_{1} > 0} given (λ_{0}, λ_{2}, *m*). From (6), we track
$({\mathcal{G}}^{(m)}(\mathbf{\lambda}),{\mathrm{\Delta}}_{j}^{(m)}(\mathbf{\lambda}))$ along the path from λ_{1} = ∞ towards λ_{1} = 0 for given (λ_{0}, λ_{2}). Let
${\mathrm{\lambda}}_{1}^{}$ be the current transition point. Our algorithm successively identifies the next transition point
${\mathrm{\lambda}}_{1}^{}$ along the path. For notational ease, we use ( = (_{1}, …, _{K}), Δ_{j}) to denote the current
$({\mathcal{G}}^{(m)}(\mathbf{\lambda}),{\mathrm{\Delta}}_{j}^{(m)}(\mathbf{\lambda}))$ after the current transition; *j* = 1, …, *p*. Note that
$({\mathcal{G}}^{(m)}(\mathbf{\lambda}),{\mathrm{\Delta}}_{j}^{(m)}(\mathbf{\lambda}))$; *j* = 1, …, *p*, remain unchanged before the next transition is reached.

For merging in (A), we compute potential merge points:

$${m}_{kl}\frac{{\left({\mathit{\text{e}}}_{k}-{\mathit{\text{e}}}_{l}\right)}^{T}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{\mathcal{G}}^{T}{\mathit{\text{Z}}}_{\mathcal{G}}\right)}^{-1}{\mathit{\text{Z}}}_{\mathcal{G}}\mathit{\text{Y}}}{n{\left({\mathit{\text{e}}}_{k}-{\mathit{\text{e}}}_{l}\right)}^{T}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{\mathcal{G}}^{T}{\mathit{\text{Z}}}_{\mathcal{G}}\right)}^{-1}\mathit{\delta}},\phantom{\rule{1em}{0ex}}\mathit{\delta}={\left({\delta}_{1},\dots ,{\delta}_{K}\right)}^{T},\phantom{\rule{1em}{0ex}}{\delta}_{k}=\sum _{j:j{\mathcal{G}}_{k}}$$

(9)

1 ≤ *k* < *l* ≤ *K*, where * e_{k}* is the

$${\mathrm{\lambda}}_{1,A}=max\phantom{\rule{0.2em}{0ex}}\{{m}_{kl}(0,{\mathrm{\lambda}}_{1}^{}\}$$

(10)

is a potential transition point at which _{k′} and _{l′} are combined into one group. Define
${\mathrm{\lambda}}_{1,A}=0\phantom{\rule{0.5em}{0ex}}\text{if}\phantom{\rule{0.5em}{0ex}}\{{m}_{kl}(0,{\mathrm{\lambda}}_{1}^{}\}=$, with denoting the empty set.

For splitting in (B), we utilize (3) and the subdifferentials in (8), to compute, for each *k* = 1, …, *K*, the largest
${\mathrm{\lambda}}_{1}({\mathrm{\lambda}}_{1,A},{\mathrm{\lambda}}_{1}^{}$ and *A* _{k} with |*A*| < |_{k}|/2 such that

$${L}_{k}^{+}\phantom{\rule{0.1em}{0ex}}({\mathrm{\lambda}}_{1},A){L}_{k}^{-}\phantom{\rule{0.1em}{0ex}}({\mathrm{\lambda}}_{1},A)=0,$$

(11)

where
${L}_{k}^{\pm}\phantom{\rule{0.1em}{0ex}}({\mathrm{\lambda}}_{1},A){\sum}_{jA}$,
${\eta}_{j}{\mathit{\text{x}}}_{j}^{T}{\mathit{\text{Z}}}_{\mathcal{G}}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{\mathcal{G}}^{T}\phantom{\rule{0.1em}{0ex}}{\mathit{\text{Z}}}_{\mathcal{G}}\right)}^{-1}\mathit{\delta}-{\mathrm{\Delta}}_{j}$, and
${\xi}_{j}{\mathit{\text{x}}}_{j}^{T}\phantom{\rule{0.2em}{0ex}}\left(I-{\mathit{\text{Z}}}_{\mathcal{G}}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{\mathcal{G}}^{T}\phantom{\rule{0.1em}{0ex}}{\mathit{\text{Z}}}_{\mathcal{G}}\right)}^{-1}\phantom{\rule{0.2em}{0ex}}{\mathit{\text{Z}}}_{\mathcal{G}}^{T}\right)\mathit{\text{Y}}$. It follows from (8) that
${B}_{j}^{(m)}(\mathbf{\lambda})=\frac{1}{n{\mathrm{\lambda}}_{1}}{\xi}_{j}+{\eta}_{j}$ before the next transition occurs, and hence that
${L}_{k}^{\pm}\phantom{\rule{0.2em}{0ex}}({\mathrm{\lambda}}_{1},{\mathcal{G}}_{k})={\sum}_{j{\mathcal{G}}_{k}}$ for any λ_{1} . Unfortunately, solving (11) through enumeration is infeasible over all subsets *A* _{k}. In Algorithm 1 below, we develop an efficient strategy utilizing piecewise linearity of
${L}_{k}^{\pm}({\mathrm{\lambda}}_{1},A)$ in
${\mathrm{\lambda}}_{1}^{-1}$ for computing the potential transition point
${\mathrm{\lambda}}_{1,B}max\{{s}_{k}({\mathrm{\lambda}}_{1,A},{\mathrm{\lambda}}_{1}^{}]:k=1,\dots ,K\}$, as well as its corresponding grouping, where *s _{k}* is the solution of (11);

To describe our strategy for solving (11), let
${A}_{k+}^{(}$ and
${A}_{k-}^{(}$ be the two subsets of _{k} of size corresponding to the largest and smallest values of
${D}_{k+}^{(}$ and
${D}_{k-}^{(}$, for = 1, …, |_{k}| and *k* = 1, …, *K*, where
${A}_{k\pm}^{(}$ if
$|{D}_{k\pm}^{(}|$. Since
${L}_{k}^{\pm}({\mathrm{\lambda}}_{1}^{})=0$, we can refine our search to
$\{{A}_{k\pm}^{(}$ for solving (11). Note that
${L}_{k}^{+}\phantom{\rule{0.2em}{0ex}}({\mathrm{\lambda}}_{1},{A}_{k+}^{(})\le 0$ and
${L}_{k}^{-}\phantom{\rule{0.2em}{0ex}}({\mathrm{\lambda}}_{1},{A}_{k-}^{(})\ge 0$ by the definition of
${B}_{j}^{(m)}(\mathbf{\lambda})$'s for
${\mathrm{\lambda}}_{1}[{\mathrm{\lambda}}_{1}^{,{\mathrm{\lambda}}_{1}^{}]}$. For *k* = 1, …, *K* and = 1, …, [_{k}/2], we seek the first zero-crossing λ_{1} values for
${L}_{k}^{\pm}\phantom{\rule{0.2em}{0ex}}({\mathrm{\lambda}}_{1},{A}_{k\pm}^{(}$, denoted as
${s}_{k\pm}^{}$, as λ_{1} decreases from
${\mathrm{\lambda}}_{1}^{}$. For each *k* = 1, …, *K*, we start with = 1 and compute

$${s}_{k1}^{\pm}=max\phantom{\rule{0.2em}{0ex}}\{\frac{{\xi}_{j}}{\pm (|{\mathcal{G}}_{k}|-1)-{\eta}_{j}}({\mathrm{\lambda}}_{1,A},{\mathrm{\lambda}}_{1}^{}]:j=1,\dots ,p\}.$$

(12)

For = 2, …, [_{k}/2], we observe that elements in
${A}_{k\pm}^{(}$ need to be updated as λ_{1} decreases due to rank changes in
$\{\frac{1}{n{\mathrm{\lambda}}_{1}}{\xi}_{j}+{\eta}_{j}:j{\mathcal{G}}_{k}\}$. This occurs at switching points:

$${h}_{j{j}^{\prime}}\frac{{\xi}_{j}-{\xi}_{{j}^{\prime}}}{{\eta}_{{j}^{\prime}}-{\eta}_{j}};\phantom{\rule{1em}{0ex}}\text{at which}\phantom{\rule{0.4em}{0ex}}\frac{1}{n{\mathrm{\lambda}}_{1}}{\xi}_{j}+{\eta}_{j}=\frac{1}{n{\mathrm{\lambda}}_{1}}{\xi}_{{j}^{\prime}}+{\eta}_{{j}^{\prime}};\phantom{\rule{1em}{0ex}}1\le j{j}^{\prime}\le p.$$

(13)

On this basis,
${s}_{k\pm}^{}$ can be computed. First, calculate the largest λ_{1} {λ_{1,A}} {*h _{jj′}* : 1 ≤

Algorithm 1 computes the next transition point, and Δ_{j}'s.

Given the current grouping and Δ_{j}'s with invertible
${\mathit{\text{Z}}}_{\mathcal{G}}^{T}\phantom{\rule{0.2em}{0ex}}{\mathit{\text{Z}}}_{\mathcal{G}}$,

**Step 1**(Potential transition for merging) Compute λ_{1,A}as defined in (10) as well as the corresponding grouping.**Step 3**(Splitting points) Starting with = 1, and for*k*= 1, …,*K*, we- compute the largest λ
_{1}*H*such that ${L}_{k}^{+}({\mathrm{\lambda}}_{1},{A}_{k+}^{(}$, and the largest λ_{1}*H*such that ${L}_{k}^{-}({\mathrm{\lambda}}_{1},{A}_{k-}^{(}$, where bisection or Fibonacci search (e.g., Gill, Murray and Wright, 1981) may be applied; - interpolate ${L}_{k}^{+}({\mathrm{\lambda}}_{1},{A}_{k+}^{(}$ linearly to obtain ${s}_{k+}^{\phantom{\rule{0.2em}{0ex}}}$ satisfying ${L}_{k}^{+}\phantom{\rule{0.2em}{0ex}}({s}_{k+}^{,}$; set ${s}_{k+}^{=}$ if ${max}_{{\mathrm{\lambda}}_{1}H}$;
- compute ${s}_{k}$ and the corresponding index set;
- if
*s*_{k,−1}>*s*_{k,}and ≤ [_{k}/2] − 1, then go to Step 3 with replaced by + 1. Otherwise, set*s*_{k,+1}= =*s*_{k,[k/2]}= 0.

**Step 4**(Potential transition for splitting) Compute ${\mathrm{\lambda}}_{1,B}=max\phantom{\rule{0.2em}{0ex}}\{{s}_{k}$, as well as the corresponding grouping.**Step 5**(Transition) Compute ${\mathrm{\lambda}}_{1}^{=max\{{\mathrm{\lambda}}_{1,A},{\mathrm{\lambda}}_{1,B}\}}$. If ${\mathrm{\lambda}}_{1}^{={\mathrm{\lambda}}_{1,A}0}$, two groups are merged at ${\mathrm{\lambda}}_{1}^{}$. If ${\mathrm{\lambda}}_{1}^{={\mathrm{\lambda}}_{1,B}0}$, a group is split into two at ${\mathrm{\lambda}}_{1}^{}$. Update and Δ_{j}'s. If ${\mathrm{\lambda}}_{1}^{=0}$, no further transition can be obtained.

**Step 1**(Parameter initialization): Specify the upper bound parameter*K**, and evaluation points of (λ_{0}, λ_{2}), where*K** ≤ min{*n*,*p*}.**Step 2**(Initialization for DC iterations): Given (λ_{0}, λ_{2}), compute^{(0)}(λ_{0}, +∞, λ_{2}), the corresponding and Δ's by solving (5) with*m*= 0. Compute^{(0)}(**λ**) along the path from λ_{1}= ∞ to λ_{1}= 0 until |^{(0)}(**λ**)| =*K** using Algorithm 1 while holding (λ_{0}, λ_{2}) fixed.**Step 3**(DC iterations): Starting from*m*= 1, compute^{(m)}(λ_{0}, +∞, λ_{2}), the corresponding and Δ's by solving (5); then successively compute^{(m)}(**λ**) along the path from λ_{1}= ∞ to λ_{1}= 0 until || =*K** using Algorithm 1.**Step 4**(Stopping rule): If*S*(^{(m−1)}(**λ**)) −*S*(^{(m)}(**λ**)) = 0, then go to Step 3 with*m*replaced by*m*+ 1. Otherwise, move to next evaluation point of (λ_{0}, λ_{2}) and go to Step 2 until all the evaluation points have been computed.

Denote by *m** the termination step of Algorithm 2, which may depend on (λ_{0}, λ_{2}). Our estimate DCE of ** β** is

**Proposition 1** (Computational properties). For Algorithm 1,
${L}_{k}^{\pm}({\mathrm{\lambda}}_{1},{A}_{k\pm}^{(}$ is continuous, piecewise linear, and strictly monotone in λ_{1}; *k* = 1, …, *K*, = 1, …, [_{k}/2]. Moreover, the computational complexities of Algorithms 1 and 2 are no greater than *p*^{2} (log *p* + *n*) and *O*(*m***n***p*^{2}(log *p* + *n*)), where *n** is the number of transition points.

In general, it is difficult to bound *n** precisely. However, an application of a heuristic argument similar to that of Rosset and Zhu (2007, Section 3.2, p. 1019) suggests that *n** is *O*(min{*n*, *p*}) on average for group combining and splitting.

**Theorem 2** (Computation). Assume that
${\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}$ is invertible for *m* ≤ *m**. Then the solution of Algorithm 2 is unique, and sequence *S*(^{(m)}(λ_{0}, λ_{0}, λ_{2})) decreases strictly in m unless ^{(m)}(λ_{0}, λ_{0}, λ_{2}) = ^{(m−1)}(λ_{0}, λ_{0}, λ_{2}). In addition, Algorithm 2 terminates in finite steps, i.e., *m** < ∞ with

$${\widehat{\mathit{\beta}}}^{(m)}(\mathbf{\lambda})={\widehat{\mathit{\beta}}}^{({m}^{}}$$

(14)

for all **λ** over the evaluation region of **λ** and all *m* ≥ *m**.

Two distinctive properties of ^{(m*)}(**λ**) are revealed by (14), leading to fast convergence and a sharp result for consistency of **(****λ**). First, the stopping rule of Algorithm 2, which is reinforced at one fixed λ_{0}, controls the entire surface for all (λ_{1}, λ_{2}) simultaneously, owing to the replacement of ^{(m−1)}(λ_{0}, λ_{2}) by ^{(m−1)}(λ_{0}, λ_{0}, λ_{2}) for iteration *m* in (4). Second, the iteration process terminates finitely with ^{(m*)}(**λ**) satisfying (14), because of the step function of *S*_{2}(^{(m−1)}(**λ**)) resulted from the locally concave points *z* = ±λ_{2} of *G*(*z*). Most critically, (14) is not expected for any penalty that is not piecewise linear with non-differentiable but continuous points.

Selection of tuning parameters is important for DCE, which involves **λ** = (λ_{0}, λ_{1}, λ_{2}). In (1), predictive performance of estimator **(****λ**) is measured by MSE(**(****λ**)), defined as
$\frac{1}{n}EL\left(\widehat{\mathit{\beta}}(\mathbf{\lambda}),{\mathit{\beta}}^{0}\right)$, where
$L\left(\widehat{\mathit{\beta}}(\mathbf{\lambda}),{\mathit{\beta}}^{0}\right)={\sum}_{i=1}^{n}{\left(\widehat{\mu}(\mathbf{\lambda},{\mathit{\text{x}}}_{i})-{\mu}^{0}({\mathit{\text{x}}}_{i})\right)}^{2}$, (**λ**, ** x**) =

One critical aspect of tuning is that DCE is a piecewise continuous estimator with jumps in ** Y**. Therefore, any model selection routine may be used for tuning of DCE, which allows for estimators with discontinuities. For instance, cross-validation and the generalized degrees of freedom (GDF, Shen and Huang, 2006) are applicable, but Stein's unbiased risk estimator (Stein, 1981) is not suited because of the requirement of continuity. Then the tuning parameters are estimated by minimizing the model selection criterion.

In practice, *σ*^{2} needs to be estimated when it is unknown. In the literature, there have been many proposals for the case of *p* < *n*, for instance, *σ*^{2} can be estimated by the residual sum squares over (*n* − *p*). In general, estimation of *σ*^{2} in the case of *p* > *n* has not yet received much attention. In our case, we propose a simple estimator
${\widehat{\sigma}}^{2}=\frac{1}{n-{K}^{}}$, where **λ*** = (λ_{0}, _{1}, ∞), _{1} is the smallest λ_{1} reaching the upper bound *K*(λ_{0}, λ_{1}, ∞) = *K**/2, and *K** is defined in Step 2 of Algorithm 2. Note that when λ_{2} = ∞, * _{i}*(

This section derives a finite-sample probability error bound, based on which we prove that **(****λ**) is consistent with regard to grouping pursuit and predictive optimality simultaneously for the same set of values of **λ**. As a result, the true grouping ^{0} is reconstructed, as well as the unbiased least squares estimate
${\widehat{\mathit{\beta}}}^{(\mathit{\text{ols}})}{\left({\widehat{\beta}}_{1}^{(\mathit{\text{ols}})},\dots ,{\widehat{\beta}}_{p}^{(\mathit{\text{ols}})}\right)}^{T}={\left({\widehat{\alpha}}_{1}^{(\mathit{\text{ols}})}{\mathbf{1}}_{|{\mathcal{G}}_{1}^{0}|},\dots ,{\widehat{\alpha}}_{{K}^{0}}^{(\mathit{\text{ols}})}{\mathbf{1}}_{|{\mathcal{G}}_{{K}^{0}}^{0}|}\right)}^{T}$ given ^{0}. Here
${\widehat{\mathit{\alpha}}}^{(\mathit{\text{ols}})}{\left({\widehat{\alpha}}_{1}^{(\mathit{\text{ols}})},\dots ,{\widehat{\alpha}}_{{K}^{0}}^{(\mathit{\text{ols}})}\right)}^{T}={\left({\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}\right)}^{-1}{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}^{T}\mathit{\text{Y}}$ with
${\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}^{T}\phantom{\rule{0.2em}{0ex}}{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}$ being invertible.

Denote by *c*_{min}() > 0 the smallest eigenvalue of
${\mathit{\text{Z}}}_{\mathcal{G}}^{T}\phantom{\rule{0.2em}{0ex}}{\mathit{\text{Z}}}_{\mathcal{G}}/n$, where *Z*_{} is the design matrix based on grouping Denote by
${\gamma}_{min}min\{|{\alpha}_{k}^{0}-{\alpha}_{l}^{0}|0:1\le kl\le {K}^{0}\}$, the resolution level that may depend on (*p*, *n*), or the level of difficulty of grouping pursuit, with a small value of *γ*_{min} being difficult. The following result is established for **(****λ**) from Algorithm 2.

**Theorem 3** (Error bounds for grouping pursuit and consistency). Under the model assumptions of (1) with *ε _{i}* ~

$$\begin{array}{l}P(\mathcal{G}(\mathbf{\lambda})\ne {\mathcal{G}}^{0})\le P(\widehat{\mathit{\beta}}(\mathbf{\lambda})\ne {\widehat{\mathit{\beta}}}^{(\mathit{\text{ols}})})\\ \phantom{\rule{5.6em}{0ex}}\le \frac{{K}^{0}({K}^{0}-1)}{2}\mathrm{\Phi}\left(\frac{-{n}^{1/2}({\gamma}_{min}-3{\mathrm{\lambda}}_{2}/2)}{2\sigma {c}_{min}^{-1/2}({\mathcal{G}}^{0})}\right)+p\mathrm{\Phi}\left(\frac{-n{\mathrm{\lambda}}_{1}}{\sigma {max}_{1\le j\le p}\Vert {\mathit{\text{x}}}_{j}\Vert}\right),\end{array}$$

(15)

where
$\mathrm{\Phi}(z)={\int}_{-\infty}^{z}exp(-{u}^{2}/2)du$ is the cumulative distribution function of *N*(0, 1), and ‖* x_{j}*‖ is the

- $\frac{n{({\gamma}_{min}-3{\mathrm{\lambda}}_{2}/2)}^{2}}{8{c}_{min}({\mathcal{G}}^{0}){\sigma}^{2}}-2log{K}^{0}\to \infty ,\phantom{\rule{1em}{0ex}}0<{\mathrm{\lambda}}_{2}<\frac{2}{3}{\gamma}_{min}$,
- $\frac{n{\mathrm{\lambda}}_{1}^{2}}{2{\sigma}^{2}{max}_{1\le j\le p}{\Vert {\mathit{\text{x}}}_{j}\Vert}^{2}/n}-logp\to \infty $,

then *P*((**λ**) ≠ (^{0}) ≤ *P*(**(****λ**) ≠ ^{(ols)}) → 0. In other words, (**λ**) = ^{0} and **(****λ**) = ^{(ols)} with probability tending to 1.

**Corollary 1** (Predictive performance) Under the assumption of Theorem 3,
$\frac{L(\widehat{\mathit{\beta}}(\mathbf{\lambda}),{\mathit{\beta}}^{0})}{L({\widehat{\mathit{\beta}}}^{(\mathit{\text{ols}})},{\mathit{\beta}}^{0})}\to 1$, and *L*(**(****λ**), *β*^{0}) = *O _{p}*(

Theorem 3 and Corollary 1 say that DCE consistently identifies the true grouping ^{0} and reconstructs the unbiased least squares estimator ^{(ols)} based on ^{0} when *p*, *n* → ∞. They also confirm the assertion made in the Introduction for consistency with regard to nearly exponentially many predictors in *n*. Specifically, consistency occurs when (a)
$p=O(exp(n{\mathrm{\lambda}}_{1}^{2}))$ (Condition (ii)) or
$\frac{logp}{n}\to 0$, (b) λ_{1}(2*K** + 1) < λ_{2}min_{||≤(K*)2} *c*_{min}(), (c)
$0<{\mathrm{\lambda}}_{2}<\frac{2}{3}{\gamma}_{min}$ and
$n{c}_{min}^{-1}({\mathcal{G}}^{0}){({\gamma}_{min}-3{\mathrm{\lambda}}_{2}/2)}^{2}-16{\sigma}^{2}log{K}^{0}\to \infty $ (Condition (i)), provided that max_{j:1≤j≤p} ‖* x_{j}*‖

We now describe characteristics of grouping, in particular—how predictors are grouped. Denote by
${\rho}_{j}(\mathbf{\lambda}){\mathit{\text{x}}}_{j}^{T}\phantom{\rule{0.2em}{0ex}}\left(\mathit{\text{Y}}-{\mathit{\text{X}}}^{T}\widehat{\mathit{\beta}}(\mathbf{\lambda})\right)$, the sample covariance between * x_{j}* and the residual.

**Theorem 4** (Grouping). Let
${\mathrm{\Delta}}_{j}(\mathbf{\lambda})={\mathrm{\Delta}}_{j}^{({m}^{}}$ be defined in Theorem 1, where Δ_{j}(**λ**) = Δ_{j′} (**λ**) if *j*, *j′* _{k}(**λ**). Let *E _{k}*(

Theorem 4 says that predictors are grouped according to if their sample covariance values fall into the same intervals, where these disjoint intervals *E*_{1}(**λ**), …, *E*_{K(λ)}(**λ**) characterize grouping. As **λ** varies, group splitting or combining may take place when these intervals split or combine.

This section examines effectiveness of the proposed method on three simulated examples and one real application to gene network analysis.

For a fair comparison, we compare DCE with the estimator obtained from its convex counterpart–an ultra-fused version of the fused Lasso based on convex penalty Σ_{j<j′} |*β _{j}*−

We perform simulations in several scenarios, including correlated predictors, different noise levels and situations of “small *p* but large *n*” and “small *n* but large *p*”. Note that a decrease in the value of *σ*^{2} implies an increase in the sample size in this case. We therefore fix *n* = 50 and vary the value of *σ*^{2} in Examples 1 and 2 and use different sample sizes in Example 3.

For estimating **λ** for DCE, we generate an independent tuning set
${({\mathit{\text{x}}}_{i},{\stackrel{~}{y}}_{i})}_{i=1}^{n}$ of size *n* in each example. Specially, the estimated **λ**, denoted by , is obtained by minimizing tuning error
${n}^{-1}{\sum}_{i=1}^{n}{({\stackrel{~}{y}}_{i}-\widehat{\mu}(\mathbf{\lambda},{\mathit{\text{x}}}_{i}))}^{2}$ over the tuning set with respect to **λ** over the path of λ_{1} > 0,
${\mathrm{\lambda}}_{0}\{i{\mathrm{\lambda}}_{1}^{}$ and λ_{2} {0.5, 1, 1.5, 2, 2.5, 3, 5, 10, ∞} in Examples 1-3, where
${\mathrm{\lambda}}_{1}^{}$ is the largest transition point corresponding to λ_{2} = ∞. The predictive performance is evaluated by MSE(**()) as defined in Section 3. The accuracy of grouping is measured by the percentage of matchings between estimated and true pairs of indices (***j*, *j′*), for *j* ≠ *j′*. Similarly, the tuning parameter of Lasso and the convex counterpart of DCE is estimated over λ_{1} > 0 based on the same tuning set for a fair comparison. All numerical analyses are conducted in R 2.9.1.

This example was used previously for feature selection in Zou and Hastie (2005). A random sample of {(* x_{i}*,

As suggested in Table 1, DCE outperforms its convex counterpart and the least squares estimate based on estimated grouping across three levels of *σ*^{2}. This is due primarily to the nonconvex penalty, which corrects the estimation bias due to use of its convex counterpart. This is evident from the fact that the convex counterpart of DCE (*m* = 0) performs worst than the least square estimates based on the estimated grouping. Furthermore, grouping indeed offers additional improvement in predictive performance in view of the result of the Lasso. Most importantly, DCE reconstructs the least square estimates based on true grouping well, confirming the asymptotic results in Theorem 3 and Corollary 1. The reconstruction is nearly perfectly when *σ*^{2} = .5 but is less so when the value of *σ*^{2} increases towards 2, indicating that the noise level does impact the accuracy of reconstruction. Overall, grouping identification and reconstruction appear to be accurate, agreeing with the theoretical results.

MSEs as well as estimated standard errors (in parentheses) of grouping pursuit for various methods based on 100 simulation replications in Example 1. Here Full, True, Lasso, Convex, DCE, denote the least squares estimates based on the full model, the **...**

Figure 1 displays the paths in λ_{1} given various values of λ_{2} with λ_{0} = .2. Clearly, **(****λ**) is continuous in λ_{1} given (λ_{0}, λ_{2}) and has jumps in λ_{2} given (λ_{0}, λ_{1}), as discussed early. Figure 2 shows four two-dimensional DCE solution surfaces for (_{1}(**λ**), _{2}(**λ**), _{19}(**λ**), _{20}(**λ**)) with respect to **λ**. In Figure 2, the four estimates are close to their corresponding least squares estimates when either λ_{1} or λ_{2} becomes small. On the other hand, they tend to be close to each other when both λ_{1} and λ_{2} become large. Note that some jumps in λ_{2} are visible for _{1}(**λ**) around **λ** = (0.2, 2.5, 2.5).

Plots of (**λ**) as a function of λ_{1} for various λ_{2} values with λ_{0} = .2 and *σ* = 1.2 in Example 1. Different components of (**λ**) are represented by different types of lines and **...**

A random sample of {(* x_{i}*,

In this “large *p* but small *n*” example, DCE outperforms its convex counterpart with regard to predictive performance in all the cases, but the amounts of improvement vary. Here, the Lasso performs slightly better in some cases, which is expected because of the large group size of zero-coefficient predictors. Interestingly, the average number of iterations for Algorithm 2 is about 3 as compared to 4 in Example 1. Finally, the matching proportion for grouping is reasonably high.

Consider Examples 4 and 5 of Bondell and Reich (2008), which are low-dimensional with relatively large dimensions. Their Example 4 is the same as Example 1 except that *n* = 100, *p* = 40, *σ*^{2} = 15^{2} and
$\mathit{\beta}={(\underset{10}{\underbrace{0,\dots ,0}},\underset{10}{\underbrace{2,\dots ,2}},\underset{10}{\underbrace{0,\dots ,0}},\underset{10}{\underbrace{2,\dots ,2}})}^{T}$. Their Example 5 has a similar setting, but with *n* = 50, *p* = 40, *σ*^{2} = 15^{2}, and
$\mathit{\beta}={(\underset{15}{\underbrace{3,\dots ,3}},\underset{25}{\underbrace{0,\dots ,0}})}^{T}$, where * x_{i}* ~

Overall DCE performs comparably with OSCAR in these noisy situations with *σ*^{2} = 15^{2}, which performs better but worst, respectively in Examples 4 and 5 of Bondell and Reich (2008). This is mainly due to the fact that DCE does not select variables beyond grouping. This aspect was also evident from Table 2, where DCE performs worse than Lasso in some cases. Most noticeably, DCE performs slightly worse than its convex counterpart when *σ* is very large. This is expected because an adaptive method tends to perform worse than its non-adaptive counterpart in a noisy situation.

Mapping the pathways giving rise to metastasis is important in breast cancer research. Recent studies suggest that gene expression profiles are useful in identifying gene subnetworks correlated with metastasis. Here we apply our proposed method to understand the functionality of subnetworks of genes for predicting the time to metastasis, which may provide novel hypotheses and confirm the existing theory for pathways involved in tumor progression.

The breast cancer metastasis data (Wang et al., 2005; Chuang et al., 2007) contain gene expression levels of 8141 genes for 286 patients, 107 of whom were detected to develop metastasis within a five year follow-up after surgery. To utilize the present gene network knowledge, we explore the PPI network previously constructed in Chuang et al. (2007).

For breast metastasis, three tumor suppressor genes–TP53, BRACA1 and BRACA2, are known to be crucial in preventing uncontrolled cell proliferation and repairing the chromosomal damage. Certain mutations of these genes increase risk of breast center, c.f., Soussi (2003). In our analysis, we construct a subnetwork of Chuang et al. (2007), consisting of genes TP53, BRCA1 and BRCA2, as well as genes that were regulated by them. This leads to 294 expressed genes for 107 patients who developed metastasis.

For subnetwork analysis, consider a vector of *p* predictors, each corresponding to one node in an undirected graph together with edges connecting nodes. Also available is a vector of network weights ** w** (

$${Y}_{i}\mu ({\stackrel{~}{\mathit{\text{x}}}}_{i})+{\epsilon}_{i},\phantom{\rule{1em}{0ex}}\mu (\stackrel{~}{\mathit{\text{x}}}){\stackrel{~}{x}}^{T}\stackrel{~}{\mathit{\beta}},\phantom{\rule{1em}{0ex}}E({\epsilon}_{i})=0,\phantom{\rule{1em}{0ex}}\text{Var}({\epsilon}_{i})={\sigma}^{2};\phantom{\rule{1em}{0ex}}i=1,\dots ,n,$$

(16)

where ** (**_{1}, …, * _{p}*)

For data analysis, the 107 patients are divided randomly into two groups with 70 and 37 patients, respectively for model-building and validation. For model building, an estimated MSE based on the generalized degrees of freedom is minimized with regard to a set of grid points over the path of λ_{1} > 0,
${\mathrm{\lambda}}_{0}\{i{\mathrm{\lambda}}_{1}^{}$ and λ_{2} {0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, ∞}, where
${\mathrm{\lambda}}_{1}^{}$ is the largest transition point corresponding to λ_{2} = ∞, to obtain the optimal tuning parameters based on the data perturbation method, c.f., Shen and Huang (2007). Moreover, we take *Y _{i}* as the log time to metastasis (in months) and

Table 3 summarizes the estimated grouped regression coefficients based on 27 estimated groups with the corresponding subnetworks displayed by different colors in Figure 3. Interestingly TP53 and BRACA1 are similar but they differ from BRACA2, as evident by the corresponding color intensities in Figure 3, indicating the roles that they play in the process of the metastasis. To make a sense of the estimated grouping, we examine the disease genes causing the metastasis, which were identified in Wang et al. (2005). Among the 17 disease genes expressed in our network, 1, 1, 1, 1 and 13 genes belong to the second, 13th, 14th, 17th and 18th groups, respectively. In fact, three disease genes form single groups, and one disease gene belongs to a small group, and 13 of them are in a large group, indicating that genes work in groups according to their functionalities with regard to survivability of breast cancer. In our analysis, the mean squared prediction error is .81 based on the testing data set of 37 observations, which yields the MSE of .3. This is reasonably good relative to the estimated *σ*^{2} = .51.

Plot of the PPI subnetwork for the metastasis data, as described by an undirected graph with 294 nodes and 326 edges. Three regulating genes TP53, BRACA1 and BRACA2 are represented by large nodes. There are 27 estimated groups colored over the color spectrum **...**

This article proposes a novel grouping pursuit method for high-dimensional least squares regression. The proposed method is placed in the framework of likelihood estimation and model selection. It offers a general treatment to a continuous but non-differentiable nonconvex likelihood problem, where the global penalized maximum likelihood estimate is difficult to obtain. Remarkably, our DC treatment yields desired statistical properties expected from the global penalized maximum likelihood estimate. This is mainly because the DC score equation defined through subdifferentials equals to that of the original nonconvex problem, when the termination criterion is met, c.f., Theorem 2. In this process, the continuous but non-differentiable penalty is essential, At present the proposed method is not designed for feature selection. To generalize, one may replace *J*(** β**) by
${\sum}_{j=1}^{p}G({\beta}_{j})+{\sum}_{j<{j}^{\prime}}G({\beta}_{j}-{\beta}_{{j}^{\prime}})$ in (2). Moreover, estimation of the tuning parameters needs to be further investigated with regard to the accuracy of selection. Further research is therefore necessary.

We prove by contradiction. Without loss of generality, assume that
${\mathit{\beta}}^{}$ is a local minimum of *f*(** β**) =

To prove continuity in λ_{1}, note that the derivative of (5) with respect to ** β** is continuous in λ

Next we derive an expression for ^{(m)}(**λ**). If *K*^{(m)}(**λ**) = 1 and
$|{\mathcal{G}}_{k}^{(m)}(\mathbf{\lambda})|\ge 2$, then the result follows from (5). Now consider the case of *K*^{(m)}(**λ**) ≥ 2. For any *j* = 1, …, *p*, write
${\sum}_{{j}^{\prime}\ne j}{b}_{j{j}^{\prime}}^{(m)}(\mathbf{\lambda})={\sum}_{{j}^{\prime}:{j}^{\prime}~j}{b}_{j{j}^{\prime}}^{(m)}(\mathbf{\lambda})+{\sum}_{{j}^{\prime}:{j}^{\prime}j}$ as
${B}_{j}^{(m)}(\mathbf{\lambda})+{\sum}_{k:k\ne g(\mathbf{\lambda},j)}|{\mathcal{G}}_{g(\mathbf{\lambda},j)}^{(m)}(\mathbf{\lambda})|$ Sign
$\left({\widehat{\alpha}}_{k}^{(m)}(\mathbf{\lambda})-{\widehat{\alpha}}_{g(\mathbf{\lambda},j)}^{(m)}(\mathbf{\lambda})\right)$, where *j′* ~ *j* if *j′* and *j* are in the same group and *j*′ *j* otherwise. Differentiating (5) with respect to ** β**, we obtain

$$-{\mathit{\text{x}}}_{j}^{T}(\mathit{\text{Y}}-{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}{\widehat{\mathit{\alpha}}}^{(m)}(\mathbf{\lambda}))+n{\mathrm{\lambda}}_{1}\left({\mathrm{\Delta}}_{j}^{(m)}(\mathbf{\lambda})+{B}_{j}^{(m)}(\mathbf{\lambda})\right)=0,\phantom{\rule{1em}{0ex}}j=1,\dots ,p,$$

(17)

which is an optimality condition (Rockafellar and Wets, 2003). For *k* = 1, …, *K*^{(m)}(**λ**), invoking the sum-to-zero constraint
${\sum}_{j{\mathcal{G}}_{k}^{(m)}(\mathbf{\lambda})}$, we have
$-{\mathit{\text{z}}}_{k,{\mathcal{G}}^{(m)}(\mathbf{\lambda})}(\mathit{\text{Y}}-{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}{\widehat{\mathit{\alpha}}}^{(m)}(\mathbf{\lambda}))+n{\mathrm{\lambda}}_{1}{\delta}_{k}^{(m)}(\mathbf{\lambda})=0$, implying (6). Thus (8) follows from (17).

By definition,
${L}_{k}^{\pm}\phantom{\rule{0.2em}{0ex}}({\mathrm{\lambda}}_{1},{A}_{k\pm}^{(})$ is piecewise linear and continuous in λ_{1}, and is strictly monotone because
${\sum}_{j{A}_{k+}^{(}{\xi}_{j}0}$ and
${\sum}_{j{A}_{k-}^{(}{\xi}_{j}0}$.

Lets *g _{k}* = |

Uniqueness of the solution follows from strict convexity of *S*^{(m)}(** β**) in

Our plan is to prove the result for λ_{1} = λ_{0} with **λ** = (λ_{0}, λ_{0}, λ_{2})^{T}. Then controlling at this point implies the desirable result for all **λ**. In what follows, we set λ_{1} = λ_{0} unless indicated otherwise. For convergence of Algorithm 2, it follows from (2) and (5) that for *m* , 0 ≤ *S*(^{(m)}(**λ**)) = *S*^{(m+1)}(^{(m)}(**λ**)) ≤ *S*^{(m)}(^{(m)}(**λ**)) ≤ *S*^{(m)}(^{(m−1)}(**λ**)) = *S*(^{(m−1)}(**λ**)). This implies that lim_{m→∞} *S*(^{(m)}(**λ**)) exists, thus leading to convergence. To study the number of steps to termination, note that *S*^{(m)}(^{(m−1)}(**λ**)) − *S*^{(m)}(^{(m)}(**λ**)) can be written as

$$\frac{1}{2n}\phantom{\rule{0.2em}{0ex}}{\Vert \mathit{\text{X}}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\mathit{\beta}}}^{(m)}(\mathbf{\lambda})-{\widehat{\mathit{\beta}}}^{(m-1)}(\mathbf{\lambda})\right)\Vert}^{2}+\frac{1}{n}\phantom{\rule{0.2em}{0ex}}{\left(\mathit{\text{Y}}-\mathit{\text{X}}{\widehat{\mathit{\beta}}}^{(m)}(\mathbf{\lambda})\right)}^{T}\mathit{\text{X}}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\mathit{\beta}}}^{(m)}(\mathbf{\lambda})-{\widehat{\mathit{\beta}}}^{(m-1)}(\mathbf{\lambda})\right)+{\mathrm{\lambda}}_{1}\phantom{\rule{0.2em}{0ex}}\{\sum _{j=1}^{p}({\widehat{\beta}}_{j}^{(m)}(\mathbf{\lambda})-{\widehat{\beta}}_{j}^{(m-1)}(\mathbf{\lambda}))\sum _{{j}^{\prime}\ne j}{G}_{2}\phantom{\rule{0.2em}{0ex}}({\widehat{\beta}}_{j}^{(m-1)}(\mathbf{\lambda})-{\widehat{\beta}}_{{j}^{\prime}}^{(m-1)}(\mathbf{\lambda}))\}+{\mathrm{\lambda}}_{1}\phantom{\rule{0.2em}{0ex}}\{\sum _{j{j}^{\prime}}\phantom{\rule{0.2em}{0ex}}(\left|{\widehat{\beta}}_{j}^{(m-1)}(\mathbf{\lambda})-{\widehat{\beta}}_{{j}^{\prime}}^{(m-1)}(\mathbf{\lambda})\right|-\left|{\widehat{\beta}}_{j}^{(m)}(\widehat{\mathbf{\lambda}})-{\widehat{\beta}}_{{j}^{\prime}}^{(m-1)}(\mathbf{\lambda})\right|)\},$$

which can be simplified, using the following equality from (17),
$-{\mathit{\text{x}}}_{j}^{T}(\mathit{\text{Y}}-\mathit{\text{X}}{\widehat{\mathit{\beta}}}^{(m)}(\mathbf{\lambda}))=n{\mathrm{\lambda}}_{1}{\sum}_{j\prime :j\prime \ne j}\phantom{\rule{0.2em}{0ex}}\{{G}_{2}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\beta}}_{j}^{(m-1)}(\mathbf{\lambda})-{\widehat{\beta}}_{j\prime}^{(m-1)}(\mathbf{\lambda})\right)-{b}_{jj\prime}^{(m)}(\mathbf{\lambda})\}$, as
$\frac{1}{2n}\phantom{\rule{0.2em}{0ex}}{\Vert \mathit{\text{X}}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\mathit{\beta}}}^{(m)}(\mathbf{\lambda})-{\widehat{\mathit{\beta}}}^{(m-1)}(\mathbf{\lambda})\right)\Vert}^{2}+{\mathrm{\lambda}}_{1}{\sum}_{j<j\prime}\phantom{\rule{0.2em}{0ex}}\left\{\left|{z}_{jj\prime}^{(m-1)}(\mathbf{\lambda})\right|-\left|{z}_{jj\prime}^{(m)}(\mathbf{\lambda})\right|-{b}_{jj\prime}^{(m)}(\mathbf{\lambda})\left({z}_{jj\prime}^{(m-1)}(\mathbf{\lambda})-{z}_{jj\prime}^{(m)}(\mathbf{\lambda})\right)\right\}$, where
${z}_{jj\prime}^{(m)}(\mathbf{\lambda})={\widehat{\beta}}_{j}^{(m)}(\mathbf{\lambda})-{\widehat{\beta}}_{j\prime}^{(m)}(\mathbf{\lambda})$. By convexity of |*z*|,
$\left|{z}_{jj\prime}^{(m-1)}(\mathbf{\lambda})\right|-\left|{z}_{jj\prime}^{(m)}(\mathbf{\lambda})\right|-{b}_{jj\prime}^{(m)}(\widehat{\mathbf{\lambda}})\left({z}_{jj\prime}^{(m-1)}(\mathbf{\lambda})-{z}_{jj\prime}^{(m)}(\mathbf{\lambda})\right)\ge 0$, implying *S*(^{(m−1)}(**λ**)) − *S*(^{(m)} (**λ**)) ≥ *S*^{(m)} (^{(m−1)} (**λ**)) −*S*^{(m)} (^{(m)} (**λ**)) is bounded below by
$\frac{1}{2n}\phantom{\rule{0.2em}{0ex}}{\Vert \mathit{\text{X}}\phantom{\rule{0.2em}{0ex}}\left({\widehat{\mathit{\beta}}}^{(m)}(\mathbf{\lambda})-{\widehat{\mathit{\beta}}}^{(m-1)}(\mathbf{\lambda})\right)\Vert}^{2}$. That is,
$\frac{1}{2n}\phantom{\rule{0.2em}{0ex}}{\left({\widehat{\mathit{\alpha}}}^{(m)}(\mathbf{\lambda})-{\mathit{\alpha}}^{(m-1)}(\widehat{\mathbf{\lambda}})\right)}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{(m)}(\mathbf{\lambda})}\left({\widehat{\mathit{\alpha}}}^{(m)}(\mathit{\lambda})-{\widehat{\mathit{\alpha}}}^{(m-1)}(\mathbf{\lambda})\right)$ is greater than zero unless ^{(m)} (**λ**) = ^{(m−1)} (**λ**).

Finally, finite step convergence follows from strict decreasingness of *S*^{(m)}(** (****λ**)) in *m* and finite possible values of *S*_{2}(^{(m−1)} (**λ**)) in (5). For (14), note that when termination, *S*_{2}(^{(m−1)} (**λ**)) remains unchanged for *m* ≥ *m**, so does the cost function (5) for *m* ≥ *m**. This implies termination for all **λ** = (λ_{0}, λ_{1}, λ_{2})^{T} in (14). This completes the proof.

Define event

$$F\left\{\underset{kl}{min}\left|{\widehat{\alpha}}_{k}^{(\mathit{\text{ols}})}-{\widehat{\alpha}}_{l}^{(\mathit{\text{ols}})}\right|3{\mathrm{\lambda}}_{2}/2\right\}{\cap}_{k:|{\mathcal{G}}_{k}^{0}|1}\{\underset{j:j{\mathcal{G}}_{k}^{0}}{max}\}.$$

By (17) with *m* = *m** and (14), for *k* = 1, …, *K*, **(****λ**) = ^{(m)}(**λ**) satisfies

$$\{\begin{array}{l}-{({\sum}_{j{\mathcal{G}}_{k}})T}^{(}|{\mathit{\text{x}}}_{j}^{T}(\mathit{\text{Y}}-\mathit{\text{X}}\mathit{\beta})+n{\mathrm{\lambda}}_{1}{\mathrm{\Delta}}_{j}(\mathit{\beta})|\le n{\mathrm{\lambda}}_{1}(|{\mathcal{G}}_{k}|-1);\phantom{\rule{2.5em}{0ex}}j{\mathcal{G}}_{k},\phantom{\rule{0.5em}{0ex}}|{\mathcal{G}}_{k}|1,\end{array}$$

(18)

for some partition (_{1}, …, _{K}) of {1, …, *p*} with *K* ≤ min {*n*, *p*}, where Δ_{j}(** β**) Σ

Note that the first event in *F*, together with the grouped subdifferentials, yields that
${\sum}_{j{\mathcal{G}}_{k}^{0}}$; *k* = 1, …, *K*^{0}. This, together with the least squares property that
${({\sum}_{j{\mathcal{G}}_{k}^{0}})T}^{(\mathit{\text{Y}}-\mathit{\text{X}}{\widehat{\mathit{\beta}}}^{(\mathit{\text{ols}})})}$, implies that the first equation of (18) is fulfilled with ** β** =

It remains to show that (18) yields the unique minimizer on *F*. Define

$$\stackrel{~}{G}(z)=\{\begin{array}{ll}G(z);& \text{if}\phantom{\rule{0.2em}{0ex}}\left|z\right|\le {\mathrm{\lambda}}_{2}(1-\nu )\phantom{\rule{0.3em}{0ex}}\text{or}\phantom{\rule{0.3em}{0ex}}\left|z\right|\ge {\mathrm{\lambda}}_{2}(1+\nu ),\\ -\frac{1}{4{\mathrm{\lambda}}_{2}\nu}{(z-{\mathrm{\lambda}}_{2})}^{2}+\frac{1}{2}(z-{\mathrm{\lambda}}_{2})+{\mathrm{\lambda}}_{2}(1-\frac{\nu}{4});& \text{if}\phantom{\rule{0.2em}{0ex}}\left|z-{\mathrm{\lambda}}_{2}\right|<{\mathrm{\lambda}}_{2}\nu ,\\ -\frac{1}{4{\mathrm{\lambda}}_{2}\nu}{(z+{\mathrm{\lambda}}_{2})}^{2}-\frac{1}{2}(z+{\mathrm{\lambda}}_{2})+{\mathrm{\lambda}}_{2}(1-\frac{\nu}{4});& \text{if}\phantom{\rule{0.2em}{0ex}}\left|z+{\mathrm{\lambda}}_{2}\right|<\nu {\mathrm{\lambda}}_{2},\end{array}$$

for *ν* = 1/2. Given any grouping with || ≤ *K**, (** β**) is a function of

To prove that ^{(ols)} is the unique minimizer of *S*(** β**) on

$$\{\underset{|\mathcal{G}|\le {(K)2}^{}}{min}\Vert {\mathit{\alpha}}_{\mathcal{G}}0.$$

(19)

Note that (** β**) =

$$\Vert (\frac{{\mathit{\alpha}}_{\mathcal{G}}\stackrel{~}{S}(\mathit{\beta})-\frac{{\mathit{\alpha}}_{\mathcal{G}}\stackrel{~}{S}(\mathit{\beta}){\left|}_{{\mathit{\alpha}}_{\mathcal{G}}}+(\frac{{\mathit{\alpha}}_{\mathcal{G}}S(\mathit{\beta})-\frac{{\mathit{\alpha}}_{\mathcal{G}}\stackrel{~}{S}(\mathit{\beta}))}{}\Vert \ge (\underset{|\mathcal{G}|\le {({K}^{}2}^{}}{min}\Vert {\mathit{\alpha}}_{\mathcal{G}}-\frac{{\mathrm{\lambda}}_{1}}{2}{K}^{}}{}}{}}{}$$

which is lower bounded by
$({min}_{|\mathcal{G}|\le {({K}^{}2}^{}}\frac{{\mathrm{\lambda}}_{2}}{2}0$, because
${min}_{|\mathcal{G}|\le {({K}^{}2}^{}}$. This, together with Lemma 1, implies that *S*(** β**) has no local minimal in

Note that
${\widehat{\alpha}}_{k}^{(\mathit{\text{ols}})}-{\widehat{\alpha}}_{l}^{(\mathit{\text{ols}})}~N({\alpha}_{k}^{0}-{\alpha}_{l}^{0},\phantom{\rule{0.2em}{0ex}}\mathit{\text{Var}}({\widehat{\alpha}}_{k}^{(\mathit{\text{ols}})}-{\widehat{\alpha}}_{l}^{(\mathit{\text{ols}})}))$ with
$\mathit{\text{Var}}({\widehat{\alpha}}_{k}^{(\mathit{\text{ols}})}-{\widehat{\alpha}}_{l}^{(\mathit{\text{ols}})})\le 4{c}_{min}^{-1}({\mathcal{G}}^{0}){\sigma}^{2}/n$, and
${\mathit{\text{x}}}_{j}^{T}\phantom{\rule{0.2em}{0ex}}\left(\mathit{\text{Y}}-{\mathit{\text{X}}}^{T}\phantom{\rule{0.2em}{0ex}}{\widehat{\mathit{\beta}}}^{(\mathit{\text{ols}})}\right)~N\left(0,{\sigma}^{2}{\Vert \left(\mathit{\text{I}}-{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}\right)}^{-1}{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}^{T}\right)\phantom{\rule{0.2em}{0ex}}{\mathit{\text{x}}}_{j}\Vert}^{2}\right)$ with
${\Vert \left(\mathit{\text{I}}-{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}^{T}{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}\right)}^{-1}{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}^{T}\right)\phantom{\rule{0.2em}{0ex}}{\mathit{\text{x}}}_{j}\Vert}^{2}\le {\Vert {x}_{j}\Vert}^{2}$. It follows that *P*((**λ**) ≠ ^{0}) ≤ *P*(**(****λ**) ≠ ^{(ols)}) ≤ *P*(*F ^{c}*), which is upper bounded by

$$\sum _{k<l}P\phantom{\rule{0.2em}{0ex}}\left(\left|{\alpha}_{k}^{0}-{\alpha}_{l}^{0}\right|-{\mathit{\text{q}}}_{kl}^{T}\phantom{\rule{0.1em}{0ex}}\mathit{\epsilon}\le 3{\mathrm{\lambda}}_{2}/2\right)+\sum _{k=1}^{{K}^{0}}\sum _{j{\mathcal{G}}_{k}^{0}}\le \frac{{K}^{0}({K}^{0}-1)}{2}\mathrm{\Phi}\phantom{\rule{0.2em}{0ex}}\left(\frac{{n}^{1/2}(3{\mathrm{\lambda}}_{2}/2-{\gamma}_{min}}{2\sigma {c}_{min}^{-1/2}({\mathcal{G}}^{0})}\right)+p\phantom{\rule{0.1em}{0ex}}\mathrm{\Phi}\phantom{\rule{0.2em}{0ex}}\left(\frac{-n{\mathrm{\lambda}}_{1}}{\sigma {max}_{1\le j\le p}\Vert {x}_{j}\Vert}\right),$$

where
${\mathit{\text{q}}}_{kl}={\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}\phantom{\rule{0.2em}{0ex}}{\left({\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}^{T}\phantom{\rule{0.1em}{0ex}}{\mathit{\text{Z}}}_{{\mathcal{G}}^{0}}\right)}^{-1}({\mathit{\text{e}}}_{k}-{\mathit{\text{e}}}_{l})$ and * e_{k}* is the

It is a direct consequence of Theorem 3 and the least squares property. The proof is thus omitted.

By Theorem 2, (**λ**) = ^{(m*)}(**λ**). From (3), for any *j* _{k}(**λ**); *k* = 1, …, *K*(**λ**), we have |*B _{j}*(

For disjointness of *E _{k}*(

^{*}Xiaotong Shen is Professor, School of Statistics, University of Minnesota, 224 Church Street S.E., Minneapolis, MN 55455 (E-mail: ude.nmu.tats@nehsx). His research is supported in part by National Science Foundation Grant DMS-0906616 and National Institute of Health Grant 1R01GM081535. Hsin-Cheng Huang is Research Fellow, Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan (E-mail: wt.ude.acinis.tats@gnauhch). He is supported in part by Grant NSC 97-2118-M-001-001-MY3. The authors thank the editor, the associate editor and three referees for their helpful comments and suggestions.

1. An HLT, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. J Global Optim. 1997;11:253–85.

2. Allgower EL, George K. Introduction to Numerical Continuation Methods. SIAM; 2003.

3. Bondell HD, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics. 2008;64:115–23. [PMC free article] [PubMed]

4. Chuang HY, Lee EJ, et al. Network-based classification of breast cancer metastasis. Molecular Systems Biology. 2007;3:140. [PMC free article] [PubMed]

5. Efron B. The estimation of prediction error: covariance penalties and cross-validation. J Amer Statist Assoc. 2004;99:619–32.

6. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its Oracle properties. J Amer Statist Assoc. 2001;96:1348–60.

7. Fan J. Comments on “Wavelets in statistics: a review” by A. Antoniadis. J Italian Statist Assoc. 1997;6:131–138.

8. Friedman J, Haste T, Hofling H, Tibshirani R. Pathwise coordinate optimization. Ann Applied Statist. 2007;1:302–332.

9. Li C, Li H. Network-constraint regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–82. [PubMed]

10. Liu S, Shen X, Wong W. Computational developments of *ψ*-learning. Proc 5th SIAM Intern Conf on Data Mining; Newport, CA. April, 2005; 2005. pp. 1–12.

11. Liu Y, Wu Y. Variable selection via a combination of the *L*_{0} and *L*_{1} penalties. J Comput Graph Statist. 2007;16:782–798.

12. Gill PE, Murray W, Wright MH. Practical Optimization. Academic Press; London: 1981.

13. Rockafellar RT, Wets RJ. Variational Analysis. Springer-Verlag; 2003.

14. Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann Statist. 2007;35:1012–30.

15. Rota GC. The number of partitions of a set. American Mathematical Monthly. 1964;71:498–504.

16. Shen X, Huang HC. Optimal model assessment, selection and combination. J Amer Statist Assoc. 2006;101:554–68.

17. Stein C. Estimation of the mean of a multivariate normal distribution. Ann Statist. 1981;9:1135–51.

18. Soussi T. Focus on the P53 gene and cancer: advances in TP53 mutation research. Human mutation. 2003;21:173–5. [PubMed]

19. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J Royal Statist Soc, Ser B. 2005;67:91–108.

20. Wang Y, Klijin JG, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–79. [PubMed]

21. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Statist Soc Ser B. 2006;68:49–67.

22. Wu S, Shen X, Geyer C. Adaptive regularization through entire solution surface. Biometrika. 2009;96:513–527. [PMC free article] [PubMed]

23. Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. Ann Statist. 2009;37:3468–3497.

24. Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statist Assoc, Ser B. 2005;67:301–20.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |