Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5353363

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Modeling Interactions with FAMILY
- 3 Degrees of Freedom
- 4 Extension to Weak Heredity
- 5 Simulation Study
- 6 Application to HIV Data
- 7 Conclusion
- Supplementary Material
- References

Authors

Related links

J Comput Graph Stat. Author manuscript; available in PMC 2017 March 16.

Published in final edited form as:

Published online 2015 August 12. doi: 10.1080/10618600.2015.1067217

PMCID: PMC5353363

NIHMSID: NIHMS703415

The publisher's final edited version of this article is available at J Comput Graph Stat

See other articles in PMC that cite the published article.

We consider the task of fitting a regression model involving interactions
among a potentially large set of covariates, in which we wish to enforce strong
heredity. We propose `FAMILY`, a very general framework for
this task. Our proposal is a generalization of several existing methods, such as
`VANISH` [Radchenko and
James, 2010], `hierNet` [Bien et al., 2013], the all-pairs lasso, and the lasso using
only main effects. It can be formulated as the solution to a convex optimization
problem, which we solve using an efficient alternating directions method of
multipliers (ADMM) algorithm. This algorithm has guaranteed convergence to the
global optimum, can be easily specialized to any convex penalty function of
interest, and allows for a straightforward extension to the setting of
generalized linear models. We derive an unbiased estimator of the degrees of
freedom of `FAMILY`, and explore its performance in a
simulation study and on an HIV sequence data set.

In this paper, we model a response variable with a set of main effects
and second-order interactions. The problem can be formulated as follows: we are
given a response vector *y* for *n* observations,
an *n* × *p*_{1} matrix
*X* of covariates and another *n* ×
*p*_{2} matrix *Z* of covariates. In
what follows, the notation
*X*_{․,j} and
*Z*_{․,k} will denote the
*j ^{th}* column of

$${y}_{i}={B}_{0,0}+\sum _{j=1}^{{p}_{1}}{B}_{j,0}{X}_{i,j}+\sum _{k=1}^{{p}_{2}}{B}_{0,k}{Z}_{i,k}+\sum _{j=1}^{{p}_{1}}\sum _{k=1}^{{p}_{2}}{B}_{j,k}{X}_{i,j}{Z}_{i,k}+{\epsilon}_{i},i=1,\dots ,n,$$

(1)

where *B* is a (*p*_{1} +
1) × (*p*_{2} + 1) matrix of coefficients, of
which the rows and columns are indexed from 0 to *p*_{1}
and 0 to *p*_{2} for the variables *X* and
*Z*, respectively. In the special case where
*X* = *Z*, the coefficient of the
(*j*, *k*)^{th}
interaction is *B _{j,k}* +

For brevity, we re-write model (1) using array notation. We construct the *n*
× (*p*_{1} + 1) ×
(*p*_{2} + 1) array *W* as follows:
for *i* {1, …, *n*},
*j* {0, …, *p*_{1}},
*k* {0, …, *p*_{2}},

$${W}_{i,j,k}=\{\begin{array}{cc}{X}_{i,j}{Z}_{i,k}\hfill & \text{for}j\ne 0\text{and}k\ne 0\hfill \\ {X}_{i,j}\hfill & \text{for}k=0\text{and}j\ne 0\hfill \\ {Z}_{i,k}\hfill & \text{for}j=0\text{and}k\ne 0\hfill \\ 1\hfill & \text{for}j=k=0\hfill \end{array}.$$

(2)

Then (1) is equivalent to the model

$$y=W*B+\epsilon ,$$

(3)

where *B* is the matrix of coefficients as in
(1), and *W* *
*B* denotes the *n*-vector whose
*i ^{th}* element takes the form ${(W*B)}_{i}\equiv {\sum}_{j=0}^{{p}_{1}}{\sum}_{k=0}^{{p}_{2}}{W}_{i,j,k}{B}_{j,k}$. The model is displayed in the left panel of Figure 1.

In fitting models with interactions, we may wish to impose either
*strong* or *weak* heredity [Hamada and Wu, 1992, Yates, 1978, Chipman,
1996, Joseph, 2006], defined
as follows:

*Strong Heredity:*If an interaction term is included in the model, then both of the corresponding main effects must be present. That is, if*B*≠ 0, then_{j,k}*B*_{j,0}≠ 0*and**B*_{0,k}≠ 0.*Weak Heredity:*If an interaction term is included in the model, then at least one of the corresponding main effects must be present. That is, if*B*≠ 0, then either_{j,k}*B*_{j,0}≠ 0*or**B*_{0,k}≠ 0.

Such constraints facilitate model interpretation [McCullagh, 1984], improve statistical power [Cox, 1984], and simplify experimental designs [Bien et al., 2013]. In this paper we propose a general convex regularized regression approach which naturally and efficiently enforces strong heredity.

A number of authors have considered the task of fitting interaction
models under strong or weak heredity constraints. Constraints to enforce
heredity [Peixoto, 1987, Friedman, 1991, Bickel et al., 2010, Park
and Hastie, 2008, Wu et al.,
2010] have been applied to conventional step-wise model selection
techniques [Montgomery et al., 2012,
chap. 10]. Chipman [1996] and George and McCulloch [1993] proposed
Bayesian methods. In more recent work, Hao and
Zhang [2014] proposed `iFORM`, an approach that
performs forward selection on the main effects, and allows interactions into the
model once the main effects have already been selected.
`iFORM` has a number of attractive properties,
including suitability for the ultra-high-dimensional setting, computational
efficiency, as well as proven theoretical guarantees.

In this paper, we take a regularization approach to inducing strong heredity. A number of regularization approaches for this task have already been proposed in the literature; in fact, a strength of our proposal is that it provides a unified framework (and associated algorithm) of which several existing approaches can be seen as special cases. Choi et al. [2010] propose a non-convex approach, which amounts to a lasso [Tibshirani, 1996] problem with re-parametrized coefficients. Alternatively, some authors have enforced strong or weak heredity via convex penalties or constraints. Jenatton et al. [2011] and Zhao et al. [2009] describe a set of penalties that can be applied to a broad class of problems. As a special case they consider interaction models with strong or weak heredity; this has been further developed by Bach et al. [2012]. Radchenko and James [2010], Lim and Hastie [2013] and Bien et al. [2013] propose penalties specifically designed for interaction models with sparsity and strong heredity. We now describe the latter two approaches in greater detail.

The `hierNet` approach of Bien et al. [2013] fits the model (1) with *X* =
*Z* and *p*_{1} =
*p*_{2} = *p*. In the case of
strong heredity, using the notation of (3), they consider the problem

$$\underset{B\in {\mathbb{R}}^{(p+1)\times (p+1)},{\beta}^{\pm}\in {\mathbb{R}}^{p}}{\text{minimize}}\frac{1}{2}\Vert y-W*B\Vert \frac{2}{2}+\lambda \sum _{j=1}^{p}({\beta}_{j}^{+}+{\beta}_{j}^{-})+\frac{\lambda}{2}{\Vert {B}_{-0,-0}\Vert}_{1}$$

$$\text{subject to}B={B}^{T},{B}_{0,-0}={\beta}^{+}-{\beta}^{-}$$

(4)

$${\Vert {B}_{j,-0}\Vert}_{1}\le {\beta}_{j}^{+}+{\beta}_{j}^{-},{\beta}_{j}^{+}\ge 0,{\beta}_{j}^{-}\ge 0\text{for}j=1,\dots ,p.$$

Using this notation, the coefficient for the
*j ^{th}* main effect is

Like `hierNet`, the
`glinternet` proposal of Lim and Hastie [2013] fits (1) with *X* =
*Z* and *p*_{1} =
*p*_{2} = *p*. In order to
describe this approach, we introduce some additional notation. Let
α_{k} be the coefficient of the
*k ^{th}* main effect. We decompose
α

$$\underset{\underset{{\{{\alpha}_{i}^{(j)}\}}_{j\ne i}\in \mathbb{R}}{{\alpha}_{0},{\{{\alpha}_{ij}\}}_{i\ne j;i,j\ne 0,}}}{\text{minimize}}{\Vert y-{\alpha}_{0}-\sum _{k=1}^{p}\sum _{j\ne k}{\alpha}_{k}^{(j)}{X}_{\u2024,k}-\sum _{j\ne k}{\alpha}_{jk}({X}_{\u2024,j}*{X}_{\u2024,k})\Vert}_{2}^{2}\phantom{\rule{0ex}{0ex}}+\lambda (\sum _{j=1}^{p}|{\alpha}_{j}^{(0)}|+\sum _{j\ne k}\sqrt{{\left({\alpha}_{j}^{(k)}\right)}^{2}+{\left({\alpha}_{k}^{(j)}\right)}^{2}+{\alpha}_{jk}^{2}}),$$

(5)

where
*X*_{․,j} *
*X*_{․,k} denotes
element-wise multiplication. Strong heredity is enforced via the group lasso
[Yuan and Lin, 2006] penalties:
if either α_{jk} or
α_{kj} is estimated as non-zero,
then ${\alpha}_{j}^{(k)}$ and ${\alpha}_{k}^{(j)}$ will be estimated to be non-zero, and hence so will
α_{j} and
α_{k}.

The rest of this paper is organized as follows. In Section 2, we provide
details of `FAMILY`, our proposed approach for modeling
interactions. An unbiased estimator for its degrees of freedom is in Section 3,
and an extension to weak heredity is in Section 4. We explore
`FAMILY`'s empirical performance in simulation
in Section 5, and in an application to an HIV data set in Section 6. The
Discussion is in Section 7.

We propose a *framework
for modeling interactions with a
convex penalty*
(`FAMILY`). The `FAMILY` approach is
the solution to a convex optimization problem, which (using the notation of Section
1.1) takes the form

$$\underset{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)}}{\text{minimize}}\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{P}_{r}({B}_{j,\u2024})+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{P}_{c}({B}_{\u2024,k})+{\lambda}_{3}{\Vert {B}_{-0,-0}\Vert}_{1}.$$

(6)

Here, λ_{1}, λ_{2}, and
λ_{3} are non-negative tuning parameters.
*P _{r}* and

As we will see, the choice of *P _{r}* and

We suggest standardizing the columns of *X* and
*Z* to have mean zero and variance one before solving (6), in order to ensure that the main
effects and interactions are on the same scale, as is standard practice for
penalized regression estimators [Hastie et al.,
2009]. We take this approach in Sections 5 and 6.

The *main effects lasso* can be viewed as a special case
of (6) where
*P _{c}* and

$$\underset{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)}}{\text{minimize}}\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{\Vert {B}_{j,\u2024}\Vert}_{1}+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{\Vert {B}_{\u2024,k}\Vert}_{1}+{\lambda}_{3}{\Vert {B}_{-0,-0}\Vert}_{1},$$

(7)

and where λ_{3} is chosen sufficiently large as
to shrink all of the interaction terms to 0. In this case, the lasso penalties
on the rows and columns are applied only to the main effects.

In contrast, if we take λ_{3} = 0, λ_{1}
= λ_{2} = λ, and
*P _{c}*(

$$\underset{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)}}{\text{minimize}}\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+\lambda {\Vert B\Vert}_{1}.$$

(8)

However, our main interest in this paper is to develop a convex framework for modeling interactions that obeys strong heredity. Clearly, the all-pairs lasso does not satisfy strong heredity, and the main effects lasso does so only in a trivial way (by setting all interaction coefficient estimates to zero).

We now consider three choices of *P _{r}* and

A graphical representation of the region *P*(β)
≤ 1, where *P*(β) = max (|β_{1}|,
|β_{2}| + |β_{3}|) *(left)*; $P(\beta )=\sqrt{{\beta}_{1}^{2}+{\beta}_{2}^{2}+{\beta}_{3}^{2}}$
*(center)*; or *P*(β) =
max(|β_{1}|, |β_{2} **...**

We first consider (6)
in the case where *P _{r}*(

$$\underset{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)}}{\text{minimize}}\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{\Vert {B}_{j,\u2024}\Vert}_{2}+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{\Vert {B}_{\u2024,k}\Vert}_{2}+{\lambda}_{3}{\Vert {B}_{-0,-0}\Vert}_{1}.$$

(9)

This formulation will induce strong heredity, in the sense
that an interaction between *X _{j}* and

Problem 9 is closely
related to `VANISH`, an approach for non-linear
interaction modeling [Radchenko and James,
2010]. In fact, if we take *X* =
*Z* and assume that all main effects and interactions are
scaled to have norm one in (9), and consider the case of `VANISH` with
only linear main effects and interactions, then
`VANISH` and (9) coincide exactly.

Radchenko and James [2010]
attempt to solve the `VANISH` optimization problem via
block coordinate descent. However, due to non-separability of the groups,
their algorithm is not guaranteed convergence to the global optimum. In
contrast, the algorithm in Section 2.3 is guaranteed convergence to the
global optimum of (6) for any
convex penalty, and can be extended to the case of generalized linear
models.

We now consider (6)
in the case where *P _{r}*(

$$\underset{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)}}{\text{minimize}}\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{\Vert {B}_{j,\u2024}\Vert}_{\infty}+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{\Vert {B}_{\u2024,k}\Vert}_{\infty}+{\lambda}_{3}{\Vert {B}_{-0,-0}\Vert}_{1}.$$

(10)

This formulation also induces strong heredity.

Finally, we consider (6) with *P _{r}*(

$$\underset{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)}}{\text{minimize}}\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}\text{max}(|{B}_{j,0}|,{\Vert {B}_{j,-0}\Vert}_{1})+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}\text{max}(|{B}_{0,k}|,{\Vert {B}_{-0,k}\Vert}_{1})+{\lambda}_{3}{\Vert {B}_{-0,-0}\Vert}_{1}.$$

(11)

In the special case where *X* =
*Z*, λ_{1} = λ_{2} =
λ, and λ_{3} = λ/2, (11) is in fact equivalent to
the `hierNet` proposal of Bien et al. [2013]. Details of this equivalence are
given in Bien et al. [2013].

Bien et al. [2013] propose to
solve `hierNet` via an ADMM algorithm which applies a
generalized gradient descent loop within each update. This leads to
computational inefficiency, especially for large *p*. In
Section 2.3, we propose a simple, stand-alone ADMM algorithm for solving
(6), which can be easily
applied to solve (11), and
consequently also the `hierNet` optimization
problem.

Given its connection to Bien et al.
[2013], we refer to (11) as `FAMILY.hierNet`.

Here we further consider the *l*_{2},
*l*_{∞} and
*l*_{1}/*l*_{∞}
hybrid penalties discussed in Sections 2.2.1–2.2.3. For an arbitrary
penalty, the proximal operator is the solution to the optimization problem

$$\underset{\beta}{\text{minimize}}\frac{1}{2}{\Vert y-\beta \Vert}^{2}+\lambda P(\beta ).$$

(12)

We begin by presenting a well-known lemma (see e.g. Proposition 1.1, Bach et al. [2011]).

*Let P*(*y*) *be a norm of
y with dual norm P*_{*}(*y*)
*max _{z}* {

It is well-known that the _{2} norm is its own
dual norm, and that the _{1} norm is dual to the
_{∞} norm. We now derive the dual norm for
the `FAMILY.hierNet` penalty. This lemma is proven
in Appendix
B.

*The dual norm of P*(β) =
max{|β_{1}|,
‖β_{−1}‖_{1}}
*takes the form*

$${P}_{*}(\beta )=|{\beta}_{1}|+{\Vert {\beta}_{-1}\Vert}_{\infty}.$$

(13)

Lemmas 2.1 and 2.2 provide insight into the values of
*y* for which all variables are shrunken to zero in
(12). The dual norm
balls for the hybrid
_{1}/_{∞},
_{2}, and _{∞} norms are
displayed in Figure 3. By Lemma
2.1, any *y* inside the dual norm ball leads to a zero
solution of (12). For
the hybrid _{1}/_{∞} norm, the
shape of the dual norm ball implies that the first element of
*y* plays an outsize role in whether or not the
coefficient vector is shrunken to zero. Consequently, the main effects
play a larger role than the interactions in determining whether sparsity
is induced. In contrast, for the _{∞} and
_{2} norms, the main effect and interactions play
an equal role in determining whether the coefficients are shrunken to
zero.

A step-by-step ADMM algorithm for solving `FAMILY`
is provided in Appendix
A.2. Here, we present an overview of this algorithm. A gentle
introduction to ADMM is provided in Appendix A.1.

We now develop an ADMM algorithm to solve (6). We define the variable
Θ = (*D*|*E*|*F*), with
*D, E, F*
^{(p1+1)×(p2+1)}.
That is, Θ is a (*p*_{1} + 1) ×
3(*p*_{2} + 1) matrix, which we partition into
*D, E*, and *F* for convenience. Then
(6) can be re-written as

$$\underset{\underset{\mathrm{\Theta}\in {\mathbb{R}}^{({p}_{1}+1)\times 3({p}_{2}+1)}}{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)},}}{\text{minimize}}\{\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{P}_{r}({D}_{j,\u2024})+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{P}_{c}({E}_{\u2024,k})+{\lambda}_{3}{\Vert {F}_{-0,-0}\Vert}_{1}\}$$

$$\text{subject to}\hspace{1em}B({I}_{({p}_{2}+1)\times ({p}_{2}+1)}|{I}_{({p}_{2}+1)\times ({p}_{2}+1)}|{I}_{({p}_{2}+1)\times ({p}_{2}+1)})=\mathrm{\Theta}.$$

(14)

The augmented Lagrangian corresponding to (14) takes the form

$${L}_{\rho}(B,\mathrm{\Theta},\mathrm{\Gamma})=\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{P}_{r}({D}_{j,\u2024})+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{P}_{c}({E}_{\u2024,k})+{\lambda}_{3}{\Vert {F}_{-0,-0}\Vert}_{1}\phantom{\rule{0ex}{0ex}}+\text{trace}({\mathrm{\Gamma}}^{T}(B(I|I|I)-\mathrm{\Theta}))+\rho /2{\Vert B(I|I|I)-\mathrm{\Theta}\Vert}_{F}^{2},$$

(15)

where Γ is a (*p*_{1} + 1)
× 3(*p*_{2} + 1)-dimensional dual variable.
For convenience, we partition Γ as follows: Γ =
(Γ_{1}|Γ_{2}|Γ_{3})
where Γ_{i} is a
(*p*_{1} + 1) ×
(*p*_{2} + 1) matrix for *i* = 1,
2, 3.

The augmented Lagrangian (15) can be rewritten as

$${L}_{\rho}(B,\mathrm{\Theta},\mathrm{\Gamma})=\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{P}_{r}({D}_{j,\u2024})+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{P}_{c}({E}_{\u2024,k})+\lambda {\Vert {F}_{-0,-0}\Vert}_{1}\phantom{\rule{0ex}{0ex}}+\langle {\mathrm{\Gamma}}_{1},B-D\rangle +\langle {\mathrm{\Gamma}}_{2},B-E\rangle +\langle {\mathrm{\Gamma}}_{3},B-F\rangle \phantom{\rule{0ex}{0ex}}+\rho /2{\Vert B-D\Vert}_{F}^{2}+\rho /2{\Vert B-E\Vert}_{F}^{2}+\rho /2{\Vert B-F\Vert}_{F}^{2}.$$

(16)

In order to develop an ADMM algorithm to solve (6), we must now simply figure
out how to minimize (16)
with respect to *B* with Θ held fixed, and how to
minimize (16) with respect
to Θ with *B* held fixed. Minimizing (16) with respect to
*B* amounts simply to a least squares problem. In order
to minimize (16) with
respect to Θ, we note that (16) can simply be minimized with respect to *D,
E*, and *F* separately. Minimizing (16) with respect to
*F* amounts simply to soft-thresholding [Friedman et al., 2007]. Minimizing
(16) with respect to
*D* or with respect to *E* amounts to
solving a problem that is equivalent to (12). We consider that problem next.

Details of the ADMM algorithm for solving (6) are given in Appendix A.2.

We saw in the previous section that the updates for
*D* and *E* in the ADMM algorithm amount
to solving the problem (12).
For *P*(β) = ‖β‖_{2},
(12) amounts to
soft-shrinkage [Simon et al., 2013,
Yuan and Lin, 2006], for which a
closed-form solution is available. For *P*(β) =
‖β‖_{∞}, an efficient algorithm
was proposed by Duchi and Singer
[2009]. We now present an efficient algorithm for solving (12) for
*P*(β) = max{|β_{1}|,
‖β_{−1}‖_{1}}.

*Let* *denote the
solution to*
(12)
*with P*(β) = max{|β_{1}|,
‖β_{−1}‖_{1}}.
*Then* = *y*
− *û*, *where û is the
solution to*

$$\underset{u\in {\mathbb{R}}^{p},{\lambda}_{1}\in \mathbb{R}}{\text{minimize}}\frac{1}{2}{\Vert y-u\Vert}^{2}$$

$$\text{subject to}|{u}_{1}|\le {\lambda}_{1},{\Vert {u}_{-1}\Vert}_{\infty}\le \lambda -{\lambda}_{1},0\le {\lambda}_{1}\le \lambda .$$

(17)

We established in Section 2.2.4 that if λ ≥
|*y*_{1}| +
‖*y*_{−1}‖_{∞},
then the solution to (12) is zero. Therefore, we now restrict our attention to
the case λ < |*y*_{1}| +
‖*y*_{−1}‖_{∞}.
For a fixed λ_{1} [0, λ], we can see
by inspection that the solution to (17) is given by

$${u}_{1}({\lambda}_{l})=\{\begin{array}{cc}\hfill {y}_{1}\hfill & \hfill |{y}_{1}|\le {\lambda}_{1}\hfill \\ \hfill {\lambda}_{1}\text{sgn}({y}_{1})\hfill & \hfill |{y}_{1}|>{\lambda}_{1}\hfill \end{array}\text{and}{u}_{i}({\lambda}_{1})=\{\begin{array}{cc}\hfill {y}_{i}\hfill & \hfill |{y}_{i}|\le \lambda -{\lambda}_{1}\hfill \\ \hfill (\lambda -{\lambda}_{1})\text{sgn}({y}_{i})\hfill & \hfill |{y}_{i}|>\lambda -{\lambda}_{1}\hfill \end{array},$$

(18)

for *i* = 2, …,
*p*. Thus, (17) is equivalent to the problem

$$\underset{{\lambda}_{1}\in [0,\lambda ]}{\text{minimize}}\frac{1}{2}{\Vert y-u({\lambda}_{1})\Vert}^{2}.$$

(19)

*Let z denote the* (*p* −
1)-*vector whose i ^{th} element is* λ
− |

$${\hat{\lambda}}_{1}=\{\begin{array}{cc}\hfill \lambda \hfill & \hfill {\text{if min}}_{j}\left\{\frac{|{y}_{1}|+{\sum}_{i=1}^{j}{z}_{(i)}}{j+1}\right\}\ge \lambda \hfill \\ \hfill 0\hfill & \hfill {\text{if min}}_{j}\left\{\frac{|{y}_{1}|+{\sum}_{i=1}^{j}{z}_{(i)}}{j+1}\right\}\le 0\hfill \\ \hfill {\text{min}}_{j}\left\{\frac{|{y}_{1}|+{\sum}_{i=1}^{j}{z}_{(i)}}{j+1}\right\}\hfill & \hfill \text{otherwise}\hfill \end{array}.$$

(20)

Combining Theorem 2.4 and Lemma 2.3 gives us a solution for
(12) with the hybrid
_{1}/_{∞} penalty. Proofs
are given in Appendix
B.

As mentioned in Section A.1, ADMM's convergence to the global optimum is guaranteed for the convex, closed and proper objective function (6) [Boyd et al., 2011]. The computational complexity of the algorithm depends on the form of the penalty functions used.

The update for *B* is typically the most
computationally-demanding step of the ADMM algorithm for (6). As pointed out in Appendix A.2, this
can be done very efficiently. We perform the singular value decomposition
for a *n* × (*p*_{1} +
1)(*p*_{2} + 1)-dimensional matrix
*once*, given the data matrix *W*. Then,
in each iteration of the ADMM algorithm, the update for *B*
requires simply an efficient matrix inversion using the Woodbury matrix
formula.

We now report timing results for our
`R`-language implementation of
`FAMILY`, available in the package
`FAMILY` on `CRAN`, on an
Intel® Xeon® E5-2620 processor. We considered an example
with *n* = 350 and *p*_{1} =
*p*_{2} = 500 (for a total of 251; 000 features).
Using the parametrization (33), running `FAMILY.l2` with α =
0.7 and a grid of 10 λ values takes a median time of 330 seconds,
and running `FAMILY.linf` takes a median time of 416
seconds.

The `FAMILY` optimization problem (6) can be extended to the case of
a general convex loss function *l*(·),

$$\underset{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)}}{\text{minimize}}\frac{1}{n}l(B)+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{P}_{r}({B}_{j,\u2024})+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{P}_{c}({B}_{\u2024,k})+{\lambda}_{3}{\Vert {B}_{-0,-0}\Vert}_{1}.$$

(21)

For instance, in the case of a binary response variable
*y*, we could take *l* to be the negative log
likelihood under a binomial model. Then (21) corresponds to a penalized logistic regression problem with
interactions. An ADMM algorithm for (21) can be derived just as in Section 2.3.1, with a modification to
the update for *B*. This is discussed in Appendix A.3.

The `FAMILY` optimization problem (6) is convex, and the algorithm
presented in Section 2.3 is guaranteed to yield a solution that achieves the
global minimum. But (6) is not
strictly convex: this means that the solution might not be unique, in the sense
that more than one value of *B* might achieve the global minimum.
However, uniqueness of the *fitted values* resulting from (6) is straightforward. This is
formalized in the following lemma. The proof is as in Lemma 1(ii) of Tibshirani et al. [2013].

*For a convex penalty function P*(·),
*let denote the solution to the problem*

$$\underset{B\in {\mathbb{R}}^{({p}_{1}+1)\times ({p}_{2}+1)}}{\mathit{\text{minimize}}}\frac{1}{2n}{\Vert y-W*B\Vert}^{2}+P(B).$$

(22)

*The fitted values W * are unique*.

Consider the linear model *y* =
*X*β + ε, with fixed *X*, and
ε ~ _{n}(0,
σ^{2}* I_{n}*). Then the
degrees of freedom of a model-fitting procedure is defined as [Stein, 1981, Efron, 1986]

$$\mathrm{d}\mathrm{f}=\frac{1}{{\sigma}^{2}}\sum _{i=1}^{n}\text{Cov}({y}_{i},{\u0177}_{i}),$$

(23)

where *ŷ _{i}* are the fitted
response values. If certain conditions hold, then

$$\mathrm{d}\mathrm{f}=E\left[\sum _{i=1}^{n}\frac{\partial {\u0177}_{i}}{\partial {y}_{i}}\right].$$

(24)

Therefore, ${\sum}_{i=1}^{n}\frac{\partial {\u0177}_{i}}{\partial {y}_{i}}$ is an unbiased estimator for the degrees of freedom of the model-fitting procedure.

Before presenting the main results of this section, we state a useful lemma.

*Given a vector x*
^{p}, *and an even positive
integer q*,

$$\frac{{d}^{2}{\Vert x\Vert}_{q}}{d{x}^{2}}=(q-1)\text{diag}\left[{\left(\frac{x}{{\Vert x\Vert}_{q}}\right)}^{q-2}\right]\times [\frac{I}{{\Vert x\Vert}_{q}}-\frac{x{({x}^{T})}^{q-1}}{{\Vert x\Vert}_{q}^{q+1}}],$$

(25)

*where* diag(*x*) *is the diagonal
matrix with x on the diagonal, and*
(*x*)^{q}
*denotes the element-wise exponentiation of the vector
x*.

We now consider the degrees of freedom of the estimator that solves the problem

$$\underset{\beta \in {\mathbb{R}}^{p}}{\text{minimize}}\frac{1}{2}{\Vert y-X\beta \Vert}_{2}^{2}+\sum _{d}{\lambda}_{d}{P}_{d}({A}_{d}\beta ),$$

(26)

where *P _{d}*(·) is an

*An unbiased estimator of the degrees of freedom of*
, *the solution to*
(26), *is given
by*

$$\hat{\mathrm{d}\mathrm{f}}=\text{trace}\left({X}_{\mathcal{A}}{[{X}_{\mathcal{A}}^{T}{X}_{\mathcal{A}}+\sum _{d}{\lambda}_{d}{({A}_{d}^{\mathcal{A}})}^{T}{\ddot{P}}_{d}({A}_{d}^{\mathcal{A}}{\hat{\beta}}_{\mathcal{A}})({A}_{d}^{\mathcal{A}})]}^{-1}{X}_{\mathcal{A}}^{T}\right),$$

(27)

*where _{d}* (·)

The derivation for Claim 3.2 is outlined in Appendix C.

In this section we present estimates for the degrees of freedom of
`FAMILY.l2` and `FAMILY.linf`. An
estimate of the degrees of freedom of `FAMILY.hierNet` is
given in Bien et al. [2013].

We write `FAMILY.l2` in the form of (26),

$$\frac{1}{2}{\Vert y-\tilde{W}\tilde{B}\Vert}_{2}^{2}+n{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{\Vert {A}_{j}\tilde{B}\Vert}_{2}+n{\lambda}_{2}\sum _{k={p}_{1}+1}^{{p}_{1}+{p}_{2}}{\Vert {A}_{k}\tilde{B}\Vert}_{2}+n{\lambda}_{3}{\Vert {A}_{I}\tilde{B}\Vert}_{1},$$

(28)

where is the vectorized version
of *B*, and is the
*n* × (*p*_{1} +
1)(*p*_{2} + 1)-dimensional matrix version of
*W*. We apply Claim 3.2 in order to obtain an unbiased
estimate for `FAMILY.l2`:

$${\hat{\mathrm{d}\mathrm{f}}}_{{\ell}_{2}}=\text{trace}\left({\tilde{W}}_{\mathcal{A}}{[{\tilde{W}}_{\mathcal{A}}^{T}{\tilde{W}}_{\mathcal{A}}+n{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{({A}_{j}^{\mathcal{A}})}^{T}[\ddot{P}({A}_{j}^{\mathcal{A}}{\hat{\tilde{B}}}_{\mathcal{A}})]({A}_{j}^{\mathcal{A}})+n{\lambda}_{2}\sum _{k={p}_{1}+1}^{{p}_{1}+{p}_{2}}{({A}_{k}^{\mathcal{A}})}^{T}[\ddot{P}({A}_{k}^{\mathcal{A}}{\hat{\tilde{B}}}_{\mathcal{A}})]({A}_{k}^{\mathcal{A}})]}^{-1}{\tilde{W}}_{\mathcal{A}}^{T}\right),$$

(29)

where $\ddot{P}({\upsilon}_{0})={\frac{{d}^{2}{\Vert \upsilon \Vert}_{2}}{d{\upsilon}^{2}}|}_{\upsilon ={\upsilon}_{0}}$ is of the form given in Lemma 3.1.

The _{∞} norm is not differentiable, and
thus we cannot apply Claim 3.2 directly. Instead, we make use of the fact
that $\underset{q\to \infty}{\text{lim}}{\Vert \beta \Vert}_{q}={\Vert \beta \Vert}_{\infty}$ in order to apply Claim 3.2 to a modified version of
`FAMILY.linf` in which the
_{∞} norm is replaced with an
_{q} norm for a very large value
of *q*. This yields the estimator

$${\hat{\mathrm{d}\mathrm{f}}}_{{\ell}_{\infty}}=\text{trace}\left({\tilde{W}}_{\mathcal{A}}{[{\tilde{W}}_{\mathcal{A}}^{T}{\tilde{W}}_{\mathcal{A}}+n{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{({A}_{j}^{\mathcal{A}})}^{T}[\ddot{P}({A}_{j}^{\mathcal{A}}{\hat{\tilde{B}}}_{\mathcal{A}})]({A}_{j}^{\mathcal{A}})+n{\lambda}_{2}\sum _{k={p}_{1}+1}^{{p}_{1}+{p}_{2}}{({A}_{k}^{\mathcal{A}})}^{T}[\ddot{P}({A}_{k}^{\mathcal{A}}{\hat{\tilde{B}}}_{\mathcal{A}})]({A}_{k}^{\mathcal{A}})]}^{-1}{\tilde{W}}_{\mathcal{A}}^{T}\right),$$

(30)

where** is of the form given in Lemma 3.1. We use
*q* = 500 in Section 3.4.

We now consider the numerical performance of our estimates of the
degrees of freedom of `FAMILY` in a simple simulation
setting. We use a fixed design matrix *X*, with
*n* = 100 rows and *p* = 10 main effects, and
we let *X* = *Z*. We randomly selected 15 true
interaction terms. We generated 100 different response vectors
*y*^{(1)}, …,
*y*^{(100)} using independent Gaussian noise. We
computed the true degrees of freedom as well as the estimated degrees of freedom
from (29) and (30), averaged over the 100
simulated data sets. In Figure 4, we see
almost perfect agreement between the true and estimated degrees of freedom.

We now consider a modification to the `FAMILY`
optimization problem, (6), that
imposes weak heredity. We assume that the main effects, interactions, and response
have been centered to have mean zero.

In order to enforce weak heredity, we take an approach motivated by the
latent overlap group lasso of Jacob et al.
[2009]. We let *W ^{X}* denote the

$${W}_{i,j,k}^{X}=\{\begin{array}{cc}{X}_{i,j}{Z}_{i,k}\hfill & \text{for}k\ne 0\hfill \\ {X}_{i,j}\hfill & \text{for}k=0\hfill \end{array}.$$

(31)

We let *W ^{Z}* denote the

We propose to solve the optimization problem

$$\underset{\underset{{B}^{Z}\in {\mathbb{R}}^{({p}_{1}+1)\times {p}_{2}}}{{B}^{X}\in {\mathbb{R}}^{{p}_{1}\times ({p}_{2}+1)}}}{\text{minimize}}\frac{1}{2n}{\Vert y-{W}^{X}*{B}^{X}-{W}^{Z}*{B}^{Z}\Vert}_{2}^{2}+{\lambda}_{1}\sum _{j=1}^{{p}_{1}}{P}_{r}({B}_{j,\u2024}^{X})+{\lambda}_{2}\sum _{k=1}^{{p}_{2}}{P}_{c}({B}_{\u2024,k}^{Z})+{\lambda}_{3}({\Vert {B}_{\u2024,-0}^{X}\Vert}_{1}+{\Vert {B}_{-0,\u2024}^{Z}\Vert}_{1}).$$

(32)

Then the coefficient for the *j ^{th}* main
effect of

Problem (32) can be solved using an ADMM algorithm similar to that of Section 2.3. Since the focus of this paper is on enforcing strong heredity, we leave the details of an algorithm for (32), as well as a careful numerical study, to future work.

We compare the performance of `FAMILY.l2` and
`FAMILY.linf` to the all-pairs lasso
(`APL`), the `hierNet` proposal of
Bien et al. [2013], and the
`glinternet` proposal of Lim and Hastie [2013]. `APL` can be performed using
the `glmnet R` package, and `hierNet` and
`glinternet` are implemented in `R`
packages available on `CRAN`. We also include the oracle model
[Fan and Li, 2001] — an
unpenalized model that uses only the main effects and interactions that are non-zero
in the true model — in our comparisons.

The forward selection proposal of Hao and
Zhang [2014], `iFORM`, is a fast screening approach
for detecting interactions in ultra-high dimensional data.
`iFORM` is intended for the setting in which the true model
is extremely sparse. In our simulation setting, we consider moderately sparse
models, which fails to highlight the advantages of `iFORM`.
Thus, we do not include results for `iFORM` in our simulation
study.

To facilitate comparison with `hierNet` and
`glinternet`, which require *X* =
*Z*, we take *X* = *Z* in our
simulation study. Similar empirical results are obtained in simulations with
*X* ≠ *Z*; results are omitted due to
space constraints.

We consider squared error loss in Section 5.1, and logistic regression loss in Section 5.2.

We created a coefficient matrix *B*, with
*p* = 30 main effects and $\left(\begin{array}{c}\hfill p\hfill \\ \hfill 2\hfill \end{array}\right)=435$ interactions, for a total of 465 features. The first 10
main effects have non-zero coefficients, assigned uniformly from the set
{−5, −4, …, −1, 1, …, 5}. The
remaining main effects’ coefficients equal zero. We consider three
simulation settings, in which we randomly select 15, 30 or 45 non-zero
interaction coefficients, chosen to obey strong heredity. The values for the
non-zero coefficients were selected uniformly from the set {−10,
−8, …, −2, 2, …, 8, 10}. Figure 5 displays *B* in
each of the three simulation settings.

For the simulation study in Section 5, the heatmap of the matrix
*B* is displayed in the case of 15 (*left*),
30 (*center*), and 45 (*right*) non-zero
interactions. The first row and column of each heatmap represent the main
effects.

We generated a training set, a test set, and a validation set, each
consisting of 300 observations. Each observation of *X* =
*Z* was generated independently from a
_{p}(0, *I*)
distribution; *W* was then constructed according to (2). For each observation we
generated an independent Gaussian noise term, with variance adjusted to
maintain a signal-to-noise ratio of approximately 2.5 to 3.5. Finally, for
each observation, a response was generated according to (3).

We applied `glinternet` and
`hierNet` for 50 different values of the tuning
parameters. For convenience, given that *X* =
*Z*, we reparametrized the `FAMILY`
optimization problem (6) as

$$\underset{B\in {\mathbb{R}}^{(p+1)\times (p+1)}}{\text{minimize}}\frac{1}{2n}{\Vert y-W*B\Vert}_{2}^{2}+(1-\alpha )\lambda \sqrt{p}\sum _{j=1}^{p}{P}_{r}({B}_{j,\u2024})+(1-\alpha )\lambda \sqrt{p}\sum _{k=1}^{p}{P}_{c}({B}_{\u2024,k})+\alpha \lambda {\Vert {B}_{-0,-0}\Vert}_{1}.$$

(33)

We applied `FAMILY.l2` and
`FAMILY.linf` over a 10 × 50 grid of
(α, λ) values, with α (0, 1) and λ
chosen to give a suitable range of sparsity.

In principle, many methods are available for selecting the tuning
parameters α and λ. These include Bayesian information
criterion, generalized cross-validation, and others. Because we do not have
an estimator for the degrees of freedom of the
`glinternet` estimator, we opted to use a
training/test/validation set approach. In greater detail, we fit each method
to the training set, selected tuning parameters based on sum of squared
residuals (SSR) on the test set, and then reported the SSR for that choice
of tuning parameters on the validation set.

It is well-known that penalized regression techniques tend to yield
models with over-shrunken coefficient estimates [Hastie et al., 2009, Fan and Li, 2001]. To overcome this problem, we obtained
*relaxed* versions of `FAMILY.l2`,
`FAMILY.linf`, `hierNet`, and
`glinternet`, by refitting an unpenalized least
squares model to the set of coefficients that are non-zero in the penalized
fitted model [Meinshausen, 2007,
Radchenko and James, 2010].

We also considered generating the observations of *X*
from a _{p}(0, Σ) distribution,
where Σ was an autoregressive or an exchangeable covariance matrix.
We found that the choice of covariance matrix Σ led to little
qualitative difference in the results. Therefore, we display only results
for Σ = *I* in Section 5.1.2.

The left panel of Figure 6
displays ROC curves for `FAMILY.linf`,
`FAMILY.l2`, `hierNet`,
`glinternet`, and `APL`. These
results indicate that `FAMILY.l2` outperforms all other
methods in terms of variable selection, especially as the number of non-zero
interaction coefficients increases. When there are 45 non-zero interactions,
`FAMILY.linf` outperforms
`glinternet`, `hierNet`, and
`APL`.

Results for the simulation study of Section 5.1, averaged over 100
simulated datasets. The colored lines indicate the results for
`glinternet` (),
`hierNet` (),
`APL` (),
`FAMILY.l2` with α = 0.7 (),
and `FAMILY.linf` with α = 0.83 ().
*Left:* ROC **...**

The right panel of Figure 6
displays the test set SSR for all methods, as the tuning parameters are
varied. We observe that relaxation leads to improvement for each method: it
yields a much sparser model for a given value of the test error. This is not
surprising, since the relaxation alleviates some of the over-shrinkage
induced by the application of multiple convex penalties. The results further
indicate that when relaxation is applied, `FAMILY.l2`
performs the best, followed by `FAMILY.linf` and then
the other competitors. We once again observe that the improvement of
`FAMILY.l2` and `FAMILY.linf`
over the competitors increases as the number of non-zero interaction
coefficients increases.

Interestingly, the right-hand panel of Figure 6 indicates that though `FAMILY.l2`
performs the best when relaxation is performed, it performs quite poorly
when relaxation is not performed, in that the model with smallest test set
SSR contains far too many non-zero interactions. This is consistent with the
remark in Radchenko and James [2010]
regarding over-shrinkage of coefficient estimates.

In Table 1, we present
results on the validation set for the model that was fit on the training set
using the tuning parameters selected on the test set, as described in
Section 5.1.1. We see that `FAMILY.l2` and
`FAMILY.linf` outperform the competitors in terms
of SSR, false discovery rate, and true positive rate, especially when
relaxation is performed.

We assume that each response *y _{i}* is a
Bernoulli variable with probability

$$\text{log}\left(\frac{{p}_{i}}{1-{p}_{i}}\right)={(W*B)}_{i};i=1,\dots ,n,$$

(34)

where *W* * *B* is the
*n*-vector defined in Section 1.1. The matrices
*X* and *B* are generated in the exact
same manner as in Section 5.1.1, but now with *n* = 500
observations in the training and test sets.

Once again, for convenience, we reparametrized
`FAMILY.l2` and `FAMILY.linf`
according to

$$\underset{B\in {\mathbb{R}}^{(p+1)\times (p+1)}}{\text{minimize}}-\frac{1}{n}\sum _{i=1}^{n}[{y}_{i}{(W*B)}_{i}-\text{log}(1+{e}^{{(W*B)}_{i}}\left)\right]\phantom{\rule{0ex}{0ex}}+\sqrt{p}(1-\alpha )\lambda \sum _{j=1}^{p}{P}_{r}({B}_{j,\u2024})+\sqrt{p}(1-\alpha )\lambda \sum _{k=1}^{p}{P}_{c}({B}_{\u2024,k})+\alpha \lambda {\Vert {B}_{-0,-0}\Vert}_{1}.$$

(35)

The results for logistic regression are displayed in Figure 7. The ROC curves in the left-hand
panel indicate that `FAMILY.linf` and
`FAMILY.l2` outperform the competitors in terms of
variable selection when there are 30 or 45 non-zero interactions. The SSR
curves in the right-hand panel of Figure
7 indicate that the relaxed versions of
`FAMILY.linf` and `FAMILY.l2`
perform very well in terms of prediction error on the test set, especially
as the number of non-zero interactions increases.

Rhee et al. [2006] study the
susceptibility of the HIV-1 virus to 6 nucleoside reverse transcriptase inhibitors
(NRTIs). The HIV-1 virus can become resistant to drugs via mutations in its genome
sequence. Therefore, there is a need to model HIV-1’s drug susceptibility as
a function of mutation status. We consider one particular NRTI, 3TC. The data
consists of a sparse binary matrix, with mutation status at each of 217 genomic
locations for *n* = 1057 HIV-1 isolates. For each of the
observations, there is a measure of susceptibility to 3TC. This data set was also
studied by Bien et al. [2013].

Rather than working with all 217 genomic locations, we create bins of ten
adjacent loci; this results in a design matrix with *p* = 22 features
and *n* = 1057 observations. We perform the binning because the raw
data contains mostly zeros, as most mutations occur in at most a few of the
observations; by binning the observations, we obtain less sparse data. This binning
is justified under the assumption that mutations in a particular region of the
genome sequence result in a change to a binding site, in which case nearby mutations
should have similar effects on a binding site, and hence similar associations with
drug susceptibility. This binning is also needed for computational reasons, in order
to allow for comparison to `hierNet` (specifically the version
that enforces strong heredity) using the `R` package of Bien et al. [2013]. (In Bien et al. [2013], all 217 genomic locations are analyzed using
a much faster algorithm that enforces *weak* (rather than strong)
heredity.)

We split the observations into equally-sized training and test sets. We fit
`glinternet`, `hierNet`,
`FAMILY.l2`, and `FAMILY.linf` on the
training set for a range of tuning parameter values, and applied the fitted models
to the test set. In Figure 8, the test set SSR
is displayed as a function of the number of non-zero estimated interaction
coefficients, averaged over 50 splits of the data into training and test sets. The
figure reveals that all four methods give roughly similar results.

The test set SSR is displayed for the HIV-1 data of Section 6, as a
function of the number of non-zero interaction terms. Results are averaged over
50 splits of the observations into a training set and a test set. The colored
lines indicate the results **...**

Figure 9 displays the estimated
coefficient matrix, , that results from applying each of
the four methods to all *n* = 1057 observations using the tuning
parameter values that minimized the average test set SSR. The estimated coefficients
are qualitatively similar for all four methods. All four methods detect some
non-zero interactions involving the 17th feature. `Glinternet`
yields the sparsest model.

In this paper, we have introduced `FAMILY`, a framework
that unies a number of existing estimators for high-dimensional models with
interactions. Special cases of `FAMILY` correspond to the
all-pairs lasso, the main effects lasso, `VANISH`, and
`hierNet`. Furthermore, we have explored the use of
`FAMILY` with _{2},
_{∞}, and hybrid
_{1}/_{∞} penalties; these result in
strong heredity and have good empirical performance.

The empirical results in Sections 5 and 6 indicate that the choice of
penalty in `FAMILY` may be of little practical importance: for
instance, `FAMILY.l2`, `FAMILY.linf`, and
`FAMILY.hierNet` have similar performance. However, one
could choose among penalties using cross-validation or a related approach.

We have presented a simple ADMM algorithm that can be used to solve the
`FAMILY` optimization problem for any convex penalty. It
finds the global optimum for `VANISH` (unlike the proposal in
Radchenko and James [2010]), and provides
a simpler alternative to the original `hierNet` algorithm
[Bien et al., 2013].

`FAMILY` could be easily extended to accommodate
higher-order interaction models. For instance, to accommodate third-order
interactions, we could take *B* to be a (*p* + 1)
× (*p* + 1) × (*p* + 1) coefficient
array. Instead of penalizing each row and each column of *B*, we
would instead penalize each ‘slice’ of the array.

In the simulation study in Section 5, we considered a setting with only
*p*_{1} = *p*_{2} = 30 main
effects. We did this in order to facilitate comparison to the
`hierNet` proposal, which is very computationally intensive
as implemented in the `R` package of Bien et al. [2013]. However, our proposal can be applied for
much larger values of *p*_{1} and
*p*_{2}, as discussed in Section 2.3.3.

The `R` package `FAMILY`,
available on `CRAN`, implements the methods described in this
paper.

We thank an anonymous associate editor and two referees for insightful comments that resulted in substantial improvements to this manuscript. We thank Helen Hao Zhang, Ning Hao, Jacob Bien, Michael Lim, and Trevor Hastie for providing software and helpful responses to inquiries. D.W. was supported by NIH Grant DP5OD009145, NSF CAREER Award DMS-1252624, and an Alfred P. Sloan Foundation Research Fellowship. N.S. was supported by NIH Grant DP5OD019820.

- Bach Francis, Jenatton Rodolphe, Mairal Julien, Obozinski Guillaume, et al. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning. 2011:19–53.
- Bach Francis, Jenatton Rodolphe, Mairal Julien, Obozinski Guillaume, et al. Structured sparsity through convex optimization. Statistical Science. 2012;27(4):450–468.
- Bickel Peter J, Ritov Yaacov, Tsybakov Alexandre B, et al. Borrowing strength: theory powering applications–a Festschrift for Lawrence D. Brown. Institute of Mathematical Statistics; 2010. Hierarchical selection of variables in sparse high-dimensional regression; pp. 56–69.
- Bien Jacob, Taylor Jonathan, Tibshirani Robert. A lasso for hierarchical interactions. The Annals of Statistics. 2013;41(3):1111–1141. [PMC free article] [PubMed]
- Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning. 2011;3(1):1–122.
- Chipman Hugh. Bayesian variable selection with related predictors. Canadian Journal of Statistics. 1996;24(1):17–36.
- Choi Nam Hee, Li William, Zhu Ji. Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association. 2010;105(489):354–364.
- Cox David R. Interaction. International Statistical Review/Revue Internationale de Statistique. 1984:1–24.
- Duchi John, Singer Yoram. Efficient online and batch learning using forward backward splitting. The Journal of Machine Learning Research. 2009;10:2899–2934.
- Efron Bradley. How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association. 1986;81(394):461–470.
- Fan Jianqing, Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360.
- Friedman Jerome, Hastie Trevor, Höfling Holger, Tibshirani Robert, et al. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1(2):302–332.
- Friedman Jerome H. Multivariate adaptive regression splines. The annals of statistics. 1991:1–67.
- George Edward I, McCulloch Robert E. Variable selection via gibbs sampling. Journal of the American Statistical Association. 1993;88(423):881–889.
- Hamada Michael, Wu CF Jeff. Analysis of designed experiments with complex aliasing. Journal of Quality Technology. 1992;24(3):130–137.
- Hao Ning, Zhang Hao Helen. Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association. 2014;109(507):1285–1301. [PMC free article] [PubMed]
- Hastie Trevor, Tibshirani Robert, Friedman Jerome, Hastie T, Friedman J, Tibshirani R. The elements of statistical learning. Vol. 2. Springer; 2009.
- Jacob Laurent, Obozinski Guillaume, Vert Jean-Philippe. Group lasso with overlap and graph lasso. Proceedings of the 26th Annual International Conference on Machine Learning; ACM; 2009. pp. 433–440.
- Jenatton Rodolphe, Audibert Jean-Yves, Bach Francis. Structured variable selection with sparsity-inducing norms. The Journal of Machine Learning Research. 2011;12:2777–2824.
- Joseph V Roshan. A bayesian approach to the design and analysis of fractionated experiments. Technometrics. 2006;48(2):219–229.
- Lim Michael, Hastie Trevor. Learning interactions through hierarchical group-lasso regularization. arXiv preprint arXiv:1308.2719. 2013
- McCullagh Peter. Generalized linear models. European Journal of Operational Research. 1984;16(3):285–292.
- Meinshausen Nicolai. Relaxed lasso. Computational Statistics & Data Analysis. 2007;52(1):374–393.
- Montgomery Douglas C, Peck Elizabeth A, Vining G Geoffrey. Introduction to linear regression analysis. Vol. 821. John Wiley & Sons; 2012.
- Park Mee Young, Hastie Trevor. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9(1):30–50. [PubMed]
- Peixoto Julio L. Hierarchical variable selection in polynomial regression models. The American Statistician. 1987;41(4):311–313.
- Radchenko Peter, James Gareth M. Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association. 2010;105(492):1541–1553.
- Rhee Soo-Yon, Taylor Jonathan, Wadhera Gauhar, Ben-Hur Asa, Brutlag Douglas L, Shafer Robert W. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences. 2006;103(46):17355–17360. [PubMed]
- Simon Noah, Friedman Jerome, Hastie Trevor, Tibshirani Robert. A sparse-group lasso. Journal of Computational and Graphical Statistics. 2013;22(2):231–245.
- Stein Charles M. Estimation of the mean of a multivariate normal distribution. The annals of Statistics. 1981:1135–1151.
- Tibshirani Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996:267–288.
- Tibshirani Ryan J, et al. The lasso problem and uniqueness. Electronic Journal of Statistics. 2013;7:1456–1490.
- Wu Jing, Devlin Bernie, Ringquist Steven, Trucco Massimo, Roeder Kathryn. Screen and clean: a tool for identifying interactions in genome-wide association studies. Genetic epidemiology. 2010;34(3):275–285. [PMC free article] [PubMed]
- Yates Frank. The design and analysis of factorial experiments. Imperial Bureau of Soil Science; 1978.
- Yuan Ming, Lin Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2006;68(1):49–67.
- Zhao Peng, Rocha Guilherme, Yu Bin. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics. 2009;37(6A):3468–3497.