J Comput Graph Stat. Author manuscript; available in PMC 2017 March 16.
Published in final edited form as:
Published online 2015 August 12.
PMCID: PMC5353363
NIHMSID: NIHMS703415

# Convex Modeling of Interactions with Strong Heredity

## Abstract

We consider the task of fitting a regression model involving interactions among a potentially large set of covariates, in which we wish to enforce strong heredity. We propose FAMILY, a very general framework for this task. Our proposal is a generalization of several existing methods, such as VANISH [Radchenko and James, 2010], hierNet [Bien et al., 2013], the all-pairs lasso, and the lasso using only main effects. It can be formulated as the solution to a convex optimization problem, which we solve using an efficient alternating directions method of multipliers (ADMM) algorithm. This algorithm has guaranteed convergence to the global optimum, can be easily specialized to any convex penalty function of interest, and allows for a straightforward extension to the setting of generalized linear models. We derive an unbiased estimator of the degrees of freedom of FAMILY, and explore its performance in a simulation study and on an HIV sequence data set.

## 1 Introduction

### 1.1 Modeling Interactions

In this paper, we model a response variable with a set of main effects and second-order interactions. The problem can be formulated as follows: we are given a response vector y for n observations, an n × p1 matrix X of covariates and another n × p2 matrix Z of covariates. In what follows, the notation X․,j and Z․,k will denote the jth column of X and kth column of Z, respectively. The goal is to fit the model

$yi=B0,0+∑j=1p1Bj,0Xi,j+∑k=1p2B0,kZi,k+∑j=1p1∑k=1p2Bj,kXi,jZi,k+εi,i=1,…,n,$
(1)

where B is a (p1 + 1) × (p2 + 1) matrix of coefficients, of which the rows and columns are indexed from 0 to p1 and 0 to p2 for the variables X and Z, respectively. In the special case where X = Z, the coefficient of the (j, k)th interaction is Bj,k + Bk,j, and the coefficient of the jth main effect is B0,j + Bj,0.

For brevity, we re-write model (1) using array notation. We construct the n × (p1 + 1) × (p2 + 1) array W as follows: for i {1, …, n}, j {0, …, p1}, k {0, …, p2},

$Wi,j,k={Xi,jZi,kforj≠0andk≠0Xi,jfork=0andj≠0Zi,kforj=0andk≠01forj=k=0.$
(2)

Then (1) is equivalent to the model

$y=W*B+ε,$
(3)

where B is the matrix of coefficients as in (1), and W * B denotes the n-vector whose ith element takes the form $(W*B)i≡∑j=0p1∑k=0p2Wi,j,kBj,k$. The model is displayed in the left panel of Figure 1.

Left: The model (1), for all n observations (top) and for the ith observation (bottom). The notation Wi,·,·, B denotes the inner product, ∑j,kWi,j,kBj,k. Right: In (6), the (p1 + 1) × (p2 + 1) coefficient ...

In fitting models with interactions, we may wish to impose either strong or weak heredity [Hamada and Wu, 1992, Yates, 1978, Chipman, 1996, Joseph, 2006], defined as follows:

• Strong Heredity: If an interaction term is included in the model, then both of the corresponding main effects must be present. That is, if Bj,k ≠ 0, then Bj,0 ≠ 0 and B0,k ≠ 0.
• Weak Heredity: If an interaction term is included in the model, then at least one of the corresponding main effects must be present. That is, if Bj,k ≠ 0, then either Bj,0 ≠ 0 or B0,k ≠ 0.

Such constraints facilitate model interpretation [McCullagh, 1984], improve statistical power [Cox, 1984], and simplify experimental designs [Bien et al., 2013]. In this paper we propose a general convex regularized regression approach which naturally and efficiently enforces strong heredity.

### 1.2 Summary of Previous Work

A number of authors have considered the task of fitting interaction models under strong or weak heredity constraints. Constraints to enforce heredity [Peixoto, 1987, Friedman, 1991, Bickel et al., 2010, Park and Hastie, 2008, Wu et al., 2010] have been applied to conventional step-wise model selection techniques [Montgomery et al., 2012, chap. 10]. Chipman [1996] and George and McCulloch [1993] proposed Bayesian methods. In more recent work, Hao and Zhang [2014] proposed iFORM, an approach that performs forward selection on the main effects, and allows interactions into the model once the main effects have already been selected. iFORM has a number of attractive properties, including suitability for the ultra-high-dimensional setting, computational efficiency, as well as proven theoretical guarantees.

In this paper, we take a regularization approach to inducing strong heredity. A number of regularization approaches for this task have already been proposed in the literature; in fact, a strength of our proposal is that it provides a unified framework (and associated algorithm) of which several existing approaches can be seen as special cases. Choi et al. [2010] propose a non-convex approach, which amounts to a lasso [Tibshirani, 1996] problem with re-parametrized coefficients. Alternatively, some authors have enforced strong or weak heredity via convex penalties or constraints. Jenatton et al. [2011] and Zhao et al. [2009] describe a set of penalties that can be applied to a broad class of problems. As a special case they consider interaction models with strong or weak heredity; this has been further developed by Bach et al. [2012]. Radchenko and James [2010], Lim and Hastie [2013] and Bien et al. [2013] propose penalties specifically designed for interaction models with sparsity and strong heredity. We now describe the latter two approaches in greater detail.

#### 1.2.1 hierNet [Bien et al., 2013]

The hierNet approach of Bien et al. [2013] fits the model (1) with X = Z and p1 = p2 = p. In the case of strong heredity, using the notation of (3), they consider the problem

$minimizeB∈ℝ(p+1)×(p+1),β±∈ℝp12‖y−W*B‖22+λ∑j=1p(βj++βj−)+λ2‖B−0,−0‖1$

$subject toB=BT,B0,−0=β+−β−$
(4)

$‖Bj,−0‖1≤βj++βj−,βj+≥0,βj−≥0forj=1,…,p.$

Using this notation, the coefficient for the jth main effect is B0,j + Bj,0, and the coefficient for the (j, k)th interaction is Bj,k + Bk,j. Strong heredity is imposed by the constraint $‖Bj,−0‖1≤βj++βj−$.

#### 1.2.2 glinternet [Lim and Hastie, 2013]

Like hierNet, the glinternet proposal of Lim and Hastie [2013] fits (1) with X = Z and p1 = p2 = p. In order to describe this approach, we introduce some additional notation. Let αk be the coefficient of the kth main effect. We decompose αk into p parameters, i.e. $αk=αk(0)+αk(1)+…+αk(k−1)+αk(k+1)+…+αk(p)$. We let αjk + αkj denote the coefficient for the interaction between Xj and Xk. Lim and Hastie [2013] propose to solve the optimization problem

$minimizeα0,{αij}i≠j;i,j≠0,{αi(j)}j≠i∈ℝ‖y−α0−∑k=1p∑j≠kαk(j)X․,k−∑j≠kαjk(X․,j*X․,k)‖22+λ(∑j=1p|αj(0)|+∑j≠k(αj(k))2+(αk(j))2+αjk2),$
(5)

where X․,j * X․,k denotes element-wise multiplication. Strong heredity is enforced via the group lasso [Yuan and Lin, 2006] penalties: if either αjk or αkj is estimated as non-zero, then $αj(k)$ and $αk(j)$ will be estimated to be non-zero, and hence so will αj and αk.

### 1.3 Organization of Paper

The rest of this paper is organized as follows. In Section 2, we provide details of FAMILY, our proposed approach for modeling interactions. An unbiased estimator for its degrees of freedom is in Section 3, and an extension to weak heredity is in Section 4. We explore FAMILY's empirical performance in simulation in Section 5, and in an application to an HIV data set in Section 6. The Discussion is in Section 7.

## 2 Modeling Interactions with FAMILY

We propose a framework for modeling interactions with a convex penalty (FAMILY). The FAMILY approach is the solution to a convex optimization problem, which (using the notation of Section 1.1) takes the form

$minimizeB∈ℝ(p1+1)×(p2+1)12n‖y−W*B‖22+λ1∑j=1p1Pr(Bj,․)+λ2∑k=1p2Pc(B․,k)+λ3‖B−0,−0‖1.$
(6)

Here, λ1, λ2, and λ3 are non-negative tuning parameters. Pr and Pc are convex penalty functions on the rows and columns of the coefficient matrix B. The ‖B−0,−01 term denotes the element-wise 1-norm on the interactions, which enforces sparsity on the interaction coefficients when λ3 is large. The right panel of Figure 1 demonstrates the action of each penalty on the matrix B.

As we will see, the choice of Pr and Pc will determine the type of structure (such as strong heredity) enforced on the fitted model. In the examples that follow, we take Pr = Pc; however, in principle, these two penalty functions need not be equal. For instance, if the features in Z are known to be of scientific importance, we might choose to perform feature selection on the main effects of X only. In this case, we might choose to use Pr(b) = ‖b2 and Pc(b) = 0.

We suggest standardizing the columns of X and Z to have mean zero and variance one before solving (6), in order to ensure that the main effects and interactions are on the same scale, as is standard practice for penalized regression estimators [Hastie et al., 2009]. We take this approach in Sections 5 and 6.

### 2.1 Connections to Lasso [Tibshirani, 1996]

The main effects lasso can be viewed as a special case of (6) where Pc and Pr are 1 penalties,

$minimizeB∈ℝ(p1+1)×(p2+1)12n‖y−W*B‖22+λ1∑j=1p1‖Bj,․‖1+λ2∑k=1p2‖B․,k‖1+λ3‖B−0,−0‖1,$
(7)

and where λ3 is chosen sufficiently large as to shrink all of the interaction terms to 0. In this case, the lasso penalties on the rows and columns are applied only to the main effects.

In contrast, if we take λ3 = 0, λ1 = λ2 = λ, and Pc(b) = Pr(b) = |b1|+1/2‖b−11, where $b=(b1,b−1T)T$, then (6) yields the all-pairs lasso, which applies a lasso penalty to all main effects and all interactions. In this case, (6) can be re-written more simply as

$minimizeB∈ℝ(p1+1)×(p2+1)12n‖y−W*B‖22+λ‖B‖1.$
(8)

However, our main interest in this paper is to develop a convex framework for modeling interactions that obeys strong heredity. Clearly, the all-pairs lasso does not satisfy strong heredity, and the main effects lasso does so only in a trivial way (by setting all interaction coefficient estimates to zero).

### 2.2 FAMILY with Strong Heredity

We now consider three choices of Pr and Pc in (6) that yield an estimator that obeys strong heredity. In Section 2.2.1, we consider the case where Pr and Pc are group lasso penalties. In Section 2.2.2, we consider the case where they are penalties. We consider a hybrid between an 1 and an norm in Section 2.2.3. The unit norm balls corresponding to these three penalties are displayed in Figure 2.

A graphical representation of the region P(β) ≤ 1, where P(β) = max (|β1|, |β2| + |β3|) (left); $P(β)=β12+β22+β32$ (center); or P(β) = max(|β1|, |β2 ...

#### 2.2.1 FAMILY with an 2 Penalty

We first consider (6) in the case where Pr(b) = Pc(b) = ‖b2, which we will refer to as FAMILY.l2. The resulting optimization problem takes the form

$minimizeB∈ℝ(p1+1)×(p2+1)12n‖y−W*B‖22+λ1∑j=1p1‖Bj,․‖2+λ2∑k=1p2‖B․,k‖2+λ3‖B−0,−0‖1.$
(9)

This formulation will induce strong heredity, in the sense that an interaction between Xj and Xk can have a non-zero coefficient estimate only if both of the corresponding main effects are non-zero.

Problem 9 is closely related to VANISH, an approach for non-linear interaction modeling [Radchenko and James, 2010]. In fact, if we take X = Z and assume that all main effects and interactions are scaled to have norm one in (9), and consider the case of VANISH with only linear main effects and interactions, then VANISH and (9) coincide exactly.

Radchenko and James [2010] attempt to solve the VANISH optimization problem via block coordinate descent. However, due to non-separability of the groups, their algorithm is not guaranteed convergence to the global optimum. In contrast, the algorithm in Section 2.3 is guaranteed convergence to the global optimum of (6) for any convex penalty, and can be extended to the case of generalized linear models.

#### 2.2.2 FAMILY with an ∞ Penalty

We now consider (6) in the case where Pr(b) = Pc(b) = ‖b; we refer to this in what follows as FAMILY.linf. We refer the reader to Duchi and Singer [2009] for a discussion of the properties of the norm, and its merits relative to the 2 norm in inducing group sparsity. In this case, (6) takes the form

$minimizeB∈ℝ(p1+1)×(p2+1)12n‖y−W*B‖22+λ1∑j=1p1‖Bj,․‖∞+λ2∑k=1p2‖B․,k‖∞+λ3‖B−0,−0‖1.$
(10)

This formulation also induces strong heredity.

#### 2.2.3 FAMILY with a Hybrid 1/∞ Penalty

Finally, we consider (6) with Pr(b) = Pc(b) = max(|b1|, ‖b−11). In this case, (6) takes the form

$minimizeB∈ℝ(p1+1)×(p2+1)12n‖y−W*B‖22+λ1∑j=1p1max(|Bj,0|,‖Bj,−0‖1)+λ2∑k=1p2max(|B0,k|,‖B−0,k‖1)+λ3‖B−0,−0‖1.$
(11)

In the special case where X = Z, λ1 = λ2 = λ, and λ3 = λ/2, (11) is in fact equivalent to the hierNet proposal of Bien et al. [2013]. Details of this equivalence are given in Bien et al. [2013].

Bien et al. [2013] propose to solve hierNet via an ADMM algorithm which applies a generalized gradient descent loop within each update. This leads to computational inefficiency, especially for large p. In Section 2.3, we propose a simple, stand-alone ADMM algorithm for solving (6), which can be easily applied to solve (11), and consequently also the hierNet optimization problem.

Given its connection to Bien et al. [2013], we refer to (11) as FAMILY.hierNet.

#### 2.2.4 Dual Norms

Here we further consider the l2, l and l1/l hybrid penalties discussed in Sections 2.2.1–2.2.3. For an arbitrary penalty, the proximal operator is the solution to the optimization problem

$minimizeβ12‖y−β‖2+λP(β).$
(12)

We begin by presenting a well-known lemma (see e.g. Proposition 1.1, Bach et al. [2011]).

##### Lemma 2.1

Let P(y) be a norm of y with dual norm P*(y) maxz {zT y : P(z) ≤ 1}. Then = 0 solves (12) if and only if P*(y) ≤ λ.

It is well-known that the 2 norm is its own dual norm, and that the 1 norm is dual to the norm. We now derive the dual norm for the FAMILY.hierNet penalty. This lemma is proven in Appendix B.

##### Lemma 2.2

The dual norm of P(β) = max{|β1|, ‖β−11} takes the form

$P*(β)=|β1|+‖β−1‖∞.$
(13)

Lemmas 2.1 and 2.2 provide insight into the values of y for which all variables are shrunken to zero in (12). The dual norm balls for the hybrid 1/, 2, and norms are displayed in Figure 3. By Lemma 2.1, any y inside the dual norm ball leads to a zero solution of (12). For the hybrid 1/ norm, the shape of the dual norm ball implies that the first element of y plays an outsize role in whether or not the coefficient vector is shrunken to zero. Consequently, the main effects play a larger role than the interactions in determining whether sparsity is induced. In contrast, for the and 2 norms, the main effect and interactions play an equal role in determining whether the coefficients are shrunken to zero.

A graphical representation of the region P*(β) ≤ 1, where P*(β) is the dual norm for P(β) = max (|β1|, |β2| + |β3|) (left); $P(β)=β12+β22+β32$ (center); or P(β) ...

### 2.3 Algorithm for Solving FAMILY

A step-by-step ADMM algorithm for solving FAMILY is provided in Appendix A.2. Here, we present an overview of this algorithm. A gentle introduction to ADMM is provided in Appendix A.1.

#### 2.3.1 ADMM Algorithm for Solving FAMILY

We now develop an ADMM algorithm to solve (6). We define the variable Θ = (D|E|F), with D, E, F (p1+1)×(p2+1). That is, Θ is a (p1 + 1) × 3(p2 + 1) matrix, which we partition into D, E, and F for convenience. Then (6) can be re-written as

$minimizeB∈ℝ(p1+1)×(p2+1),Θ∈ℝ(p1+1)×3(p2+1){12n‖y−W*B‖22+λ1∑j=1p1Pr(Dj,․)+λ2∑k=1p2Pc(E․,k)+λ3‖F−0,−0‖1}$

$subject to B(I(p2+1)×(p2+1)|I(p2+1)×(p2+1)|I(p2+1)×(p2+1))=Θ.$
(14)

The augmented Lagrangian corresponding to (14) takes the form

$Lρ(B,Θ,Γ)=12n‖y−W*B‖22+λ1∑j=1p1Pr(Dj,․)+λ2∑k=1p2Pc(E․,k)+λ3‖F−0,−0‖1+trace(ΓT(B(I|I|I)−Θ))+ρ/2‖B(I|I|I)−Θ‖F2,$
(15)

where Γ is a (p1 + 1) × 3(p2 + 1)-dimensional dual variable. For convenience, we partition Γ as follows: Γ = (Γ123) where Γi is a (p1 + 1) × (p2 + 1) matrix for i = 1, 2, 3.

The augmented Lagrangian (15) can be rewritten as

$Lρ(B,Θ,Γ)=12n‖y−W*B‖22+λ1∑j=1p1Pr(Dj,․)+λ2∑k=1p2Pc(E․,k)+λ‖F−0,−0‖1+〈Γ1,B−D〉+〈Γ2,B−E〉+〈Γ3,B−F〉+ρ/2‖B−D‖F2+ρ/2‖B−E‖F2+ρ/2‖B−F‖F2.$
(16)

In order to develop an ADMM algorithm to solve (6), we must now simply figure out how to minimize (16) with respect to B with Θ held fixed, and how to minimize (16) with respect to Θ with B held fixed. Minimizing (16) with respect to B amounts simply to a least squares problem. In order to minimize (16) with respect to Θ, we note that (16) can simply be minimized with respect to D, E, and F separately. Minimizing (16) with respect to F amounts simply to soft-thresholding [Friedman et al., 2007]. Minimizing (16) with respect to D or with respect to E amounts to solving a problem that is equivalent to (12). We consider that problem next.

Details of the ADMM algorithm for solving (6) are given in Appendix A.2.

#### 2.3.2 Solving (12) for 2, ∞, and Hybrid 1/∞ Penalties

We saw in the previous section that the updates for D and E in the ADMM algorithm amount to solving the problem (12). For P(β) = ‖β‖2, (12) amounts to soft-shrinkage [Simon et al., 2013, Yuan and Lin, 2006], for which a closed-form solution is available. For P(β) = ‖β‖, an efficient algorithm was proposed by Duchi and Singer [2009]. We now present an efficient algorithm for solving (12) for P(β) = max{|β1|, ‖β−11}.

##### Lemma 2.3

Let denote the solution to (12) with P(β) = max{|β1|, ‖β−11}. Then = yû, where û is the solution to

$minimizeu∈ℝp,λ1∈ℝ12‖y−u‖2$

$subject to|u1|≤λ1,‖u−1‖∞≤λ−λ1,0≤λ1≤λ.$
(17)

We established in Section 2.2.4 that if λ ≥ |y1| + ‖y−1, then the solution to (12) is zero. Therefore, we now restrict our attention to the case λ < |y1| + ‖y−1. For a fixed λ1 [0, λ], we can see by inspection that the solution to (17) is given by

$u1(λl)={y1|y1|≤λ1λ1sgn(y1)|y1|>λ1andui(λ1)={yi|yi|≤λ−λ1(λ−λ1)sgn(yi)|yi|>λ−λ1,$
(18)

for i = 2, …, p. Thus, (17) is equivalent to the problem

$minimizeλ1∈[0,λ]12‖y−u(λ1)‖2.$
(19)

##### Theorem 2.4

Let z denote the (p − 1)-vector whose ith element is λ − |yi+1|. Then the solution to problem (19) is given by

$λ^1={λif minj{|y1|+∑i=1jz(i)j+1}≥λ0if minj{|y1|+∑i=1jz(i)j+1}≤0minj{|y1|+∑i=1jz(i)j+1}otherwise.$
(20)

Combining Theorem 2.4 and Lemma 2.3 gives us a solution for (12) with the hybrid 1/ penalty. Proofs are given in Appendix B.

#### 2.3.3 Convergence, Computational Complexity, and Timing Results

As mentioned in Section A.1, ADMM's convergence to the global optimum is guaranteed for the convex, closed and proper objective function (6) [Boyd et al., 2011]. The computational complexity of the algorithm depends on the form of the penalty functions used.

The update for B is typically the most computationally-demanding step of the ADMM algorithm for (6). As pointed out in Appendix A.2, this can be done very efficiently. We perform the singular value decomposition for a n × (p1 + 1)(p2 + 1)-dimensional matrix once, given the data matrix W. Then, in each iteration of the ADMM algorithm, the update for B requires simply an efficient matrix inversion using the Woodbury matrix formula.

We now report timing results for our R-language implementation of FAMILY, available in the package FAMILY on CRAN, on an Intel® Xeon® E5-2620 processor. We considered an example with n = 350 and p1 = p2 = 500 (for a total of 251; 000 features). Using the parametrization (33), running FAMILY.l2 with α = 0.7 and a grid of 10 λ values takes a median time of 330 seconds, and running FAMILY.linf takes a median time of 416 seconds.

### 2.4 Extension to Generalized Linear Models

The FAMILY optimization problem (6) can be extended to the case of a general convex loss function l(·),

$minimizeB∈ℝ(p1+1)×(p2+1)1nl(B)+λ1∑j=1p1Pr(Bj,․)+λ2∑k=1p2Pc(B․,k)+λ3‖B−0,−0‖1.$
(21)

For instance, in the case of a binary response variable y, we could take l to be the negative log likelihood under a binomial model. Then (21) corresponds to a penalized logistic regression problem with interactions. An ADMM algorithm for (21) can be derived just as in Section 2.3.1, with a modification to the update for B. This is discussed in Appendix A.3.

### 2.5 Uniqueness of the FAMILY Solution

The FAMILY optimization problem (6) is convex, and the algorithm presented in Section 2.3 is guaranteed to yield a solution that achieves the global minimum. But (6) is not strictly convex: this means that the solution might not be unique, in the sense that more than one value of B might achieve the global minimum. However, uniqueness of the fitted values resulting from (6) is straightforward. This is formalized in the following lemma. The proof is as in Lemma 1(ii) of Tibshirani et al. [2013].

#### Lemma 2.5

For a convex penalty function P(·), let denote the solution to the problem

$minimizeB∈ℝ(p1+1)×(p2+1)12n‖y−W*B‖2+P(B).$
(22)

The fitted values W * are unique.

## 3 Degrees of Freedom

### 3.1 Review of Degrees of Freedom

Consider the linear model y = Xβ + ε, with fixed X, and ε ~ n(0, σ2In). Then the degrees of freedom of a model-fitting procedure is defined as [Stein, 1981, Efron, 1986]

$df=1σ2∑i=1nCov(yi,ŷi),$
(23)

where ŷi are the fitted response values. If certain conditions hold, then

$df=E[∑i=1n∂ŷi∂yi].$
(24)

Therefore, $∑i=1n∂ŷi∂yi$ is an unbiased estimator for the degrees of freedom of the model-fitting procedure.

Before presenting the main results of this section, we state a useful lemma.

#### Lemma 3.1

Given a vector x p, and an even positive integer q,

$d2‖x‖qdx2=(q−1)diag[(x‖x‖q)q−2]×[I‖x‖q−x(xT)q−1‖x‖qq+1],$
(25)

where diag(x) is the diagonal matrix with x on the diagonal, and (x)q denotes the element-wise exponentiation of the vector x.

### 3.2 Degrees of Freedom for a Penalized Regression Problem

We now consider the degrees of freedom of the estimator that solves the problem

$minimizeβ∈ℝp12‖y−Xβ‖22+∑dλdPd(Adβ),$
(26)

where Pd(·) is an q norm for a positive q, and Ad is a p × p diagonal matrix with ones and zeros on the main diagonal. We define the active set to be = {j : j ≠ 0}, the set of non-zero coefficient estimates. Let denote the coefficients of the active set, and let X denote the matrix with columns corresponding to elements of the active set. Furthermore, we define $Ad𝒜$ to be the sub-matrix of Ad with rows and columns in .

#### Claim 3.2

An unbiased estimator of the degrees of freedom of , the solution to (26), is given by

$df^=trace(X𝒜[X𝒜TX𝒜+∑dλd(Ad𝒜)TP¨d(Ad𝒜β^𝒜)(Ad𝒜)]−1X𝒜T),$
(27)

where d (·) is the Hessian of the function Pd(·), and where is the active set.

The derivation for Claim 3.2 is outlined in Appendix C.

### 3.3 Degrees of Freedom for FAMILY

In this section we present estimates for the degrees of freedom of FAMILY.l2 and FAMILY.linf. An estimate of the degrees of freedom of FAMILY.hierNet is given in Bien et al. [2013].

#### 3.3.1 FAMILY.l2

We write FAMILY.l2 in the form of (26),

$12‖y−W˜B˜‖22+nλ1∑j=1p1‖AjB˜‖2+nλ2∑k=p1+1p1+p2‖AkB˜‖2+nλ3‖AIB˜‖1,$
(28)

where is the vectorized version of B, and is the n × (p1 + 1)(p2 + 1)-dimensional matrix version of W. We apply Claim 3.2 in order to obtain an unbiased estimate for FAMILY.l2:

$df^ℓ2=trace(W˜𝒜[W˜𝒜TW˜𝒜+nλ1∑j=1p1(Aj𝒜)T[P¨(Aj𝒜B˜^𝒜)](Aj𝒜)+nλ2∑k=p1+1p1+p2(Ak𝒜)T[P¨(Ak𝒜B˜^𝒜)](Ak𝒜)]−1W˜𝒜T),$
(29)

where $P¨(υ0)=d2‖υ‖2dυ2|υ=υ0$ is of the form given in Lemma 3.1.

#### 3.3.2 FAMILY.linf

The norm is not differentiable, and thus we cannot apply Claim 3.2 directly. Instead, we make use of the fact that $limq→∞‖β‖q=‖β‖∞$ in order to apply Claim 3.2 to a modified version of FAMILY.linf in which the norm is replaced with an q norm for a very large value of q. This yields the estimator

$df^ℓ∞=trace(W˜𝒜[W˜𝒜TW˜𝒜+nλ1∑j=1p1(Aj𝒜)T[P¨(Aj𝒜B˜^𝒜)](Aj𝒜)+nλ2∑k=p1+1p1+p2(Ak𝒜)T[P¨(Ak𝒜B˜^𝒜)](Ak𝒜)]−1W˜𝒜T),$
(30)

where** is of the form given in Lemma 3.1. We use q = 500 in Section 3.4.

### 3.4 Numerical Results

We now consider the numerical performance of our estimates of the degrees of freedom of FAMILY in a simple simulation setting. We use a fixed design matrix X, with n = 100 rows and p = 10 main effects, and we let X = Z. We randomly selected 15 true interaction terms. We generated 100 different response vectors y(1), …, y(100) using independent Gaussian noise. We computed the true degrees of freedom as well as the estimated degrees of freedom from (29) and (30), averaged over the 100 simulated data sets. In Figure 4, we see almost perfect agreement between the true and estimated degrees of freedom.

The estimated degrees of freedom as a function of the actual degrees of freedom, for (Left:) FAMILY.l2 and (Right:) FAMILY.linf. To estimate the degrees of freedom for FAMILY.linf, we used q = 500 in (30). Several values of α in were used in the ...

## 4 Extension to Weak Heredity

We now consider a modification to the FAMILY optimization problem, (6), that imposes weak heredity. We assume that the main effects, interactions, and response have been centered to have mean zero.

In order to enforce weak heredity, we take an approach motivated by the latent overlap group lasso of Jacob et al. [2009]. We let WX denote the n × p1 × (p2 + 1) array defined as follows: for i {1, …, n}, j {1, …, p1}, k {0, …, p2},

$Wi,j,kX={Xi,jZi,kfork≠0Xi,jfork=0.$
(31)

We let WZ denote the n × (p1+1) × p2 array defined in an analogous way. We take BX to be a p1 × (p2+1) matrix, and BZ to be a (p1 + 1) × p2 matrix.

We propose to solve the optimization problem

$minimizeBX∈ℝp1×(p2+1)BZ∈ℝ(p1+1)×p212n‖y−WX*BX−WZ*BZ‖22+λ1∑j=1p1Pr(Bj,․X)+λ2∑k=1p2Pc(B․,kZ)+λ3(‖B․,−0X‖1+‖B−0,․Z‖1).$
(32)

Then the coefficient for the jth main effect of X is $Bj,0X$, the coefficient for the kth main effect of Z is $B0,kZ$, and the coefficient for the (j, k) interaction is $Bj,kX+Bj,kZ$. If we take Pr and Pc to be either 2, or hybrid 1/ penalties, then (32) imposes weak heredity: if the kth column of BZ has a zero estimate, then the (j, k)th interaction coefficient estimate need not be zero. However, if the jth row of BX and the kth column of BZ have zero estimates, then the (j, k)th interaction coefficient estimate is zero.

Problem (32) can be solved using an ADMM algorithm similar to that of Section 2.3. Since the focus of this paper is on enforcing strong heredity, we leave the details of an algorithm for (32), as well as a careful numerical study, to future work.

## 5 Simulation Study

We compare the performance of FAMILY.l2 and FAMILY.linf to the all-pairs lasso (APL), the hierNet proposal of Bien et al. [2013], and the glinternet proposal of Lim and Hastie [2013]. APL can be performed using the glmnet R package, and hierNet and glinternet are implemented in R packages available on CRAN. We also include the oracle model [Fan and Li, 2001] — an unpenalized model that uses only the main effects and interactions that are non-zero in the true model — in our comparisons.

The forward selection proposal of Hao and Zhang [2014], iFORM, is a fast screening approach for detecting interactions in ultra-high dimensional data. iFORM is intended for the setting in which the true model is extremely sparse. In our simulation setting, we consider moderately sparse models, which fails to highlight the advantages of iFORM. Thus, we do not include results for iFORM in our simulation study.

To facilitate comparison with hierNet and glinternet, which require X = Z, we take X = Z in our simulation study. Similar empirical results are obtained in simulations with XZ; results are omitted due to space constraints.

We consider squared error loss in Section 5.1, and logistic regression loss in Section 5.2.

### 5.1 Squared Error Loss

#### 5.1.1 Simulation Set-up

We created a coefficient matrix B, with p = 30 main effects and $(p2)=435$ interactions, for a total of 465 features. The first 10 main effects have non-zero coefficients, assigned uniformly from the set {−5, −4, …, −1, 1, …, 5}. The remaining main effects’ coefficients equal zero. We consider three simulation settings, in which we randomly select 15, 30 or 45 non-zero interaction coefficients, chosen to obey strong heredity. The values for the non-zero coefficients were selected uniformly from the set {−10, −8, …, −2, 2, …, 8, 10}. Figure 5 displays B in each of the three simulation settings.

For the simulation study in Section 5, the heatmap of the matrix B is displayed in the case of 15 (left), 30 (center), and 45 (right) non-zero interactions. The first row and column of each heatmap represent the main effects.

We generated a training set, a test set, and a validation set, each consisting of 300 observations. Each observation of X = Z was generated independently from a p(0, I) distribution; W was then constructed according to (2). For each observation we generated an independent Gaussian noise term, with variance adjusted to maintain a signal-to-noise ratio of approximately 2.5 to 3.5. Finally, for each observation, a response was generated according to (3).

We applied glinternet and hierNet for 50 different values of the tuning parameters. For convenience, given that X = Z, we reparametrized the FAMILY optimization problem (6) as

$minimizeB∈ℝ(p+1)×(p+1)12n‖y−W*B‖22+(1−α)λp∑j=1pPr(Bj,․)+(1−α)λp∑k=1pPc(B․,k)+αλ‖B−0,−0‖1.$
(33)

We applied FAMILY.l2 and FAMILY.linf over a 10 × 50 grid of (α, λ) values, with α (0, 1) and λ chosen to give a suitable range of sparsity.

In principle, many methods are available for selecting the tuning parameters α and λ. These include Bayesian information criterion, generalized cross-validation, and others. Because we do not have an estimator for the degrees of freedom of the glinternet estimator, we opted to use a training/test/validation set approach. In greater detail, we fit each method to the training set, selected tuning parameters based on sum of squared residuals (SSR) on the test set, and then reported the SSR for that choice of tuning parameters on the validation set.

It is well-known that penalized regression techniques tend to yield models with over-shrunken coefficient estimates [Hastie et al., 2009, Fan and Li, 2001]. To overcome this problem, we obtained relaxed versions of FAMILY.l2, FAMILY.linf, hierNet, and glinternet, by refitting an unpenalized least squares model to the set of coefficients that are non-zero in the penalized fitted model [Meinshausen, 2007, Radchenko and James, 2010].

We also considered generating the observations of X from a p(0, Σ) distribution, where Σ was an autoregressive or an exchangeable covariance matrix. We found that the choice of covariance matrix Σ led to little qualitative difference in the results. Therefore, we display only results for Σ = I in Section 5.1.2.

#### 5.1.2 Results

The left panel of Figure 6 displays ROC curves for FAMILY.linf, FAMILY.l2, hierNet, glinternet, and APL. These results indicate that FAMILY.l2 outperforms all other methods in terms of variable selection, especially as the number of non-zero interaction coefficients increases. When there are 45 non-zero interactions, FAMILY.linf outperforms glinternet, hierNet, and APL.

Results for the simulation study of Section 5.1, averaged over 100 simulated datasets. The colored lines indicate the results for glinternet (), hierNet (), APL (), FAMILY.l2 with α = 0.7 (), and FAMILY.linf with α = 0.83 (). Left: ROC ...

The right panel of Figure 6 displays the test set SSR for all methods, as the tuning parameters are varied. We observe that relaxation leads to improvement for each method: it yields a much sparser model for a given value of the test error. This is not surprising, since the relaxation alleviates some of the over-shrinkage induced by the application of multiple convex penalties. The results further indicate that when relaxation is applied, FAMILY.l2 performs the best, followed by FAMILY.linf and then the other competitors. We once again observe that the improvement of FAMILY.l2 and FAMILY.linf over the competitors increases as the number of non-zero interaction coefficients increases.

Interestingly, the right-hand panel of Figure 6 indicates that though FAMILY.l2 performs the best when relaxation is performed, it performs quite poorly when relaxation is not performed, in that the model with smallest test set SSR contains far too many non-zero interactions. This is consistent with the remark in Radchenko and James [2010] regarding over-shrinkage of coefficient estimates.

In Table 1, we present results on the validation set for the model that was fit on the training set using the tuning parameters selected on the test set, as described in Section 5.1.1. We see that FAMILY.l2 and FAMILY.linf outperform the competitors in terms of SSR, false discovery rate, and true positive rate, especially when relaxation is performed.

Simulation results, averaged over 100 simulated datasets, for the simulation set-up in Section 5.1. Tuning parameters were selected using a training/test/validation set approach, as described in Section 5.1.1. From left to right, the table's columns indicate ...

### 5.2 Logistic regression

#### 5.2.1 Simulation Set-up

We assume that each response yi is a Bernoulli variable with probability pi. We then model pi as

$log(pi1−pi)=(W*B)i;i=1,…,n,$
(34)

where W * B is the n-vector defined in Section 1.1. The matrices X and B are generated in the exact same manner as in Section 5.1.1, but now with n = 500 observations in the training and test sets.

Once again, for convenience, we reparametrized FAMILY.l2 and FAMILY.linf according to

$minimizeB∈ℝ(p+1)×(p+1)−1n∑i=1n[yi(W*B)i−log(1+e(W*B)i)]+p(1−α)λ∑j=1pPr(Bj,․)+p(1−α)λ∑k=1pPc(B․,k)+αλ‖B−0,−0‖1.$
(35)

#### 5.2.2 Results

The results for logistic regression are displayed in Figure 7. The ROC curves in the left-hand panel indicate that FAMILY.linf and FAMILY.l2 outperform the competitors in terms of variable selection when there are 30 or 45 non-zero interactions. The SSR curves in the right-hand panel of Figure 7 indicate that the relaxed versions of FAMILY.linf and FAMILY.l2 perform very well in terms of prediction error on the test set, especially as the number of non-zero interactions increases.

Results for the simulation study of Section 5.2, averaged over 100 simulated data sets. Details are as in Figure 6, but with α = 0.8 for FAMILY.linf ().

## 6 Application to HIV Data

Rhee et al. [2006] study the susceptibility of the HIV-1 virus to 6 nucleoside reverse transcriptase inhibitors (NRTIs). The HIV-1 virus can become resistant to drugs via mutations in its genome sequence. Therefore, there is a need to model HIV-1’s drug susceptibility as a function of mutation status. We consider one particular NRTI, 3TC. The data consists of a sparse binary matrix, with mutation status at each of 217 genomic locations for n = 1057 HIV-1 isolates. For each of the observations, there is a measure of susceptibility to 3TC. This data set was also studied by Bien et al. [2013].

Rather than working with all 217 genomic locations, we create bins of ten adjacent loci; this results in a design matrix with p = 22 features and n = 1057 observations. We perform the binning because the raw data contains mostly zeros, as most mutations occur in at most a few of the observations; by binning the observations, we obtain less sparse data. This binning is justified under the assumption that mutations in a particular region of the genome sequence result in a change to a binding site, in which case nearby mutations should have similar effects on a binding site, and hence similar associations with drug susceptibility. This binning is also needed for computational reasons, in order to allow for comparison to hierNet (specifically the version that enforces strong heredity) using the R package of Bien et al. [2013]. (In Bien et al. [2013], all 217 genomic locations are analyzed using a much faster algorithm that enforces weak (rather than strong) heredity.)

We split the observations into equally-sized training and test sets. We fit glinternet, hierNet, FAMILY.l2, and FAMILY.linf on the training set for a range of tuning parameter values, and applied the fitted models to the test set. In Figure 8, the test set SSR is displayed as a function of the number of non-zero estimated interaction coefficients, averaged over 50 splits of the data into training and test sets. The figure reveals that all four methods give roughly similar results.

The test set SSR is displayed for the HIV-1 data of Section 6, as a function of the number of non-zero interaction terms. Results are averaged over 50 splits of the observations into a training set and a test set. The colored lines indicate the results ...

Figure 9 displays the estimated coefficient matrix, , that results from applying each of the four methods to all n = 1057 observations using the tuning parameter values that minimized the average test set SSR. The estimated coefficients are qualitatively similar for all four methods. All four methods detect some non-zero interactions involving the 17th feature. Glinternet yields the sparsest model.

For the HIV-1 data of Section 6, the estimated coefficient matrix −0,−0 is shown for (a): glinternet; (b): hierNet; (c): FAMILY.l2 with α = 0.944; and (d): FAMILY.linf with α = 0.944. Main effects are ...

## 7 Conclusion

In this paper, we have introduced FAMILY, a framework that unies a number of existing estimators for high-dimensional models with interactions. Special cases of FAMILY correspond to the all-pairs lasso, the main effects lasso, VANISH, and hierNet. Furthermore, we have explored the use of FAMILY with 2, , and hybrid 1/ penalties; these result in strong heredity and have good empirical performance.

The empirical results in Sections 5 and 6 indicate that the choice of penalty in FAMILY may be of little practical importance: for instance, FAMILY.l2, FAMILY.linf, and FAMILY.hierNet have similar performance. However, one could choose among penalties using cross-validation or a related approach.

We have presented a simple ADMM algorithm that can be used to solve the FAMILY optimization problem for any convex penalty. It finds the global optimum for VANISH (unlike the proposal in Radchenko and James [2010]), and provides a simpler alternative to the original hierNet algorithm [Bien et al., 2013].

FAMILY could be easily extended to accommodate higher-order interaction models. For instance, to accommodate third-order interactions, we could take B to be a (p + 1) × (p + 1) × (p + 1) coefficient array. Instead of penalizing each row and each column of B, we would instead penalize each ‘slice’ of the array.

In the simulation study in Section 5, we considered a setting with only p1 = p2 = 30 main effects. We did this in order to facilitate comparison to the hierNet proposal, which is very computationally intensive as implemented in the R package of Bien et al. [2013]. However, our proposal can be applied for much larger values of p1 and p2, as discussed in Section 2.3.3.

The R package FAMILY, available on CRAN, implements the methods described in this paper.

## Acknowledgments

We thank an anonymous associate editor and two referees for insightful comments that resulted in substantial improvements to this manuscript. We thank Helen Hao Zhang, Ning Hao, Jacob Bien, Michael Lim, and Trevor Hastie for providing software and helpful responses to inquiries. D.W. was supported by NIH Grant DP5OD009145, NSF CAREER Award DMS-1252624, and an Alfred P. Sloan Foundation Research Fellowship. N.S. was supported by NIH Grant DP5OD019820.

## References

• Bach Francis, Jenatton Rodolphe, Mairal Julien, Obozinski Guillaume, et al. Convex optimization with sparsity-inducing norms. Optimization for Machine Learning. 2011:19–53.
• Bach Francis, Jenatton Rodolphe, Mairal Julien, Obozinski Guillaume, et al. Structured sparsity through convex optimization. Statistical Science. 2012;27(4):450–468.
• Bickel Peter J, Ritov Yaacov, Tsybakov Alexandre B, et al. Borrowing strength: theory powering applications–a Festschrift for Lawrence D. Brown. Institute of Mathematical Statistics; 2010. Hierarchical selection of variables in sparse high-dimensional regression; pp. 56–69.
• Bien Jacob, Taylor Jonathan, Tibshirani Robert. A lasso for hierarchical interactions. The Annals of Statistics. 2013;41(3):1111–1141. [PubMed]
• Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning. 2011;3(1):1–122.
• Chipman Hugh. Bayesian variable selection with related predictors. Canadian Journal of Statistics. 1996;24(1):17–36.
• Choi Nam Hee, Li William, Zhu Ji. Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association. 2010;105(489):354–364.
• Cox David R. Interaction. International Statistical Review/Revue Internationale de Statistique. 1984:1–24.
• Duchi John, Singer Yoram. Efficient online and batch learning using forward backward splitting. The Journal of Machine Learning Research. 2009;10:2899–2934.
• Efron Bradley. How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association. 1986;81(394):461–470.
• Fan Jianqing, Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360.
• Friedman Jerome, Hastie Trevor, Höfling Holger, Tibshirani Robert, et al. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1(2):302–332.
• Friedman Jerome H. Multivariate adaptive regression splines. The annals of statistics. 1991:1–67.
• George Edward I, McCulloch Robert E. Variable selection via gibbs sampling. Journal of the American Statistical Association. 1993;88(423):881–889.
• Hamada Michael, Wu CF Jeff. Analysis of designed experiments with complex aliasing. Journal of Quality Technology. 1992;24(3):130–137.
• Hao Ning, Zhang Hao Helen. Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association. 2014;109(507):1285–1301. [PubMed]
• Hastie Trevor, Tibshirani Robert, Friedman Jerome, Hastie T, Friedman J, Tibshirani R. The elements of statistical learning. Vol. 2. Springer; 2009.
• Jacob Laurent, Obozinski Guillaume, Vert Jean-Philippe. Group lasso with overlap and graph lasso. Proceedings of the 26th Annual International Conference on Machine Learning; ACM; 2009. pp. 433–440.
• Jenatton Rodolphe, Audibert Jean-Yves, Bach Francis. Structured variable selection with sparsity-inducing norms. The Journal of Machine Learning Research. 2011;12:2777–2824.
• Joseph V Roshan. A bayesian approach to the design and analysis of fractionated experiments. Technometrics. 2006;48(2):219–229.
• Lim Michael, Hastie Trevor. Learning interactions through hierarchical group-lasso regularization. arXiv preprint arXiv:1308.2719. 2013
• McCullagh Peter. Generalized linear models. European Journal of Operational Research. 1984;16(3):285–292.
• Meinshausen Nicolai. Relaxed lasso. Computational Statistics & Data Analysis. 2007;52(1):374–393.
• Montgomery Douglas C, Peck Elizabeth A, Vining G Geoffrey. Introduction to linear regression analysis. Vol. 821. John Wiley & Sons; 2012.
• Park Mee Young, Hastie Trevor. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9(1):30–50. [PubMed]
• Peixoto Julio L. Hierarchical variable selection in polynomial regression models. The American Statistician. 1987;41(4):311–313.
• Radchenko Peter, James Gareth M. Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association. 2010;105(492):1541–1553.
• Rhee Soo-Yon, Taylor Jonathan, Wadhera Gauhar, Ben-Hur Asa, Brutlag Douglas L, Shafer Robert W. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences. 2006;103(46):17355–17360. [PubMed]
• Simon Noah, Friedman Jerome, Hastie Trevor, Tibshirani Robert. A sparse-group lasso. Journal of Computational and Graphical Statistics. 2013;22(2):231–245.
• Stein Charles M. Estimation of the mean of a multivariate normal distribution. The annals of Statistics. 1981:1135–1151.
• Tibshirani Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996:267–288.
• Tibshirani Ryan J, et al. The lasso problem and uniqueness. Electronic Journal of Statistics. 2013;7:1456–1490.
• Wu Jing, Devlin Bernie, Ringquist Steven, Trucco Massimo, Roeder Kathryn. Screen and clean: a tool for identifying interactions in genome-wide association studies. Genetic epidemiology. 2010;34(3):275–285. [PubMed]
• Yates Frank. The design and analysis of factorial experiments. Imperial Bureau of Soil Science; 1978.
• Yuan Ming, Lin Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2006;68(1):49–67.
• Zhao Peng, Rocha Guilherme, Yu Bin. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics. 2009;37(6A):3468–3497.

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.