Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3020238

Formats

Article sections

- Summary
- 1. Introduction
- 2. Methods
- 3. Simulation Studies
- 4. Application to a Breast Cancer Data Set
- 5. Final Remarks
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2012 March 1.

Published in final edited form as:

PMCID: PMC3020238

NIHMSID: NIHMS187314

Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Genomic instability, the propensity of aberrations in chromosomes, plays a critical role in the development of many diseases. High throughput genotyping experiments have been performed to study genomic instability in diseases. The output of such experiments can be summarized as high dimensional binary vectors, where each binary variable records aberration status at one marker locus. It is of keen interest to understand how aberrations may interact with each other, as it provides insight into the process of the disease development. In this paper, we propose a novel method, LogitNet, to infer such interactions among these aberration events. The method is based on penalized logistic regression with an extension to account for spatial correlation in the genomic instability data. We conduct extensive simulation studies and show that the proposed method performs well in the situations considered. Finally, we illustrate the method using genomic instability data from breast cancer samples.

Genomic instability refers to the propensity of aberrations in chromosomes such as mutations, deletions and amplifications in diseased tissues. It has been thought to play a critical role in the development of many diseases, for example, many types of cancers (Klein and Klein 1985). Unlike germline mutations, these somatic chromosomal abnormalities are not passed on from parents to children (Coop and Ellis 2009). Identifying which somatic aberrations contribute to disease risk, and how they may interact with each other during disease development is of keen interest. High throughput genotyping experiments have been performed to interrogate these aberrations in diseases, providing a vast amount of information on genomic instabilities on tens of thousands of marker loci simultaneously. These data can be organized as a *n* × *p* matrix where *n* is the number of samples, *p* is the number of marker loci, and the (*i, j*)^{th} element of the matrix is the binary aberration status for the *i*th sample at the *j*th locus. Our goal is to infer the conditional dependencies between aberrations, which we refer to as oncogenic pathways, based on these binary genomic instability profiles.

Oncogenic pathways can be compactly represented by graphs, in which vertices represent aberrations and edges represent interactions between aberrations. Tools developed for graphical models (Lauritzen 1996) can therefore be employed to infer interactions among aberrations. Specifically, each vertex represents a binary random variable that codes aberration status at a locus, and an edge will be drawn between two vertices if the corresponding two random variables are conditionally dependent given all other random variables. Here, we want to point out that graphical models based on conditional dependencies provide information on “higher order” interactions compared to other methods (e.g., hierarchical clustering) which examine the marginal pairwise correlations. The latter does not tell, for example, whether a non-zero correlation is due to a direct interaction between two aberration events or due to an indirect interaction through a third intermediate aberration event.

There is a rich literature on fitting graphical models for a limited number of variables (see for example Edward 2000; Drton and Perlman 2004, and references therein). However, in genomic instability profiles, the number of genes *p* is typically much larger than the number of samples *n*. Under such high-dimension-low-sample-size scenarios, sparse regularization becomes indispensable for purposes of both model tractability and model interpretation. Some work has already been done to tackle this challenge for high dimensional continuous variables. For example, Meinshausen and Buhlmann (2006) proposed to perform neighborhood selection with lasso regression (Tibshirani 1996) for each node. Peng et al. (2009a) extended the approach by imposing the sparsity on the whole network instead of each neighborhood, and implemented a fast computing algorithm. Other approaches include penalized maximum likelihood (e.g., Yuan and Lin 2007 and Friedman et al. 2007b), various regularization methods (e.g., Li and Gui 2006; Schafer and Strimmer 2007), and Bayesian methods (e.g., Madigan et al. 1995).

In this paper, we consider binary variables and propose a novel method, LogitNet, for inferring edges, i.e., the conditional dependence between pairs of aberration events given all others. With proper assumptions on the topology of oncogenic pathways, we derive the joint probability distribution of the *p* binary variables, which naturally leads to a set of *p* logistic regression models with the combined *p* × *p* coefficient matrix being symmetric. We propose sparse logistic regression with a lasso penalty term and extend it to account for the spatial correlations within chromosomes. This extension together with the enforcement of symmetry of the coefficient matrix produces a group selection effect, which enables LogitNet to account for and also benefit from spatial correlation when inferring the edges.

LogitNet is related to the work by Ravikumar et al. (2009), which also utilized sparse logistic regression to construct a network based on high dimensional binary variables. The basic idea of Ravikumar et al. is the same as that of Meinshausen and Buhlmann’s (2006) neighborhood selection approach, in which sparse logistic regression was performed for each binary variable given all others. A sparsity constraint was then imposed on each neighborhood and the sparse regression was performed for each binary variable separately. In this approach, the symmetry of conditional dependence obtained from regressing variable *X*_{r} on variable *X*_{s} and from regressing *X*_{s} on *X*_{r} is not guaranteed. As such, it can yield contradictory neighborhoods, which makes interpretation of the results difficult. It also loses power in detecting dependencies, especially when the sample size is small. The proposed LogitNet, on the other hand, makes use of the symmetry, which produces compatible logistic regression models for all variables and thus achieves a more coherent result with better efficiency than the Ravikumar et al. approach. We show by intensive simulation studies that LogitNet performs better in terms of false positive rate and false negative rate of edge detection.

The rest of the paper is organized as follows. In section 2, we will present the model, its implementation and the selection of the penalty parameter. Simulation studies of the proposed method and the comparison with the Ravikumar et al. approach are described in Section 3. Real genomic instability data from breast cancer samples is used to illustrate the method in Section 4. We conclude the paper with a brief summary in Section 5.

Consider a *p* × 1 vector of binary variables *X*^{T} = (*X*_{1}*,* …*, X*_{p}). Here the superscript *T* is a transpose. The pattern of conditional dependencies between these binary variables can be described by an undirected graph = (*V, E*), where *V* is a set of p vertices {1*,* …*, p*}, representing binary variables (*X*_{1}*,* …*, X*_{p}); and *E* is a set of pairs of vertices such that each pair are conditionally dependent given the rest of binary variables. We assume that the sizes of *maximum cliques* in the graph are all ≤ 2. Under this assumption, the joint probability distribution Pr(*X*) can be represented as a product of functions of pairs of binary variables, i.e., cliques of size 2. Note that a *clique* is defined as a subset of vertices with each pair of vertices in the subset connected; and a *maximum clique* is a clique such that including any other vertex outside the clique will make the clique cease to be a clique. For example, in a set of four binary variables, if the edge set includes (1,2), (2,3), (3,4) and (1,4), under this assumption it can not include edge (1,3) or (2,4), as adding either edge will form maximum clique with size 3. However this assumption does not preclude the possibility of cycles. In the above example, we can see the edge set includes a cycle 1 – 2 – 3 – 4 – 1.

Let V = {1, …, p}; and X_{−(}_{r,s}_{)} denote the vector of binary variables X excluding X* _{r}* and X

$$E=\{(r,s)\mid Pr({X}_{r},{X}_{s}\mid {X}_{-(r,s)})\ne Pr({X}_{r}\mid {X}_{-(r,s)})Pr({X}_{s}\mid {X}_{-(r,s)});r,s\in V,r<s\},$$

and |E| = K. If the sizes of maximum cliques in are ≤ 2, then there exist functions {h* _{k}*, k = 1, …, K} such that

$$Pr(X)=\prod _{k=1}^{K}{h}_{k}({X}_{{r}_{k}},{X}_{{s}_{k}}),$$

where (r* _{k}*, s

The proof of Proposition 1 is largely based on the Hammersley and Clifford theorem (Lauritzen, 1996) and given in Supplementary Appendix A.

Assuming Pr(*X*) is strictly positive for all values of *X*, then the above probability distribution leads to the well known quadratic exponential model

$$Pr(X=x)={\mathrm{\Delta}}^{-1}exp({x}^{T}\theta +{z}^{T}\kappa ),$$

(1)

where *z ^{T}* = (

Under this model, the zero values in *κ* are equivalent to the conditional independence for the corresponding binary variables. The following proposition describes this result and the proof is given in Supplementary Appendix B.

If the distribution on X is (1), then X_{r} X_{s} | X_{−(r,s)} if and only if κ_{rs} = 0, for 1 ≤ r < s ≤ p.

As the goal of graphical model selection is to infer the edge set *E* which represents the conditional dependence among all the variables, the result of Proposition 2 implies that we can infer the edge between a pair of events, say *X _{r}* and

$$\frac{Pr({x}_{s}=1\mid {x}_{1},\dots ,{x}_{s-1},{x}_{s+1},\dots ,{x}_{p})}{Pr({x}_{s}=0\mid {x}_{1},\dots ,{x}_{s-1},{x}_{s+1},\dots ,{x}_{p})}=exp({\kappa}_{1s}{x}_{1}+\dots +{\kappa}_{(s-1)s}{x}_{s-1}+{\theta}_{s}+{\kappa}_{s(s+1)}{x}_{s+1}+\dots +{\kappa}_{sp}{x}_{p}).$$

Taking the log transformation of the left hand side of this equation results in the familiar form of a logistic regression model, where the outcome is the *j*th binary variable and the predictors are all the other binary variables. Doing this for each of *x*_{1}*, x*_{2}*,* …*, x _{p}*, we obtain

$$\{\begin{array}{c}\text{logit}\{Pr({x}_{1}=1\mid {x}_{2},\dots ,{x}_{p})\}={\theta}_{1}+{\kappa}_{12}{x}_{2}+\dots +{\kappa}_{1p}{x}_{p},\\ \vdots \\ \text{logit}\{Pr({x}_{p}=1\mid {x}_{1},\dots ,{x}_{p-1})\}={\kappa}_{1p}{x}_{1}+\dots +{\kappa}_{(p-1)p}{x}_{p-1}+{\theta}_{p}.\end{array}$$

(2)

The matrix of all of the regression coefficients from *p* logistic regression models can then be row combined, which we denote by . It is easy to see that the matrix is symmetric, i.e., *β _{rs}* =

Specifically, let *X _{n}*

$$l(\mathcal{B})=-\sum _{r=1}^{p}\sum _{i=1}^{n}log\phantom{\rule{0.16667em}{0ex}}\{1+exp(-{X}^{r}[i,]\mathcal{B}{[r,]}^{T}\xb7{y}_{ir})\}.$$

(3)

where
${X}^{r}[i,]=({x}_{i1}^{r},\dots ,{x}_{ip}^{r})$; and [*r*,] = (*β*_{r1}*,* …*, β*_{rp}). Note, here we have the constraints *β _{rs}* =

Recall that our interest is to infer oncogenic pathways based on genome instability profiles of tumor samples. Most often, we are dealing with hundreds or thousands of genes and only tens or hundreds of samples. Thus, regularization on parameter estimation becomes indispensable as the number of variables is larger than the sample size, *p > n*. In the past decade, _{1} norm based sparsity constraints such as lasso (Tibshirani 1996) have shown considerable success in handling high-dimension-low-sample-size problems when the true model is sparse relative to the dimensionality of the data. Since it is widely believed that genetic regulatory relationships are intrinsically sparse, we propose to use _{1} norm penalty for inferring oncogenic pathways. The penalized loss function can be written as:

$${l}_{\lambda}^{\mathit{lasso}}(\mathcal{B})=-l(\mathcal{B})+\lambda \sum _{r=1}^{p}\sum _{s=r+1}^{p}\mid {\beta}_{rs}\mid .$$

(4)

Note that _{1}-norm penalty is imposed on all off-diagonal entries of simultaneously to control the overall sparsity of the joint logistical regression model, i.e., only a limited number of *β _{rs}*,

As described in the Introduction, the LogitNet model is closely related to the work by Ravikumar et al. (2009) which fits *p* lasso logistic regressions separately (hereafter referred to as SepLogit). The key difference between the two methods is that, LogitNet enforces symmetry when estimating the regression coefficients while SepLogit does not. Thus, for LogitNet, there are only about half of the parameters needed to be estimated as for SepLogit. As a result, LogitNet is more efficient and the resulting estimates are more interpretable than that of SepLogit. We want to point out that, in SepLogit, we used the same tuning parameter *λ* for all *p* lasso logistic regressions. Technically one can choose *p* different tuning parameters for the *p* regressions. However this strategy seriously overfits the data and additionally can cause substantial variability in model selection.

In this section, we describe an algorithm for obtaining the LogitNet estimator (*λ*). The algorithm extends the gradient descent algorithm (Genkin et al. 2007) to enforce the symmetry of . Parameters are updated one at a time using a one-step Newton-Raphson algorithm, in the same spirit as the shooting algorithm (Fu 1998) and the coordinate descent algorithm (Friedman et al. 2007a) for solving the general linear lasso regressions.

More specifically, let (*β _{rs}*) and (

$$\begin{array}{l}\stackrel{.}{l}({\beta}_{rs})=\sum _{i=1}^{n}\frac{{X}^{r}[i,s]Y[i,r]}{1+exp({R}_{r})}+\sum _{i=1}^{n}\frac{{X}^{s}[i,r]Y[i,s]}{1+exp({R}_{s})},\\ \ddot{l}({\beta}_{rs})=\sum _{i=1}^{n}{({X}^{r}[i,s])}^{2}\frac{exp({R}_{r})}{{\{1+exp({R}_{r})\}}^{2}}+\sum _{i=1}^{n}{({X}^{s}[i,r])}^{2}\frac{exp({R}_{s})}{{\{1+exp({R}_{s})\}}^{2}},\end{array}$$

where *R _{r}* =

$$\mathrm{\Delta}{\beta}_{rs}^{\mathit{lasso}}=-\frac{{\stackrel{.}{l}}_{\lambda}^{\mathit{lasso}}({\beta}_{rs})}{{\ddot{l}}_{\lambda}^{\mathit{lasso}}({\beta}_{rs})}=\mathrm{\Delta}{\beta}_{rs}-\frac{\lambda}{\ddot{l}({\beta}_{rs})}sgn({\beta}_{rs}),$$

(5)

where sgn(*β _{rs}*) is a sign function, which is 1 if

To further improve the convergence speed of the algorithm, we utilize the Active-Shooting idea proposed by Peng et al. (2009a) and Friedman et al. (2009). Specifically, at each iteration, we define the set of currently non-zero coefficients as the current active set and conduct the following two steps: (1) update the coefficients in the active set until convergence is achieved; (2) conduct a full loop update on all the coefficients one by one. We then repeat (1) and (2) until convergence is achieved on all of the coefficients. Since the target model in our problem is usually very sparse, this algorithm achieves very fast convergence by focusing on the small subspace whose members are more likely to be in the model.

We note that in equation (5) the regularization shrinks the estimate towards zero by the amount determined by the penalty parameter *λ* and that each parameter is not penalized by the same amount: *λ* is weighted by the variance (*β _{rs}*)

Spatial correlation of aberrations is common in genomic instability data. When we perform the regression of *X _{r}* on all other binary variables, loci that are spatially closest to

One approach to accounting for this undesirable spatial effect is to downweight the effect of the neighboring loci of *X _{r}* when regressing

$$\mathrm{\Delta}{\beta}_{rs}^{\mathit{lasso}}=\mathrm{\Delta}{\beta}_{rs}-\lambda \frac{{w}_{rs}}{\ddot{l}({\beta}_{rs})}sgn({\beta}_{rs}),$$

where *w _{rs}* is the weight for the spatial correlation. Naturally the weight

- Calculate the odds ratio
*α*between every locus in the chromosome with the target locus by univariate logistic regression. - Plot the
*α*’s by their genomic locations and smooth the profile by loess with a window size of 10 loci. - Set the smoothed curve to 0 as soon as the curve starting from the target locus hits 0. Here “hits 0” is defined as
*< ε*, where*ε*= median|_{r}−_{r}_{r}_{+1}|. - Set the weight
*w*= exp(*α*).

It is worth noting that the above weighting scheme together with the enforcement of the symmetry of in LogitNet encourages a group selection effect, *i.e.*, highly correlated predictors tend to be in or out of the model simultaneously. We illustrate this point with a simple example system of three variables *X*_{1}*, X*_{2}, and *X*_{3}. Suppose that *X*_{2} and *X*_{3} are very close on the genome and highly correlated; and *X*_{1} is associated with *X*_{2} and *X*_{3} but sits on a different chromosome. Under our proposal, the weight matrix *w* is 1 for all entries except *w*_{23} = *w*_{32} = *a*, which is a large value because of the strong spatial correlation between *X*_{2} and *X*_{3}. Then, for LogitNet, the joint logistic regression model

$$\text{logit}({X}_{1})\sim {\beta}_{11}+{\beta}_{12}{X}_{2}+{\beta}_{13}{X}_{3},$$

(6)

$$\text{logit}({X}_{2})\sim {\beta}_{12}{X}_{1}+{\beta}_{22}+{\beta}_{23}{X}_{3},$$

(7)

$$\text{logit}({X}_{3})\sim {\beta}_{13}{X}_{1}+{\beta}_{23}{X}_{2}+{\beta}_{33},$$

(8)

is subject to the constraint |*β*_{12}| + |*β*_{13}| + *a*|*β*_{23}| *< s*. Because of the large value of *a*, *β*_{23} will likely be shrunk to zero, which ensures *β*_{12} and *β*_{13} to be nonzero in (7) and (8), respectively. With the symmetry constraint imposed on matrix, we also enforce both *β*_{12} and *β*_{13} to be selected in (6). This grouping effect would not happen if we fit only the model (6) for which only one of *β*_{12} and *β*_{13} would likely be selected (Zou and Hastie 2005), nor would it happen if we didn’t have a large value of *a* because *β*_{23} would have been the dominant coefficient in models (7) and (8). Indeed, the group selection effect of LogitNet is clearly observed in the simulation studies conducted in Section 3.

We consider two procedures for selecting the penalty parameter *λ*: cross validation (CV) and Bayesian Information Criterion (BIC).

After we derive the weight matrix *w* based on the whole data set, we divide the data into *V* non-overlapping equal subsets. Treat the *v ^{th}* subset

In order to further control the false positive findings due to stochastic variation, we employ the cv.vote procedure proposed by Peng et al. (2009b). The idea is to derive the “consensus” result of the models estimated from each training set, as variables that are consistently selected by different training sets should be more likely to appear in the true model than the ones that are selected by one or few training sets. Specifically, in the following simulation and real application sections, we report edges that have been detected in more than *V/*2 training sets as our final result.

We can also use BIC to select *λ*:

$${\lambda}_{\text{BIC}}=arg\underset{\lambda}{min}\left\{-2l({\widehat{\mathcal{B}}}_{\mathit{uns}}(\lambda )\mid X)+log(n)\sum _{r<s}I({\widehat{\beta}}_{\mathit{uns},rs}^{(v)}(\lambda )\ne 0)\right\}$$

(9)

where
${\sum}_{r<s}I({\widehat{\beta}}_{\mathit{uns},rs}^{(v)}(\lambda )\ne 0)$ gives the dimension of the parameter space of the selected model. Here again, *un-shrunk estimates*
${\widehat{\mathcal{B}}}_{\mathit{uns}}^{(v)}(\lambda )$ is used to calculate the log likelihood.

In this section, we investigate the performance of the LogitNet method and compare it with SepLogit which fits *p* separate lasso logistic regressions with the same penalty parameter value (Ravikumar et al., 2009). We use R package *glmnet* to compute the SepLogit solution and the same weight matrix as described in Section 2.3 to account for the spatial correlation. In addition, since the SepLogit method does not ensure the symmetry of the estimated matrix, there will be cases that *β _{rs}* = 0 but

We generated background aberration events with spatial correlation using a homogenous Bernoulli Markov model. It is part of the instability-selection model (Newton et al. 1998), which hypothesizes that the genetic structure of a progenitor cell is subject to chromosomal instability that causes random aberrations. The Markov model has two parameters: *δ* and *ν*, where *δ* is the marginal (stationary) probability at a marker locus and *ν* measures the strength of the dependence between the aberrations. So *δ* plays the role of a background or sporadic aberration when *ν* affects the overall rate of change in the stochastic process. Under this model, the infinitesimal rate of change from no aberration to aberration is *νδ*, and from aberration to no aberration is *ν*(1 − *δ*). We then super-imposed the aberrations at disease loci, which were generated according to a pre-determined oncogenic pathway, on the background aberration events. The algorithm for generating an aberration indicator vector *X ^{T}* = (

- Specify the topology of an oncogenic pathway for the disease loci and the transitional probabilities among the aberrations on the pathway. The
*K*disease loci are indexed by {*s*_{1}*,*···*, s*}, where_{K}*s*{1_{i}*,*…*, p*} for*i*= 1*,*…*, K*. - Generate background aberrations denoted by a
*p*× 1 vector*Z*according to the homogenous Bernoulli Markov process with preselected values of*δ*= 0.05 and*ν*= 15. - Generate aberration events at disease loci following the oncogenic pathway specified in Step 1. This is denoted by a
*p*×1 vector*U*, where indices {*s*_{1}*,*···*, s*} are disease loci. If disease locus_{K}*s*has an aberration (_{i}*U*_{si}= 1), we also assign aberrations to its neighboring loci*U*= 1, for_{t}*t*[*s*−_{i}*a*+_{i}, s_{i}*b*], where_{i}*a*and_{i}*b*are independently sampled from Uniform[0_{i}*,*30]. The rest of the elements in*U*are 0. - Combine the aberration events at disease loci and the background aberrations by assigning
*X*= 1 if_{i}*U*+_{i}*Z*0 and 0 if_{i}>*U*=_{i}*Z*= 0, for_{i}*i*= 1*,*…*, p*.

We set *n* = 200 and *p* = 600 to mimic the dimension of the real data set used in Section 4, so *V* = {1*,* …*,* 600}. We assume the 600 marker loci fall into 6 different chromosomes with 100 loci on each chromosome. We consider two different oncogenic pathway models: a chain shape and a tree shape (see Figure 1). Each model contains 6 aberration events: M = {A*,* B*,* C*,* D*,* E*,* F}. Without loss of generality, we assume these 6 aberrations are located in the middle of each chromosome, so the indices of A–F are *s*_{A} = 50, *s*_{B} = 150, ···, *s*_{F} = 550, respectively. For any u M, *X*_{su} = 1 means aberration u occurs in the sample.

We evaluate the performance of the methods by two metrics: the false positive rate (FPR) and the false negative rate (FNR) of edge detection. Denote the true edge set *E* = {(*u, v*)|*X _{u}* and

$$\mid r-{I}_{\text{u}}\mid +\mid s-{I}_{\text{v}}\mid >30,\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\forall (\text{u},\text{v})\in E.$$

A cutoff value of 30 is used here because in the simulation setup (see Step 3) we set the maximum aberration size around the disease locus to be 30. Similarly, we define a conditionally dependent pair (u*,* v) *E* is missed, if there is no non-zero *β* falling in the grey diamond. We then calculate FPR as the number of false detections divided by the total number of non-zero * _{rs}*,

For the chain model, aberrations A-F occur sequentially on one oncogenic pathway. The aberration frequencies and transitional probabilities along the oncogenic pathway are illustrated in Figure 1(a). The true conditionally dependent pairs in this model are

$$E=\{({s}_{\text{A}},{s}_{\text{B}}),({s}_{\text{B}},{s}_{\text{C}}),({s}_{\text{C}},{s}_{\text{D}}),({s}_{\text{D}},{s}_{\text{E}}),({s}_{\text{E}},{s}_{\text{F}})\}.$$

Based on this chain model, we generated 50 independent data sets. The heatmap of one example data matrix 200×600 is shown in Supplemental Figure S-1. We then apply LogitNet and SepLogit to each simulated data set for a series of different values of *λ*. Figure 2 shows the FPR and FNR of the two methods as a function of *λ*. For both methods, FPR decreases with *λ* while FNR increases with *λ*. Comparing the two methods, LogitNet clearly outperforms SepLogit in terms of FPR and FNR. For LogitNet, the average optimal sum of the error rates (FPR+FNR) across the 50 independent data sets is 0.014 (s.d.=0.029; FPR=0.014, FNR=0); while for SepLogit it is 0.211 (s.d.=0.144; FPR=0.116, FNR=0.028). Here the optimal sum of the error rates for each data set is defined as the smallest value for the sum of FPR+FNR that could be achieved and is obtained by examining a series of models corresponding to different values of *λ* for this data set. In Figure 2, it is the nadir of the graph for the sum of the FPR+FNR error rates in the bottom panel for each data set. Specifically, taking the data set shown in the Supplemental Figure S-1 as an example, the optimal sum of the error rates achieved by LogitNet on this data set is 0 (FPR= 0, FNR= 0), while the optimal sum of the error rates achieved by SepLogit is 0.549 (FPR= 0.549, FNR= 0). The corresponding two coefficient matrices are illustrated in Figure 3. As one can see, there is a large degree of asymmetry in the result of SepLogit: 378 out of the 398 non-zero * _{rs}* have inconsistent transpose elements,

Results of LogitNet and SepLogit for the chain model in Figure 1(a). Each grey line represents one of the 50 independent data sets. The solid line is the mean curve with the two dashed lines represent mean ± one s.d.

Results of LogitNet and SepLogit on the example data set shown in Supplemental Figure S-1. Each red dot represents a non-zero *β*_{rs}. Points in the grey diamond are deemed as correct detections. The dashed blue lines indicate the locations of aberration **...**

We evaluated the two penalty parameter selection approaches: CV and BIC, for LogitNet. Table 1 summarizes selected *λ* values, FPR and FNR for CV and BIC. We can see that on average the selected *λ* values are close to the lowest point of the “U” shape curves of the sum of FPR+FNR in the bottom panel of Figure 2, suggesting the tuning strategies perform reasonably well. Comparing CV with BIC, the CV criterion tends to selects smaller *λ* values which means larger models and thus more FPR and fewer FNR than the BIC. The average sum of the error rates for CV is 0.079, which is slightly smaller than that for BIC, 0.084.

Summary of FPR and FNR for LogitNet for using CV and BIC to choose optimal λ. Each entry is the mean (S.D.) over 50 independent data sets

We also investigated the impact of sample size on the performance of LogitNet and SepLogit by conducting a simulation experiment using *n* = 50*,* 100*,* 150*,* 200. For each *n*, we generated 10 independent data sets based on the Chain Model, and applied both methods to each simulated data set for a series of *λ* values. The average optimal sum of the error rates (FPR+FNR) for each *n* and each method are shown in Figure 4, and the corresponding FPR and FNR values are listed in Supplementary Table S-1. From the figure and the supplementary Table S-1 we can see that for both methods the sum of the error rates decreases as the sample size increases and LogitNet has much lower sum of the error rates than SepLogit for all *n*s. Interestingly, the relative improvements from increasing sample size are nearly the same for both methods.

For the tree model, we used the empirical mutagenic tree derived in Beerenwinkel et al. (2004) for an HIV data set. The details of the model are illustrated in Figure 1(b). The true conditionally dependent pairs in this model are

$$E=\{({s}_{\text{A}},{s}_{\text{B}}),({s}_{\text{B}},{s}_{\text{E}}),({s}_{\text{A}},{s}_{\text{C}}),({s}_{\text{C}},{s}_{\text{F}}),({s}_{\text{A}},{s}_{\text{D}})\}.$$

The results of LogitNet and SepLogit for these data sets are summarized in Supplementary Figure S-2. Again, LogitNet outperforms SepLogit in terms of FPR and FNR. The average optimal sum of the error rates (FPR and the FNR) achieved by LogitNet across the 50 independent data sets is 0.163 (s.d.=0.106); while for SepLogit it is 0.331 (s.d.=0.160), twice as large as LogitNet. We also evaluated CV and BIC for LogitNet. The results are summarized in Table 1. Both CV and BIC give much higher FNRs under the tree model than under the chain model. This is not surprising as some transition probabilities between aberration events along the pathway are smaller in the tree model than in the chain model. As in the chain model, we also observe that BIC gives smaller FPR and higher FNR than CV, suggesting CV tends to select larger models and thus yields less false negatives but with more false positives in detecting edges.

In this section, we illustrate our method using genomic instability data from a breast cancer study (Loo et al., 2008). In this study, the GeneChip Mapping 10K Assay (Affymetrix, Santa Clara, CA) which contains 9670 SNPs with annotated genome locations was used to genotype 166 breast tumors and their constitutional normals. The genomic instability is measured by loss of heterozygosity (LOH), a common genomic alteration in breast tumors. An LOH event at a marker locus for a tumor is a binary variable, which is 1 if the locus is heterozygous in the constitutive normal DNA but homozygous in the tumor, and 0 if the locus is heterozygous in both the normal and tumor DNA. Even though 90% of the SNPs have call rates over 95%, 75% of these SNPs are homozygous in the constitutive normal samples, hence their LOH status can not be determined. To overcome this high rate of non-informativeness at individual locus, we binned the SNPs by cytogenetic bands (cytoband) and treated each cytoband as a genomic unit in our analysis. We then re-defined LOH for a cytoband as follows: the LOH status is 1 if at least 2 informative SNPs in the cytoband show LOH and 0 otherwise. In total, there were 765 cytobands covered by these SNPs. Cytobands which had noninformative rates above 20% or showed LOH in *<* 5 samples were removed. The the average LOH rate of the remaining 601 cytobands was 12.3%.

After binning the SNPs by cytobands, about 7.5% of LOH status still couldn’t be determined. We used the multiple imputation approach to impute the unknown LOH status based on the conditional probability of LOH at the target cytoband given the available LOH status at adjacent cytobands. If the LOH status for both adjacent cytobands were also unknown, we imputed the genotype using only the marginal probability of the target cytoband. See Supplemental Appendix D for details of the multiple imputation algorithm.

We then generated 10 imputed data sets. We applied LogitNet on each of them taking into account the spatial correlation. It is worth noting that LOHs here are meant to detect somatic mutations in tumors and thus the spatial correlation among LOHs has a different pattern (for example, larger blocks) compared to the spatial correlation (linkage disequilibirum) for the inherited genetic variation, though in both cases the same SNP chips have been used. We used 10-fold CV for penalty parameter selection. To stabilize the penalty parameter selection, we averaged the CV scores
${\sum}_{v=1}^{V}l({\widehat{\mathcal{B}}}_{\mathit{uns}}^{(v)}(\lambda )\mid {X}^{(v)})$ over 10 imputed datasets for a series of *λ* values and selected the value that gives the smallest average CV score. We then applied this *λ* value to each imputed data set and obtained the final model. The total number of edges inferred on each imputed data set is summarized in Supplementary Table S-2. We can see the number of edges selected is consistent across 10 imputed data sets. We then examined the consensus edges across different imputation data sets. Three edges are inferred in at least 4 imputed datasets. These are (11q22.3, 13q33.1), (11q24.2, 13q21.33), and (11q25, 13q14.11) and they occur in 6, 10, and 4 out of 10 imputed data sets, respectively. Particularly, the interaction between 11q24.2 and 13q21.33 has been consistently detected in all of the 10 imputation data sets. The LOH frequencies at these two cytobands are shown in Supplementary Table S-3. Cytoband 11q24.2 harbors the CHEK1 gene, which is an important gene in the maintenance of genome integrity and a potential tumor suppressor. DACH1 is located on cytoband 13q21.33 and has a role in the inhibition of tumor progression and metastasis in several types of cancer (e.g., Wu et al., 2009). Both CHEK1 and DACH1 inhibit cell cycle progression through mechanisms involving the cell cycle inhibitor, CDKN1A. See Supplemental Figure S-3 for the pathway showing the interaction between CHEK1 on 11q24.2 and DACH1 on 13q21.33.

In this paper, we propose the LogitNet method for learning oncogenic pathways using high dimensional binary data. When the size of maximum cliques ≤ 2, the dependence parameters in the joint probability distribution of binary variables can be estimated by fitting a set of logistic regression models with a symmetric coefficient matrix. For high-dimension-low-sample-size data, this result is especially appealing as we can use sparse regression techniques to regularize the parameter estimation. We implemented a fast algorithm for obtaining the LogitNet estimator. For example, in our simulation study where there are 600 variables and 200 samples, it took less than 30 seconds (on average) to fit the LogitNet model on a server with two Dual/Core CPU 2.5 GHz and 6 GB RAM. With extensive simulation studies, we demonstrate that this method achieves good power in edge detection, and also performs favorably compared to an existing method by Ravikumar et al. (2009).

The likelihood function (3) is a summation of log-likelihood functions corresponding to *p* logistic regression models, and is not an exact likelihood function. A recent work by Hofling and Tibshirani (2009) showed that approximate pseudo-likelihood estimators were much faster to obtain than exact likelihood estimators and only had slightly worse accuracy. Our basic likelihood function (3) is exactly the same as the approximate pseudo-likelihood in the Hofling and Tibshirani paper. Even though we further developed the likelihood function to accommodate specific features of genomic instability data, we would still expect the same observation from Hofling and Tibshirani hold in our situation.

In LogitNet, the weighting scheme together with the enforcement of symmetry encourage a group selection effect, *i.e.*, highly spatially correlated variables tend to be in and out of the model simultaneously. It is conceivable that this group selection effect may be further enhanced by replacing the lasso penalty with the elastic net penalty proposed by Zou and Hastie (2005) as
${\lambda}_{1}{\sum}_{1\le r<s\le p}\mid {\beta}_{rs}\mid +{\lambda}_{2}{\sum}_{1\le r<s\le p}{\beta}_{rs}^{2}$. The square _{2} norm penalty may facilitate group selection within each regularized logistic regression. More investigation along these lines is warranted.

In our example using real data, the genomic instability data consist of binary variables indicating the LOH status along the genome. There are also other types of genomic instability data which may have multiple categories. For example, DNA copy number alteration at a locus (or a genomic region) is usually summarized into one of the following three status: loss, no change, and gain. In these cases, it would be more appropriate to treat variables as polytomous than binary and use polytomous logistic regression models. We want to point out that regardless of the data type, the data pre-processing is an important step in the data analysis, particularly for data mining such as our network analysis. Data preparation and filtering steps can take considerable amount of processing time. However, analyzing data that has not been carefully curated not only makes the discovery of networks more difficult but also can produce misleading results.

R package LogitNet is available to public from CRAN (http://www.r-project.org/).

Web Appendices referenced in Sections 2, 3, and 4 are available under the Paper Information

link at the Biometrics website http://www.biometrics.tibs.org.

The authors are grateful to Drs. Peggy Porter and Lenora Loo for providing the genomic instability data set to us, which has motivated this methods development work. The authors are in part supported by grants from the National Institute of Health: R01GM082802 (Wang), R01AG14358 (Chao and Hsu), and P01CA53996 (Hsu).

- Beerenwinkel N, Rahnenfhrer J, Däumer M, Hoffmann D, Kaiser R, Selbig J, Lengauer T. Learning Multiple Evolutionary Pathways from Cross-Sectional Data. Journal of Computational Biology. 2005;12:584–598. [PubMed]
- Coop A, Ellis MJ. The nature and development of cancer. In: Warrell DA, Cox TM, Firth JD, editors. Oxford Textbook of Medicine. Oxford University Press; London: 2009.
- Drton M, Perlman MD. Model selection for Gaussian concentration graphs. Biometrika. 2004;91:591–602.
- Edward D. Introduction to Graphical Modelling. 2. New York: Springer; 2000.
- Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle Regression. Annals of Statistics. 2004;32:407–499.
- Friedman J, Hastie T, Hofling H, Tibshirani R. Pathwise Coordinate Optimization. Annals of Applied Statistics. 2007a;1:302–332.
- Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2007b;9:432–441. [PMC free article] [PubMed]
- Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Technical report. 2009. http://www-stat.stanford.edu/jhf/ftp/glmnet.pdf. [PMC free article] [PubMed]
- Fu W. Penalized Regressions: the Bridge vs the Lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416.
- Genkin A, Lewis DD, Madigan D. Large-scale Bayesian logistic regression for text categorization. Technometrics. 2007;49:291–304.
- Hofling H, Tibshirani R. Estimation of sparse binary pairwise Markov networks using pseudo-liklihoods. Journal of Machine Learning Research. 2009;10:883–906. [PMC free article] [PubMed]
- Joe H, Liu Y. A model for a multivariate binary response with covariates based on compatible conditionally specified logistic regression. Statistics & Probability Letters. 1996;31:113–120.
- Klein G, Klein E. Evolution of tumors and the impace of molecular oncology. Nature. 1985;315:190–195. [PubMed]
- Lauritzen SL. Graphical Models. Clarendon Press; Oxford, United Kingdom: 1996.
- Li H, Gui J. Gradient Directed Regularization for Sparse Gaussian Concentration Graphs, with Applications to Inference of Genetic Networks. Biostatitics. 2006;7:302–317. [PubMed]
- Loo L, Ton C, Wang YW, Grove DI, Bouzek H, Vartanian N, Lin MG, Yuan X, Lawton TL, Daling JR, Malone KE, Li CI, Hsu L, Porter P. Differential patterns of allelic loss in estrogen receptor-positive infiltrating lobular and ductal breast cancer. Genes, Chromosomes and Cancer. 2008;47:1049–66. [PMC free article] [PubMed]
- Madigan D, York J. Bayesian graphical models for discrete data. International Statistical Review. 1995;63:215–232.
- Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the Lasso. Annals of Statistics. 2006;34:1436–1462.
- Newton MA, Gould MN, Reznikoff CA, Haag JD. On the statistical analysis of allelic-loss data. Statistics in Medicine. 1998;17:1425–1445. [PubMed]
- Peng J, Wang P, Zhou N, Zhu J. Partial Correlation Estimation by Joint Sparse Regression Model. Journal of the American Statistical Association. 2009a;104:735–746. [PMC free article] [PubMed]
- Peng J, Zhu J, Bergamaschi A, Han W, Noh DY, Pollack JR, Wang P. Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. Technique Report. 2009b. http://arxiv.org/abs/0812.3671. [PMC free article] [PubMed]
- Ravikumar P, Wainwright M, Lafferty J. High-dimensional Ising model selection using
*l*_{1}-regularized logistic regression. Annals of Statistics. 2009 to appear. - Schafer J, Strimmer K. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics. Statistical Applications in Genetics and Molecular Biology. 2005;4 Article 32. [PubMed]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statististical Society Series B. 1996;58:267–88.
- Wu K, Katiyar S, Witkiewicz A, Li A, McCue P, Song L, Tian L, Jin M, Pestell RG. The Cell Fate Determination Factor Dachshund Inhibits Androgen Receptor Signaling and Prostate Cancer Cellular Growth. Cancer Research. 2009;69:3347–3355. [PMC free article] [PubMed]
- Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;94:19–35.
- Zhang T, Oles F. Text categorization based on regularized linear classifiers. Information Retrieval. 2001;4:5–31.
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B. 2005;67:301–320.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |