Genomic instability refers to the propensity of aberrations in chromosomes such as mutations, deletions and amplifications in diseased tissues. It has been thought to play a critical role in the development of many diseases, for example, many types of cancers (Klein and Klein 1985
). Unlike germline mutations, these somatic chromosomal abnormalities are not passed on from parents to children (Coop and Ellis 2009
). Identifying which somatic aberrations contribute to disease risk, and how they may interact with each other during disease development is of keen interest. High throughput genotyping experiments have been performed to interrogate these aberrations in diseases, providing a vast amount of information on genomic instabilities on tens of thousands of marker loci simultaneously. These data can be organized as a n
matrix where n
is the number of samples, p
is the number of marker loci, and the (i, j
element of the matrix is the binary aberration status for the i
th sample at the j
th locus. Our goal is to infer the conditional dependencies between aberrations, which we refer to as oncogenic pathways, based on these binary genomic instability profiles.
Oncogenic pathways can be compactly represented by graphs, in which vertices represent aberrations and edges represent interactions between aberrations. Tools developed for graphical models (Lauritzen 1996
) can therefore be employed to infer interactions among aberrations. Specifically, each vertex represents a binary random variable that codes aberration status at a locus, and an edge will be drawn between two vertices if the corresponding two random variables are conditionally dependent given all other random variables. Here, we want to point out that graphical models based on conditional dependencies provide information on “higher order” interactions compared to other methods (e.g., hierarchical clustering) which examine the marginal pairwise correlations. The latter does not tell, for example, whether a non-zero correlation is due to a direct interaction between two aberration events or due to an indirect interaction through a third intermediate aberration event.
There is a rich literature on fitting graphical models for a limited number of variables (see for example Edward 2000
; Drton and Perlman 2004
, and references therein). However, in genomic instability profiles, the number of genes p
is typically much larger than the number of samples n
. Under such high-dimension-low-sample-size scenarios, sparse regularization becomes indispensable for purposes of both model tractability and model interpretation. Some work has already been done to tackle this challenge for high dimensional continuous variables. For example, Meinshausen and Buhlmann (2006)
proposed to perform neighborhood selection with lasso regression (Tibshirani 1996
) for each node. Peng et al. (2009a)
extended the approach by imposing the sparsity on the whole network instead of each neighborhood, and implemented a fast computing algorithm. Other approaches include penalized maximum likelihood (e.g., Yuan and Lin 2007
and Friedman et al. 2007b
), various regularization methods (e.g., Li and Gui 2006
; Schafer and Strimmer 2007), and Bayesian methods (e.g., Madigan et al. 1995
In this paper, we consider binary variables and propose a novel method, LogitNet, for inferring edges, i.e., the conditional dependence between pairs of aberration events given all others. With proper assumptions on the topology of oncogenic pathways, we derive the joint probability distribution of the p binary variables, which naturally leads to a set of p logistic regression models with the combined p × p coefficient matrix being symmetric. We propose sparse logistic regression with a lasso penalty term and extend it to account for the spatial correlations within chromosomes. This extension together with the enforcement of symmetry of the coefficient matrix produces a group selection effect, which enables LogitNet to account for and also benefit from spatial correlation when inferring the edges.
LogitNet is related to the work by Ravikumar et al. (2009)
, which also utilized sparse logistic regression to construct a network based on high dimensional binary variables. The basic idea of Ravikumar et al. is the same as that of Meinshausen and Buhlmann’s (2006)
neighborhood selection approach, in which sparse logistic regression was performed for each binary variable given all others. A sparsity constraint was then imposed on each neighborhood and the sparse regression was performed for each binary variable separately. In this approach, the symmetry of conditional dependence obtained from regressing variable Xr
on variable Xs
and from regressing Xs
is not guaranteed. As such, it can yield contradictory neighborhoods, which makes interpretation of the results difficult. It also loses power in detecting dependencies, especially when the sample size is small. The proposed LogitNet, on the other hand, makes use of the symmetry, which produces compatible logistic regression models for all variables and thus achieves a more coherent result with better efficiency than the Ravikumar et al. approach. We show by intensive simulation studies that LogitNet performs better in terms of false positive rate and false negative rate of edge detection.
The rest of the paper is organized as follows. In section 2, we will present the model, its implementation and the selection of the penalty parameter. Simulation studies of the proposed method and the comparison with the Ravikumar et al. approach are described in Section 3. Real genomic instability data from breast cancer samples is used to illustrate the method in Section 4. We conclude the paper with a brief summary in Section 5.