A common problem in contemporary biomedical research is the occurrence of a large number of variables that accompany relatively few observations. Thus, studying associations in high-dimensional data is not straightforward. Including all variables would result in a highly over parameterized model, computational complexity and unstable estimation of the associations [1
]. This methodological problem has been solved for the domain of genomic medicine by using graphical modeling. Graphical models were identified as a promising new approach to modeling clinical data [2
], and thereby the systems approach to health and disease.
A promising approach to describe such complex relationships is graphical modeling. Graphical models [3
] provide a probabilistic tool to display, analyze and visualize the net-like dependence structures by drawing a graph describing the conditional dependencies between the variables. A graphical model consists of nodes representing the variables and edges representing conditional dependencies between the variables. In order to understand graphical models it is important to understand the concept of conditional independence. Two variables X and Y are considered conditional independent given Z, if f(x|y, z) = f(x|z). Thus, learning information about Y does not give any additional information about X, once we know Z.
Beyond association, this method has also been developed for estimating causal effects [4
]. Recently, graphical modeling has been outlined as a tool for investigating complex phenotype data, specifically for the visualization of complex associations [5
], dimension reduction, comparison of substructures and the estimation of causal effects from observational data [6
]. Until now, the main focus of research was on building Gaussian graphical models for continuous multivariate data following a multivariate normal distribution [7
]. A popular way to built Gaussian graphical models are covariance selection methods [8
]. These methods are used to sort out conditionally independent variables. They aim to identify the non-zero elements in the inverse of the covariance matrix since the non-zero entries in the inverse covariance matrix correspond to conditional dependent variables. However, this method is not reliable for high-dimensional data, but can be improved by concentrating on low-order graphs [9
Another approach to identifying the non-zero elements in the inverse covariance matrix has been proposed by Meinshausen and Buehlmann [10
]. They propose the Least Absolute Shrinkage and Selection Operator (LASSO) [11
] as a variable selection method to identify the neighborhood of each variable, thus the non-zero elements. A neighborhood is the set of predictor variables corresponding to non-zero coefficients in a prediction model by estimating the conditional independence separately for each variable. Meinshausen and Buehlmann showed that this method is superior to common covariance selection methods, in particular if the number of variables exceeds the number of observations. They also proved that the method asymptotically recovers the true graph.
The LASSO was originally proposed for linear regression models and has become a popular model selection and shrinkage estimation method. LASSO is based on a penalty on the sum of absolute values of the coefficients (1
-type penalty) and can be easily adapted to other settings, for example Cox regression [12
], logistic regression [13
] or multinomial logistic regression [16
] by replacing the residual sum of squares by the corresponding negative log-likelihood function. Important progress has been made in recent years in developing computational efficient and consistent algorithms for the LASSO with good properties even in high-dimensional settings [17
]. The so-called graphical lasso by Friedman et al. [19
] uses a coordinate descent algorithm for the LASSO regression to estimate sparse graphs via the inverse covariance matrix. They describe the connection between the exact problem and the approximation suggested by Meinshausen and Bühlmann [10
However, the determination of the right amount of penalization for these methods has remained a main problem for which no satisfactory solution exists [20
The current methodology primarily provides solutions for continuous data. The relationship of binary data is difficult to identify using classical methods. Building binary graphical models for dichotomous data is based on the corresponding contingency tables and log-linear models [21
]. The interaction terms are used to control for conditional dependencies. With a growing number of variables model selection becomes computationally demanding and quickly exceeds feasibility, thereby making the method difficult to adapt to high-dimensional data. For a fully saturated log-linear model one would need 2 p
parameters, with p
being the number of variables. A common solution is to reduce the problem to first-order interaction where conditional independence is determined by first-order interaction terms.
The properties of the LASSO for logistic regression have recently been investigated. Van de Geer [22
] focused on the prediction error of the estimator and not on variable selection. She proposed a truncation of the estimated coefficients to derive consistent variable selection. Bunea [23
] showed the asymptotic consistency of variable selection under certain conditions for 1
-type penalization schemes.
The adaptation of local penalized logistic regression to graphical modeling has been proposed by Wainwright [24
]. Under certain conditions on the number of variables n
, the number of nodes p
and the maximum neighborhood size, the 1
-penalized logistic regression for high-dimensional binary graphical model selection gives consistent neighborhood selection [24
]. Wainwright et al. showed that a logarithmic growth in n
relative to p
is sufficient to achieve neighborhood consistency. Another new approach is based on an approximate sparse maximum likelihood (ASML) problem for estimating the parameters in a multivariate binary distribution. Based on this approximation a consistent neighborhood could be selected and a sensible penalty term can be identified [17
However, when analyzing high-dimensional categorical data the main problem that there is no rationale for the choice of the amount of penalization controlled by the value of the penalty term for consistent variable selection still remains [20
A possible solution might be to adapt bootstrap aggregating to these problems. Bootstrap aggregating (bagging) generates multiple versions of a classifier and aggregates the results to get a single enhanced classifier. By making bootstrap replicates of the original data multiple versions are formed, each acting as single learning data for a classification problem. Also, for linear regression it has been shown that bagging provides substantial gains in accuracy for variable selection and classification [26
]. This idea has been carried further by Bach [27
], resulting in the Bolasso (bootstrap-enhanced least absolute shrinkage operator) algorithm for variable selection in linear regression. Here, the LASSO is applied to several bootstrapped replications of a given sample. The intersection of each of these models leads to consistent model selection.
In this paper we adapted the method of Meinshausen and Bühlmann to binary data and used the LASSO for logistic regression to identify the conditional dependence structure. We applied bagging to improve variable selection, hence adapted the Bolasso. Performances are tested on a data set with known structure. This data set was simulated by Gibbs sampling [28
]. We also applied graphical modeling methods to real-life data.
Objective of this paper was to examine the performance of the Bolasso to the development of graphical models for high dimensional binary data with various values for the penalty term and various numbers of bootstraps. We hypothesized that the performance of Bolasso is superior to competing LASSO based methods to identify graphical models. Specifically, the hypothesis was that the choice of the penalty is not critical as long as it is chosen sensibly, i.e. corresponding to a reasonable number of selected variables.