Home  About  Journals  Submit  Contact Us  Français 
We developed an EXpectation Propagation LOgistic REgRession (EXPLORER) model for distributed privacypreserving online learning. The proposed framework provides a high level guarantee for protecting sensitive information, since the information exchanged between the server and the client is the encrypted posterior distribution of coefficients. Through experimental results, EXPLORER shows the same performance (e.g., discrimination, calibration, feature selection etc.) as the traditional frequentist Logistic Regression model, but provides more flexibility in model updating. That is, EXPLORER can be updated one point at a time rather than having to retrain the entire data set when new observations are recorded. The proposed EXPLORER supports asynchronized communication, which relieves the participants from coordinating with one another, and prevents service breakdown from the absence of participants or interrupted communications.
Frequentist logistic regression [1] has a long and successful history of useful applications in biomedicine, including various decision support applications, e.g., anomaly detection [2] survival analysis [3], and early diagnosis of myocardial infarction [4]. Despite its simplicity and interpretability, the frequentist logistic regression approach has limitations. It requires training data to be combined in a centralized repository and cannot directly handle distributed data (the scenario in many biomedical applications [5]). It has been shown in last decade that data privacy cannot be maintained by simply removing patient identities. For example, Sweeney showed that a simple combination of [date of birth, sex, and 5digit zip code] was sufficient to uniquely identify over 87% of US citizens [6]. Due to privacy concerns, training data in one institute cannot be exchanged or shared with other institutes directly for the purposes of global model learning. To address such a challenge, many privacypreserving models have been studied [7, 8, 9, 10]. Among the most popular ones, privacypreserving methods based on secure multiparty computing (SMC) [11, 12, 13, 14, 15] (i.e., building accurate predictive models without sharing raw data) do not change the results and seem practical compared to solutions based on data generalization and perturbation [16, 17, 18, 19] that change results.
For distributed model learning with multiple sites, a common scenario is that each site has a subset of records with the same fields, which is usually referred to as horizontally partitioned data. In this paper, we focus on the horizontally partitioned data for distributed logistic regression learning in Bayesian paradigm. During the past decade, numerous privacypreserving/secure distributed frequentist regression models for horizontally partitioned data [20, 21, 22, 23, 24, 25, 26] have been studied. For example, the DataSHIELD framework [20] provides a secure multisite regression solution without sacrificing the model learning accuracy. However, in the above multisite regression frameworks, the information matrix and score vector exchanged among multiple sites may result in information leakage during each learning iteration [27, 28]. To mitigate privacy and security risks, Karr and Fienberg et. al. studied numerous SMC based distributed regression model [21, 22, 23, 24, 25]. Unfortunately, as mentioned by El Emam et.al in [26], aforementioned approaches can still potentially leak sensitive personal information. Therefore, the authors [26] proposed a secure distributed logistic regression protocol to offer stronger privacy/security protection. The computational complexity of the above protocol grows exponentially with the increase of site number.
The closest work for the method presented here is the Grid LOgistic REgression (GLORE) model [29] and the Secure Pooled Analysis acRoss Ksite (SPARK) protocol [26], which train frequentist logistic regression model in a distributed, privacypreserving manner. GLORE leverages nonsensitive decomposable intermediary results (i.e., calculated at an individual participating site) to build an accurate global model. However, as GLORE does not use any SMC protocol, there is no provable privacy guarantee. SPARK protocol uses secure building block (e.g., secure matrix operation, etc.) to develop a secure distributed logistic regression protocol. However, SPARK will not scale well for a large distributed network, as its complexity grows exponentially with the network size. Both GLORE and SPARK require synchronized communication among participants (i.e., all parties had to be simultaneously online for multiple iterations of training until convergence). Additionally, the frequentist logistic regression approach is inefficient in learning data that are frequently being updated, because the model needs to be completely retrained when they receive any additional observations.
We propose a Bayesian alternative for the distributed frequentist logistic regression model, which we call EXpectation Propagation LOgistic REgRession (EXPLORER). EXPLORER offers distributed privacypreserving online model learning. The Bayesian logistic regression model was described previously by Ambrose et al. [30], who compared it with the frequentist logistic regression model in terms of performance. Marjerison also discussed Bayesian logistic regression [31] and suggested a Gibbs sampling based optimization, which unfortunately is very timeconsuming operation. However, both papers assumed a centralized computation environment, and privacy was not taken into consideration. In comparison, EXPLORER focuses on privacy preservation and it is based on an efficient stateoftheart inference technique (i.e., expectation propagation [32]). To the best of our knowledge, EXPLORER is the first paper addressing distributed logistic regression in the Bayesian setting. EXPLORER handles some shortcomings of the frequentist logistic regression approach and other frequentist SMC models, as illustrated in Table 1.
The major contributions of this paper are as follows: we propose a Bayesian approach for logistic regression that takes the privacy issue into account. Just like GLORE and SPARK, the EXPLORER model learns from distributed sources, and it does not require access to raw patient data. In addition, it provides online learning capability to avoid the need for training on the entire database when a single record is updated. Furthermore, EXPLORER supports asynchronous communication so that participants do not have to coordinate one another. This prevents service breakdowns that result from absence of participants or communication interruptions. Finally, we introduced Secured Intermediate iNformation Exchange (SINE) protocol to enhance the security of the proposed EXPLORER framework, in order to further reduce the risk of information leak during the exchange of unprotected aggregated statistics. The proposed SINE protocol offers provable security and lightweighted computation overhead to ensure the scalability of the EXPLORER framework.
We start with a quick review of the logistic regression (LR) model. Assume a training dataset D = {(x_{1}, y_{1}), (x_{2}, y_{2}), · · ·, (x_{m}, y_{m})}, where y_{i} {0, 1} and x_{i} are the binary label and the feature vector of each record, respectively, with i = 1, · · ·, m. We denote by Y = {y_{1}, · · ·, y_{m}} the set of binary labels. The posterior probability of a binary event (i.e., class label) y_{i} = 1 given observation of a feature vector x_{i} can be expressed as a logistic function acting on a linear function β^{T} x_{i} so that
where the parameter vector β corresponds to the set of coefficients that need to be estimated and that will be multiplied by the feature vector x_{i} (i.e., β^{T}x_{i}) to make predictions. In this paper, we drop the feature vector x_{i} from the likelihood function and denote P (y_{i} = 1  x_{i}, β) as P (y_{i}= 1  β) to allow a more compact notation.
To estimate β from training datasets, existing learning algorithms can be categorized into two classes, Maximum Likelihood (ML) estimation and Maximum a Posterior (MAP) estimation. The procedures of estimating model coefficients through ML and MAP estimators are elaborated in the supplementary materials  Sections S3 and S4. In this paper, we focus on the MAP estimation for the proposed EXPLORER framework.
In this section, we introduce the EXPLORER framework based on a factor graph, which enables the independent intersite update of all the participating sites without performance loss. In a nutshell, a factor graph is a bipartite graph (see supplementary materials  Sections S1 and S2) that comprises two different kinds of nodes (i.e., a factor node (square) and a variable node (circle)). In a factor graph, each edge must connect a factor node and a variable node. The joint probability over all variables can be expressed as products of some factor functions in which each contains only a subset of all variables as arguments and is represented by a factor node. Each variable node expresses a random variable. EXPLORER requires two phases: initially the updates on coefficients must be made on each site (i.e., intrasite update), and then updated across sites (i.e., intersite update).
Although we are interested in the distributed online model learning, let us first explain how each participating site updates its own posterior distribution (i.e., intrasite update). In general, the posterior probability of the jth site with j = 1, 2, · · ·, n can be expressed as
where we introduce m_{h}_{<sub>}_{j}_{</sub>→}_{B}_{<sub>}_{j}_{</sub>}(β_{j}) and ${f}_{i}^{j}({\mathit{\beta}}_{j})$ to capture the prior probability and the likelihood function, respectively. Moreover, m_{j} is the total number of records in the first j sites with m_{0} = 0. However, the direct evaluation of the above posterior is mathematically intractable, thus we need to resort to the Expectation propagation (EP) algorithm, a deterministic approximate inference method. In the proposed EXPLORER framework, we introduce an approximate function ${\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})$ representing a normal distribution for each true likelihood function ${f}_{i}^{j}({y}_{i}\mid {\mathit{\beta}}_{j})$. Therefore, the approximate posterior distribution can be expressed as
Based on the above factorization, we first introduce a variable node B_{j} (i.e., site header) to capture the approximate posterior distribution q(β_{j}). We introduce extra factor nodes h_{j} and ${f}_{i}^{j}$ to capture the prior probability and the likelihood function (see Fig. 1), respectively. The update of the approximate likelihood function ${\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})$ relies on the factor graph basedEP algorithm (see Appendix A. for details).
The details of intrasite update rules in EXPLORER are listed in Algorithm 1 (A1). Since we are using the Normal distribution as our approximate distribution, messages exchanged between the factor nodes and the variable nodes can be parameterized by the mean vector and its covariance matrix. The intrasite update starts from the initialization step (A1: line 1), where all the messages are initialized as uniform distributions with zero mean and infinite variance for a new model learning task. However, for an online learning task with previous results, we need to initialize the messages using their previous status. The approximate posterior is initialized as the product of prior and all the approximate likelihood functions. We can iteratively update the approximate posterior distribution of each site until it converges through A1 from lines 2 to 7. The key idea of EP is to sequentially update the approximate posterior distribution q_{i} (β_{j}) by incorporating a single true likelihood function ${f}_{i}^{j}({\mathit{\beta}}_{j})$ as shown in line 5. For more details about EPbased LR, please refer to Appendix A.
1:  Initialize : each approximate likelihood function ${\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})$ and approximate posterior $q\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}_{j})={m}_{{h}_{j}\to {B}_{j}}({\mathit{\beta}}_{j}){\prod}_{i={m}_{j1}+1}^{{m}_{j}}{\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})$n 
2:  Repeat 
3:  for all records i = m_{j}_{−1} + 1, · · ·, m_{j} do 
4:  Get partial posterior function q^{\}^{i} (β_{j}) by removing the approximate factor ${\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})$ from the approximate posterior: ${q}^{\backslash i}\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}_{j})=q({\mathit{\beta}}_{j})/{\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})$ 
5:  Update q(β_{j}) by incorporating a single true likelihood function ${f}_{i}^{j}({\mathit{\beta}}_{j})$ according to Assumed Density Filtering (ADF) [33]: $q\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}_{j})=\mathit{Proj}[{\scriptstyle \frac{1}{{Z}_{i}}}{q}^{\backslash i}({\mathit{\beta}}_{j})\phantom{\rule{0.16667em}{0ex}}{f}_{i}^{j}({\mathit{\beta}}_{j})]$ 
6:  Set the approximate factor: ${\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})=\frac{{Z}_{\text{i}}q({\mathit{\beta}}_{\text{j}})}{{q}^{\backslash \text{i}}({\mathit{\beta}}_{\text{j}})}$ 
7:  end for

8:  Until parameters converge 
To achieve asynchronous intersite update (see Fig. 1), we introduce an additional variable node B (i.e., server node) to capture the global posterior probability learnt from all sites. Then, we connect each factor node h_{j} with the server variable node B for exchanging messages among sites. We assume that all sites share the same prior information, which is captured by the factor node g_{0}. The intersite update of each approximate posterior q_{j} (β_{j}) relies on a powerful message passing algorithm (i.e., belief propagation (BP)), which has been widely used in Bayesian inference on factor graphs (see Appendix B for details). Based on this framework, we can update sites in an asynchronous way.
1:  Global initialization: Initialize all the messages m_{B}_{→}_{h}_{<sub>}_{j}_{</sub>}(β) and m_{h}_{<sub>}_{j}_{</sub>→}_{B} (β) between server variable node B and clients factor nodes h_{j}, where the subscriptions B → h_{j} and h_{j} → B indicate the message directions. 
 
2:  Local initialization for all the online sites: 
 
3:  Initialize messages m_{B}_{<sub>}_{j}_{</sub>→}_{h}_{<sub>}_{j}_{</sub>} (β) and each approximate likelihood function ${\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})$ with i = m_{j−1} + 1, · · ·, m_{j}. 
 
4:  Repeat: 
 
5:  for all the online sites (parallel update) 
 
6:  Update intrasite message: m_{h}_{<sub>}_{j}_{</sub>→} _{B}_{<sub>}_{j}_{</sub>}(β_{j})= ∫ δ(β, β_{j}) m_{B}_{→}_{h}_{<sub>}_{j}_{</sub>} (β) dβ 
 
7:  Set approximate posterior: ${q}_{j}\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}_{j})={m}_{{h}_{j}\to {B}_{j}}({\mathit{\beta}}_{j}){\prod}_{i=1}^{{m}_{j}{m}_{j1}}{\stackrel{\sim}{f}}_{i}^{j}({\mathit{\beta}}_{j})$ 
 
8:  Update approximate posterior q_{j}(β_{j}) according to Algorithm 1. 
 
9:  Update intrasite messages at variable node B_{j} :
$${m}_{{B}_{j}\to {h}_{j}}\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}_{j})={q}_{j}\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}_{j})/{m}_{{h}_{j}\to {B}_{j}}({\mathit{\beta}}_{j})$$ 
 
10:  Upload message at factor node h_{j} :
$${m}_{{h}_{j}\to B}\phantom{\rule{0.16667em}{0ex}}(\mathit{\beta})=\int \delta \phantom{\rule{0.16667em}{0ex}}(\mathit{\beta},{\mathit{\beta}}_{\mathbf{j}})\phantom{\rule{0.16667em}{0ex}}{m}_{{B}_{j}\to {h}_{j}}\phantom{\rule{0.16667em}{0ex}}({\mathit{\beta}}_{j})\phantom{\rule{0.16667em}{0ex}}d{\mathit{\beta}}_{\mathbf{j}}.$$ 
 
11:  end for 
 
12:  Send out message at server node B:
$${\text{m}}_{\text{B}\to {\text{h}}_{\text{j}}}(\mathit{\beta})={m}_{{g}_{0}\to B}(\mathit{\beta}){\prod}_{\begin{array}{l}k=1\\ k\ne j\end{array}}^{n}{m}_{{h}_{k}\to B}(\mathit{\beta}).$$ 
 
13:  Until parameters converge 
 
14:  Get the final approximate posterior distribution by multiply ing all the incoming messages at the server node B. 
The details of the proposed asynchronous EXPLORER are listed in Algorithm 2 (A2). The asynchronous intersite update starts from the global and local initialization steps (A2: line 1 to 2), which follow the same initialization rules as those for the intrasite update. Then, we can iteratively update the approximate posterior distribution of each site until it converges via A2 from lines 4 to 12. In A2 line 6, we choose delta function δ(β_{j}, β) as a factor function of the factor node h_{j}, which follows the suggestion in [34] for mathematical convenience (see Appendix B. for details). Lines 6 and 7 show the factor node and variable node updates according to the BP algorithm. In line 8, we perform an intrasite update according to Algorithm 1. Then, in line 9, we update the message from variable node B_{j} to factor node h_{j}. Finally, factor node h_{j} commits its message to the server node according to line 10, where the message can be interpreted as the belief that the server node should take value β from the jth site. When the server node has collected all the updates from the corresponding online sites, it can send the aggregated information back to each site as in line 12. Finally, the approximate posterior can be obtained by multiplying all the incoming messages at the server node B as in line 14. The asynchronous EXPLORER allows client sites to dynamically shift from online to offine modes as needed. The impacts of sites with different data size on convergence speed have been studied in the results section. The proposed EXPLORER framework is based on the EP algorithm, which is guaranteed to converge to a fixed point for any given dataset [32].
Feature selection is important to logistic regression analysis. In this subsection, we introduce the distributed forward feature selection (DFFS) protocol, which is based on the traditional forward feature selection (FFS) algorithm [35], but tailored for the EXPLORER framework. To better understand DFFS protocol, let us start with a quick review of traditional FFS algorithm.
Suppose there are d candidate features (i.e., x_{all} = {x_{1}, x_{2}, · · ·, x_{d}}). In the first iteration, the FFS algorithm starts by taking only one feature into account at each time, so that one can find the best individual feature ${\mathbf{x}}_{\mathit{sub}}^{1}=({x}_{{s}_{1}})$ for s_{1} = 1, 2, · · ·, d, which could result in the best classification performance. Then, in the second iteration, FFS algorithm tries to find the best subset ${\mathbf{x}}_{\mathit{sub}}^{2}=\{{x}_{{s}_{1}},{x}_{{s}_{2}}\}$ in terms of classification performance, where x_{s}_{<sub>2</sub>} is chosen from the remaining d − 1 features in x_{all}. We repeat the aforementioned procedures, until the currently best subset ${\mathbf{x}}_{\mathit{sub}}^{k+1}=\{{x}_{{s}_{1}},{x}_{{s}_{2}},\cdots {x}_{{s}_{k+1}}\}$ at iteration k + 1 degrades the classification performance obtained through the subset ${\mathbf{x}}_{\mathit{sub}}^{k}=\{{x}_{{s}_{1}},{x}_{{s}_{2}},\cdots {x}_{{s}_{k}}\}$. Finally, the ${\mathbf{x}}_{\mathit{sub}}^{k}=\{{x}_{{s}_{1}},{x}_{{s}_{2}},\cdots {x}_{{s}_{k}}\}$ is treated as the output of FFS algorithm. In traditional centralized regression model, the classification performance is usually measured through averaged classification accuracy using cross validation. However, in a distributed environment, it is usually infeasible to perform cross validation over distributed sites, which motivated us to develop the following DFFS protocol.
In our proposed DFFS, suppose there are n participant sites. Then, we create n EXPLORER instances at the server side with n − 1 participant sites in each instance in parallel, where the jth site is excluded from the jth EXPLORE instance, but it serves as the testing data for the jth EXPLORE instance. For example, given a candidate feature subset with l features, the jth EXPLORER instance can first learn a logistic regression model based on all the participant sites except the jth site. The jth EXPLORER instance can send its learnt model to the jth site to verify its classification performance. Then, the jth site can report the classification accuracy back to the server. It is worth mentioning that the information exchanged between the jth site and the server node are aggregated information. Since there are n parallel EXPLORER instances at the server side, the server can calculate an averaged classification accuracy based on the reports from each instance, which is analogous to a centralized n cross validation. The averaged classification accuracy can be used as the criteria for selecting the best ${\mathbf{x}}_{\mathit{sub}}^{l}=\{{x}_{{s}_{1}},{x}_{{s}_{2}},\cdots {x}_{{s}_{l}}\}$ at the l DFFS iteration. Then, all the EXPLORER instances will move to the (l + 1)th DFFS iteration. By repeating the above procedures, we can find the best feature subset ${x}_{\mathit{sub}}^{k}=\{{x}_{{s}_{1}},{x}_{{s}_{2}},\cdots {x}_{{s}_{k}}\}$ with the maximum classification performance.
The information exchanged among all participant sites in EXPLORER framework are the posterior distribution of the model parameter β, which is assumed as normal distribution and captured by the mean vector and the covariance matrix. Compared with raw data, the posterior distribution (i.e., the mean vector and the covariance matrix) only reflects the aggregated information of the raw data rather than information based on individual patients, which has already reduced the privacy risk. However, as identified by previous studies [27, 28, 26], aggregated information may potentially leak private information. We propose a Secured Intermediate iNformation Exchange (SINE) protocol as an optional module for further enhancing the confidentiality of the EXPLORER framework.
As we illustrated in the Section 3.2, the posterior distribution of the global model parameter is calculated by multiplying all the incoming messages from n sites at the server node. In the context of Gaussian distribution, the global distribution obtained through the multiplication of all the incoming messages [36] can be captured by its mean vector μ and covariance matrix V as follows,
where V_{j} and μ_{j} are the covariance matrix and the mean vector obtained from the jth site (j = 1, 2, · · ·, n), respectively. The proposed SINE protocol is mainly based on the modified secure sum algorithm, which offers provable security guarantee [37, 24].
The SINE protocol, as shown in Fig. 2, begins by generating a pair of secure random matrices ${\mathbf{R}}_{V}^{j}$ (with size d × d) and ${\mathbf{R}}_{\mu}^{j}$ (with size d × 1) at each participant site before the start of each learning iteration, where d is the dimension of μ_{j}. Meanwhile, the server also generates a pair of random matrix ${\mathbf{R}}_{V}^{0}$ and random vector ${\mathbf{R}}_{\mu}^{0}$ with the same size as these of ${\mathbf{R}}_{V}^{j}$ and ${\mathbf{R}}_{\mu}^{j}$. Then, the server sends ${\mathbf{R}}_{V}^{0}$ and ${\mathbf{R}}_{\mu}^{0}$ to a randomly selected site (e.g., the jth site). The jth site adds its own ${\mathbf{R}}_{V}^{j}$ and ${\mathbf{R}}_{\mu}^{j}$ with the received ${\mathbf{R}}_{V}^{0}$ and ${\mathbf{R}}_{\mu}^{0}$ and sends the summation to its neighboring site according to the standard secure sum protocol. Finally, the last site send its summation (i.e., ${\sum}_{j=0}^{n}{\mathbf{R}}_{V}^{j}$ and ${\sum}_{j=0}^{n}{\mathbf{R}}_{\mu}^{j}$) back to the server node. According to the secure sum protocol, the server can easily recover the summation ${\sum}_{j=1}^{n}{\mathbf{R}}_{V}^{j}$ and ${\sum}_{j=1}^{n}{\mathbf{R}}_{\mu}^{j}$, as ${\mathbf{R}}_{V}^{0}$ and ${\mathbf{R}}_{\mu}^{0}$ were generated at the server side.
Now, for secure information exchange among client sites and server, each client site can send out the secured information ${\mathbf{V}}_{j}^{s}={\mathbf{V}}_{j}^{1}+{\mathbf{R}}_{V}^{j}$ and ${\mathit{\mu}}_{j}^{s}={\mathbf{V}}_{j}^{1}{\mathit{\mu}}_{j}+{\mathbf{R}}_{\mu}^{j}$ instead of the raw information ${\mathbf{V}}_{j}^{1}$ and ${\mathbf{V}}_{j}^{1}{\mathit{\mu}}_{j}$, respectively. Then, at the server side, one can easily recover the true summation ${\mathbf{V}}^{1}={\sum}_{j=1}^{n}{\mathbf{V}}_{j}^{s}{\sum}_{j=1}^{n}{\mathbf{R}}_{V}^{j}$ and $\mathit{\mu}=\mathbf{V}{\sum}_{j=1}^{n}{\mathit{\mu}}_{j}^{s}{\sum}_{j=1}^{n}{\mathbf{R}}_{\mu}^{j}$, where the aggregations of ${\sum}_{j=1}^{n}{\mathbf{V}}_{j}^{s}={\sum}_{j=1}^{n}{\mathbf{V}}_{j}^{1}+{\sum}_{j=1}^{n}{\mathbf{R}}_{V}^{j}$ and ${\sum}_{j=1}^{n}{\mathit{\mu}}_{j}^{s}={\sum}_{j=1}^{n}{\mathbf{V}}_{j}^{1}{\mathit{\mu}}_{j}+{\sum}_{j=1}^{n}{\mathbf{R}}_{\mu}^{j}$ are obtained in the current step and the summations of ${\sum}_{j=1}^{n}{\mathbf{R}}_{V}^{j}$ and ${\sum}_{j=1}^{n}{\mathbf{R}}_{\mu}^{j}$ are already obtained in the previous step. It is worth mentioning that unlike frequentist LR where the information matrix must be passed through all participant sites, in EXPLORER, each V_{j} and μ_{j} can be updated independently by each site and aggregated at the server node, which reduces the privacy and collusion risks by avoiding the intersite communication of sensitive information. The following is the proof of security of the proposed SINE protocol based on secure summation principle [38].
Let’s suppose r_{k}_{,}_{l} and v_{k}_{,}_{l} are two elements at the kth row and the lth column of the secure random matrix ${\mathbf{R}}_{V}^{j}$ and the covariance matrix ${\mathbf{V}}_{j}^{1}$, respectively. We also suppose that v_{k}_{,}_{l} is known to lie in the range [−10^{M}, 10^{M}), which can be validated by selecting a significantly large number for M (e.g., M = 50). Then, r_{k}_{,}_{l} is a randomly selected floating number with maximum precision at 10^{−}^{N}th digits from the range [−10^{M}, 10^{M}), which means there are total 2 × 10^{N}^{+}^{M} possible choices for any r_{k}_{,}_{l}. Given the summation ${v}_{k,l}^{s}={v}_{k,l}+{r}_{k,l}$ at the attacker side, the probability to gain the original information (i.e., _{k}_{,}_{l} = v_{k}_{,}_{l}) can be expressed as,
where _{k}_{,}_{l} and _{k}_{,}_{l} are the estimates of v_{k}_{,}_{l} and r_{k}_{,}_{l} at the attacker side, respectively. Moreover, since there are total d^{2} elements in ${\mathbf{V}}_{j}^{1}$, the probability to recover the original covariance matrix can be calculated as $P({\widehat{\mathbf{V}}}_{j}^{1}={\mathbf{V}}_{j}^{1})={\prod}_{k=1}^{d}{\prod}_{j=1}^{d}P({\widehat{v}}_{k,l}={v}_{k,l}\mid {v}_{k,l}^{s})=\frac{1}{2}\times {10}^{{d}^{2}(N+M)}$, where ${\widehat{\mathbf{V}}}_{j}^{1}$ is the estimate of ${\mathbf{V}}_{j}^{1}$ at the attacker side. Then, we can always select some large numbers for both N and M (e.g., M +N = 100), such that the probability $P({\widehat{\mathbf{V}}}_{j}^{1}={\mathbf{V}}_{j}^{1})$ that an attacker can gain the original information is sufficiently small.
We evaluated EXPLORER in 5 perspectives with 6 simulated datasets and 5 clinical datasets. Specifically, the perspectives we evaluated include: 1). distributed feature selection, 2). modeling interaction, 3). classification performance, 4). model coefficients estimation, and 5). model convergence. A summary of these 6 simulated and 5 clinical datasets [1, 4, 39] can be found in Table 2, which includes the information of dataset description, number of covariates, number of samples and class distribution for each dataset. The rule of simulated dataset generation will be introduced within each experimental task, where we considered three different types of simulated datasets i.e., independent and identically distributed (i.i.d.) dataset [40], correlated dataset [41] and binary dataset [42]. Moreover, Table 3 shows the detailed descriptions of covariates in each clinical dataset used in our experiment, where numerical covariates are indicated with “*” and categorical variables are converted into binary covariates through dummy coding [43]. For example, a categorical variable with c possible values (e.g., 0, 1 and 2 for c = 3), in dummy coding, will be converted into c−1 binary covariates (e.g., 0 → (0, 0), 1 → (1, 0) and 2 → (0, 1) ).
In our experiment, each training dataset was created by randomly choosing 80% records from a given dataset and the corresponding testing dataset was generated through the remaining 20% records. Moreover, in order to obtain more reliable results and to be able to compare the results in a statistical way, we conducted each experimental task over 30 trials through the aforementioned method of training/testing datasets generation. Unless explicitly stated, each training dataset was evenly partitioned [29, 26] among all the participant EXPLORER sites. For example, if there are m records in a given training dataset and n participant sites in a task, the subdataset possessed by each site is equal to ${\scriptstyle \frac{m}{n}}$, where we assume that m is divisible by n. In addition, for all 2site EXPLORER setups of datasets 3 to 11, the difference between means (DBM) of class distribution and covariates of the 2 sites has been shown in Figs. D.6 to D.14 in Appendix D, which offers an intuitive sense about how heterogeneous these sites are.
Feature selection is an important part of logistic regression analysis. In this subsection, we studied the distributed feature selection capability of a 5site EXPLORER setup based on a simulated dataset with 5 covariates, 500 number of records (i.e., the dataset 1 in Table 2). The simulated dataset 1 was generated by drawing samples from 5 independent normally distributed variables [44] with zero means and unit variances, where similar dataset generation strategy has been repeatedly employed in many previous studies [29, 26, 40]. Then, we randomly drew 4 values for the model parameter β from a normal distribution with mean 0 and variance 5 as β = [1.7813, −2.2428, 0, 3.1668, 0, −1.8701]^{T}, where β_{0} = 1.7813 is the intercept, β_{2} and β_{4} are assigned by zeros for the purpose of studying feature selection. Finally, the outcome variable (i.e., classification label) was drawn from a normal distribution with probability of success equal to π(βx_{i}), where π(·) is a logistic function as show in (1) and x_{i} is a vector composed of a constant “1” followed by the aforementioned 5 covariates. As we are interested in the study of feature selection in this task, we treat all 500 records in the simulated dataset 1 as training data and randomly split them into 5 subdatasets with equal sizes for all participant EXPLORER sites (i.e., 100 records per site) in each experimental trial.
Table 4 shows the feature selection results for simulated dataset 1 using Ordinary LR with FFS algorithm and EXPLORER with DFFS protocol as described in Section 3.3, where β′ and β″ are model parameters averaged over 30 trials learnt by Ordinary LR and EXPLORER, respectively, the Prob. indicates the chance of a given covariate to be selected by either FFS or DFFS algorithms during the 30 trials, and twosample Ztests are performed between β′ and β″. In twosample Ztest, the null and the alternative hypotheses are “β′ = β″” and “β′ ≠ β″”, receptively. In Table 4, we can see both FFS algorithm and the proposed DFFS protocol achieved similar feature selection performance in terms of both qualitative and statistical comparisons, where the model parameters (i.e., β_{1} to β_{5}) used for generating outcome variable are also listed as a reference. The twosample Ztest results show that there is no statistically significant difference between β′ and β″. Moreover, for both FFS and DFFS algorithms, the covariates with nonzero model parameters (i.e., β_{i} ≠ 0 for i = 1, 2, · · ·, 5) are fully selected over all 30 trials (i.e., Prob. equals to 1). However, for these covariates with zero model parameters (i.e., β_{i} = 0 for i = 1, 2, · · ·, 5), the chances of selection are quite small (e.g., Prob. is less than 0.3). As we have demonstrated the DFFS capability of the proposed EXPLORER framework, in the rest of our experiments, we assume that all covariates in remaining datasets have been preselected.
Interaction effects in regression reflects the combined impact of variables, which is very important for understanding the relationships among the variables. In this section, we studied a 4site EXPLORER setup with interaction based on a simulated dataset (i.e., the dataset 2 in Table 2). The simulated dataset 2 was generated by first drawing 500 samples from 3dimension multivariate normal distribution (i.e., x_{1}, x_{2}, x_{3}) with zero means and randomly generated covariance matrix. Second, we consider the interaction between x_{1} and x_{2}, x_{1} and x_{3}, and x_{2} and x_{3}. Finally, a record vector x_{i} can be represented as x_{i} = [1, x_{1}, x_{2}, x_{3}, x_{1}x_{2}, x_{1}x_{3}, x_{2}x_{3}]^{T}. Moreover, we randomly drew 7 values for the model parameter β = [−1.2078, 2.9080, 0.8252, 1.3790–1.0582, −0.4686, −0.2725]^{T}, where β_{0} = −1.2078 is the intercept. Then, the outcome variable was drawn from a normal distribution with probability of success equal to π(βx_{i}).
In Table 5, we performed sidebyside comparisons of HosmerLemeshow (HL) Goodnessoffit test [45] (i.e., HL test), and AUCs between model learning with and without interaction for both ordinary LR and EXPLORER. In an HL test, the null and alternative hypotheses are “the model fits the data well” and “the model does not provide an adequate fit”, respectively. In a twosample Ztest of AUCs, the null and alternative hypotheses are “AUCs of ordinary LR and EXPLORER are equal” and “AUCs of both models are unequal”, respectively. In Table 5, we can see that both EXPLORER and ordinary LR achieve good HL test performances. For AUCs, the twosample Ztest results show that there is no statistically significant difference between AUCs of ordinary LR and EXPLORER. As the simulated dataset 2 was generated with interaction, the AUCs of model with interaction outperform that of model without interaction, which demonstrated that interaction would be an important factor in some regression studies.
Since EXPLORER is proposed for distributed privacypreserving classification, we are very interested in its classification accuracy and how well the model fits the datasets compared with ordinary LR. In this section, the classification performances are verified over 4 simulated datasets and 4 clinical datasets for total 8 datasets. A brief summary of all aforementioned 8 datasets can be found in Table 2, and the detailed descriptions of covariates of each clinical dataset have been listed in Table 3. Besides the simulated i.i.d. and correlated datasets used in previous tasks, we also included several simulated binary datasets using binomial distribution in this task. Moreover, we carefully chose these simulated and clinical datasets to provide a wider range of class distribution, sample size and number of covariates. All the model parameters β’s for generating outcome variables are randomly sampled from the normal distributions. In this experimental task, we only considered the 2site EXPLORER setup, where the impact of using different number of participant EXPLORER sites on the model convergence will be discussed shortly in Section 4.6.
First, the HL test [45] was used to verify the model fit for the proposed EXPLORER. In our experiment, we perform an HL test at a 5% significance level. Table 6 shows the HL tests results (i.e., test statistic and pvalue) for both EXPLORER and ordinary LR based on aforementioned 8 datasets. There are no conspicuous differences on pvalue and test statistic between EXPLORER and ordinary LR for all 8 datasets.
Second, Table 7 illustrates both the qualitative and statistical comparison of AUCs between Ordinary LR and 2site EXPLORER based on 8 datasets, where the standard deviation of AUCs are obtained over 30 trials. For qualitative comparison, we can see the maximum difference of AUCs between ordinary LR and EXPLORER is only 0.007. Similarly, in twosample Ztest of AUCs between ordinary LR and EXPLORER, no statistically significant differences have been observed.
From results above, we demonstrated that the proposed EXPLORER can achieve similar classification and model fit performance when compared with the ordinary LR model. However, in biomedical research, another important aspect of LR is the interpretability of the estimated coefficients β, where the linear function β^{T} x_{i} can be interoperated as the log odds ratios of a binary event y_{i} = 1. Thus, it is also very important to verify that estimated coefficients using the proposed EXPLORER are compatible with those of ordinary LR.
Tables 8 to to1111 compare the estimated model coefficients β, their standard deviation over 30 trials, twosample Ztest of these estimated coefficients between the ordinary LR and 2site EXPLORER on simulated datasets 3 to 6. The same results for clinical datasets 7 to 10 can be found in Appendix C through Tables C.13 to C.16. The results shown in Tables 8 to to1111 and Tables C.13 to C.16 further confirm that EXPLORER and ordinary LR are compatible.
In this section, we focus on the study of convergence for the intersite update. Although, as shown in [32], the EP algorithm can always converge to a fixed point, we wanted to verify through experimental results that the convergence accuracy of EXPLORER does not depend on the data order or on the number of participating sites. Table 12 shows the twosample Ztests of estimated coefficients between 2 and 4site, 2 and 6site, and 2 and 8site EXPLORER setups, respectively, where the dataset partitioning stragetries can been found in the Section 4.1. We can see that there is no difference in terms of the statistical comparison of estimated coefficients among different settings.
We analyzed the convergence speed of asynchronous nsite EXPLORER with n = 2, 3, · · ·, 8 using Edinburgh dataset. Fig. 3 depicted the convergence of all 11 coefficients based on an 8site setup. In Fig. 3, we can see that the convergence speeds for all 11 coefficients and all 8 participant sites are very fast, although the initial coefficients from different sites are quite different. Usually, within 4 or 5 iterations, all the coefficients will converge to their fixed points. To reach a tolerance of Mean Squared Error (MSE) level of 10^{−8}, EXPLORER takes 9 iterations on average, where the MSE of each intersite update is calculated using the estimated coefficients after each intersite update in the asynchronous EXPLORER phase and their converged values.
We also studied the impact of evenly partitioned dataset sizes on the convergence speed. Fig. 4 shows the convergence speed of an evenly partitioned Edinburgh dataset with n = 2 to 8, where each participant site has approximately 1128/n randomly partitioned records. Figs. 4 (a), and (b) show that the MSE decays very fast, and it is less than 10^{−4} within 3 iterations. These results, especially Fig. 4 (b), also illustrate that the MSEs for each site at the 1st iteration are mainly datadriven. Fig. 4 (c) illustrates the MSE of each site after the 1st iteration update for nsite setups with n = 2, 3, · · ·, 8 in box plots. As the total number of records in our experimental dataset is fixed, the amount of data that each site holds is inversely proportional to number of involved sites. In Fig. 4 (c), we can see the larger amount of data that each site holds, the smaller the divergence of the estimated coefficients among different sites.
Finally, we studied the impact of unevenly partitioned Edinburgh dataset on the convergence speed. As shown in Fig. 5 (a), (b) and (c), we randomly selected 500, 750 and 1000 records for the first site, respectively. The remaining n − 1 sites shared the rest of the records evenly. Fig. 5 (a), (b) and (c) show, as expected, that the MSE of the first site decreases as the number of records it holds increases. As the number of records in the first site reaches 1000, the MSE drops from 2.5 to less than 10^{−2} within 1 iteration update. Please note that in our asynchronous EXPLORER, the intersite information exchange starts at the 2nd iteration. This observation illustrates how the online learning works, where the model learnt from a large dataset can be used to improve the model learnt from a relatively smaller dataset. For example, in Fig. 5 (c), the models from site 2, 3, 4 learnt at the first iteration have been significantly improved at the second iteration with the information obtained from the site 1, where their MSE reduced from 10^{0} to 10^{−4}. However, the improvement of site 1 resulting from other sites is small, as its MSE reduction is only from the order of 10^{−2} to 10^{−4}. The complexity analysis can be found in supplementary materials  Section S5.
There is currently great interest in sharing healthcare and biomedical research data to expedite discoveries that improve quality of care [46, 47, 48, 49, 29, 50, 51]. Unfortunately, healthcare institutions cannot share their data easily due to government regulations and privacy concerns. In this paper, we investigated an EXpectation Propagation LOgistic REgRession (EXPLORER) model for distributed privacypreserving online learning. The proposed framework provides a high level guarantee for protecting sensitive information. Through experimental results, EXPLORER shows the same performance (i.e., classification accuracy, model parameter estimation) as the traditional frequentist Logistic Regression model. Practical applications of privacypreserving predictive models can benefit from methods such as the ones employed in EXPLORER, since there it does not require retraining every time a new data point is added, and it does not need to rely on synchronous communication as its predecessor distributed logistic regression model [29].
The proposed EXPLORER provides strong privacy protection, since the exchanged statistics are always aggregated over a local population of m_{j} −m_{j}_{−1} records. This is analogous to the generalization operation used for table deidentification (i.e., kanonymization [49]) and the aggregated statistics reduce the risk of privacy breach of individual patient. For example, when a malicious user is eavesdropping at the server side, he can only observe the difference between m_{h}_{<sub>}_{j}_{</sub>→}_{B} (β) and m_{B}_{→}_{h}_{<sub>}_{j}_{</sub>}(β) at each intersite update, where m_{B}_{→}_{h}_{<sub>}_{j}_{</sub>}(β) and m_{h}_{<sub>}_{j}_{</sub>→}_{B} (β) are the messages from server node at the previous iteration and the message uploaded to server node at the current iteration, respectively. Therefore, the malicious user cannot match this aggregated difference back to each individual record in EXPLORER. A very interesting direction is to extend EXPLORER to satisfy objective privacy criteria like εdifferential privacy. We plan to investigate in future work how differential privacy can be applied.
Moreover, we introduced secured intermediate information exchange (i.e., SINE) protocol to enhance the security of the proposed EXPLORER framework, which could significantly reduce the risk of information leak due to the exchange of unprotected aggregated statistics among EXPLORER clients and server. SINE protocol is based on the modified secure sum algorithm in secure multiparity computation, which offers a high level provable security guarantee [52].
Since EXPLORER is working in a distributed fashion, it introduced additional communication overhead between server and clients. In general, the communication overhead between server and clients is proportional to the number of participating sites. Since we use the Normal distribution as the approximate distribution, the messages propagated between the server and the clients consist of the d dimensional mean vector β and its covariance matrix V with d(d + 1)/2 unique elements. Moreover, unlike SPARK protocol [26] with exponentially increased complexity against network size, SINE protocol is lightweighted with linearly grown complexity against network size, which offers a much better scalability.
Although EXPLORER showed comparable performance to LR, it has some theoretical limitations: its optimization procedure is not convex, and therefore there is no guarantee for global optimal solutions. The proposed EXPLORER also introduced communication overhead, which is proportional to the number of participant sites. The convergence speed of EXPLORER depends on the partition size of each involved site.
In summary, EXPLORER offers an additional tool for privacypreserving distributed statistical learning. We showed empirically on two relatively data sets that the results are very similar to those of ordinary logistic regression. These promising results warrant further validation in larger data sets and further refinement of the methodology. Inability to openly share (i.e., transmit) patient data without onerous processes involving pairwise agreements between institutions may significantly slow down analyses that could produce important results for healthcare improvement and biomedical research advances. EXPLORER provides a means to mitigate this problem by relying on multiparty computation without need for extensive retraining of models, nor reliance on synchronous communications among sites.
Assumeddensity filtering (ADF) method is a sequential technique for fast computing an approximate posterior distribution in Bayesian inference. However, the performance of ADF technique depends on the data order. Expectation propagation (EP) algorithm [53], as an extension of ADF, exploits a good approximation to the posterior by incorporating iterative refinement on the solution produced by ADF. Thus, EP is usually much more accurate than ADF. EP works by approximating each likelihood term through minimizing Kullback Leibler (KL) divergence between true posterior and approximate posterior within a tractable distribution (e.g., distributions in exponential family). Then by iteratively performing this approximation process, the approximate distribution will finally reach a fixed point [32].
In our Bayesian logistic regression problem, parameter β is associated with a Gaussian prior distribution as
Given a training dataset D = {(x_{1}, y_{1}), (x_{2}, y_{2}), · · ·, (x_{N}, y_{N})}, the likelihood function for parameter β is written as
Then let us denote the true posterior distribution of β by p(β y) p(β) Π_{i} p(y_{i}β) = p(β) Π_{i} f_{i}(β) and approximate posterior by q(β y) p(β)_{i}(β). It is mathematically convenient to choose a Guassian distribution for approximation term _{i}(β) such that the resulted approximate posterior will also be a Gaussian. To perform an efficient EP process, _{i}(β) can be parameterized as
The procedure to obtain the posterior approximation through EP algorithm is shown as follows [54]:
In this section, we will introduce the details of the factor graph construction and message passing in the EXPLORER. Let us write down the factorization of the posterior distribution of each site as follows
where $\text{Z}={\scriptstyle \frac{p({\mathbf{Y}}_{2}{\mathbf{Y}}_{3})}{p({\mathbf{Y}}_{1}{\mathbf{Y}}_{2}{\mathbf{Y}}_{3})}}$ is a normalization constant. In the above equation, p(β_{j}  Y_{1}Y_{2}Y_{3})is equivalent to belief b(β_{j}) in the BP algorithm, which is captured by the variable node B_{j}. Therefore, in the context of factor graph and BP algorithm, we can interpret the product term ${\prod}_{i=1}^{m}p({\mathbf{y}}_{i}\mid {\mathit{\beta}}_{j})$ as the message collected from all the factor nodes ${f}_{i}^{j}$ and the integral term ∫ δ (β_{j}, β)p(βY_{2}Y_{3}) as the message sent from factor node h_{j}. Moreover, according to the factor node update rule in BP algorithm, we can identify the delta function δ(β_{j}, β) as the factor function and p (β  Y_{2}Y_{3}) as the message sent from server node B. In practice, there are many ways to select factor functions to reflect the contribution of p(β_{j} β). In this paper, we follow the suggestion in Loeliger [34] and choose δ(β_{j}, β) to represent the probability p(β_{j} β) for mathematical convenience, because of ∫δ(β_{j}, β)p (β  Y_{2}Y_{3}) dβ= p(β_{i}Y_{2}Y_{3}).
β  Ordinary LR  EXPLORER  twosample Z Test  

value  std.  value  std.  Test statistic  pvalue  
β_{0}  −1.442  0.243  −1.539  0.252  1.517  0.12934 
β_{1}  0.027  0.005  0.030  0.005  −2.234  0.02547 
β_{2}  0.016  0.006  0.019  0.008  −1.481  0.13849 
β  Ordinary LR  EXPLORER  twosample Z test  

value  std.  value  std.  Test statistic  pvalue  
β_{0}  −0.009  0.346  −0.009  0.308  −0.004  0.99671 
β_{1}  0.220  0.118  0.220  0.116  0.015  0.98831 
β_{2}  0.462  0.150  0.453  0.146  0.224  0.82279 
β_{3}  0.258  0.437  0.185  0.409  0.667  0.50459 
β_{4}  0.519  0.116  0.527  0.115  −0.292  0.77064 
β_{5}  −0.795  0.133  −0.809  0.134  0.416  0.67711 
β_{6}  −0.856  0.139  −0.862  0.137  0.152  0.87930 
β_{7}  0.021  0.010  0.023  0.009  −0.727  0.46710 
β_{8}  −0.010  0.002  −0.010  0.001  1.169  0.24258 
β  Ordinary LR  EXPLORER  twosample Z test  

value  std.  value  std.  Test statistic  pvalue  
β_{0}  −2.368  0.382  −2.199  0.355  −1.779  0.07527 
β_{1}  0.050  0.011  0.046  0.010  1.482  0.13833 
β_{2}  0.000  0.005  −0.001  0.005  1.097  0.27271 
β_{3}  −0.614  0.142  −0.602  0.141  −0.327  0.74355 
β_{4}  −0.726  0.122  −0.707  0.120  −0.606  0.54419 
β_{5}  −0.072  0.020  −0.079  0.020  1.434  0.15143 
β_{6}  0.231  0.125  0.227  0.128  0.137  0.89088 
β_{7}  0.437  0.096  0.427  0.096  0.411  0.68095 
β_{8}  0.164  0.105  0.147  0.106  0.611  0.54091 
β  Ordinary LR  EXPLORER  twosample Z test  

value  std.  value  std.  Test statistic  pvalue  
β_{0}  −1.020  0.665  −0.904  0.458  −0.785  0.43227 
β_{1}  −0.176  0.251  −0.301  0.209  2.089  0.03673 
β_{2}  1.281  0.242  1.204  0.198  1.345  0.17877 
β_{3}  1.732  0.286  1.643  0.236  1.320  0.18667 
β_{4}  −0.183  0.034  −0.195  0.033  1.484  0.13780 
β_{5}  1.238  0.182  1.243  0.180  −0.097  0.92265 
β_{6}  1.199  0.255  1.185  0.239  0.211  0.83267 
β_{7}  −0.671  0.555  −0.621  0.405  −0.400  0.68887 
β_{8}  −0.096  0.493  −0.031  0.349  −0.588  0.55628 
For all 2site EXPLORER setups of datasets 3 to 11, the difference between means (DBM) of class distribution and covariates of the 2 sites has been shown in Figs. D.6 to D.14 in this section, which offers an intuitive sense about how heterogeneous these sites are.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 