Home  About  Journals  Submit  Contact Us  Français 
Privacy is becoming a major concern when sharing biomedical data across institutions. Although methods for protecting privacy of individual patients have been proposed, it is not clear how to protect the institutional privacy, which is many times a critical concern of data custodians. Built upon our previous work, Grid Binary LOgistic REgression (GLORE)^{1}, we developed an Institutional Privacypreserving Distributed binary Logistic Regression model (IPDLR) that considers both individual and institutional privacy for building a logistic regression model in a distributed manner. We tested our method using both simulated and clinical data, showing how it is possible to protect the privacy of individuals and of institutions using a distributed strategy.
Data are increasingly being collected electronically in the field of biomedicine ^{2}. A most prominent challenge for sharing these data is how to protect privacy. Although information exchange is critical in healthcare ^{3} and many people believe that the integration of data might lead to models that can help improve the quality of life ^{4}^{–}^{7}, the prospect of inadvertently leaking sensitive patient information makes data custodians reluctant to share the data ^{8}^{,}^{9}. To overcome such barriers and facilitate research ^{10}, we have to systematically handle privacy concerns and mitigate the risk of inappropriate disclosure.
An interesting problem is how to train a global statistical model without sharing patient data from distributed local sites, which we have addressed in prior work ^{1}. There are typically two types of studies, depending on how data are partitioned. The partition of patient sets is called a horizontal partition^{11}. If attributes of each patient are split among data owners, we called this a vertical partition^{12}. An early study ^{13} discussed how to train linear regression on both horizontally and vertically partitioned data. However, compared to the linear regression, the binary logistic regression ^{14} (LR) is more popular in biomedical informatics ^{15}^{–}^{19}. Therefore, a model for training LR in a distributed and privacypreserving manner is useful. Earlier work ^{20} by other authors showed how to estimate the coefficients of a LR model from distributed data concentrated on computing efficiency, but did not discuss the variance estimation, distributed ROC curves or distributed model fit test statistic calculations. Our recent approach, the GLORE model ^{1}, which is based on the decomposition of patient data in the NewtonRaphson coefficient estimation procedure ^{21}, provides a practical solution for training a LR model, conducting the HosmerLemeshow goodnessoffit test ^{22}, and calculating the Area Under the ROC Curve (AUC) score ^{23}. Although GLORE is a very useful privacypreserving model for individuals, it requires that sites send their intermediary results to the central server without considering institutional privacy^{24}, We define institutional privacy as the protection of important and sensitive information that, if leaked, can put an institution in disadvantage with respect to its peers or competitors. The goal in institutional privacy it so allow institutions to remain anonymous, so data cannot be traced back to a particular institution. This includes masking the provenance of information such as infection rates, complications, etc. It also includes masking the provenance of coefficients in a distributed logistic regression model. In this paper, we describe an algorithm that masks the ownership of the intermediary results (i.e., adjusted coefficients) sent to the central server, in an effort to promote institutional privacy.
In GLORE, the distributed sites calculate coefficients for the LR model using their local data at each step or iteration, send these coefficients to the central site, and receive adjusted coefficients to start the next iteration. The central site can determine which site contributed each matrix of coefficients, and hence institutional privacy is not achieved. To overcome the shortcomings in GLORE, we propose Institutional Privacypreserving Distributed binary Logistic Regression (IPDLR). The problem is how to combine necessary information from distributed sites into a central server in a privacypreserving manner.
We developed the following algorithm, in which we assume that the central server needs to obtain the sum of matrices (i.e., each of size m_{1} × m_{2}) from k local sites. This algorithm is inspired by Chen ^{9}, who proposed a method for performing logrank test in survival analysis in a distributed fashion.

We can see that, by adopting Algorithm 1, the central server cannot recover the local matrix and each participating site is not able to infer local matrices from the other sites. Note that, in order to use Algorithm 1, the matrix sent to the server should be the sum of the matrices from all local sites. Figure 1 shows two different networks used for the basic GLORE model and for secure summation in IPDLR. Note that in the GLORE model only servertoclient communication is needed, while clienttoclient communication is necessary in IPDLR.
The innovations in the work described her refer to modification of GLORE to enable institutional privacy through secure summation, and a method to calculate ROCs in a distributed, privacy preserving fashion. In our previous work, we have only described an algorithm for AUC calculation in a distributed fashion.
IPDLR applies Algorithm 1 to various data integration steps involved in the GLORE model; including model coefficients estimation, variancecovariance estimation, and HL test statistic computation. Specifically, the secure summation algorithm ensures the confidentiality in transmitting partial derivatives for coefficients, Fisher information for variancecovariance, and outcome distribution within each decile for HL tests. This provides stronger security for local information in the distributed computation. In addition to these improvements, we developed a privacypreserving way to obtain the ROC curve by leveraging the secure summation algorithm, which was not possible with the basic GLORE model.
GLORE discusses how to compute the Area Under the ROC Curve (AUC) for model validation in a distributed, individual privacypreserving fashion. However, the ROC curve supplies much more information than a single AUC score, since the ROC curve shows the onetoone relationship between sensitivity and specificity. Some users may want to plot the ROC curve instead of getting a single AUC value. The ROC curve needs the contingency table, including true positive (TP), true negative (TN), false positive (FP) and false negative (FN) for each predicted probability. Please refer to Zou ^{23} for excruciating details on ROC analyses. For the overall contingency table, if all sites send their local table to the central server directly, the relationship between predictions and record labels (“one” or “zero”) for each site can be recovered in the central server. For example, if TP is increased by 4 between two adjacent rows, then the central server knows that there are 4 records with label “1” from a specific local site with a known predicted probability. Hence, a safer method is necessary for the ROC computation. Towards this goal Algorithm 1 is applied, and the local table information from each site is masked from the central server, which implies, according to our definition, that institutional privacy is preserved
To use secure summation, each site requires a local contingency table for all predictions from all sites. In fact, to create a local table each site only needs the rankings of its own predictions among the overall sorted (in descending order) prediction set, in addition to the size of the overall prediction set. Here, we present a method to get the local table elements (TP, TN, FP, FN) for the predictions from other sites. The four elements for a prediction with ranking is obtained using the following two rules when the prediction is from another site.
IPDLR applies Algorithm 2 to obtain the overall contingency table. Once the central server has the contingency table, the ROC curve is plotted by the sensitivity ( $\frac{TP}{TP+FN}$) against 1specificity ( $\frac{FP}{TN+FP}$) for each prediction.

Figure 2 shows the flow chart for the main steps of plotting the ROC curve with three local clients. Next we use a simple artificial example to explain how Algorithm 2 works when there are only two local sites S1 and S2. Suppose site S1 has prediction set (0.9, 0.8, 0.5, 0.3, 0.2) and a corresponding class label set (1, 1, 0, 1, 0), and site S2 has prediction set (0.8, 0.7, 0.5, 0.3, 0.1) and a corresponding class label set (1, 0, 1, 0, 0).
We performed a simulation study and used two clinical data sets to validate the accuracy for IPDLR results. The computation was conducted using the R statistical language.
In the simulation study, we compared IPDLR (assuming data are evenly partitioned between 2 sites) and ordinary LR (combining all data for computation). We choose a total sample size of 1000 (500 for each site) and the feature (i.e., variable) number to be 9. First, we simulated all features from a standard normal distribution, then simulated the response from a binomial distribution assuming that the log odds of the response being 1 was a linear function of features (all coefficients were set to 1). We conducted the study on 100 runs to compare coefficient estimation difference between IPDLR and LR for the same simulated data. The study shows that the numbers of NewtonRaphson iterations to convergence are always 6 in this data set when 10^{−6} precision is set for the iteration stop criterion.
Table 4 shows the mean absolute difference (MAD) between 2site IPDLR and LR estimations for all 10 coefficients (9 features plus 1 intercept), where the mean is for 100 runs. There are no substantial differences between IPDLR and LR estimations for all model coefficients.
Two real datasets are used to illustrate our IPDLR model. The first clinical dataset is related to myocardial infarction at Edinburgh, UK ^{15}, which has 1,253 records with one binary outcome and 48 features. We picked nine nonredundant features in this data set and evenly split 1,253 records into two parts (627 vs. 626) to test IPDLR with 2 sites. The second dataset contains 141 records with 1 binary outcome denoting with cancer or not, and two biomarkers CA19 and CA125. The 141 records were split into 71 and 70 for two sites.
Tables 5 and and66 show results for fitting IPDLR including coefficient estimation and their standard errors, Ztest statistics and pvalues for the Edinburgh data and for the CA19 and CA125 cancer marker data, respectively. When fitting IPDLR HL test statistic equals 12.983 with a pvalue 0.112 for the Edinburgh data, and HL test statistic equals 3.510 with a pvalue 0.898 for CA19 and CA125 data, which are no different from the results of fitting ordinary LR models. Moreover, there are 5 and 12 NewtonRaphson iterations needed for convergence with 10^{−6} precision for Edinburgh data and CA19 and CA125 data, respectively. In addition, AUC values for IPDLR are 0.699 and 0.891 for Edinburgh data and CA19 and CA125 data, respectively, which are no different from the results of ordinary LR models using all data. ROC curves for IPDLR were plotted as well. Figure 3 and Figure 4 are ROC curves generated by Algorithm 2 for the Edinburgh data and for CA19 and CA125 data, respectively. These ROC curves are exactly same as those produced from ordinary LR. Based on the HL test and ROC curve (AUC) for the two clinical datasets, we see that LR fits CA19 and CA125 data very well.
The proposed IPDLR model was built on top of our GLORE model, but further improved it for institutional privacy, and added a mechanism to plot ROC curves in a distributed, institutionallyprivate fashion. In IPDLR, the provenance of original and derived data is masked. The core algorithm of IPDLR is secure summation. This algorithm is based on the creation of a random matrix and works when summation of partial information from all local sites is required in computation. To perform the secure summation, properties of the partial data could also be considered to improve the security. For example, the random matrix for the contingency table should only contain integer elements, and we could make the column that corresponds to the TP column also increasing. The central server also could perform error check before accepting the partial data, for example to verify whether the row sum is the same. IPDLR improves all distributed methods in GLORE, including coefficient estimation, variancecovariance matrix estimation, HL test statistic calculation and AUC value calculation. Furthermore, IPDLR proposes a privacypreserving method for calculating the ROC curve, which was not described in our previous work. IPDLR model improves the confidentiality of all participating clients and of the central server when the protocol is followed (i.e., all parties are trustful). However, it might still be possible to decrypt some patient information from the partial data while performing data analysis or data management. We plan to quantify the privacy risk within exchanged partial data using “differential identifiability”, a recent model proposed by Lee and Clifton ^{25}, in future work. We will also improve the robustness and security of the system by developing quality control and error checking modules for received partial data. One idea for quality control is to filter out participants with low counts. For example, before the IPDLR model fitting starts, the central server will send the same computing algorithm (i.e., scripts or executable JAR files) to all sites and set a threshold to rule out sites with very low observation numbers.
Although IPDLR provides the opportunity of combining data for more statistic power, we need to decide whether fitting IPDLR with combined data will have added value (i.e., depending on the goodnessoffit statistics at the local site as well as for the global model). We also need to perform data preprocessing for all local sites using same procedure before fitting IPDLR. The secure summation algorithm ensures institutional privacy. As a cost, the software design for IPDLR involves more complicated communications between parties as compared to what was needed for GLORE. We have also not addressed the problem of internal attacks that originate from within the communication network.
More work is needed to demonstrate the successful application of IPDLR in practice. However, this article presents the first steps towards a distributed logistic regression algorithm and its distributed evaluation that preserves the privacy of individuals as well as the privacy of institutions. We are currently working on a project making distributed models accessible and useful to data analysts, and we will use this opportunity to test and implement an IPDLR application in some clinical centers.
The authors were funded in part by the NIH grants R01LM009520, U54HL108460, R01HS019913, UL1RR031, 1K99LM01139201.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 