|Home | About | Journals | Submit | Contact Us | Français|
Privacy is becoming a major concern when sharing biomedical data across institutions. Although methods for protecting privacy of individual patients have been proposed, it is not clear how to protect the institutional privacy, which is many times a critical concern of data custodians. Built upon our previous work, Grid Binary LOgistic REgression (GLORE)1, we developed an Institutional Privacy-preserving Distributed binary Logistic Regression model (IPDLR) that considers both individual and institutional privacy for building a logistic regression model in a distributed manner. We tested our method using both simulated and clinical data, showing how it is possible to protect the privacy of individuals and of institutions using a distributed strategy.
Data are increasingly being collected electronically in the field of biomedicine 2. A most prominent challenge for sharing these data is how to protect privacy. Although information exchange is critical in healthcare 3 and many people believe that the integration of data might lead to models that can help improve the quality of life 4–7, the prospect of inadvertently leaking sensitive patient information makes data custodians reluctant to share the data 8,9. To overcome such barriers and facilitate research 10, we have to systematically handle privacy concerns and mitigate the risk of inappropriate disclosure.
An interesting problem is how to train a global statistical model without sharing patient data from distributed local sites, which we have addressed in prior work 1. There are typically two types of studies, depending on how data are partitioned. The partition of patient sets is called a horizontal partition11. If attributes of each patient are split among data owners, we called this a vertical partition12. An early study 13 discussed how to train linear regression on both horizontally and vertically partitioned data. However, compared to the linear regression, the binary logistic regression 14 (LR) is more popular in biomedical informatics 15–19. Therefore, a model for training LR in a distributed and privacy-preserving manner is useful. Earlier work 20 by other authors showed how to estimate the coefficients of a LR model from distributed data concentrated on computing efficiency, but did not discuss the variance estimation, distributed ROC curves or distributed model fit test statistic calculations. Our recent approach, the GLORE model 1, which is based on the decomposition of patient data in the Newton-Raphson coefficient estimation procedure 21, provides a practical solution for training a LR model, conducting the Hosmer-Lemeshow goodness-of-fit test 22, and calculating the Area Under the ROC Curve (AUC) score 23. Although GLORE is a very useful privacy-preserving model for individuals, it requires that sites send their intermediary results to the central server without considering institutional privacy24, We define institutional privacy as the protection of important and sensitive information that, if leaked, can put an institution in disadvantage with respect to its peers or competitors. The goal in institutional privacy it so allow institutions to remain anonymous, so data cannot be traced back to a particular institution. This includes masking the provenance of information such as infection rates, complications, etc. It also includes masking the provenance of coefficients in a distributed logistic regression model. In this paper, we describe an algorithm that masks the ownership of the intermediary results (i.e., adjusted coefficients) sent to the central server, in an effort to promote institutional privacy.
In GLORE, the distributed sites calculate coefficients for the LR model using their local data at each step or iteration, send these coefficients to the central site, and receive adjusted coefficients to start the next iteration. The central site can determine which site contributed each matrix of coefficients, and hence institutional privacy is not achieved. To overcome the shortcomings in GLORE, we propose Institutional Privacy-preserving Distributed binary Logistic Regression (IPDLR). The problem is how to combine necessary information from distributed sites into a central server in a privacy-preserving manner.
We developed the following algorithm, in which we assume that the central server needs to obtain the sum of matrices (i.e., each of size m1 × m2) from k local sites. This algorithm is inspired by Chen 9, who proposed a method for performing log-rank test in survival analysis in a distributed fashion.
We can see that, by adopting Algorithm 1, the central server cannot recover the local matrix and each participating site is not able to infer local matrices from the other sites. Note that, in order to use Algorithm 1, the matrix sent to the server should be the sum of the matrices from all local sites. Figure 1 shows two different networks used for the basic GLORE model and for secure summation in IPDLR. Note that in the GLORE model only server-to-client communication is needed, while client-to-client communication is necessary in IPDLR.
The innovations in the work described her refer to modification of GLORE to enable institutional privacy through secure summation, and a method to calculate ROCs in a distributed, privacy preserving fashion. In our previous work, we have only described an algorithm for AUC calculation in a distributed fashion.
IPDLR applies Algorithm 1 to various data integration steps involved in the GLORE model; including model coefficients estimation, variance-covariance estimation, and H-L test statistic computation. Specifically, the secure summation algorithm ensures the confidentiality in transmitting partial derivatives for coefficients, Fisher information for variance-covariance, and outcome distribution within each decile for H-L tests. This provides stronger security for local information in the distributed computation. In addition to these improvements, we developed a privacy-preserving way to obtain the ROC curve by leveraging the secure summation algorithm, which was not possible with the basic GLORE model.
GLORE discusses how to compute the Area Under the ROC Curve (AUC) for model validation in a distributed, individual privacy-preserving fashion. However, the ROC curve supplies much more information than a single AUC score, since the ROC curve shows the one-to-one relationship between sensitivity and specificity. Some users may want to plot the ROC curve instead of getting a single AUC value. The ROC curve needs the contingency table, including true positive (TP), true negative (TN), false positive (FP) and false negative (FN) for each predicted probability. Please refer to Zou 23 for excruciating details on ROC analyses. For the overall contingency table, if all sites send their local table to the central server directly, the relationship between predictions and record labels (“one” or “zero”) for each site can be recovered in the central server. For example, if TP is increased by 4 between two adjacent rows, then the central server knows that there are 4 records with label “1” from a specific local site with a known predicted probability. Hence, a safer method is necessary for the ROC computation. Towards this goal Algorithm 1 is applied, and the local table information from each site is masked from the central server, which implies, according to our definition, that institutional privacy is preserved
To use secure summation, each site requires a local contingency table for all predictions from all sites. In fact, to create a local table each site only needs the rankings of its own predictions among the overall sorted (in descending order) prediction set, in addition to the size of the overall prediction set. Here, we present a method to get the local table elements (TP, TN, FP, FN) for the predictions from other sites. The four elements for a prediction with ranking is obtained using the following two rules when the prediction is from another site.
IPDLR applies Algorithm 2 to obtain the overall contingency table. Once the central server has the contingency table, the ROC curve is plotted by the sensitivity ( ) against 1-specificity ( ) for each prediction.
Figure 2 shows the flow chart for the main steps of plotting the ROC curve with three local clients. Next we use a simple artificial example to explain how Algorithm 2 works when there are only two local sites S1 and S2. Suppose site S1 has prediction set (0.9, 0.8, 0.5, 0.3, 0.2) and a corresponding class label set (1, 1, 0, 1, 0), and site S2 has prediction set (0.8, 0.7, 0.5, 0.3, 0.1) and a corresponding class label set (1, 0, 1, 0, 0).
We performed a simulation study and used two clinical data sets to validate the accuracy for IPDLR results. The computation was conducted using the R statistical language.
In the simulation study, we compared IPDLR (assuming data are evenly partitioned between 2 sites) and ordinary LR (combining all data for computation). We choose a total sample size of 1000 (500 for each site) and the feature (i.e., variable) number to be 9. First, we simulated all features from a standard normal distribution, then simulated the response from a binomial distribution assuming that the log odds of the response being 1 was a linear function of features (all coefficients were set to 1). We conducted the study on 100 runs to compare coefficient estimation difference between IPDLR and LR for the same simulated data. The study shows that the numbers of Newton-Raphson iterations to convergence are always 6 in this data set when 10−6 precision is set for the iteration stop criterion.
Table 4 shows the mean absolute difference (MAD) between 2-site IPDLR and LR estimations for all 10 coefficients (9 features plus 1 intercept), where the mean is for 100 runs. There are no substantial differences between IPDLR and LR estimations for all model coefficients.
Two real datasets are used to illustrate our IPDLR model. The first clinical dataset is related to myocardial infarction at Edinburgh, UK 15, which has 1,253 records with one binary outcome and 48 features. We picked nine non-redundant features in this data set and evenly split 1,253 records into two parts (627 vs. 626) to test IPDLR with 2 sites. The second dataset contains 141 records with 1 binary outcome denoting with cancer or not, and two biomarkers CA-19 and CA-125. The 141 records were split into 71 and 70 for two sites.
Tables 5 and and66 show results for fitting IPDLR including coefficient estimation and their standard errors, Z-test statistics and p-values for the Edinburgh data and for the CA-19 and CA-125 cancer marker data, respectively. When fitting IPDLR H-L test statistic equals 12.983 with a p-value 0.112 for the Edinburgh data, and H-L test statistic equals 3.510 with a p-value 0.898 for CA-19 and CA-125 data, which are no different from the results of fitting ordinary LR models. Moreover, there are 5 and 12 Newton-Raphson iterations needed for convergence with 10−6 precision for Edinburgh data and CA-19 and CA-125 data, respectively. In addition, AUC values for IPDLR are 0.699 and 0.891 for Edinburgh data and CA-19 and CA-125 data, respectively, which are no different from the results of ordinary LR models using all data. ROC curves for IPDLR were plotted as well. Figure 3 and Figure 4 are ROC curves generated by Algorithm 2 for the Edinburgh data and for CA-19 and CA-125 data, respectively. These ROC curves are exactly same as those produced from ordinary LR. Based on the H-L test and ROC curve (AUC) for the two clinical datasets, we see that LR fits CA-19 and CA-125 data very well.
The proposed IPDLR model was built on top of our GLORE model, but further improved it for institutional privacy, and added a mechanism to plot ROC curves in a distributed, institutionally-private fashion. In IPDLR, the provenance of original and derived data is masked. The core algorithm of IPDLR is secure summation. This algorithm is based on the creation of a random matrix and works when summation of partial information from all local sites is required in computation. To perform the secure summation, properties of the partial data could also be considered to improve the security. For example, the random matrix for the contingency table should only contain integer elements, and we could make the column that corresponds to the TP column also increasing. The central server also could perform error check before accepting the partial data, for example to verify whether the row sum is the same. IPDLR improves all distributed methods in GLORE, including coefficient estimation, variance-covariance matrix estimation, H-L test statistic calculation and AUC value calculation. Furthermore, IPDLR proposes a privacy-preserving method for calculating the ROC curve, which was not described in our previous work. IPDLR model improves the confidentiality of all participating clients and of the central server when the protocol is followed (i.e., all parties are trustful). However, it might still be possible to decrypt some patient information from the partial data while performing data analysis or data management. We plan to quantify the privacy risk within exchanged partial data using “differential identifiability”, a recent model proposed by Lee and Clifton 25, in future work. We will also improve the robustness and security of the system by developing quality control and error checking modules for received partial data. One idea for quality control is to filter out participants with low counts. For example, before the IPDLR model fitting starts, the central server will send the same computing algorithm (i.e., scripts or executable JAR files) to all sites and set a threshold to rule out sites with very low observation numbers.
Although IPDLR provides the opportunity of combining data for more statistic power, we need to decide whether fitting IPDLR with combined data will have added value (i.e., depending on the goodness-of-fit statistics at the local site as well as for the global model). We also need to perform data preprocessing for all local sites using same procedure before fitting IPDLR. The secure summation algorithm ensures institutional privacy. As a cost, the software design for IPDLR involves more complicated communications between parties as compared to what was needed for GLORE. We have also not addressed the problem of internal attacks that originate from within the communication network.
More work is needed to demonstrate the successful application of IPDLR in practice. However, this article presents the first steps towards a distributed logistic regression algorithm and its distributed evaluation that preserves the privacy of individuals as well as the privacy of institutions. We are currently working on a project making distributed models accessible and useful to data analysts, and we will use this opportunity to test and implement an IPDLR application in some clinical centers.
The authors were funded in part by the NIH grants R01LM009520, U54HL108460, R01HS019913, UL1RR031, 1K99LM011392-01.