Home | About | Journals | Submit | Contact Us | Français |

**|**J Am Med Inform Assoc**|**v.20(3); May-Jun 2013**|**PMC3628047

Formats

Article sections

- Abstract
- Introduction
- Preliminaries
- The algorithm
- Experimental description
- Discussion
- Conclusion
- References

Authors

Related links

J Am Med Inform Assoc. 2013 May-Jun; 20(3): 462–469.

Published online 2012 December 13. doi: 10.1136/amiajnl-2012-001027

PMCID: PMC3628047

Noman Mohammed,^{1} Xiaoqian Jiang,^{2} Rui Chen,^{1} Benjamin C M Fung,^{1} and Lucila Ohno-Machado^{2}

Received 2012 April 20; Revised 2012 April 13; Accepted 2012 November 1.

Copyright Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions

See "Health surveillance using the internet and other sources of information" in volume 20 on page 403.

This article has been cited by other articles in PMC.

Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among existing privacy models, *ε-differential privacy* provides one of the strongest privacy guarantees and makes no assumptions about an adversary's background knowledge. All existing solutions that ensure ε*-*differential privacy handle the problem of disclosing relational and set-valued data in a privacy-preserving manner separately. In this paper, we propose an algorithm that considers both relational and set-valued data in differentially private disclosure of healthcare data.

The proposed approach makes a simple yet fundamental switch in differentially private algorithm design: instead of listing all possible records (ie, a contingency table) for noise addition, records are generalized before noise addition. The algorithm first generalizes the raw data in a probabilistic way, and then adds noise to guarantee ε*-*differential privacy.

We showed that the disclosed data could be used effectively to build a decision tree induction classifier. Experimental results demonstrated that the proposed algorithm is scalable and performs better than existing solutions for classification analysis.

The resulting utility may degrade when the output domain size is very large, making it potentially inappropriate to generate synthetic data for large health databases.

Unlike existing techniques, the proposed algorithm allows the disclosure of health data containing both relational and set-valued data in a differentially private manner, and can retain essential information for discriminative analysis.

With the wide deployment of electronic health record systems, health data are being collected at an unprecedented rate. The need for sharing health data among multiple parties has become evident in several applications ,^{1} such as decision support, policy development, and data mining. Meanwhile, major concerns have been raised about individual privacy in health data sharing. The current practice of privacy protection primarily relies on policies and guidelines, for example, the Health Insurance Portability and Accountability Act (HIPAA)^{2} in the USA. HIPAA defines two approaches to achieve de-identification: the first is *Expert Determination*, which requires that an expert certify that the re-identification risk inherent in the data is sufficiently low; the second is *Safe Harbor*, which requires the removal and suppression of a list of attributes.^{3}
*Safe Harbor* requires data disclosers to follow a checklist^{4} to remove specific information to de-identify the records.

However, there are numerous controversies on both sides of the privacy debate regarding these HIPAA privacy rules.^{5} Some think that the protections provided in the de-identified data are not sufficient.^{6} Others contend that these privacy safeguards hamper biomedical research, and that observing them may preclude meaningful studies of medical data that depend on suppressed attributes, for example, fine-grained epidemiology studies in areas with fewer than 20 000 residents or geriatric studies requiring detailed ages in those over 89.^{3} There are concerns that privacy rules will erode the efficiencies that computerized health records may create, and in some cases, interfere with law enforcement.^{5} Recently, the Institute of Medicine Committee on Health Research and the Privacy of Health Information concluded that the privacy rules do not adequately safeguard privacy and also significantly impede high-quality research.^{7} The result is that patients’ health records are not well protected at the same time that researchers cannot effectively use them for discoveries.^{8} Technical efforts are highly encouraged to make published health data both privacy-preserving and useful.^{9}

Anonymizing health data is a challenging task due to inherent *heterogeneity*. Modern health data are typically composed of different types, for example relational data (eg, demographics) and set-valued data (eg, diagnostic codes and laboratory tests). In relational data (eg, gender, age, body mass index), records contain only one value for each attribute. On the other hand, set-valued data (eg, diagnostic codes and laboratory tests) contain one or more values (cells) for each attribute. For example, the attribute-value {1*, 2*} of the *diagnostic code* contains two separate cells: {1*} and {2*}. For many medical problems, different types of data need to be published *simultaneously* so that the correlation between different data types can be preserved. Such an emerging heterogeneous data-publishing scenario, however, is seldom addressed in the existing literature on privacy technology. Current techniques primarily focus on a single type of data^{10} and therefore are unable to thwart privacy attacks caused by inferences involving different data types. In this article, we propose an algorithm so that heterogeneous health data can be published yet retain essential information for supporting data mining tasks in a differentially private manner. The following real-life scenario further illustrates the privacy threats resulting from heterogeneous health data sharing.

*Example 1* Consider the raw patient data in table 1 (the attribute ID is just for the purposes of illustration). Each row in the table represents information from a patient.

The attributes *Sex*, *Age*, and *Diagnostic code* are categorical, numerical, and set-valued, respectively. Suppose that the data owner needs to release table 1 for the purpose of classification analysis on the class attribute, which has two values, Y and , indicating whether or not the patient is deceased. If a record in the table is too specific such that not many patients can match it, releasing the data may lead to the re-identification of a patient. For example, Loukides *et al*^{11} demonstrated that for the International Classification of Diseases (ICD), Ninth Revision (ICD-9) codes (or ‘diagnostic codes’ for brevity), one source of set-valued data could be used by an adversary for linkage to patients’ identities. Needless to say, the knowledge of both relational and set-valued data about a victim makes the privacy attack easier for an adversary. Suppose that the adversary knows that the target patient is female and her diagnostic codes contain {11}. Then, record #4 can be uniquely identified, since she is the only *Female* with diagnostic codes {11,12} in the raw data. Thus, identifying her record results in disclosure that she also has {12}. Note that we do not make any assumption about the adversary's background knowledge. An adversary may have partial or full information about the set-valued data and can try to use any background knowledge to identify the victim.

To prevent such linking attacks, a number of partition-based privacy models have been proposed.^{12–16} However, recent research has indicated that these models are vulnerable to various privacy attacks^{17–20} and provide insufficient privacy protection. In this article, we employ *differential privacy*,^{21} a privacy model that provides provable privacy guarantees and that is, by definition, immune against all aforementioned attacks. Differential privacy makes no assumption about an adversary's background knowledge. A differentially private mechanism ensures that the probability of any output (released data) is almost equally likely from all nearly identical input data sets and thus guarantees that all outputs are insensitive to any single individual's data. In other words, an individual's privacy is not at risk because of inclusion in the disclosed data set.

Existing algorithms that provide differential privacy guarantees are based on two approaches: *interactive* and *non-interactive*. In an interactive framework, a data miner can pose aggregate queries through a private mechanism, and a database owner answers these queries in response. Most of the proposed methods for ensuring differential privacy are based on an interactive framework.^{22–26} In a non-interactive framework the database owner first anonymizes the raw data and then releases the anonymized version for public use. In this article, we adopt the non-interactive framework as it has a number of advantages for data mining.^{10} Current techniques that adopt the non-interactive approach publish contingency tables or marginals of the raw data.^{27–30} The general structure of these approaches is to first derive a frequency matrix of the raw data over the database domain.

For example, table 2 shows the contingency table of table 3. After that, noise is added to each count to satisfy the privacy requirement. Finally, the noisy frequency matrix is published. However, this approach is not suitable for high-dimensional data with a large domain because when the added noise is relatively large compared to the count, the utility of the data is significantly destroyed. We also confirm this point in the ‘Experimental description’ section. Our proposed solution instead first probabilistically generates a generalized contingency table and then adds noise to the counts. For example, table 4 is a generalized contingency table of table 3. Thus the count of each partition is typically much larger than the added noise.

We propose a novel technique for publishing heterogeneous health data that provides an *ε-differential privacy* guarantee. While protecting privacy is a critical element in data publishing, it is equally important to preserve the utility of the published data, since this is the primary reason for data release. Taking the decision tree induction classifier as an example, we show that our sanitization algorithm can be effectively tailored for preserving information in data mining tasks. The contributions of this article are:

- To our knowledge, a
*differentially private*data disclosure algorithm that simultaneously handles both relational and set-valued data has not been previously developed. The proposed differentially private data algorithm is based on a generalization technique and preserves information for classification analysis. Previous work^{31}suggests that deterministic generalization techniques cannot be used to achieve*ε*-differential privacy, as they depend heavily on the data to be disseminated. Yet, we show that differentially private data can be released through the addition of uncertainty in the generalization procedure. - The proposed algorithm can also handle numerical attributes. Unlike existing methods,
^{30}it does not require the numerical attributes to be pre-discretized. The algorithm adaptively determines the split points for numerical attributes and partitions the data based on the workload, while guaranteeing ε-differential privacy. This is an essential requirement for obtaining accurate classification, as we show in the ‘Discussion’ section. Moreover, the algorithm is computationally efficient. - It is well acknowledged that ε-differential privacy provides a strong privacy guarantee. However, the utility of data disclosed by
*differentially private*algorithms has received much less study. Does an interactive approach offer better data mining results than a non-interactive approach? Does differentially private data disclosure provide less utility than disclosure based on k-anonymous data? Experimental results demonstrate that our algorithm outperforms the recently proposed differentially private interactive algorithm for building a classifier^{26}and the*top-down specialization*(*TDS*) approach^{32}that publishes k-anonymous data for classification analysis.

This article is organized as follows. The ‘Preliminaries’ section provides an overview of the generalization technique and presents the problem statement. Our anonymization algorithm is explained in ‘The algorithm’ section. In the ‘Experimental description’ section, we experimentally evaluate the performance of our solution, and we summarize our main findings in the ‘Discussion’ section.

In this section, we introduce the notion of generalization in the context of data publishing, followed by a problem statement.

Let be a multiset of records, where each record represents the information of an individual with attributes . We represent the data set in a tabular form and use the terms ‘data set’ and ‘data table’ interchangeably. We assume that each attribute has a finite domain, denoted by . The domain of is defined as . To generalize a data set , we replace the value of an attribute with a more general value. The exact general value is determined according to the attribute partition.

Definition 2.1 *(Partition)* The partitions of a numerical attribute are the intervals in such that

For categorical and set-valued attributes, partitions are defined by a set of nodes from the taxonomy tree such that it covers the whole tree, and each leaf node belongs to exactly one partition.

For example, is the general value of according to the taxonomy tree of in figure 1. Similarly, 23 and 11 can be represented by the interval and the code 1*, respectively. For numerical attributes, these intervals are determined adaptively from the entire data.

*Definition 2.2 (Generalization)* Generalization is defined by a function , where maps each value to a .

Clearly, given a data set over a set of attributes , many alternative generalization functions are feasible. Each generalization function partitions the attribute domains differently. To satisfy ε-differential privacy, the algorithm must determine a generalization function that is insensitive to the underlying data. More formally, for any two data sets and , where (ie, they differ on at most one record), the algorithm must ensure that the ratio of and is bounded, where is a randomized algorithm (see online supplementary appendix A1). One naive solution satisfying ε-differential privacy is to have a fixed generalization function, irrespective of the input data set (ie, by definition *zero*-differentially private but useless). However, the proper choice of a generalization function is crucial since the data mining result varies significantly for different choices of partitioning. In the ‘Experimental description’ section, we present an efficient algorithm for determining an adaptive partitioning technique for classification analysis while guaranteeing ε-differential privacy. Online supplementary appendix A1 presents an overview of ε-differential privacy and the core mechanisms to achieve ε-differential privacy.

Suppose a data owner wants to release a de-identified data table where the symbols and correspond to predictor attributes and the class attribute, respectively, for release to the public for classification analysis. The attributes in are classified into three categories: (1) an *identifier*
attribute that explicitly identifies an individual, such as *SSN* (social security number), and *Name*. These attributes are removed before releasing the data as per the HIPAA Privacy Rule; (2) a *class* attribute that contains the class value; the goal of the data miner is to build a classifier to accurately predict the value of this attribute; and (3) a set of
*predictor* attributes , whose values are used to predict the binary label of the class attribute.

We require the class attribute to be categorical, and the predictor attribute can be categorical, numerical, or set-valued. Further, we assume that for each categorical or set-valued attribute , a taxonomy tree is provided. The taxonomy tree of an attribute specifies the hierarchy among the values. Our problem statement can be written as: given a data table and the privacy parameter , our objective is to generate a de-identified data table such that (1) satisfies ε-differential privacy, and (2) preserves as much information as possible for classification analysis.

In this section, we present an overview of our *Diff*erentially private algorithm based on *Gen*eralization (*DiffGen*). We elaborate the key steps, and prove that the algorithm is ε-differentially private in online supplementary appendix A2. In addition, we present the implementation details and analyze the complexity of the algorithm in online supplementary appendix A3.

- Input: Raw data set , privacy budget
*ε*, and number of specializations Output: Generalized data set - 1: Initialize every value in to the topmost value (see figure 2 for details);
- 2: Initialize to include the topmost value (see figure 2 for details);
- 3: Set a privacy budget for specification of predictors ;
- 4: Determine the split value for each with probability ;
- 5: Compute the score for each candidate (see online supplementary appendix A2 for details);
- 6: for to do
- 7: Select with probability ;
- 8: Specialize on and update ;
- 9: Determine the split value for each new with probability
- ;
- 10: Update score for ;
- 11: end for
- 12: return each group with count , where denotes the probability density function of Laplacian distribution.

Algorithm 1 first generalizes the predictor attributes and divides the raw data into several equivalence groups, where all the records within a group have the same attribute values. Then, the algorithm publishes the noisy counts of the groups. The general idea is to sanitize the raw data by a sequence of specializations, starting from the topmost general state as shown in figure 2. A *specialization*, written as , where denotes the set of child values of , replaces the parent value with a child value. The specialization process can be viewed as pushing the ‘cut’ of each taxonomy tree downwards. A *cut* of the taxonomy tree for an attribute , denoted by , contains exactly one value on each root-to-leaf path. The value of the set-valued attribute of a record can be generalized to a *cut* if *every* item in the record can be generalized to a node in the cut and *every* node in the cut generalizes some items in the record. For example, the value can be generalized to the hierarchy cuts and , but *not*
. Figure 2 shows a solution cut indicated by the dashed curve representing table 5, which corresponds to de-identified data to be disseminated.

Initially, *DiffGen* creates a single partition by generalizing all values in to the topmost value in their taxonomy trees (line 1). The contains the topmost value for each attribute (line 2). The specialization starts from the topmost cut and pushes down the cut iteratively by specializing some value in the current cut. At each iteration, *DiffGen* uses an *exponential mechanism*^{33} (see online supplementary appendix A1) to select a candidate for specialization (line 7). Candidates are selected based on their score values (see online supplementary appendix A2), and different heuristics (eg, *information gain* and *max frequency*) can be used to determine the score of the candidates. Then, the algorithm specializes and updates (line 8). As taxonomy trees for the numerical attributes are not given, *DiffGen* again uses the exponential mechanism to determine the split value dynamically for each numerical candidate (lines 4 and 9). *DiffGen* specializes by recursively distributing the records from the parent partition into disjoint child partitions with more specific values based on the taxonomy tree. For set-valued attributes, the algorithm computes the noisy count of each child partition to determine whether it is empty or not. Only ‘non-empty’ partitions are considered for further split in the next iteration. We provide additional details for candidate selection and the split value determination steps in online supplementary appendix A2. *DiffGen* also calculates the score for each new candidate due to the specialization (line 10). The algorithm terminates after a given number of specializations. Finally, the algorithm adds Laplace noise (see online supplementary appendix A1) to each equivalence group of the leaf partition to construct the sanitized data table . We use the following example to facilitate understanding of how to use score functions, which are based on heuristics (eg, information gain and max frequency), for specification.

*Example 2* Consider table 1 with ε=1 and h=2. Initially, the algorithm creates one root partition containing all the records that are generalized to . includes . To find the first specialization among the candidates in , we compute the scores of , [*18–65*), and . We show how to compute the information gain *(InfoGain)* and maximum frequency *(Max)* scores of for the specialization . Details of these two utility functions are discussed in online supplementary appendix 2.

- Information gain:

- 2.Max:

.

Let the first specialization be . The algorithm then creates three child partitions with the child values , , and , respectively, by replacing the node with different combinations of its children, leading , , , and to the child partition * *and , , , and to the child partition . Suppose that the noisy counts indicate that these two child partitions are ‘non-empty’. Then further splits are needed for them. There is no need to explore the child partition any more, as it is considered ‘empty’. is updated to . Suppose that the next specialization is , which creates further specialized partitions. Finally, the algorithm outputs the equivalence groups of each leaf partition along with their noisy counts as shown in figure 2 under the dotted line.

Please refer to online supplementary appendix A2 for privacy analysis of *DiffGen* and to online supplementary appendix A3 for implementation details.

In this section our objectives are to study the impact of enforcing differential privacy on the data quality in terms of classification accuracy (CA), and to evaluate the scalability of the proposed algorithm for handling large data sets. We also compare *DiffGen* with *DiffP-C4.5*,^{26} a differentially private interactive algorithm for building a classifier, and with the *TDS* approach^{32} that publishes k-anonymous data for classification analysis. All experiments were conducted on an Intel Core i7 2.7GHz PC with 12GB RAM.

We employed two real-life data sets: *MIMIC* and *Adult*. We retrieved the *MIMIC* data set from the Multi-parameter Intelligent Monitoring in Intensive Care II research database,^{34} which contains over 36 000 intensive care unit episodes. Specifically, we picked eight features (ie, marital status, gender, ethnicity, payment description, religion description, admission type, admission source, and ICD-9 code) and a target variable (ie, mortality). Of the eight features, the first seven are categorical attributes while the last one is a set-valued attribute. The publicly available *Adult*^{35} data set is the 1994 US census data set that has been widely used for testing many sanitization algorithms. *Adult* has 45 222 census records with six numerical attributes, eight categorical attributes, and a binary *class* column representing two income levels, ≤US$50 K or >US$50 K. Please refer to Fung *et al*^{32} for the description of attributes.

To evaluate the impact on classification quality, we divided the data into training and testing sets. First, we applied our algorithm to sanitize the training set and determine the . Then, the same was applied to the testing set to produce a generalized test set. Next, we built a classifier on the sanitized training set and measured the CA on the generalized records of the test set. For classification models, we used the well-known C4.5 classifier.^{36} For each experiment, we executed 10 runs and averaged the results over the runs.

We applied *DiffGen* to the *MIMIC* data set for both utility functions (ie, Max and InfoGain). Figure 3 shows the CA for Max and InfoGain, where the privacy budget ε=0.1, 0.25, 0.5, 1, and the number of specializations h=5. We used two thirds of the records to build the classifier and measured the accuracy on the remaining third of the records. Both utility functions have similar performance, where CA spans from 86% to 89% under different privacy budgets. The experimental result suggests that the proposed algorithm can achieve good CA on heterogeneous health data. We could not directly compare our method with others for the *MIMIC* data set because we are not aware of an approach that can sanitize heterogeneous data while ensuring ε-differential privacy.

To better visualize the cost and benefit of our approach, we provide additional measures: *baseline accuracy* (*BA*) is the CA measured on the raw data without sanitization. BA−CA represents the cost in terms of classification quality for achieving a given ε-differential privacy requirement. At the other extreme, we measure *lower-bound accuracy* (*LA*), which is the accuracy on the raw data with all attributes (except for the *class* attribute) removed. CA−LA represents the benefit of our method over the naive non-disclosure approach.

Figure 4A depicts the CA for the utility function Max, where the privacy budget ε=0.1, 0.25, 0.5, 1, and the number of specializations 4≤h≤16. The BA and LA are 85.3% and 75.5%, respectively, as shown in the figure by the dotted lines. For ε=1 and h=10, BA−CA is around 3% and CA−LA is 6.74%. For ε=0.5, BA−CA spans from 3.57% to 4.8%, and CA−LA spans from 5% to 6.23%. However, when ε decreases to 0.1, CA quickly decreases to about 78% (highest point), the cost increases to about 7%, and the benefit decreases to about 3%. These results suggest that for an acceptable privacy budget such as , the cost for achieving ε-differential privacy is small, while the benefit of our method over the naive method is large. Figure 4B depicts the CA for the utility function InfoGain. The performance of InfoGain is not as good as that of Max because the difference between the scores of a good and a bad attribute is much smaller for InfoGain as compared to Max. Therefore, the exponential mechanism does not work as effectively in the case of InfoGain as it does for Max.

Figure 5A shows the CA of *DiffGen*, *DiffP-C4.5*, and *TDS*. For *DiffGen*, we use utility function Max and fix the number of specializations h=15. *DiffP-C4.5* also uses the *Adult* data set and all the results of the *DiffP-C4.5* are taken from the paper by Friedman and Schuster.^{26} For *TDS* we fixed the anonymity threshold k=5 and conducted the experiment ourselves. Following the same setting,^{26} we executed 10 runs of 10-fold cross-validation to measure the CA.

Comparison of *DiffGen* with *DiffP-C4.5* and top-down specialization (*TDS*) algorithms. (A) Evaluation of averaged accuracy, where the bottom and topmost lines stand for the worst case (ie, all records generalized to one super record) and the optimal case **...**

The accuracy of *DiffGen* is clearly better than that of *DiffP-C4.5* for privacy budgets ε≤2. Note that the privacy budget should be typically smaller than 1.^{21} Even for a higher budget, the accuracy of *DiffGen* is comparable to that of *DiffP-C4.5*. The major advantage of our algorithm is that we publish data and the data miner has much better flexibility to perform the required data analysis. On the other hand, in *DiffP-C4.5* the classifier is built through interactive queries; therefore, the database has to be permanently shut down to satisfy the privacy requirement after generating only *one* classifier. The experimental result also shows that *DiffGen* performs better than *TDS*. For a higher anonymity threshold k, the accuracy of *TDS* will be lower. One advantage of *DiffGen* is that, unlike *TDS*, it does not need to ensure that every equivalence group contains k records; therefore, *DiffGen* is able to provide more detailed information than *TDS*. This result demonstrates for the first time that, if designed properly, a differentially private algorithm can provide better utility than a partition-based approach.

All previous experiments can finish the sanitization process within 30 s. We further study the scalability of our algorithm over large data sets. We generate different data sets of different sizes by randomly adding records to the *Adult* data set. For each original record r, we create α−1 variations of the record by replacing some of the attribute values randomly from the same domain. Here α is the blowup scale and thus the total number of records is α×45 222 after adding random records. Figure 5B depicts the runtime for 200 000 to million records for h=15 and ε=1.

We observed two general trends from the experiments. First, the privacy budget has a direct impact on the CA. A higher budget results in better accuracy since it ensures better attribute partitioning and lowers the magnitude of noise that is added to the count of each equivalence group. Second, the CA initially increases with the increase in the number of specializations, but decreases after a certain threshold. This is an interesting observation. The number of equivalence groups increases quite rapidly with an increase in the number of specializations, resulting in a smaller count per group. Up to a certain threshold it has a positive impact due to more precise values; however, the influence of the Laplace noise gets stronger as the number of specializations grows. Note that if the noise is as large as the count, then the disclosed data are useless. This confirms that listing all possible combinations of values (ie, contingency table) and then adding noise to their counts is not a good approach for high-dimensional data since the noise will be as big as the count. Since this is a non-interactive approach, the data owner can try different values of h to find the threshold and then release the sanitized data. Determining a good value of h adaptively, given the data set and the privacy budget, is an interesting approach for future work that we plan to investigate.

Does the algorithm yield a globally optimal solution? Can the algorithm be easily modified to anonymize sequential data? Are noise addition and generalization-based techniques desirable for the users of medical data sets? In this section, we provide answers to these questions.

The proposed algorithm does not yield an optimal *solution cut*, rather it is suboptimal. The algorithm uses an exponential mechanism which probabilistically chooses a candidate with a high score. Thus, it is possible that a different solution cut may provide better utility. However, it is important to note that maximizing the overall sum of the *Score* for specializations in the training data does not guarantee having the lowest classification error in the testing data.

While set-valued data (not considering the order) are useful for many data analysis tasks, we acknowledge that in some other analysis tasks the order of items could provide extra useful information and may pose new re-identification risks. The proposed anonymization algorithm can be extended to handle sequential data as well. In order to anonymize sequential data, we need to preprocess each item in the sequence by its order and then consider the item–order pair as the new ‘item’. Then, these new item sets could be used as an input to our algorithm, which will then prevent re-identification attacks based on the order of items. However, this is not currently implemented in the paper.

One of the inputs of our algorithm is the number of generalizations. Data users could specify the desired degree of generalizations by setting a proper value for h. Concerning the negative impact of noise (ie, on data utility) added to satisfy differential privacy, we expect the sanitized data from our algorithm to be useful for a few tasks (eg, classification as illustrated in this paper), but not all data analysis tasks, especially those that focus on attribute values of individuals. This is an inherent limitation of differential privacy methodology, and it is the price we pay to achieve a provable privacy guarantee. We acknowledge this limitation and will seek for better solutions in future work.

In summary, the generalization technique used in *DiffGen* might result in loss of information, which leads to a tradeoff problem between data utility and privacy protection, like many other privacy enhancement algorithms. The generalization is also context dependent. For example, if we want to generalize diseases (ie, asthma with respiratory disease and psoriasis with dermatological disease), resultant outputs might be adequate to answer some scientific questions (ie, classification tasks determined upfront) but may be insufficient to answer others, (eg, immunologic diseases, which could encompass both asthma and psoriasis). This is a fundamental limitation of generalization techniques and the type of lumping must be used carefully.

This paper presents a new anonymization algorithm that achieves ε-differential privacy and supports effective classification analysis for heterogeneous health data. The proposed solution connects the classical generalization technique with output perturbation to effectively sanitize raw data. Experimental results suggest that the proposed solution provides better utility than a pure differentially private interactive approach or an approach to simply produce k-anonymous data.

We thank Dr Shuang Wang for helping with the experiments and useful discussions.

**Contributors:** The authors are ranked according to their contributions. The authorship credit is based on meeting all the following criteria: (1) substantial contributions to conception and design, acquisition of data, or analysis and interpretation of data; (2) drafting the article or revising it critically for important intellectual content; and (3) final approval of the version to be published.

**Funding:** BCMF, NM, and RC are supported in part by Discovery Grants (356065-2008) and Canada Graduate Scholarships from the Natural Sciences and Engineering Research Council of Canada (NSERC). XJ and LO-M were funded in part by NIH grants 1K99LM 011392-01, R01LM009520, U54 HL108460, R01HS019913, and UL1RR031980.

**Competing interests:** None.

**Provenance and peer review:** Not commissioned; externally peer reviewed.

1. O'Keefe CM.
Privacy and the use of health data—reducing disclosure risk. Electronic J Health Informatics
2008;3:e5:1–e5:9

2. Standards for privacy of individually identifiable health information. Final Rule, 45 CFR parts 160 and 164. http://www.hhs.gov/ocr/privacy/hipaa/administrative/privacyrule/adminsimpregtext.pdf (accessed 20 Feb 2012)

3. Benitez K, Loukides G, Malin BA.
Beyond safe harbor: automatic discovery of health information de-identification policy alternatives. *The 1st ACM International Health Informatics Symposium*; ACM, 2010:163–72 [PMC free article] [PubMed]

4. Madsen E, Masys DR, Miller RA.
HIPAA Possumus. J Am Med Inform Assoc
2003;10:294. [PMC free article] [PubMed]

5. Baumer D, Earp JB, Payton FC.
Privacy of medical records: IT implications of HIPAA. ACM Comput Soc (SIGCAS)
2000;30:40–7

6. McGraw D. Why the HIPAA privacy rules would not adequately protect personal health records: Center for Democracy and Technology (CDT) brief. http://www.cdt.org/brief/why-hipaa-privacy-rules-would-not-adequately-protect-personal-health-records (accessed 20 Feb 2012)

7. Institute of Medicine
Beyond the HIPAA privacy rule: enhancing privacy, improving health through research. Washington (DC): National Academies Press (US); 2009 [PubMed]

8. The HIPAA Privacy Rule: Lacks Patient Benefit, Impedes Research Growth: Association of Academic Health Centers (AAHC) brief, Washington, DC. http://www.aahcdc.org/Resources/ReportsAndPublications/IssueBriefs/View/ArticleId/65/The-HIPAA-Privacy-Rule-Lacks-Patient-Benefit-Impedes-Research-Growth.aspx (accessed 2 Oct 2011)

9. McGraw D.
Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data. J Am Med Inform Assoc
2012.
In press. doi:10.1136/amiajnl-2012-000936 [PMC free article] [PubMed]

10. Fung BCM, Wang K, Chen R, et al.
Privacy-Preserving Data Publishing: A survey of recent developments. ACM Comput Surveys
2010;42:1–53

11. Loukides G, Denny J, Malin B.
The disclosure of diagnosis codes can breach research participants’ privacy. J Am Med Inform Assoc
2010;17:322–7 [PMC free article] [PubMed]

12. Sweeney L.
*k*-anonymity: A model for protecting privacy. Int J Uncertainty Fuzziness Knowledge Based Syst
2002;10:557–70

13. Samarati P, Sweeney L.
Generalizing data to provide anonymity when disclosing information. *Process of The Acm Sigact Sigmod Sigart Symposium on Principles of Database Systems (PODS)*; 1998:1–13

14. Machanavajjhala A, Kifer D, Gehrke J, et al.
l-diversity: privacy beyond k-anonymity. ACM Trans Knowledge Discov Data (TKDD)
2007;1:3-es

15. Wong RCW, Li J, Fu AWC, et al.
(a,k)-anonymity: An enhanced k-anonymity model for privacy preserving data publishing. *International Conference on Knowledge Discovery and Data Mining (SIGKDD)*; Philadelphia: ACM, 2006:754–9

16. Martin DJ, Kifer D, Machanavajjhala A, et al.
Worst-case background knowledge for privacy-preserving data publishing. *IEEE 23rd International Conference on Data Engineering*; IEEE, 2007:126–35

17. Wong RCW, Fu AWC, Wang K, et al.
Minimality attack in privacy preserving data publishing. Proceedings of the 33rd International Conference on Very Large Data Bases; Vienna, Austria, 2007:543–54

18. Ganta S, Kasiviswanathan S, Smith A.
Composition attacks and auxiliary information in data privacy. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM, 2008:265–73

19. Kifer D.
Attacks on privacy and Definetti's theorem. Proceedings of the 35th SIGMOD International Conference on Management of Data; ACM, 2009:127–38

20. Wong RCW, Fu AWC, Wang K, et al.
Can the utility of anonymized data be used for privacy breaches?
ACM Trans Knowledge Discov Data
2011;5:1–24

21. Dwork C.
Differential privacy. Automata, Languages Programming. 33rd International Colloquium on Automata, Languages and Programming (ICALP). Springer Verlag, 2006;4052:1–12

22. Dwork C, McSherry F, Nissim K, et al.
Calibrating noise to sensitivity in private data analysis. Theory of Cryptography
2006;3876:265–84

23. Dinur I, Nissim K.
Revealing information while preserving privacy. Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York, NY, USA: ACM, 2003:202–10

24. Roth A, Roughgarden T.
Interactive privacy via the median mechanism. Proceedings of the 42nd ACM symposium on Theory of computing. New York, NY, 2010:765–74

25. Bhaskar R, Laxman S, Smith A, et al.
Discovering frequent patterns in sensitive data. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, 2010:503–10

26. Friedman A, Schuster A.
Data mining with differential privacy. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘10); New York, NY, USA: ACM Press, 2010:493–502

27. Bhaskar B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K.
Privacy, accuracy, and consistency too: a holistic solution to contingency table release. Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York, NY, USA: ACM, 2007:273–82

28. Hay M, Rastogi V, Miklau G, Suciu D.
Boosting the accuracy of differentially private histograms through consistency. Very Large Database (VLDB) Endowment
2010;3(1-2):1021–32

29. Dwork C, McSherry F, Nissim K, Smith A.
Calibrating noise to sensitivity in private data analysis. Theory of Cryptography Conference (TCC), 2006:265–84

30. Xiao X, Wang G, Gehrke J.
Differential Privacy via Wavelet Transforms. Knowledge and Data Engineering. IEEE Transactions on; 2011;23:1200–14

31. Machanavajjhala A, Gehrke J, Gotz M.
Data publishing against realistic adversaries. Proceedings of the VLDB Endowment; 2009;2:790–801

32. Fung BCM, Wang K, Yu PS.
Anonymizing classification data for privacy preservation. IEEE Trans Knowledge Data Eng
2007;19:711–25

33. McSherry F, Talwar K.
Mechanism design via differential privacy. Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07); IEEE, 2007:94–103

34. Saeed M, Villarroel M, Reisner AT, et al.
Multiparameter intelligent monitoring in intensive care II: a public-access intensive care unit database. Crit Care Med
2011;39:952–60 [PMC free article] [PubMed]

35. Frank A, Asuncion A.
UCI Machine Learning Repository, 2010

36. Quinlan JR.
C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann, 1993

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of **American Medical Informatics Association**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |