The increasing adoption of electronic medical record (EMR) systems has facilitated a rapid escalation in the amount of patient-level data collected and used in primary care settings. Additionally, such systems have become an invaluable resource for a wide range of secondary applications, including comparative effectiveness studies and novel biomedical investigations.1
Notably, EMR-derived data have contributed to a number of genome-wide association studies (GWAS), the goal of which is to unearth biomarkers that correlate with various clinical phenotypes.3
These data reuse efforts live in a special sphere for privacy protection. Various regulations, such as those promulgated by the National Institutes of Health (NIH), require that data used in NIH-sponsored research, and GWAS in particular, be shared beyond the initial collecting institution in a de-identified format.4
Given that the originating EMR systems are managed by covered entities as defined by the Health Insurance Portability and Accountability Act (HIPAA),6
data sharing is subject specifically to the de-identification standard set forth in the HIPAA Privacy Rule.7
The HIPAA de-identification standard is general and states ‘Standard: de-identification of protected health information. Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information’.7
Technically, to achieve this goal, the Privacy Rule delineates several routes by which data can be rendered de-identified. One route specified in the Privacy Rule has been referred to as the ‘Statistical Standard’. This approach requires an expert to certify that ‘the risk is very small that the information could be used…to identify an individual who is the subject of the information [using] generally accepted…methods for rendering information not individually identifiable’.7
The regulation points to various statistical methods applied by the disclosure limitation community,8
and some candidate methods for supporting this approach were suggested in the comments to the legislation.
An alternative route specified by the Privacy Rule is ‘Safe Harbor’, which enumerates 18 attributes that must be suppressed. The architects of the Privacy Rule designed this route to be ‘an easily followed cookbook approach,’7
for cases where the certification required by the statistical standard was too onerous. The enumerated list includes explicit identifiers, such as personal names and social security numbers, as well as ‘quasi-identifiers’, which, in combination, may lead to the identification of an individual. The latter class of attributes includes (but is not limited to) dates, patient ages over 89, and zip codes that are more specific than the first three digits. Although easy to follow, the Safe Harbor approach has been criticized for being too stringent, partially because it can significantly limit epidemiological and population-based studies, which may depend on some of these suppressed attributes.9
Safe Harbor has also been criticized for being an ad hoc approach,11
but many commercial and open-source health information de-identification systems favor this model.12
There are a number of social and technical factors that may have contributed to this preference. Firstly, statistical expertise is not always available and, if it is, there is no clear rule on what constitutes an ‘expert’ under the Statistical Standard. Secondly, many traditional disclosure limitation methods alluded to in the regulation are designed to handle tabular data and are not oriented to handle patient-level records or the semantics of health information. Thirdly, the federal government has not addressed how to apply specific techniques to achieve the Statistical Standard since its passage. In contrast, Safe Harbor is readily interpretable and can be applied without statistical expertise or support.
The goal of this work is to demonstrate how simple computational methods, based in risk analysis, can be designed and applied to support the dissemination of health information in ways that Safe Harbor precludes, and thus to allow data administrators to take advantage of either of the two de-identification standards proposed. We present a de-identification framework that is easy to interpret, straightforward to replicate, and quick to reconfigure across datasets. Additionally, we set up our framework to directly utilize Safe Harbor to parameterize the methods, providing evidence for why these de-identification schemes should be considered. In summary, our approach consists of two core steps. Firstly, we quantify the ‘re-identification’ risk of demographics in patient records inherent in data shared according to the Safe Harbor policy, with an emphasis on patient-specific attributes that have been leveraged to compromise patient privacy in the past.13
Secondly, we alter the granularity of the patient's attributes to generate policy alternatives with risk that is no greater than what Safe Harbor deems acceptable.
To demonstrate its effectives, we assess our approach with real GWAS patient cohorts from five medical centers. While Safe Harbor prohibits the dissemination of the cohorts' patients' ages over 89 years old, we show that such information can be disclosed with minimal impact on the overall scientific utility of the records compared with the original records. Additionally, to assist other institutions in replicating our approach, we offer an open-source version of our software.14
The remainder of this paper is organized as follows. In the following section, we provide background on re-identification risk analysis and existing approaches to measure and mitigate this problem. Next, we provide the setup of our analysis, including the cohorts employed in this study. Afterward, we detail our framework, with a formalization of the risk metrics applied and how risk is estimated for patient-level data. Then, we describe the results of applying the framework to the cohorts, and finally we conclude by discussing the limitations of the methods described and suggestions for future work.