|Home | About | Journals | Submit | Contact Us | Français|
Clinical researchers need to share data to support scientific validation and information reuse, and to comply with a host of regulations and directives from funders. Various organizations are constructing informatics resources in the form of centralized databases to ensure widespread availability of data derived from sponsored research. The widespread use of such open databases is contingent on the protection of patient privacy.
In this paper, we review several aspects of the privacy-related problems associated with data sharing for clinical research from technical and policy perspectives. We begin with a review of existing policies for secondary data sharing and privacy requirements in the context of data derived from research and clinical settings. In particular, we focus on policies specified by the U.S. National Institutes of Health and the Health Insurance Portability and Accountability Act and touch upon how these policies are related to current, as well as future, use of data stored in public database archives.
Next, we address aspects of data privacy and “identifiability” from a more technical perspective, and review how biomedical databanks can be exploited and seemingly anonymous records can be “re-identified” using various resources without compromising or hacking into secure computer systems. We highlight which data features specified in clinical research data models are potentially vulnerable or exploitable. In the process, we recount a recent privacy-related concern associated with the publication of aggregate statistics from pooled genome-wide association studies that has had a significant impact on the data sharing policies of NIH-sponsored databanks.
Finally, we conclude with a list of recommendations that cover various technical, legal, and policy mechanisms that open clinical databases can adopt to strengthen data privacy protections as they move toward wider deployment and adoption.
A number of organizations, distributed around the globe, have invested considerable effort to construct information technology infrastructure to support the management and analysis of data on human participants enrolled in clinical and translational research studies.1 Organizations are now moving towards models of broader data sharing and accessibility through open-access translational research information systems (OTRIS). OTRIS are dynamic and evolving, in terms of technical implementation and oversight, but have a common goal of establishing data warehousing infrastructure to facilitate the rapid dissemination of research findings. They aim to integrate a variety of data types, such as experimental information derived from laboratory experimentation (e.g. genome sequence, gene expression and proteomics data) with rich clinical phenotypes. OTRIS further aim to integrate data from various laboratories and other resources, so that the research community has access to a broad range of datasets to validate and reanalyze published findings, as well as mine for novel clinically-relevant discoveries. Thus, it is the intention of OTRIS managers to make their systems, and, to the extent to which it is possible, the data within, freely accessible as a resource to the public.
OTRIS raise complex ethical, legal, and social issues that developers, managers, and scientists associated with these systems will need to consider as software engineering and scientific investigation moves forward.1,2,3 Recent meetings have solicited information from ethicists, informaticists, lawyers, and biomedical scientists to characterize various issues associated with the construction of database archives, ranging from informed consent to attribution of property to identifiability of human participants in supported research projects.4 In this paper, we elaborate on the data privacy issues in the context of OTRIS. We recognize that a complete solution will require further investigation on ethical, social, and legal components of the problem, but use this forum to illustrate how policy and technology can be combined to resolve data sharing and privacy goals.
It has been stressed that the availability of OTRIS for widespread use is contingent on the protection of patient anonymity.5 And, while biomedical privacy policies and technologies exist, various studies suggest they are ill-equipped for environments that centralize detailed patient-specific data.6 Moreover, recent forensics science research7,8 has prompted significant changes to data sharing policies for various OTRIS; most notably the database of Genotype and Phenotype (dbGaP)9 at the U.S. National Library of Medicine.10 In the face of such threats, one must question if there are potential privacy vulnerabilities for other emerging resources. Furthermore, if such threats do exist, then what are the measures, both from a technical and policy perspective that should be explored to mitigate such threats?
In this paper, we illustrate how OTRIS are vulnerable, but it is important to note that not all emerging OTRIS are susceptible to privacy violations in the same manner. In addition, the power that responsible policies and oversight can provide in mitigating threats that remain in de-identified research settings should not be neglected. The issues raised and potential solutions offered in this paper are applicable to many informatics resources intending to share clinical and biological data for translational research purposes and, where possible, we draw on examples from emerging OTRIS to demonstrate their potential application.
Before we address technical issues, it is important to note the regulatory landscape. Data collected, shared, and used within OTRIS will be subject to various regulatory controls. The appropriateness of such controls depends on from where the data will be derived. In particular, there are several primary privacy and data sharing policies that OTRIS managers must be cognizant of as they move forward. The following is an introduction to some of the relevant regulatory issues at play and should not be considered a comprehensive list.
The NIH Data Sharing Policy was designed to increase access to data collected through, or studied with, federal funding.11 The policy applies to all projects that receive at least $500,000 in annual direct funding. According to the policy, data must be shared in a “de-identified” format in a manner similar to the Safe Harbor model as defined in the Privacy Rule of the Health Insurance Portability and Accountability Act (see below). The data sharer must also remove information for which there is prior knowledge that it could be used to determine the identity of the subjects. Some investigators have argued that the sensitivity of their datasets and the lack of ability to provide provable privacy guarantees are sufficient to opt out of data sharing.
Genome wide genetic scans of sequence variations have become important, but costly, research tools for the biomedical community. The NIH created a specific policy for the collection and sharing of data derived from, or studied in, genome wide association studies.12 Similar to the 2003 data sharing policy, the GWAS policy was defined such that it applies to any project regardless of funding level in which genome wide genetic scans are produced or studied. The NIH has since designated dbGaP as the repository to which NIH sponsored investigators should submit their GWAS records. As in the NIH Data Sharing Policy, GWAS data must be de-identified prior to dissemination.
The NIH has recognized that genomic data itself may lead to the re-identification of an individual. Thus, users of GWAS datasets in dbGaP must sign a contractual use agreement that explicitly prohibits non-sanctioned uses, as well as attempts to identify subjects (see below). Other NIH groups and repositories are applying similar use agreements to assign legal constraints to the use of information stored in their OTRIS.
In the United States, when a “covered entity”, as defined by HIPAA (e.g., healthcare providers, health data clearinghouses, etc.), wishes to share data collected in the context of clinical activities, it must adhere to the Privacy Rule.13 The regulation outlines several routes by which personal health information can be shared without patient consent for secondary research purposes: 1) Safe Harbor, 2) Limited Data Set, and 3) Statistical certification.
The Safe Harbor Standard allows covered entities to publicly share data once it is stripped of an enumerated list of eighteen types of personal identifiers. These include explicit identifiers (e.g., names), “quasi-identifiers” (e.g., dates, geocodes), and traceable elements (e.g., medical record numbers). Neither clinical nor genomic data is explicitly labeled as a personal identifier and it has been debated if such data can be released under this policy.14 For years, clinical data has been shared in public resources, such as hospital discharge databases.15,16 Similarly, person-specific DNA sequences have been disclosed to public repositories, such as those at the National Center for Biotechnology Information (NCBI)17.
Various groups argue against disclosing data via Safe Harbor based on the observations that the usefulness of such data for certain types of studies (e.g., epidemiology) is questionable, but also out of “re-identification” concerns.5,18 Rather, an alternative called the Limited Data Set Standard is advocated, which allows covered entities to share more detailed data, including dates and zip codes. The tradeoff is that data recipients must enter into an acceptable use contract that prohibits re-identification. While this policy is appropriate for trusted investigators, as the quantity of centralized data and number of investigators granted access increases, such an approach may become infeasible to manage. Moreover, this policy neither prevents an individual from attempting re-identification, nor assesses the risk of re-identification.
The Statistical Standard allows sharing data in any format, provided an expert certifies “the risk is very small that the information could be used by the recipient, alone or in combination with other reasonably available information, to identify an individual”.13 Methods to quantify risks have been researched7,18, but no standards have emerged. One disclosure control method that has been considered is to “perturb” DNA sequences; e.g., AACCTATA shared as AATCAATA.19 The intuition is that as the quantity of perturbation increases, the likelihood that an investigator can determine the original sequence decreases, implying greater privacy protection. The tradeoff; however, is that perturbation can potentially obscure, or worse, lead to false, associations. Thus, it could diminish the utility and scientific credibility of the resource. A second criticism of such a protection approach is that research has shown certain types of perturbation can be filtered to reliably infer the original data.20 Despite such problems, data protection based on scientific models can be achieved, but care must be taken to design them with formal principles.
As we alluded to, data that is de-identified according the aforementioned polices can be “re-identified” to the individuals from which the data was derived via numerous routes. As we illustrate in Figure 1, re-identification is a process and requires the satisfaction of certain conditions. First, it requires that the de-identified data is unique or “distinguishing”. In other words, we must be able to pinpoint an individual in a group of size n people or less. Genomic sequence data, for instance, and possibly other laboratory and molecular expression data, are often highly distinguishing. However, it needs to be recognized that the ability to distinguish data is, by itself, insufficient to claim that the corresponding individual’s privacy will actually be compromised. This is because of the second condition, which is that we need a “naming” resource. Without such a resource, there is no way to link the de-identified data to an identity.* Finally, for the third condition, we need a mechanism to relate the de-identified and identified resources. Inability to design such a relational mechanism would hamper an adversary’s opportunity to achieve success to no better than random assignment of de-identified data and named individuals.
There are many situations in which de-identified biomedical information can be re-identified to the patient from whom it was derived without hacking or breaking into private health information systems. For instance, in the mid-1990’s it was shown that de-identified hospital discharge records, which were publicly available at the state-level, could be linked to identified public records in the form of voter registration lists. The result received notoriety because it led to the re-identification of the medical status of the governor of the Commonwealth of Massachusetts.21 This attack was achieved by linking the resources on the seemingly innocuous, but common, fields of patient‘s date of birth, gender, and zip code. Various estimates indicate that the uniqueness of this combination of attributes in the U.S. population is somewhere between 65% and 87%, and with certain subpopulations even more unique.22,23
One of the responses to the aforementioned attack was the HIPAA Safe Harbor policy. But, it should be recognized that even the suppression of all enumerated features fails to prevent all re-identifications. In many instances, there are residual features, including the remaining demographics (e.g., race, year of birth, state of residence, and gender) that can lead to identification. However, the extent to which residual features can be applied to re-identification is context dependent and relies on the availability of the fields that can be leverage in the attack. In Table 2, we provide some general guidelines to consider when assessing the re-identification risk of data in OTRIS. In general, it helps to partition the person-specific features into classes of relatively “high” and relatively “low” risk. We recognize that risk is more of a continuous variable, but this type of dichotomization helps illustrate how context impacts risk. Beyond riskiness of attributes, it is important to understand the routes by which data can be linked to naming sources or sensitive knowledge can be inferred, as we review below.
Higher risk features are those that are documented in multiple environments and are publicly available. These are features that can be exploited by any recipient of such records. For instance, patient or research subject demographics are high-risk identifiers. Even the demographics that are permissive under the Safe Harbor policy leave certain individuals in a unique status and thus at non-trivial risk for identification through public resources that contain similar features, such as birth, death, marriage, voter, and property assessor records, and more.
Lower risk features are those that do not appear in public records and are less readily available. For instance, clinical features, such as an individual‘s diagnoses and treatments are relatively static (i.e., because they are often mapped to standard coding terminologies for billing purposes), and can manifest in de-identified resources, such as the aforementioned hospital discharge databases as well as identified resources, such as electronic medical records. While combinations of diagnosis and treatment codes, or temporal dependencies, can uniquely characterize a patient in a population24, the identified records are available to a much smaller group of individuals than the general public. Moreover, this select group of individuals may be relatively more trustworthy, such as care providers and business associates of the organization that generated the documented features. Additional disincentives may exist as well, such as HIPAA-related penalties that are applied in the event an individual willingly violates the terms of employment to commit a breach of privacy.
When OTRIS include data derived from biological samples, the situation becomes a bit more complex. In certain instances the information that is associated with genomic and expression data, particularly genomic data derived from a clinical setting, permits relationships to be established between de-identified and identifiable resources. The following is a summary of several attacks, with further details available elsewhere6.
There exists an inherent relationship between certain genomic data sequences and physical phenotypic manifestations. A clinical phenotype may be described in biomedical coding standards, such as the International Classification of Diseases (ICD), and may be disclosed in various settings including, “semi-private” data, such as administrative or insurance records, but also more public records, such as hospital discharge databases.
A second type of attack is made possible because genomic data is increasingly disseminated in the context of familial information. This practice is common in gene hunting expeditions. Familial information could be represented in the format of a de-identified pedigree, which reports gender, disease status, and the death status of the family members. At the same time, there is a variety of publicly available identified information available. One particular resource that has been exploited for identifiers is obituaries, which have wide coverage on a population and are often free to post in newspapers. Such resources tend to include information on the recently deceased individual as well as the family relations.25
Many patients (and research participants) are transient and visit multiple institutions providing care. As such, a patient’s location-visit pattern is often distinguishing and facilitates what has been termed a “trails” attack.26 In this scenario, a patient visits multiple hospitals, where his clinical and DNA-related data is collected. The facilities forward de-identified DNA records, tagged with the submitting institution, to a public centralized databank.27,28,29 Additionally, the hospitals send identifiable discharge records, including patient demographics and diagnoses to a discharge database.30 Even if there is no clear biomedical relationship between the diagnosis codes and sequence markers in the DNA, we can track the hospitals a patient has visited (i.e., the “trail”) in the discharge data and the DNA records in the repository.26 Notably, this attack is generalizable in that trails can manifest in a number of environments.31
Genome sequence data is increasingly applied in clinical research. However, it is also a well-known distinguishing feature unto itself. Lin and colleagues19 demonstrated that only a small number (less than 100) of single nucleotide polymorphisms, or SNPs, is required to uniquely characterize an individual in the entire world’s population. SNP data is increasingly found in ancestry, clinical, molecular phenotype, and pharmaceutical efficacy association studies. Thus, if an adversary has access to an identified DNA sequence, it may be possible to learn additional information about the individual from the de-identified data in the association studies.
In recognition of this fact, dbGaP decided to publicly disseminate SNP – clinical status correlations for various datasets only as aggregated results. Specifically, for each dataset, and for each individual SNP, they publicly posted the proportion of the population that was diagnosed with (or without) a clinical feature (e.g. immunodeficiency disorder) and the corresponding SNP’s value.
However, as was recently demonstrated by Homer and colleagues7,8, such an approach does not prevent privacy threats. They demonstrated what we call a “pool attack”, where the information on several thousand SNPs could be used to determine if an identified individual’s DNA was in the set of clinically positive cases, the set of clinically negative cases, or neither of the above. Moreover, the approach involved in the attack is applicable to any environment in which aggregate statistics on biomedical datasets is available. Details of their attack are beyond the scope of this paper, but can be found elsewhere. This attack is important to note because it had significant impact on dbGaP’s public data access policy (as described in the following section).
The previous types of data and attacks may compromise privacy because they are sufficiently replicable and available in multiple datasets. However, data stored in many OTRIS is also expected to consist of functional genomics data (e.g. gene expression microarray data) derived from laboratory testing. While such data may be unique and located in multiple datasets, the extent to which this data is replicable is questionable. To the best of our knowledge there is limited research that addresses the precision of repeated functional genomics tests. However, if such information is not adequately replicable, these data may be considered less risky to share than sequence data.
Before proposing specific recommendations regarding technologies and policies to improve data privacy protections, we wish to return to the pool attack and highlight several results and policy decisions. The pool attack did not involve a compromise of identity because the adversary was already in possession of the subject’s identity and genomic data. However, the attack resulted in a breach of confidentiality because the subject did not inform the adversary of their clinical status. Given the accuracy of the attack, the NIH felt they could not publish statistical summaries of SNP-clinical class correlations without violating the privacy principles stated in their data sharing policies. As a consequence, the NIH removed all summary statistics from the public version of dbGaP.10 Following the lead of the NIH, the Wellcome Trust, the main biobanking and human genomic data dissemination agency in the UK, followed suit. The policy changes received significant attention from the popular media.33,34,35,36 While privacy advocates have lauded these actions, there are several reasons why this response is not necessarily appropriate for every OTRIS.
The attack is a feasible attack in that it can be achieved given relatively open data sharing strategies. In fact, the approach is generalizable to other types of information derived from biological samples. However, the question remains as to what the likelihood of such an attack is given today’s climate. To achieve the attack, the adversary needs access to an identified DNA sequence, which begs the question of who would be in possession of such information? It has been suggested that such information could be available through forensic investigations, but it is unclear if forensic specialists would be sufficiently motivated to learn clinical information about the subject in question. Second, it has been noted that individuals beyond the forensic realm could collect biological samples, subsequently sequence, and use the resulting information, but the economic and computational barriers are nontrivial and it is not clear that anyone would attempt to mount such an attack. This is not to say that such an attack could not be committed by motivated individuals, but that the context for executing such an attack has yet to be clearly voiced. We recognize that biological data will be increasingly available as high-throughput genomic sequencing technology becomes cheaper and more mainstream, but at the present time, the threat is believed to be more theoretical than practical.
Investigators conducting NIH-sponsored GWAS are encouraged to submit their de-identified records to dbGaP.12 The result is that dbGaP stores datasets for a number of NIH institutes. Initially, dbGaP defined its access policy according to a two-tier model. The first tier consisted of “public” information, which included summary information for each dataset, including data collection mechanisms, the types of demographic, clinical, and biological information collected, and summary statistics for the various “classes” of individuals. This was designated as public information that was readily available on the dbGaP website. The second tier of access was for person-level records, or “microdata”. To access this information, investigators must proceed through a formal evaluation process. The process begins when a new investigator submits a request to access the records in a dataset. The application is sent to an NIH data access committee (DAC). Since each dataset may have unique use limitations, and may have been sponsored by a different NIH institute, the investigator may need to make multiple requests for multiple datasets.
When the NIH decided that summary statistics for data deposited in dbGaP would no longer be accessible through the first, or public, access level, such information was moved to the second tier. However, the DAC model was designed to handle requests for individual-level datasets to validate or explore specific hypotheses, not requests to mine for new knowledge across datasets that are unrelated in the initial reasons for their collection. Thus, this approach to managing summary information could limit large-scale data mining and hypothesis-generation driven research methodologies that are gaining popularity in the biomedical domain.
From a technical perspective, the removal of summary statistics from the public realm created an “all or nothing” data access setting. Initially, researchers were permitted access to “all” of the SNPs and the relative occurrence of variant statistics, but after the policy change, researchers were shuttled into a “nothing” model, in which no statistical information on any SNP could be reported. Given the current manner in which data protection is achieved and the existing protection technologies on the market, this is a logical situation. However, as recent research suggests, there is room to create a gray solution that resides within this policy space through the use of risk analysis strategies. Consider, as we mentioned above, if an adversary has access to summary information about a single SNP, then the likelihood the adversary can map an identified DNA sequence to the affected, non-affected, or none-of-the above classes is significantly hampered. If provided with summary information about two SNPs, the probability that the adversary could link the identified record to one of the classes would be greater, but still extremely small. As we increase the number of SNPs that an investigator is permitted to have access to, the probability of linkage will increase. If data managers could determine the level of risk they are willing to tolerate, then they could disseminate information on a subset of SNPs consistent with their level of risk tolerance. This is precisely the basis for a formal and provable data protection strategy in accordance with the statistical data protection standard mentioned in the previous section.
The above sections provided a high-level analysis of the existing threats and potential opportunities for OTRIS. This section formalizes specific recommendations regarding technologies and policies to improve data privacy protections. There is no single solution that will address all privacy and identifiability issues, but a combination of technical, policy, and legal mechanisms will help ameliorate potential problems.
As biomedical data sharing increases and systems move toward open access, there are certain guidelines and recommendations we believe OTRIS should consider. The following recommendations are briefly summarized in Table 3.
Given current NIH policies, it is recommended that OTRIS not post pooled statistical information on publicly accessible web servers regarding static, replicable features that are easy to derive from biological information, such as genome-wide SNP scans. Though the risk of an individual actually applying such information in a linkage attack is unknown, the posting of such information will be in direct contradiction of policies adopted by similar NIH repositories and recent statements of the NIH director.†
As noted earlier, functional genomics data may be the focus of a given database repository. It is anticipated that the reliability of data replication will be data type specific and it is thus recommended that OTRIS management discuss this issue with the scientists submitting data. If data is unreliably replicable, then the risk of publishing such information is less of a concern and the OTRIS may justify less strict oversight to access the data. If, on the other hand, such data is reliably replicable, then data in the OTRIS could be subject to a pool attack. In this case, it is recommended that such data should not be shared publicly.
It is recommended that formal data access policies be established and published on the appropriate OTRIS management’s, website or made available through the appropriate regulatory bodies. In association with a formalized policy, it is further recommended that OTRIS establish a data access committee that reviews applications for access to data. This committee may be designed in a similar manner to the dbGaP DAC, but should be tailored to the needs of the repository. Individuals that serve on this committee could be drawn, to the extent it is possible, from the following classes:
Additional groups that may be represented on such a committee could consists of
If the resource determines that data should be made available to anyone with a legitimate request, the access committee’s role may only need to define what such requests correspond to and perform expedited reviews of requests for data access.
It is recommended that the OTRIS determine what is considered acceptable use with respect to accessed data. Such information should be codified and explicitly defined in a data use contract that is agreed upon by the data recipient. It is recommended that the OTRIS work with legal experts with experience in this area to establish appropriate terms.
De-identification and controlled access are essential aspects of legal and ethical data reuse from existing research databases and electronic health records. At the same time; however, much information will come from prospective mechanistic and translational studies, including GWAS. In these cases, the informed consent process must disclose the potential data sharing mechanisms described in this paper. It should be recognized that it may not be possible to describe all of the future users of a subject’s de-identified data, and it may not be legally possible for subjects to consent to unspecified future uses.40 However, subjects should be entitled with the opportunity to authorize future uses of their data for particular types of studies and withhold permission for others.41 Documentation of such understanding by subjects when they enter into research studies will assist Institutional Review Boards and Ethics Committees to provide the necessary certification when data is shared according to NIH policies. Moreover, clear demonstration that subjects in genetic research know and understand the potential of re-identification may lessen the regulation imposed in response to various privacy invading mechanisms, such as the aforementioned pool attack.
Though data shared through the OTRIS may be de-identified, it may be potentially re-identifiable. As such, the resource needs to determine the extent to which it is willing to assume liability for misuse of data. For instance, if a data user actually performs re-identification of a record, and such a re-identification becomes known, there should be a standing policy for how best to address and/or reprimand the user. Responding to the situation may be handled by the OTRIS itself, the user’s home institution (if one exists), the originating institution of the data, or by any combination of the parties. Regardless, policies and procedures need to be established and agreed upon by all parties involved. Again, the resource should work with appropriate legal specialists and stakeholders in this activity.
Even if the OTRIS chooses to make data “public” (or “semi-public”), it should enable auditing capability. In doing so, the OTRIS should assign unique login and passwords for each data user and log their activities in immutable audit logs. The resource should also determine when and how to audit users of the OTRIS. In most cases, data users will not act maliciously, but they may violate terms of service or best practices of use without realizing it.
It is recommended that the managers of the OTRIS determine what they consider to be acceptable levels of risk and realistic vulnerabilities to the data in the system (the examples of high and low risk identifiers discussed earlier can help guide this discussion). If possible, the OTRIS may wish to provide access to different levels of data detail. For instance, it could provide access to aggregate statistics at one level and detailed microdata at another. At all levels, the aforementioned access committee should be involved. Moreover, it should be noted that managing aggregate statistical features of biological or molecular data in the same manner as the actual microdata is an overprotective and potentially research-limiting step. For resources of lower risk (such as aggregates), the committee may choose to apply expedited reviews to ensure that the requests are in line with acceptable use policies, similar to those applied by IRBs. The goal is to minimize the amount of time that the committee needs to spend reviewing an individual’s request to access data. In contrast, for access to more detailed information, the committee may use a more stringent review process and require additional restrictions on data access and transferability.
If aggregate statistics are to be made available, it should be recognized that there is no universal solution to mitigate identifiability. There is no definite set of data attributes that, if suppressed, will guarantee protection from the data being re-identified. Rather, it is recommended that risk estimates be performed to determine the level of risk involved with sharing the data (note – this risk is data and not attribute dependent) and these risks should be deemed acceptable to the data managers, whether it is the investigators sharing data to the OTRIS, the sponsoring agency’s program managers, or OTRIS administrators.
As mentioned earlier, different types of data lead to different linkage concerns. Some data can be linked to publicly available data – especially demographics. It should be recognized that even if data shared via a database resource adheres to HIPAA Safe Harbor levels of protection, there is no guarantee the data is impregnable to re-identification. Thus, if there is concern that someone would attempt to discredit the OTRIS by identifying a single record in the database, then managers should consider disclosing data according to a more formal data protection model. One manner by which the OTRIS can formally mitigate risk is to generalize and/or suppress data to ensure that each record corresponds to a certain number of people (i.e., a minimum bin size).
Technically, the resource may consider a formal protection model, such as k-anonymity.32 In this model, the data protector chooses a “k” that specifies the risk deemed to be acceptable; specifically, k corresponds to how many people data managers want each specific record to link to. The question remains, however, of what is the appropriate level k to be chosen. For some guidance, the various statistical agencies have suggest that around five appears to be an acceptable solution. Whether or not this is directly applicable to life sciences data remains an open question. If deemed acceptable, this is a solution that can be tailored to any dataset in an OTRIS. In other words, k could be made dependent on the “sensitivity” of the data in question or the amount of harm that could be committed through the data.
The benefits of a privacy model, such as k-anonymity, are that it (i) appears to satisfy the HIPAA protection policy of the statistic or scientific standard and (ii) requires the data holder to cognitively be involved in the protection of data. The drawbacks are that (a) it is not clear how k-anonymization affects the utility of the data for translational hypothesis generation and data mining in particular and (b) it is not clear how k-anonymized data can be analyzed with typical statistical packages, let alone specialized software for complex data types. Nonetheless, many biomedical research studies are concerned with the discovery, or application, of common genetic and clinical variants, such that a formal data protection model may sufficiently preserve enough biomedical information for future investigators. Additional research is necessary to determine when this technical method of data protection is applicable.
The policy and technology recommendations outline above can be combined for flexible control. The recommendations should be used as the OTRIS deems necessary. The main goal is too strike an appropriate balance, where the technical aspects of data protection are complemented with acceptable use and oversight policies. If data users are more trusted, then data may be disseminated in a more specific form with stringent use contracts and if users are deemed to be less trusted, then data may be disclosed in more aggregated form with weaker use contracts.
In conclusion, there is increasing awareness by all of the various stakeholders involved in human studies research – research sponsors, investigators and subject participants – that in order to maximize the return on the investment that all parties make in clinical research it is advantageous to make the results widely available to the research community. A variety of OTRIS resources have been established to facilitate the sharing and re-use of these valuable data. However, while the benefits of data sharing are recognized, the requirements for maintaining the autonomy, privacy and confidentiality of the research participants must also be addressed. We have presented a series of recommendations regarding both technical and policy approaches designed to minimize the risk of participant re-identification from clinical research data. By adopting these recommendations, OTRIS can balance the benefits gained by data sharing while minimizing the risk to the research participants.
The authors wish to thank Dr. Ellen Wright Clayton of Vanderbilt University, Jeff Wiser of Northrop Grumman, and the anonymous referees for helpful discussions and recommendations regarding the writing of this manuscript.
Disclosure: This work was supported in part by the following grants from the U.S. National Institutes of Health: N01AI40076, R01LM009989, U01HG004603, UL1RR023468, and UL1RR024982.
*We recognize that the lack of a readily available naming resource does not imply that data is sufficiently protected from re-identification. Nonetheless, it does indicate that it is much harder to identify an individual, or group of individuals, given the resources at hand.
†There are emerging algorithms, implemented in working software, that can help data managers determine which features can be disclosed while ensuring that the probability of classifying an individual is below a predefined threshold.39 Yet, until such approaches are tied to appropriate policy models, their implementation will be limited.