The panel developed seven broad recommendations, which are summarized in
Box 1.
Box 1. Recommendations- Determine whether initial consent and ethical approval will allow secondary research.
- Ensure that there are appropriate data security mechanisms and review bodies to protect privacy interests in aggregated databases.
- Informed consent should take into account the potential incorporation of data into aggregated databases.
- Address special challenges of using data obtained from existing databases.
- Pursue efforts directed at standardization of data.
- Establish data sharing rules, including attribution of contributions.
- Adopt “best practices” to avoid identifiability of the data.
1. Determine whether initial consent and ethical approval will allow secondary research. In general, this necessitates ensuring that data were collected with prospective informed consent where it was practicable to do so. If the investigator who submits data to a database did not collect the original biological specimen or clinical phenotype, then its history should be detailed, tracing the custody of the specimen and/or information to the original investigator and consented participant. The initial IRB or similar oversight body should typically be able to attest that there was approval of the protocol used to collect the information and confirm whether submission to the aggregated database is within the scope of the original consent. Did the consent address secondary uses of the data or the possibility that the participant could be re-identified? Where the submitting investigator has received “de-identified” specimens and data from another source, the responsible IRB should have assessed the propriety of acquisition and study of the samples at the time of proposed transfer to the submitting investigator. IRBs and those creating aggregated databases must recognize that some research protocols and informed consent processes specify limited use of participants' data. For example, such limitations may specify that data may be shared with other investigators doing research only on the same or related conditions.
Finally, as technology makes whole-genome analyses faster and cheaper, the need for sharing samples will be replaced by the sharing of data. One investigator's “affected” could be another investigator's “control.” However, IRBs historically have been concerned with informed consent for the collection and sharing of specimens (blood, DNA, etc.). Both IRBs and investigators should anticipate sharing of data derived from specimens and be able to document participants' consent for this purpose.
2. Ensure that there are appropriate data security mechanisms and review bodies to protect privacy interests in aggregated databases. By design, the research facilitated by combination and re-analysis of datasets within de-identified databases should be exempt from complete prospective IRB review. By only accepting data that cannot be directly linked to living persons, studies in these databases do not constitute human participants research under US regulations (
http://www.hhs.gov/ohrp/humansubjects/guidance/cdebiol.pdf). This may lead to a false sense of security, given that complex phenotypes and extensive genotypes may be used to re-identify individuals. Re-identification should be prevented using technology that does not significantly degrade the utility of the data (see Recommendation 7) and policies that address the appropriate uses of those data. Only data in de-identified, aggregated form should be accessible to all users. Access to record-level data must be limited, either by keeping all data within the confines of the secure database and providing users with sophisticated analytical capacity, or by requiring legal agreements to prevent attempted re-identification of participants. Downloading and redistribution of data should be carefully controlled. If allowed at all, extraction of record-level data should only be permitted for specified, pre-approved purposes. Database stewards should play an active role in the periodic review of data use to confirm that data integrity is not compromised and that any proposed data use is consistent with the initial approval by the IRB and consent of the participants. Ideally, this oversight should come from the organization (e.g., NIH Institute) holding the aggregated data, and not from the multiple IRBs that approved the original data collection.
3. Informed consent should take into account the potential incorporation of data into aggregated databases. Research into the attitudes of participants suggests that the majority of them are willing to allow their de-identified data to be used for future research by either the investigator who recruited them or by other investigators in the future [
5]. However, it seems likely that very few consent processes have fully anticipated the possibility that data would be included in databases controlled by private corporations, the federal government, or international consortia. Surveys have found that a substantial fraction of participants may not want their data used in this manner. In one report, the overwhelming majority (more than 80%) of survey respondents felt that new consent was required for each use of DNA samples for research [
6]. Some may wish to limit the use of their data (e.g., to studies of their own disease or condition). Some may limit the use of their data to exclude the study of mental illness or studies that have the potential to stigmatize their families or ethnic groups. Others may wish their data to be used for nonprofit research only, or have concerns about the custody of their data and their ability to withdraw from research. Legal action brought by the Havasupai tribal members and families with Canavan disease, along with the recent
Washington University v. Catalona decision, all point to the importance of these considerations [
7,
8].
Ethically and logically, one cannot give informed consent to unspecified actions that may or may not occur at some time in the future. As pointed out by Arnason in a discussion of Icelanders' participation in the deCODE Genetics databases, participants were given a choice of consent forms to sign—one that allowed only the current research, and one that allowed future research on stored samples [
9]. Nonetheless, it is not practicable to incorporate all future research possibilities into a research protocol or consent process. Arnason suggests that informed consent may not be possible in this context. At best, participants can grant permission to unspecified research by indicating their understanding of the scope of future uses, the degree of de-identification/anonymity entailed, whether they will be re-contacted or informed of future findings, the ability (or not) and procedure required to withdraw from the database, the potential users of their data, and measures taken to regulate the database and ensure their privacy is protected. As Caulfield points out [
10], this variation on the traditional consent model actually gives participants more autonomy over the use of specimens or data derived there from.
4. Address special challenges of using data obtained from existing databases. Many current studies are being conducted with information and specimens collected for another purpose at substantial expense and labor [
11]. Therefore, it seems inefficient and contrary to the public good to let these resources remain unused and attempt to re-establish similar registries or biobanks at public expense. However, it may never be possible to determine the specific wishes of the participants with regard to the re-use of their personal information and specimens. Given such situations, stewards of aggregated databases may consider different options. One involves oversight that includes assurance by contributing investigators that consent was obtained and that, beyond the verbatim license granted for the primary research, broader future uses by different investigators are permitted. Given the size and scope of the individual database, this oversight may come from the IRB or an institutional committee within a single university, or may involve governmental representatives, and an outside group. Second, it may be possible to address the problem of a lack of anticipated secondary use of data or specimens by consulting with legitimate representatives of the donor group(s) and obtaining approval for the new studies [
12]. Typically, this would not be a single IRB, but could include patient support groups, veteran's associations, leaders of indigenous people, etc. A third, though costly [
13], option would be to re-contact individual participants and ask permission for the new study. There is no clear consensus on which of these (or other) methods to invoke in a particular instance. Nevertheless, prudence suggests establishing a data-use committee to make such decisions. However, if new research is contemplated that carries a risk of individual or group stigmatization, then re-consent may be the appropriate choice regardless of who is responsible for making such decisions.
5. Pursue efforts directed at standardization of data. Realizing the goals of database aggregation requires interoperability among the primary databases. Accordingly, standard data protocols should be developed and used. For example, participant demographic characteristics and medical conditions should be described using controlled vocabularies allowing careful matching of participants across studies. For example, the PhenX project (
http://www.phenx.org/) supported by the National Human Genome Research Institute is developing sets of standardized measures and surveys to be used with genome-wide association studies. The Public Population Project in Genomics (
http://www.p3gconsortium.org/) is a multinational collaboration to catalog and share methods, surveys, and other tools needed to do large-scale genetic research [
14]. Currently, over 100 studies with planned enrollment of over 11 million participants are part of the effort. The Organisation for Economic Co-operation and Development has recently posted draft guidelines for the operation of genetic databanks that include best practices for data interchange (
http://www.oecd.org/document/50/0,3343,en_2649_34537_37646258_1_1_1_1,00.html). The Clinical Data Standards Interchange Consortium has designed standards for the interoperability of clinical trial data submitted to the US Food and Drug Administration as well as globally (
http://www.cdisc.org/standards/index.html). Finally, the Open Biomedical Ontologies Foundry project (
http://www.obofoundry.org/) is a multinational collaboration that is attempting to coordinate the development of ontologies of biomedical interest, including a sequence ontology for describing genetic loci and sequence variations, a phenotype ontology for describing participant characteristics, and an investigation ontology for describing laboratory methodologies, etc., that will provide the standardized vocabulary necessary for true data interoperability [
15].
6. Establish data sharing rules, including attribution of contributions. Investigators spend considerable time, effort, and money collecting data and specimens. Consequently, there is rightfully a sense of entitlement felt by investigators who may not believe it is appropriate to permit other scientists to analyze “their” data before they have completely finished with them. Large sums of public and private money have been spent to build and maintain investigative teams that are conducting original research primarily for the public benefit. These investigators may see little reward and much risk in participating in aggregated databases, despite the potential public good from sharing their data. There also may be concerns regarding the intellectual property created by the original investigator. Further, funding agencies or the investigator's employers may limit the sharing of research data. The question of data ownership will be important not only for publication, but also with regard to grant applications, promotion, etc. There are currently no firm guidelines that protect investigators with regard to ownership and sharing of data. Therefore, stewards of aggregated databases should develop rules that establish the length of time or other conditions determining when an investigator can have absolute control over the data s/he has contributed, even after submission in the aggregated database. Publications and grant applications must clearly describe the origin of data taken from public or quasi-public sources (investigator/repository), and the original contributors must be given appropriate credit.
7. Adopt “best practices” to avoid identifiability of the data. Clinical research data can be collected and stored with several levels of confidentiality. The terminology in this area is confusing, but in general, data can be fully identified, de-identified (coded) but linkable to identifiers, or de-identified and un-linkable to a code (i.e., “anonymized”) [
16]. Participants may desire that their de-identified information be treated as if it were anonymous, and many conflate anonymity and de-identification, but it is relatively rare for research to be conducted under conditions of true anonymity—blood samples without names or numbers on the tubes, or clinical phenotypes that never contained explicit identifiers, for example. But the general concept of anonymity/identification is not absolute [
17].
Genome-wide analyses by their very nature produce uniquely identifying information. It is estimated that between 75–100 single nucleotide polymorphisms or fewer than 20 microsatellite markers can unambiguously identify a single individual [
18]. Likewise, a highly detailed clinical phenotype with participant demographic characteristics can be used to re-identify an individual. The most disconcerting scenarios regarding re-identification assume that there is some other database of identified participants to compare with. While it may seem unlikely that an individual's genotype and phenotype would exist in more than one place, techniques such as “trail re-identification” have shown that this is often the case. That is, patients or research participants deposit seemingly de-identified data in multiple hospitals or other institutions that share information, revealing a visit pattern or “trail” that can be linked to individual identities [
19]. The threat of sample re-identification will grow as more data are collected and aggregated. The concern extends for database users outside the academic research arena, including pharmaceutical companies, insurance companies, and law enforcement agencies with access to rich storehouses of public and private information to re-identify the phenotypes stored by researchers.
Technical approaches exist that can mask the identities of participants in large databases beyond the ad hoc approaches of removing links between data and identifiers as prescribed in the HIPAA (Health Insurance Portability and Accountability Act) safe harbor provision. First, data can be generalized. For example, in storing genetic sequence information, AGG could become AG(G or C) at polymorphic residues [
20,
21]. Second, data can be encrypted at the source, and stored in encrypted form in the database server. A researcher would only be able to query a processing engine that would interact with the database and report the results. Analysis of record-level information would still be possible, but that analysis would be done by the database host, and not by downloading the data to client computers. Third, data can be manipulated so that each record that is shared with other users is linked or mapped to a prespecified number of potential identities in the database. Thus the original data retain utility, but now have a much lower chance of being linked to a specific individual. While the computational methodologies necessary to protect data in this manner exist, their refinement and practical application require further development.