As noted above, in practice some level of open distribution of aggregate data is necessary to communicate results in the literature. Assessing privacy risk is an important aspect of disseminating findings from GWAS that reach significance and those that don’t. A dilemma that has been faced by many researchers is what balance should be struck between releasing summary-level data during publication or through searchable databases and minimizing the risk to the privacy of study participants. For example, how do researchers determine the number of SNPs that should be placed on the web or in a supplementary table? Is releasing summary-level data from 1,000 or 5,000 SNPs reasonable? Managing and assessing the risk when sharing summary-level data should balance multiple factors - both quantitative and non-quantitative - as well as have a clear deliberation process.
Non-quantitative risk assessment should include consideration of the potential consequences of someone in a particular cohort being identified as a participant. For example, identification of participants in studies of readily observable common traits, such as obesity or hair color, would be less concerning than identification of individuals in studies of alcohol dependence, illegal behavior, or psychiatric conditions. These types of non-quantitative risk considerations are often study specific and higher-level restrictions on access may be warranted for higher-risk studies. Within databases such as dbGAP, there is ability to define restriction on access through the Data Use Certification agreement. For example, some datasets require applicants to obtain IRB approval for access, while many other datasets allow for general use access following institutional and user agreement to standard sharing and reporting policies for GWAS.
Quantifying the risk of making summary-level data broadly available is an essential part of the risk assessment process, and one that lends itself to more traditional approaches for risk assessment. Box 1
introduces several key concepts in risk assessment, such as sensitivity, specificity and positive predictive value (PPV). Each of these metrics gives insight into a specific type of risk. Beyond these metrics, software tools also exist for quantifying risk associated with summary-level data from GWAS. Notably, Sankaraman and colleagues 30
published a method and software tool called SecureGenome, which utilizes an input genotype set and a reference set and determines the number of highly ranked SNPs that can be safely exposed from the upper bounds of the optimally solved likelihood ratio test.
Box 1. Risk assessment definitions applied to sharing GWAS aggregate datasets
In order to consider risk-assessment definitions, it is useful to first think of standard ‘ability of a test to detect a disease’ measures of sensitivity, specificity, positive and negative predictive values, as shown in the upper table. Each of these can be converted to an ‘ability to classify an individual as being in a genome-wide association study (GWAS) data set’, as shown in the lower table.
The risk-assessment definitions in the context of GWAS data sets are listed below.
Type II error. The proportion of times that someone who is actually in the data set is not identified as being in the data set. For example, with 20% type II error, there is a 20% chance of failing to determine that someone is in a data set.
Type I error. The proportion of times that someone is predicted to be in the data set when they are not. For example, with 5% type I error, there is a 5% chance of determining that someone is in the data set when they are not.
Sensitivity. The ability to detect true positives (that is, the correct classification of people in the data set). In both cases, this would be (a) / (a + c). For example, with a sensitivity of 30%, only 30% of test individuals in the data set will be correctly classified as being in the data set; 70% of those actually in the data set will be missed.
Specificity. The proportion of those people that are not in the data set who are correctly classified as not being in the data set (that is, true negatives). From the table, this would be (d) / (b + d). For example, with a specificity of 40%, only 40% of test individuals will be correctly classified as not being in the data set; 60% of those classified as being in the data set actually are not.
Power. The proportion of times that an individual who is actually in the data set will be correctly classified as being in the data set. For example, with 80% power, there is an 80% chance of correctly classifying someone as being in the data set.
Positive predictive value. The positive predictive value (PPV) is defined as the number of true positives divided by the total number of all positives ((a) / (a + b)). This measure is frequently used for rare disorders. Similarly, most individuals from a population would not actually be in a GWAS data set. PPV is the proportion of all individuals predicted to be positive from a population that are truly in a data set. With 20% PPV, only 20% of those identified as being in the cohort actually will be; 80% will not (and hence the ratio of false positives to true positives would be 4:1).
Positive predictive value
In this section we discuss a metric that can be used in quantitative risk assessments in the context of sharing data that specifically accounts for the size of the sampled population and the fact that most individuals from a population are actually not in the dataset. In a concept highlighted by Braun et al31
, false positives depend on the number of participants from the population, and PPV as a metric can quantify the risk of correctly identifying an individual as being included within a dataset given that most individuals from the population are not actually included in the dataset. Calculating PPV requires determining the proportion of the ‘at-risk’ population that is in the genome-wide association study. For example, let’s assume someone wished to determine whether a person was within a dataset of 1,000 European ancestry individuals as part of the Framingham study (and they had genotype data for this person). Given an estimated 65,000 individuals in Framingham with an approximate 75% European ancestry population, the ‘at-risk’ population is approximately 50,000 individuals. The prevalence is thus 1,000/50,000=0.02. Without any data, the risk of positively identifying a person who is actually in the dataset is 2%. Thus the prevalence allows estimation of the positive predictive value given this prior knowledge. Prevalence of participants within a study may be quite low in large-scale studies, or may become reasonably high in small ‘at-risk’ populations such as a GWAS of the Native Hawaiian populations or Old Order Amish. The influence of prevalence on risk assessment through PPV is illustrated in , in which a simulation with a high prevalence is compared with a simulation with a low prevalence. With low prevalence, the risk of resolving membership of a cohort is greatly reduced. Therefore, the strength of PPV as a measure is that it inherently accounts for the prior probability that a person selected at random is actually in the dataset and inherently accounts for key aspects of the population as part of risk assessment29,30
Sharing 5,000 SNPs at different prevalence or prior probabilities
As explained above, researchers are often faced with the question of how many SNPs should be included in the summary data that they release. The PPV is one way to obtain a quantitative risk assessment for different numbers of SNPs and different study sizes; provides several examples of PPV as a risk assessment in simulations of releasing between a few hundred and few thousand most associated SNPs by p-value from a study with different prevalence settings. In these simulations we used a prevalence of 0.01, which could be similar to a study of cardiovascular traits in a Framingham population, and 0.001, which could be similar to a study of 1,000 individuals with Major Depression sampled from a population defined to include all U.S. persons of European ancestry. The results of these simulations show the importance of considering prevalence; for 5,000 SNPs and a cohort size of 500 the PPV is 29.2% for a prevalence of 0.01 and 7.5% for a prevalence of 0.001, both with a discrimination threshold of 0.001. Further results from suggest that sharing 1,000 SNPs for datasets with > 500 individuals generally lead to a low PPV, regardless of the population size. Taken together, the process of assessing risk with PPV and/or other statistical metrics can be used to inform discussions of non-quantitative risks.
Risk assessment with different prevalence parameters