A recent report by Homer et al
demonstrated a method to determine whether a given person contributed a trace amount (<0.1%) of DNA to a pool of DNA from 200 individuals based on allelic probe intensities from the same microarray genotyping technology commonly employed in GWAS. Three sets of probe data are required for Homer’s test statistic, denoted here as Tallele
: 1) from a pool to which the individual may or may not have contributed DNA; 2) from a reference pool of DNA sampled from members of the same genetic population; and 3) from DNA obtained from a single individual to be tested for membership in the first pool. Their primary applications were directed toward specific forensic challenges, such as determining whether a DNA sample contributed to a biospecimen of mixed origin. They speculated that their method could be applied to GWAS allele frequency data.
We explored that suggestion and have extended the approach of Homer et al
to propose an improved test statistic Tgeno
to detect whether an individual contributed genotypes to a GWAS, and, if so, determine their case/control phenotype. Conceptually, Tgeno
substitutes frequencies from two groups (e.g., cases and controls) and genotype states of the individual to be tested for pools of admixed DNA and measures of allelic probe intensities in the method of Homer et al
is a novel likelihood ratio statistic that can detect whether an individual contributed DNA to neither, one, or both groups given only genotypes for the individual to be tested and genotype frequencies from each group. Our statistic is similar to Tallele
except that Tgeno
contrasts the logarithm of frequencies of the each observed genotype of the tested individual (as opposed to allelic probe intensity or allele frequency in Tallele
) between two populations.
The predictive power of Tgeno was derived approximately (see Methods), and evaluated by simulation under a range of scenarios consistent with the size and scope of current GWAS designs. In each scenario, we assumed two independent groups of individuals, denoted “test” and “reference”, drawn from a homogeneous population with genotypes from independent biallelic single-nucleotide polymorphism (SNP) markers in Hardy-Weinberg equilibrium with fixed minor allele frequency (MAF). We varied the size of each group, the number and MAF of independent loci, and genotype error rate. Power was computed by simulation, randomly sampling test and reference groups as well as individuals to be tested and comparing Tgeno to its theoretical null distribution using a two-sided test that assumes Z could be a member of either group.
shows simulation results of the sensitivity and specificity of Tgeno for significance levels ranging from 0.05 to 10−6 for scenarios labeled (a)-(e). We observed that the power to detect membership in either group increased as the test group size decreased due to the larger contribution of the individual relative to the size of the test group; power also increased as the reference group size increased due to the improved knowledge of underlying population frequencies. Scenario (c) demonstrated that substantial genotyping error was required to attenuate the power of the method. Power was essentially unaffected with 1% genotyping error and only began to appreciably diminish with rates approaching 10%. Scenario (d) demonstrated that the power to detect membership does not depend on MAF. Scenario (e) emphasized the role of the size of the population that did not contain the individual being tested. Power was dramatically reduced when only 60 individuals were available in the reference group, reaching only 20% at the 0.05 significance level and no power at the 10−6 significance level; power increases to 100% at the 0.05 significance level and to 96% at the 10−6 significance level with 5,000 reference individuals. In each scenario, the predicted theoretical and empirical power matched very closely.
Theoretical power of Tgeno to detect an individual in the Test group.
In , we explored the power to detect membership of a close relative of an individual in the test group using 200,000 independent SNPs simulated under the same assumptions as in . In (f) and (g) the genotype discordance rate due to the familial relationship was a function of MAF (see Methods) and reflects no additional discordance due to assay-related genotyping errors. Power to detect membership of relatives decreased as MAF increased (and consequently the genotype discordance rate increased), but remained high for MAFs lower than 30%. In populations of European descent, the average MAF for most whole-genome SNP panels is between 20-25%.
Theoretical power of Tgeno to detect a relative of an individual in the Test group.
We applied our method to data combined from several GWAS conducted at the US National Institutes of Health using genotype data from individuals of European descent. The dataset contained 6,733 individuals affected with a disease (cases) and 6,871 unaffected individuals (controls) and genotype data for 557,005 SNPs from the Illumina HumanHap550 assay. shows a histogram of Tgeno for a hypothetical GWAS with 1,000 cases (in red), 1,000 controls (in blue) and 11,604 subjects not in the virtual study (in gray), and with the theoretical null density curve shown in black. The groups were well-differentiated, indicating that Tgeno is capable of accurately inferring membership and case/control phenotype. Both the mean and variance of the empirical null distribution were shifted from those of the theoretical standard normal distribution. These differences arise from use of fixed test and reference samples, while the theoretical null distribution holds over repeated random samples of the two groups. In addition, LD between SNPs is expected to inflate the variance of the test-statistics relative to our derivation that assumes locus independence. Thus, making inferences about Tgeno when applied to particular GWAS data requires suitable calibration of the null distribution. shows histograms comparing Tgeno and Tallele for a fixed subset of 1,000 cases and 1,000 controls for varying numbers of “top associated” SNPs.
Histogram of Tgeno for a GWAS with 1,000 cases and controls
Histograms of calibrated Tgeno and Homer’s Tallele with 1,000 cases and controls and varying numbers of SNPs
Using these data, GWAS scenarios were explored by selecting subsets of cases and controls, estimating genotype frequencies of each group, and fitting logistic genotype-phenotype association models. In each scenario, all 13,604 individuals were tested for membership conditional on a fixed set of cases and controls chosen. We also attempted to infer the phenotype of the cases and controls selected in each scenario given the knowledge that they participated in the study. Individuals not selected as cases or controls for a given scenario were used to empirically estimate the null distribution. contains Receiver Operating Characteristic (ROC) curves showing empirical sensitivity and specificity for classifying individuals as participants and the determination of their phenotype given knowledge of participation. GWAS scenarios with a fixed subset of 1,000 and 5,000 cases and an equal number of controls are shown for varying numbers of randomly chosen or top associated SNPs. These ROC curves focus on high values of specificity with 1-specificity in the range of 0.05 to 10−6
on a logarithmic scale. Supplementary Figure 1
shows ROC curves for additional GWAS scenarios with 1,000-5,000 cases and an equal number of controls. Supplementary Figure 2
is analogous to Supplementary Figure 1
except showing ROC curves for the non-log scale for the full range of specificity. Supplementary Figure 3
is analogous to Supplementary Figure 1
, except with ROC curves for Tallele
Sensitivity and specificity of Tgeno applied to GWAS data
shows a hypothetical GWAS with 1,000 cases and 1,000 controls. If the top 1,000 associated SNPs are published, the power to detect whether an individual is included in the study is approximately 43%, when the chance of a false positive, incorrectly concluding the individual is in the study, is 1 in 20. Power approaches zero when the false-positive rate is less than 1 in 1,000. If frequencies of 5,000 of the top associated SNPs are published, approximately 90% power is attained at a false-positive rate of 1 in 20 and 41% power when the false-positive rate is 1 in 1,000. When 50,000 of the top associated SNP frequencies are published, near perfect power is attained at a false-positive rate of 1 in 20 and approximately 98% power when the false-positive rate is 1 in 1,000.
We have shown the robustness of a novel test statistic, Tgeno
, an extension of the Tallele
approach by Homer et al
adapted for use on genotype frequency data rather than SNP probe intensity or allele frequency data. We found that Tgeno
was more powerful than Tallele
since the Tgeno
statistic conditions on known genotype states and is a likelihood-ratio statistic, and has optimality properties for both hypothesis testing and classification (see and Supplementary Figure 2
This method makes a critical assumption that individuals in both groups are sampled from the same population. Unless population structure or substructure is identical in both groups, incorrect inferences are likely to be made. Essentially, Tgeno and Tallele test whether an individual’s genotypes are more consistent with one group versus another (see Methods for a formal description). These statistics are valid for determining membership only when the differences between groups are due to sampling variation. Otherwise, the relatively small bias due to membership of an individual quickly becomes swamped by the systematic differences in the underlying populations. Although this characteristic may be problematic in forensic applications in which test groups are often of unknown and potentially varied population origin, GWAS case and control groups are usually drawn from genetically comparable populations.
In addition to the number and selection method of SNPs, the sensitivity of our method, that is the power to detect true membership, is affected by several parameters: the number of subjects in each group, the number of SNPs tested, genotyping error, and genetic relatedness. As the size of the group containing the individual decreases, it becomes easier to detect the contribution of a single individual; as the size of the group without the individual increases, a more precise knowledge of the background population frequencies improves power. Because this method does not rely on patterns of rare or unique variation, power is not greatly affected by MAF for relatively common SNPs (MAF>5%) (see Methods for details). Although substantial genotype error can decrease power, modest error rates (<5%) have little effect (current commercial genotyping platforms typically have error rates <0.1%). One can detect membership of siblings, parents and offspring, whereas the presence of more distant relatives would be more challenging to detect. For some applications, the ability to detect relatives is not desirable and could decrease specificity.
The conditional power to detect membership of an individual given a test and a reference sample is affected by sampling variation in the estimates of genotype frequencies between the two groups. The genetic variants need not have different underlying frequencies in the two groups, as might be a consequence of association with disease; only a non-zero difference in estimates is required. SNPs with low association p-values or, equivalently, large differences in estimated genotype frequencies between cases and controls, result in higher power because the magnitude of the ratio of frequencies between the two groups is larger. Two-sided association p-values are inversely correlated with the magnitude of the difference in genotype frequencies, but do not indicate the direction of the differences or provide additional information useful for detection of membership when using Tgeno.
This method is particularly well-suited to case-control GWAS designs since subjects are partitioned into disjoint sets that provide both groups of frequency data required by this method. Publication of genotype frequencies separately for cases and controls also increases power to detect membership, since each group is typically a large fraction of the size of the overall study. GWAS are also often composed of a combination of multiple study populations and results may be published for each subset, increasing the power of this method, particularly for rare diseases that necessitate the collaboration of multiple studies. Similarly, re-use of sets of cases and controls among several GWAS can allow the calculation of frequencies for smaller groups by removing the contribution of overlapping individuals.
Further work is required to address some issues raised by our analyses. A more thorough understanding of the test statistic is required to account for linkage disequilibrium. Aggregate statistics other than frequencies could reveal membership information. It is notable that our approach could be applied to other types of dense datasets, both genetic and non-genetic. The underlying principle is based on detecting small correlations in categorical or continuous data accumulated over a sufficient number of highly repeatable outcomes.
We anticipate that these findings will aid in understanding the implications of publishing and sharing aggregate genomic data in ways that protect the privacy of research subjects. This method could be used to determine if specific individuals participated in a clinical study and perhaps influence the decision to enroll or continue participating. Our results should be considered as a lower bound for the power to detect membership and phenotype in an aggregate genotype dataset, as more efficient methods may yet exist. In light of these developments, the policies and practices guiding genomic data sharing should continue to evolve in order to promote quality science, minimize duplicative research, and merit the ongoing trust of the research subjects who consent to participate in scientific studies.