|Home | About | Journals | Submit | Contact Us | Français|
Though genome-wide association studies (GWAS) have identified numerous susceptibility loci for common diseases, their use is limited due to the expense of genotyping large cohorts of individuals. One potential solution is to use ‘additional controls’, or genotype data from control individuals deposited in public repositories. While this approach has been used by several groups, the genetically heterogeneous nature of the population of the United States makes this approach potentially problematic. We empirically investigated the utility of this approach in a US-based GWAS. In a small GWAS of pancreatic cancer in New York, we observed clear population structure differences relative to controls from the database of Genotypes and Phenotypes (dbGaP). When we conduct the GWAS using these additional controls, we find large inflation of the test statistic that is properly corrected by using eigenvectors from principal components analysis as covariates. To deal with errors introduced due to different sources, we propose simultaneously genotyping a small number of controls along with cases and then comparing this group to the additional controls. We show that removing SNPs that show differences between these control groups reduces false-positive findings. Thus, through an empirical approach, this report provides practical guidance for using additional controls from publicly available datasets.
In recent years, genome-wide association studies (GWAS) have been used to identify numerous replicable susceptibility loci for many complex diseases. A typical GWAS involves a case-control design in which the investigator analyzes DNA samples from both affected case individuals and matched, healthy control individuals. One hurdle in conducting such studies, in which hundreds of thousands of SNPs are independently tested for association with disease, is the large sample size required to obtain adequate power to detect a modest effect after correcting for multiple testing. To address this problem, many groups have joined efforts to create large consortia with DNA samples from thousands or tens of thousands of individuals to conduct studies that are well powered to detect even a modest genetic effect. Even with large consortia, however, the cost of genotyping such a large number of samples can be prohibitive.
One potential solution to the sample size requirement of GWAS that has been proposed is the use of a common set of control individuals in numerous studies. In 2007, the Welcome Trust Case Control Consortium (WTCCC) used this ‘shared controls’ approach to study seven common diseases . Rather than using controls individually matched to the cases for each disease, the WTCCC genotyped a common set of controls representative of the self-identified white European population of Great Britain and compared allele frequencies from this group with each set of case individuals. This approach has been used by others with case individuals who come from both the UK and elsewhere, including the United States [1,2,3,4,5]. Recently, Zhuang et al.  published a simulation study in which they showed the theoretical potential for expanding the control group with publicly available disease samples or reference samples to increase the power of GWAS; we refer to the use of such controls from the database as ‘additional controls’.
Despite the apparent practical success of this approach and simulation studies suggesting its effectiveness, both the power and pitfalls of using additional controls from databases in the genetically heterogeneous population of the United States remain unclear. Genome-wide genotype information, along with limited phenotypic data, is available for numerous healthy individuals from the United States in the database of Genotypes and Phenotypes (dbGaP) at NIH. Therefore, in theory it should be possible to combine these data with genome-wide SNP profiles from a smaller number of cases that an individual investigator is studying to identify disease susceptibility loci. However, population stratification due to differences in genetic ancestry between people in such case and control groups and differential genotyping error from different sources could hinder an effective use of this approach. It is known that even if a study is restricted to self-identified ‘white’ individuals in the United States, genotype frequency at many loci can vary based on from where in Europe the ancestors came [7,8]. While a variety of statistical methods have been developed to identify and correct for such stratification [9,10], how such correction will influence the power and type I error rate of using common controls in US-based studies remains to be seen.
In this paper, we evaluate the use of additional controls from publicly available sources in a US-based GWAS. To do so, we utilize a small pancreatic cancer dataset for which we have genome-wide genotype data on 263 cases and 202 controls. We chose this dataset in part because four recently reported pancreatic cancer-associated SNPs could be used as true positives to estimate the power of this additional controls approach in a real setting [11,12]. We found that the rank and p value of these true disease SNPs improved significantly in our dataset with additional controls, with the added benefit of more controls reaching a plateau after a control:case ratio of 10:1 is obtained. Despite a large amount of population stratification in this joint dataset, the impact of this stratification was effectively captured and corrected by principal component analysis (PCA). We demonstrate the utility of genotyping some controls at the same time as cases for comparison with the additional controls to remove SNPs that show differential allele frequencies due to disparity in data processing and technical artifacts. We thus show systematically for the first time the practical issues that concern the use of controls from different sources. This report can serve as useful guidance when using additional controls from publicly available datasets in future studies.
The study was approved by the Memorial Sloan-Kettering Cancer Center's (MSKCC) Institutional Review Board and all participants signed informed consent.
We determined the analytical power of GWAS assuming a simple test of allelic association. We computed the power using a non-central χ2 distribution with non-centrality parameter λ . The power was computed under an additive model with the significance threshold α = 1 × 10−7. The genotype relative risk (GRR) was varied from 1.0–3.0 with increments of 0.1, and the disease allele frequency (DAF) was varied from 0.05 to 0.50. The number of cases used ranged from 100 to 3,000, and the control:case ratio ranged from 1:1 to 50:1.
The pancreatic cancer study dataset was obtained from an ongoing hospital-based case-control study conducted in conjunction with the Familial Pancreatic Tumor Registry (FPTR) at MSKCC. The samples were obtained by the MSKCC FPTR research study assistant. Patients were eligible if they were aged 21 or over, spoke English, and had pathologically or cytologically confirmed adenocarcinoma of the pancreas. Patients were recruited from the Surgical and Medical Oncology Clinics at MSKCC when seen for initial diagnosis or follow-up. Controls were visitors accompanying patients with other diseases to MSKCC or spouses of patients. They had the same age and language eligibility requirements as the cases and were not eligible if they had a personal history of cancer (except for non-melanoma skin cancer). The 263 cases and 202 controls in this analysis were recruited between June 2003 and July 2009. The participation rate among approached and eligible individuals was 76% among cases and 56% among controls. Participants provided a blood or buccal (mouthwash or saliva) sample for DNA and completed risk factor and family history questionnaires administered by the research study assistant by telephone or in person.
Genomic DNA was isolated from buccal cells using the Puregene DNA purification kit (Qiagen, Inc., Valencia, Calif., USA). DNA was also isolated from saliva samples with the Oragene saliva kits (DNA Genotek, Kanata, Ont., Canada) or from blood using the Gentra Puregene blood kit (Qiagen, Inc.). DNA samples were hydrated in 1x TE buffer. Genomic DNA was genotyped on the Illumina 370K SNP chip (either the Illumina CNV370-Duo or Illumina CNV370-Quad) at the Genomics Core Laboratory of MSKCC according to the manufacturer's protocol.
Genotypes from additional controls were obtained from the NIH's dbGaP. All individuals used are controls in the underlying study and are of European ancestry. Specifically, data from 6 studies in dbGaP genotyped using Illumina chips were used (table (table1).1). These datasets provide 5,485 additional controls in total. Using a common set of markers present in all the datasets, we combined our MSKCC cases and controls with some or all of the additional controls to yield control:case ratios of 5:1, 10:1 or 20:1.
All genotype data was processed using PLINK . We performed several steps of quality control (QC). First, we processed the MSKCC samples alone, without additional controls. As we could not be certain of the DNA strand the genotype calls from each study are in reference to, we removed all A/T and C/G SNPs, as strand could be confused for these allele pairs. We removed individuals for whom less than 90% of genotypes were called and SNPs for which less than 10% of genotypes were called. We also removed SNPs with a minor allele frequency of <5%, or SNPs that were out of Hardy-Weinberg equilibrium in controls (p < 1 × 10−7). A total of 314,664 markers passed the QC in the MSKCC data and were used for combining data from various sources. Similar QC steps with the same parameters were performed on each of the additional controls datasets independently. The datasets were then merged using PLINK, restricting analysis to a set of SNPs common to all datasets. We calculated genome-wide identity by descent (IBD) using PLINK (– genome) and 70 individuals with excessive IBD (π^ > 0.4) were removed from our analysis. After these steps, we applied the same thresholds for missing data, minor allele frequency, and Hardy-Weinberg equilibrium as before. We also removed 529 SNPs that showed a significant difference in rates of missing genotype calls between cases and controls (p < 1 × 10−7) and a further 723 markers that show differential missingness (p < 1 × 10−7) between males and females. A test for differences in missingness based on local haplotype also did not reveal any SNPs with strong evidence for differential missingness based on inferred genotype at the SNP (– test-mishap in PLINK; p < 1 × 10−7). We compared allele frequencies and call rates between MSKCC study samples obtained from different DNA sources (buccal, saliva, or blood) and did not find any markers showing different missingness rates or genotype frequencies due to difference in DNA source (p < 1 × 10−7).
To perform principal components analysis to adjust for population substructure, we used the EIGENSTRAT software from the EIGENSOFT 2.0 package . We first filtered the data by removing markers in high linkage disequilibrium (LD). This gave us a set of 32,619 SNPs for which pairwise r2 values within a window of 50 SNPs are all less than a specified threshold (usually 0.1; – indep-pairwise 50 5 0.1 command in PLINK). This set of markers was then used as input for EIGENSTRAT. Principal components were computed and outliers removed using default parameters. Significant principal components were determined using the Tracy-Widom statistic (p < 0.05).
To perform additional QC to reduce false-positive findings, we tested for genotype frequency differences between each control group versus the rest of the controls. For each control group, we adjusted for the top 11 principal components and used logistic regression to test for differences in genotype frequency versus the other control groups. For the MSKCC controls, we identified 2,702 SNPs that show a significant difference in genotype frequencies (p < 0.01; online suppl. fig. 1; for all online suppl. material, see www.karger.com/doi/10.1159/000330149); these SNPs were removed from further analysis. For the other control groups, we identified an additional 15 SNPs that showed significant deviation in genotype frequency in at least one control group (p < 1 × 10−7; online suppl. fig. 1). Notably, we found that the 211 controls from the Study of Irish Amyotrophic Lateral Sclerosis (SIALS; phs000127v1) show a strong deviation from the null hypothesis on a quantile-quantile plot (online suppl. fig. 1). Therefore, we chose to remove these 211 controls from the final analysis. This resulted in a final dataset of 263 cases and 5,416 total controls at 267,109 markers.
To test for association between disease phenotype and SNPs, we used logistic regression as implemented in PLINK. When we do not consider population substructure, logistic regression is used without covariate adjustment; otherwise, significant principal components were used as covariates to adjust for population substructure.
We used PLINK's estimate for the genomic control parameter λ, which is a measure of test statistic inflation due to effects such as population stratification. PLINK reports λ (based on median χ2) in the .log file. To test control:case ratios of 1:1, 5:1, 10:1, and 20:1, we selected appropriate subsets of the additional controls to add to the MSKCC case-control dataset.
All MSKCC DNA samples were first amplified using the Illustra GenomiPhi v2 DNA amplification kit (GE Healthcare), following the manufacturer's recommendations. The reaction was then diluted by adding 120 μl reduced TE buffer. Prior to use in genotyping, we performed an additional 2-fold dilution to improve assay performance. One SNP, rs2236479, was genotyped using the TaqMan allelic discrimination genotyping assay (Applied Biosystems). Genotyping was conducted according to the manufacturer's instructions as follows: a master mix consisting of 1.375 μl water, 2.5 μl 2× TaqMan master mix, and 0.125 μl SNP assay (probe + primers) for each individual was prepared. Four microliters were aliquoted into each well of a 384 well plate, and 1 μl of amplified and diluted DNA was added. PCR was performed in an ABI Gene Amp 9700 machine under the following conditions: 95°C for 10 min followed by 48 cycles of 92°C for 15 s and 60°C for 1 min. Plates were read on an ABI Prism 7900HT fast real-time PCR system, and genotype calling was performed using the ABI Sequence Detection System software version 2.3. The genotype concordance rate was computed using 346 individuals who were genotyped both with TaqMan and on the Illumina arrays.
The large number of control individuals currently available in dbGaP and other databases raises the question of limiting returns. In other words, at what point is the improved power obtained through additional controls small enough that it is no longer worth adding controls? We therefore investigated the shape of the curve of power as a function of control:case ratio with a constant number of cases. As expected, the power increases with increasing number of cases, GRR and DAF. The maximum power is achieved when the control:case ratio increases to 10:1; beyond that, the power plateaus (fig. (fig.1).1). For example, at a GRR of 1.6, a DAF of 20%, and a significance level of 10−7, little increase in power is observed after a control:case ratio of 10:1. Therefore, we consider a 10:1 control:case ratio ideal for using additional controls in a GWAS.
The present study was motivated by our desire to combine data from common controls with data from case individuals ascertained at MSKCC in New York. We were concerned that population stratification could become a significant problem in such a study, even if we restrict our analysis to self-identified ‘white’ individuals, because of subtle genetic differences among different European populations [8,15,16]. The history of immigration to the United States suggests that a larger proportion of white Americans of Ashkenazi Jewish or southern European (e.g. Italian) ancestry would be found in the New York metropolitan area compared to the country as a whole. If this were the case, combining additional controls with our New York-based population would result in the detection of alleles that mark geographic ancestry within Europe rather than disease risk. To investigate whether this concern was well-founded, we performed PCA on 263 cases and 202 controls from the MSKCC pancreatic cancer study combined with 5,416 individuals selected as additional controls from 6 different studies available in dbGaP (table (table1).1). When we examine the first and third principal components in our samples from New York, we observe many individuals along a single gradient which has been previously suggested to represent a cline extending from northwest to southeast Europe  (fig. (fig.2).2). The separate cluster of individuals has been previously suggested to be individuals of Ashkenazi Jewish ancestry; all participants in our study who self-identified as Ashkenazi Jewish cluster in this group, supporting the contention that this cluster represents the Ashkenazi Jewish population (fig. (fig.2).2). When we compared this PCA plot with one for the controls from dbGaP, we observe marked differences in the distribution of individuals on the plot, suggesting a different distribution of geographic ancestry within Europe. Notably, 18% of the individuals in our study cluster in the ‘Ashkenazi Jewish’ group, compared with 1.7% in the dbGaP control group. These differences could potentially lead to high test statistic inflation when cases and additional controls are analyzed together. Therefore, we conclude that population stratification may be a serious issue when using additional controls with a New York-based case dataset and must be addressed.
We next asked if stratification between our New York-based case dataset and controls from dbGaP results in false positives and if PCA can properly correct for it. We limited the data to those SNPs that were in common among all studies. As all studies were conducted using the Illumina platform, there were 272,796 overlapping SNPs. The full dataset results in a control:case ratio of 20:1, twice as much as we would recommend based on the analytical power calculations. Using an independent set of markers (all pair-wise LD r2 < 0.1), we determined the significant principal components using EIGENSTRAT . The top principal components were used as covariates in a logistic regression model. As can be seen on the quantile-quantile plot, there is an immense inflation of the test statistic when we do not correct for population structure; we interpret this to be due to stratification rather than any true positive finding (fig. (fig.3).3). When we correct for population structure by adjusting for the top 21 eigenvectors, the quantile-quantile plot follows the distribution expected for the null hypothesis much more closely (fig. (fig.3),3), even though there is a little inflation near the tail. Therefore simple adjustment for principal components can largely correct for population stratification introduced when using additional controls.
The presence of 6 SNPs at the genome-wide significance threshold of 10−7 concerned us as such highly significant associations should have been found in the previously reported pancreatic cancer GWAS [11,12]. When we examined the previously reported GWAS of pancreatic cancer in dbGaP , none of these 6 SNPs was significant (all p > 0.05) (table (table2).2). This failure to replicate raises the possibility that the significant results in our study may represent false positives even after following QC steps used in regular case-control GWAS. We next asked if SNPs that lead to false positives could be detected by comparing the MSKCC controls with the additional controls from dbGaP using logistic regression. The quantile-quantile plot of this comparison shows no inflation of test statistics when correcting for 11 principal components (genomic inflation factor λ = 1.01). Five out of six potential false-positive SNPs showed a nominally significant difference (p < 0.01) in allele frequency between control groups (table (table2).2). We then examined the normalized intensity plots for the sixth SNP, rs1975920, in the data we generated (online suppl. fig. 3). While the plot shows distinct clusters, we noticed that this SNP was monomorphic in the samples we genotyped on the Illumina CNV370-Quad array, while it was polymorphic in the larger number of samples genotyped using the Illumina CNV370-Duo array. As only 20 controls were genotyped using the Illumina CNV370-Quad array, we were not able to detect this artifact through the control group comparison. However, 84 out of 263 cases were genotyped on the CNV370-Quad, presumably driving the signal seen in the case-control analysis. Thus, we introduce an additional QC step by removing 2,863 SNPs that show significant difference (p < 0.01) in allele frequencies between the MSKCC control group and additional controls. We extended this analysis to the other control groups, comparing each group with all other control groups. We excluded 15 markers with significant differences in genotype frequency (p < 1 × 10−7). We also visually inspected the quantile-quantile plot of each test for excess test statistic inflation (online suppl. fig. 1). Notably, we found that the 211 controls from the SIALS (phs000127v1) show deviation from the null hypothesis in the quantile-quantile plot. Thus, we removed these 211 controls from the final analysis. We reanalyzed 263 pancreatic cancer cases with 5,416 additional controls after performing this additional QC step and found that most of the SNPs with an extremely low p value were removed except one (rs2236479). We genotyped rs2236479 in our cohort using a different technology (TaqMan). The concordance rate between the two technologies (TaqMan and Illumina) for rs2236479 was 85%, suggesting that false positives may still be present due to genotyping error. Therefore, we conclude that careful QC using a small control group genotyped simultaneously with cases can effectively reduce false-positive findings when using additional controls by identifying SNPs that show different genotype frequencies between control groups.
We next analyzed how test statistic inflation is influenced by the number and choice of sets of additional controls. We used the genomic control parameter λ as an estimate of the test statistic inflation . We measured λ in both the original case-control dataset (no additional controls) and with the addition of various additional controls from dbGaP. We observe that λ is near 1 when no additional controls are used (table (table3),3), indicative of no test statistic inflation. As the control:case ratio is increased by adding data from different sources, λ increases, suggesting the existence of population stratification and/or other technical artifacts. In this analysis, λ is maximal at 1.81 when data from all 6 different studies are added for a control:case ratio of 20:1 (table (table3).3). When all significant principal components from PCA were used to correct for population stratification, λ reduces to nearly 1 (range 1.01–1.03; table table3).3). Thus, as expected from our quantile-quantile plot analysis, PCA-based correction can properly account for the population stratification that results when using additional controls.
We next turned to the question of whether the use of additional controls in GWAS will enable new discoveries. To investigate this question, we asked whether we would have been able to discover the 4 recently reported pancreatic cancer susceptibility SNPs [11,12] in our data combined with additional controls. We asked what rank and p value are observed for each of these 4 SNPs both in our original cohort and as we add more additional controls. Theoretically, the power to detect each of these SNPs doubles as the control:case ratio increases from 1:1 to 20:1 (table (table4).4). We found that rank and p value of the 4 pancreatic cancer-associated SNPs improved after adding additional controls in a manner that appears to correlate with the computed power. There is a two-fold increase in power for each of the 4 SNPs when the control:case ratio is increased from 1:1 to 20:1. SNP rs9543325 has the highest increase in power and largest improvement in rank and p value. There is some fluctuation in rank and p value for all 4 SNPs when we compare control:case ratios of 10:1 and 20:1. We assume this is due to sampling variability rather than a difference in power as power plateaus out beyond a 10:1 control:case ratio. These results demonstrate that using additional controls in GWAS can help bring true positive hits towards the top of the list, though in this case none of the true positives reached genome-wide significance. These powers should be compared to the power of the original PanScan study, which had 99% power to detect these 4 SNPs at α = 0.05, and reasonable power at α = 10−7, suggesting that our inability to find these true positives at genome-wide significance was to be expected.
We also asked if, for a given number of additional controls, the choice of dataset(s) from which the additional controls are taken influences our ability to detect association with these 4 SNPs. Using additional controls from 4 different studies of approximately equal size, we asked what rank and p value are observed for each of the 4 known pancreatic cancer risk SNPs. We observed variability in both the rank and p value for each of these 4 SNPs depending on the choice of control samples. As no control group is consistently the best for all 4 SNPs, we attribute this variability to sampling variation rather than intrinsic factors in any of the control groups (table (table55).
One choice that must be made is how many principal components are included as covariates in the model. If one simply asks which principal components are significant using Tracy-Widom statistics , the number of covariates to use increases as additional sources of control individuals are added (table (table3).3). For instance, in our example with a 20:1 control:case ratio there are 21 significant principal components to include. To ask whether these many covariates are necessary, we varied the number of top principal components used as covariates and measured test statistic inflation using the genomic control parameter λ (fig. (fig.4).4). We find that λ decreases drastically with the first principal component and decreases somewhat more as the next 3 are added (fig. (fig.4).4). While this suggests that all 21 principal components are not needed as covariates, it does not tell us whether including extra principal components as covariates decreases the power of the test. When we examine the 4 known pancreatic risk SNPs, we find that the ranks of the 4 SNPs do not change dramatically as more principal components are added after the first few (table (table6).6). This suggests that while only 4 principal components may be needed in this situation to correct for population stratification, the risk of decreased power through adding additional principal component covariates is minimal. To address the question of what these 21 significant principal components may represent, we first asked if any of the principal components appear to associate with membership in specific studies. Visual inspection of plots of 2 principal components at a time, with studies color-coded, does not reveal any striking correlation between principal components and study membership. Regression analysis revealed that only the top 4 principal components, for which we recommend adjusting in the GWAS, are associated with study membership (data not shown). We next repeated the PCA with a more stringent r2 threshold for LD-based SNP pruning. When the r2 threshold for pruning is lowered from 0.1 to 0.05, the number of significant eigenvectors (Tracy-Widom p < 0.05) drops from 21 to 11.
Therefore, we conclude that using additional controls can increase the power of relatively small GWAS after strict QC steps and properly correcting population stratification.
In this article, we have performed a practical evaluation of using additional controls from publicly available databases to conduct GWAS. This approach can result in improved power by increasing the number of controls without any extra cost of genotyping. By using data from our small pancreatic cancer GWAS, we evaluated this approach through comparison with results from the recently published PanScan GWAS [11,12]. When we analyzed our pancreatic cancer data with additional controls and properly accounted for population stratification, we found improvement in the rank and p value for all 4 known pancreatic cancer SNPs relative to an analysis of our case-control dataset alone. However, while 3 of the 4 SNPs were significantly associated with pancreatic cancer in our analysis with p < 0.05, these results cannot be considered an independent replication of the PanScan results, as a large subset of our cases and controls were included in PanScan.
While statistical theory argues that the power of a GWAS increases as the control:case ratio increases for a fixed number of cases, no clear guidelines exist to determine the maximum number of added controls after which there is little or no added benefit. Using analytical power calculations, we show that power increases rapidly as the control:case ratio moves from 1:1 to 10:1 and then plateaus out. Through our analysis of the pancreatic cancer data, we see improved power with a 20:1 control:case ratio relative to a 10:1 ratio. Based on these data, it appears that when designing a GWAS using additional controls, obtaining at least 10 controls for every case is extremely important, though additional benefit could be had by obtaining up to 20 controls for every case.
It is apparent that the QC steps of GWAS in the context of additional controls obtained from public data sources are different from those of typical case-control GWAS. Recently, Pluzhnikov et al.  reported a method to estimate genotyping errors from raw signal intensity data when using GWAS control samples from existing public databases. This method can only be used when the raw signal intensity data is available, which is not always the case. As an alternative approach to deal with errors introduced from genotype data with different origins, we propose including some controls to be genotyped along with the cases. By removing SNPs that show different frequencies between our controls and the additional controls, we effectively reduced the false-positive findings. We consider this step crucial in controlling false positives, especially when raw intensity data is not available.
Beside genotyping errors caused due to different data sources, our results illustrate that population stratification is also a potential problem with additional controls. If there is different underlying genetic ancestry in the populations from which cases and controls are taken, an inflated type I error will result. This is clearly observed in our example, where disproportionately more self-reported white cases from the New York metropolitan area are of southern European or Ashkenazi Jewish ancestry than self-reported white controls from other parts of the United States. This stratification results in artificially high test statistics if we combine data without correcting for population structure. Using simulation studies, it has been demonstrated that correction for population stratification can be achieved successfully by using various methods like multidimensional scaling or PCA [9,10,20,21,22]. We used the popular PCA software EIGENSTRAT to identify principal components in our data and then corrected for these components in logistic regression. Adjusting for the significant principal components substantially reduces the genomic inflation factor in every additional controls dataset we tested.
The proper number of principal components to consider in correcting for population substructure remains unclear. Notably, the number of significant principal components computed using the Tracy-Widom test statistic  increased when we increased the control:case ratio by adding data from different sources. With a control:case ratio of 20:1, 21 significant principal components were identified. We explored the effect of including different numbers of principal components in our analysis and found that after 4 principal components are included, no additional benefit is gained by including more principal components. Intriguingly, in a GWAS of Alzheimer's disease, Harold et al.  similarly found no additional improvement in λ after accounting for 4 principal components. As we found a reduced number of significant principal components upon lowering the r2 threshold to obtain independent markers for the PCA calculation, we hypothesize that many of the 21 principal components may be picking up local LD patterns in the data rather than population substructure. Therefore, including these additional principal components is not necessary for the analysis.
We acknowledge that the additional controls approach is limited by choice of genotyping platform, as it requires the same SNP to be genotyped in all samples. To maximize overlap between SNPs, we restricted our analysis to projects that used Illumina chips for genotyping and further restricted analysis to only SNPs in common among all studies. Alternatively, imputation techniques have been used to integrate genotype data from different platforms, though how such an approach will perform when different platforms are used to genotype the cases and controls remains unclear.
Besides these technical issues, there are conceptual limitations to this approach. Using additional controls works best in consideration of genetic effects alone. While in theory gene-environment interaction can be considered if appropriate environmental data is present in dbGaP, in practice this information is often found in only some datasets and details of the collection of this data likely vary between studies.
Based on these results, it appears that using this approach with only several hundred cases to study a disease typical of the common diseases studied with GWAS will result in the true disease loci rising to the top of the list of SNPs but not reaching genome-wide significance. Therefore, we propose that the use of additional controls will work best in the context of a large case-control study. In this context, a subset of cases and controls would be selected for genome-wide genotyping. These data would be combined with additional controls. The top 103−104 SNPs from this analysis would then be genotyped in the full case-control study both to increase power and remove false positives. In other words, additional controls may work best when included in stage 1 of a two-stage GWAS design [24,25,26,27]. Standard downstream analyses including independent replication and fine mapping would then be conducted on SNPs that pass the second stage. Thus, the use of additional controls is a promising method to increase sample sizes and thus the power of the study without additional cost.
Quantile-quantile (Q-Q) plot for each control group versus all of the other control groups combined. Base 10 logarithms are used on each axis. The figures extend over 4 pages.
Principal component analysis plot comparing pairs of eigenvectors for the top 20 principal components. Each individual is color-coded by sample source, as indicated in the Figures. All figures use the same coding.
Allele intensity scatter plot for rs1975920. Normalized probe intensity, as rederived from the log R ratio and B allele frequency, are shown for cases and controls genotyped using the Illumina CNV370-Duo and Illumina CNV370-Quad arrays.
We are grateful to Heriberto Moran for technical assistance, Jason Willis for extracting the allele intensity data, and Jason Willis and Xing Xu for helpful comments on the manuscript. This work was supported by the Geoffrey Beene Cancer Research Center at MSKCC; the Emerald Foundation, and NIH R03 CA141524 (to R.J.K.). We are extremely grateful to all of the investigators and funding agencies responsible for the data deposited in dbGaP that were used in the study; detailed acknowledgements can be found in the online supplementary Acknowledgements.