For genome-wide association studies (GWAS) in case-control data with stratification, a commonly used association test is the generalized Armitage (GA) trend test implemented in the software EIGENSTRAT. The GA trend test uses principal component analysis to correct for population stratification. It usually assumes an additive disease model and can have high power when the underlying disease model is additive or multiplicative, but may have relatively low power when the underlying disease model is recessive or dominant. The purpose of this paper is to provide a test procedure for GWAS with increased power over the GA trend test under the recessive and dominant models while maintaining the power of the GA trend test under the additive and multiplicative models.
We extend a Hardy-Weinberg disequilibrium (HWD) trend test for a homogeneous population to account for population stratification, and then propose a robust association test procedure for GWAS that incorporates information from the extended HWD trend test into the GA trend test.
Results and Conclusions
Our simulation studies and application of our method to a GWAS data set indicate that our proposed method can achieve the purpose described above.
generalized sequential Bonferroni procedure; genome-wide association studies; Hardy-Weinberg trend test; robust test; recessive model
DNA sequence data are now being used to study the ancestral history of human population. The existing methods for such coalescence inference use recursion formula to compute the data probabilities. These methods are useful in practical applications, but computationally complicated. Here we first investigate the asymptotic behavior of such inference; results indicate that, broadly, the estimated coalescent time will be consistent to a finite limit. Then we study a relatively simple computation method for this analysis and illustrate how to use it.
For genome-wide association studies (GWAS) with case-control designs, one of the most widely used association tests is the Cochran-Armitage (CA) trend test assuming an additive mode of inheritance. The CA trend test often has higher power than other association tests under additive and multiplicative disease models. However, it can have very low power under a recessive disease model in GWAS. Although tests (such as MAX3) robust to different genetic models have been developed, they often have relatively lower power than the CA trend test under additive and multiplicative models. The goal of this study is to propose an efficient method that not only has higher power than the CA trend test under dominant and recessive models but also maintains the power of the CA trend test under additive and multiplicative models.
We employed the generalized sequential Bonferroni (GSB) procedure of Holm to incorporate information from a Hardy-Weinberg disequilibrium (HWD) test into the CA trend test based on estimating weights from the p values of the HWD test. We proposed to smooth the weights to reduce possible noise.
Results and Conclusions
Results from extensive simulation studies showed that the proposed GSB procedure can achieve the goal described above.
Generalized sequential Bonferroni procedure; Genome-wide association studies; Hardy-Weinberg disequilibrium; Multiple testing; Smoothed weights
Androgens and inflammation have been implicated in the etiology of several cancers, including prostate cancer. Serum androgens have been shown to correlate with markers of inflammation and expression of inflammation-related genes.
In this report, we evaluated associations between 9,932 single nucleotide polymorphisms (SNPs) marking common genetic variants in 774 inflammation-related genes and four serum androgen levels (total testosterone [T], bioavailable T [BioT]; 5α-androstane-3α, 17β-diol glucuronide [3αdiol G], and 4-Androstene-3,17-dione [androstenedione]), in 560 healthy men (median age 64 years) drawn from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. Baseline serum androgens were measured by radioimmunoassay. Genotypes were determined as part of the Cancer Genetic Markers of Susceptibility Study genome-wide scan. SNP-hormone associations were evaluated using linear regression of hormones adjusted for age. Gene-based p-values were generated using an adaptive rank truncated product method.
Suggestive associations were observed for two inflammation-related genes and circulating androgen levels (false discovery rate [FDR] q-value<0.1) in both SNP and gene-based tests. Specifically, T was associated with common variants in MMP2 and CD14, with the most significant SNPs being rs893226G>T in MMP2 and rs3822356T>C in CD14 (FDR q-value=0.09 for both SNPs). Other genes implicated in either SNP or gene-based tests were IK with T and BioT, PRG2 with T, and TNFSF9 with androstenedione.
These results suggest possible cross-talk between androgen levels and inflammation pathways, but larger studies are needed to confirm these findings and to further clarify the interrelationship between inflammation and androgens and their effects on cancer risk.
Inflammation; Androgens; Genes; Testosterone; Polymorphism; Single Nucleotide
Host genetic factors might affect the risk of progression from infection with carcinogenic human papillomavirus (HPV), the etiologic agent for cervical cancer, to persistent HPV infection, and hence to cervical precancer and cancer.
We assessed 18,310 tag single nucleotide polymorphisms (SNPs) from 1113 genes in 416 cervical intraepithelial neoplasia 3 (CIN3)/cancer cases, 356 women with persistent carcinogenic HPV infection (median persistence of 25 months) and 425 randomly selected women (non-cases and non-HPV persistent) from the 10,049 women from the Guanacaste, Costa Rica HPV natural history cohort. For gene and SNP associations, we computed age-adjusted odds ratio and p-trend. Three comparisons were made: 1) association with CIN3/cancer (compared CIN3/cancer cases to random controls), 2) association with persistence (compared HPV persistence to random controls), and 3) progression (compared CIN3/cancers with HPV-persistent group). Regions statistically significantly associated with CIN3/cancer included genes for peroxiredoxin 3 PRDX3, and ribosomal protein S19 RPS19. The single most significant SNPs from each gene associated with CIN3/cancer were PRDX3 rs7082598 (Ptrend<0.0001), and RPS19 rs2305809 (Ptrend=0.0007), respectively. Both SNPs were also associated with progression.
These data suggest involvement of two genes, RSP19 and PRDX3, or other SNPs in linkage disequilibrium, with cervical cancer risk. Further investigation showed that they may be involved in both the persistence and progression transition stages. Our results require replication but, if true, suggest a role for ribosomal dysfunction, mitochondrial processes, and/or oxidative stress, or other unknown function of these genes in cervical carcinogenesis.
Genome-wide association studies have identified 8q24 region variants as risk factors for prostate cancer. In the Agricultural Health Study, a prospective study of licensed pesticide applicators, we observed increased prostate cancer risk with specific pesticide use among those with a family history of prostate cancer. Thus, we evaluated the interaction between pesticide use, 8q24 variants and prostate cancer risk. The authors estimated odds ratios (ORs) and 95% confidence intervals (CIs) for interactions between 211 8q24 variants, 49 pesticides and prostate cancer risk in 776 cases and 1,444 controls. The ORs for a previously identified variant, rs4242382, and prostate cancer increased significantly (p<0.05) with exposure to the organophosphate insecticide, fonofos, after correction for multiple testing, per allele ORnonexposed= 1.17 (95% CI: 0.93, 1.48), per allele ORlow=1.30 (95% CI: 0.75, 2.27), per allele ORhigh=4.46 (95% CI: 2.17, 9.17), p-interaction=0.002, adjusted p-interaction = 0.02. Similar effect modification was observed for three other organophosphate insecticides, coumaphos, terbufos, and phorate and one pyrethroid insecticide, permethrin. Among ever users of fonofos, subjects with 3 or 4 risk alleles at rs7837328 and rs4242382 had approximately 3 times the risk of prostate cancer (OR=3.14 95% CI: 1.41, 7.00) compared with subjects who had zero risk alleles and never used fonofos. We observed a significant interaction between variants on chromosome 8q24, pesticide use, and risk of prostate cancer. Insecticides, particularly organophosphates, were the strongest modifiers of risk, although the biologic mechanism is unclear. This is the first report of effect modification between 8q24 and an environmental exposure on prostate cancer risk.
Prostate cancer; pesticides; 8q24; single nucleotide polymorphism; interaction
Genome-wide association studies have identified multiple independent regions on chromosome 8q24 that are associated with cancers of the prostate, breast, colon, and bladder.
To investigate their biological basis, we examined the possible association between 164 single nucleotide polymorphism (SNPs) in the 8q24 risk regions, spanning 128,101,433–128,828,043 bp, and serum androgen (testosterone, androstenedione, 3αdiol G, and bioavailable testosterone) and sex hormone-binding globulin levels in 563 healthy, non-Hispanic, Caucasian men (55–74 years old) from a prospective cohort study, the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Age-adjusted linear regression models were used to determine the association between the SNPs in an additive genetic model and log transformed biomarker levels.
Three adjacent SNPs centromeric to prostate cancer risk-region 2 (rs12334903, rs1456310, and rs980171) were associated with testosterone (P<1.1×10−3) and bioavailable testosterone (P<6.3×10−4). Suggestive associations were seen for a cluster of 9 SNPs in prostate cancer risk region 1 and androstenedione (P<0.05).
These preliminary findings require confirmation in larger studies, but raise the intriguing hypothesis that genetic variations in the 8q24 cancer risk regions may correlate with androgen levels.
These results may provide some clues for the strong link between 8q24 and prostate cancer risk.
8q24; genetic polymorphisms; serum androgens
For comparison of multiple outcomes commonly encountered in biomedical research, Huang et al. (2005) improved O’Brien’s (1984) rank-sum tests through the replacement of the ad hoc variance by the asymptotic variance of the test statistics. The improved tests control the Type I error rate at the desired level and gain power when the differences between the two comparison groups in each outcome variable fall into the same direction. However, they may lose power when the differences are in different directions (e.g., some are positive and some are negative). These tests and the popular Bonferroni correction failed to show important significant difference when applied to compare heart rates from a clinical trial to evaluate the effect of a procedure to remove the cardioprotective solution HTK. We propose an alternative test statistic, taking the maximum of the individual rank-sum statistics, which controls the type I error and maintains satisfactory power regardless of the directions of the differences. Simulation studies show the proposed test to be of higher power than other tests in certain alternative parameter space of interest. Furthermore, when used to analyze the heart rates data the proposed test yields more satisfactory results.
Autism spectrum disorder; Behrens-Fisher problem; Cardioprotective solution; Case-control studies; Growth hormones; Multiple outcomes; Non-parametrics; Rank-sum statistics
Background. Estrogen plays a major role in endometrial carcinogenesis, suggesting that common variants of genes in the sex hormone metabolic pathway may be related to endometrial cancer risk. In support of this view, variants in CYP19A1 [cytochrome P450 (CYP), family 19, subfamily A, polypeptide 1] have been associated with both circulating estrogen levels and endometrial cancer risk. Associations with variants in other genes have been suggested, but findings have been inconsistent. Methods. We examined 36 sex hormone-related genes using a tagging approach in a population-based case–control study of 417 endometrial cancer cases and 407 controls conducted in Poland. We evaluated common variation in these genes in relation to endometrial cancer risk using sequential haplotype scan, variable-sized sliding window and adaptive rank-truncated product (ARTP) methods. Results. In our case–control study, the strongest association with endometrial cancer risk was for AR (androgen receptor; ARTP P = 0.006). Multilocus analyses also identified boundaries for a region of interest in AR and in CYP19A1 around a previously identified susceptibility loci. We did not find evidence for consistent associations between previously reported candidate single-nucleotide polymorphisms in this pathway and endometrial cancer risk. Discussion. In summary, we identified regions in AR and CYP19A1 that are of interest for further evaluation in relation to endometrial cancer risk in future haplotype and subsequent fine mapping studies in larger study populations.
Most current genetic association studies, including genome-wide association studies, look for the single nucleotide polymorphisms (SNPs) with a relatively large minor allele frequency (MAF) (e.g. >5%) in the search for genetic loci underlying the susceptibility for complex diseases. The strategy of focusing on common SNPs in genetic association studies is very effective under the common-disease-common-variant (CDCV) hypothesis, which claims that common diseases are caused by common variants that have relatively small to moderate effects. Although the CDCV hypothesis has become the dogma guiding the conduct of association studies over the past decade, growing evidence from recent empirical data and simulations suggests that the causal genetic polymorphisms, including SNPs and copy number variants (CNVs), for common diseases have a wide spectrum of MAFs, ranging from rare to common. Unlike the analysis for common genetic variants, statistical approaches for the analysis of rare variants receive very little attention. Methods developed for common variants usually rely on their asymptotic properties, which can be inaccurate for the study of the rare variants with limited sample size. Although Fisher's exact test can be used for such a scenario, it is usually conservative and thus its usefulness is diminished to some extent. Here we propose two novel approaches for the analysis of rare genetic variants. Simulation studies and two real examples demonstrate the advantages of the proposed methods over the existing methods.
Association test; CDRV; Rare polymorphisms
Biliary tract cancers, encompassing cancers of the gallbladder, extrahepatic bile ducts, and ampulla of Vater, are rare but highly fatal. Gallstones represent the major risk factor for biliary tract cancer, and share with gallbladder cancer a female predominance and an association with reproductive factors and obesity. While estrogens have been implicated in earlier studies of gallbladder cancer, there are no data on the role of androgens. Since intracellular androgen activity is mediated through the androgen receptor (AR), we examined associations between AR CAG repeat length [(CAG)n] and the risk of biliary tract cancers and stones in a population-based study of 331 incident cancer cases, 837 gallstone cases, and 750 controls from Shanghai, China, where the incidence rates for biliary tract cancer are rising sharply. Men with (CAG)n>24 had a significant 2-fold risk of gallbladder cancer (odds ratio [OR]=2.00; 95% confidence interval [CI] 1.07–3.73), relative to those with (CAG)n≤22. In contrast, women with (CAG)n>24 had reduced gallbladder cancer risk (OR=0.69, 95% CI 0.43–1.09) relative to those with (CAG)n≤22; P-interaction sex=0.01), which was most pronounced for women aged 68–74 (OR=0.48, 95% CI 0.25–0.93; P-interaction age=0.02). No associations were found for bile duct cancer or gallstones. Reasons for the heterogeneity of genetic effects by gender and age are unclear but may reflect an interplay between AR and the levels of androgen as well as estrogen in men and older women. Further studies are needed to confirm these findings and clarify the mechanisms involved.
Biliary Tract Cancer; Gallstones; Androgen Receptor; Gallbladder Neoplasms
The cost efficient two-stage design is often used in genome-wide association studies (GWASs) in searching for genetic loci underlying the susceptibility for complex diseases. Replication-based analysis, which considers data from each stage separately, often suffers from loss of efficiency. Joint test that combines data from both stages has been proposed and widely used to improve efficiency. However, existing joint analyses are based on test statistics derived under an assumed genetic model, and thus might not have robust performance when the assumed genetic model is not appropriate.
In this paper, we propose joint analyses based on two robust tests, MERT and MAX3, for GWASs under a two-stage design. We developed computationally efficient procedures and formulas for significant level evaluation and power calculation. The performances of the proposed approaches are investigated through the extensive simulation studies and a real example. Numerical results show that the joint analysis based on the MAX3 test statistic has the best overall performance.
MAX3 joint analysis is the most robust procedure among the considered joint analyses, and we recommend using it in a two-stage genome-wide association study.
Large comparative clinical trials usual target a wide-range of patients population in which subgroups exist according to certain patients’ characteristics. Often, scientific knowledge or existing empirical data support the assumption that patients’ improvement is larger among certain subgroups than the others. Such information can be used to design a more cost-effective clinical trial.
The goal of the article is to use such information to design a more cost-effective clinical trial.
A two-stage sample-enrichment design strategy is proposed that begins with enrollment from certain subgroup of patients and allows the trial to be terminated for futility in that subgroup.
Simulation studies show that the two-stage sample-enrichment strategy is cost-effective if indeed the null hypothesis of no treatment improvement is true, as also so illustrated with data from a completed trial of calcium to prevent preeclampsia.
Feasibility of the proposed enrichment design relies on the knowledge prior to the start of the trial that certain patients can benefit more than others from the treatment.
The two-stage sample-enrichment approach borrows strength from treatment heterogeneity among target patients in a large scale comparative clinical trial, and is more cost-effective if the treatment are of no difference.
Sample size and power; stopping for futility; subgroup analysis; treatment heterogeneity
In the genetic study of complex traits, especially behavior related ones, such as smoking and alcoholism, usually several phenotypic measurements are obtained for the description of the complex trait, but no single measurement can quantify fully the complicated characteristics of the symptom because of our lack of understanding of the underlying etiology. If those phenotypes share a common genetic mechanism, rather than studying each individual phenotype separately, it is more advantageous to analyze them jointly as a multivariate trait in order to enhance the power to identify associated genes. We propose a multilocus association test for the study of multivariate traits. The test is derived from a partially linear tree-based regression model for multiple outcomes. This novel tree-based model provides a formal statistical testing framework for the evaluation of the association between a multivariate outcome and a set of candidate predictors, such as markers within a gene or pathway, while accommodating adjustment for other covariates. Through simulation studies we show that the proposed method has an acceptable type I error rate and improved power over the univariate outcome analysis, which studies each component of the complex trait separately with multiple-comparison adjustment. A candidate gene association study of multiple smoking-related phenotypes is used to demonstrate the application and advantages of this new method. The proposed method is general enough to be used for the assessment of the joint effect of a set of multiple risk factors on a multivariate outcome in other biomedical research settings.
Generalized estimating equation; Genetic association study; Model selection; Multiple-comparison adjustment; Tree-based model
The primary circulating form of vitamin D, 25-hydroxy-vitamin D [25(OH)D], is associated with multiple medical outcomes, including rickets, osteoporosis, multiple sclerosis and cancer. In a genome-wide association study (GWAS) of 4501 persons of European ancestry drawn from five cohorts, we identified single-nucleotide polymorphisms (SNPs) in the gene encoding group-specific component (vitamin D binding) protein, GC, on chromosome 4q12-13 that were associated with 25(OH)D concentrations: rs2282679 (P = 2.0 × 10−30), in linkage disequilibrium (LD) with rs7041, a non-synonymous SNP (D432E; P = 4.1 × 10−22) and rs1155563 (P = 3.8 × 10−25). Suggestive signals for association with 25(OH)D were also observed for SNPs in or near three other genes involved in vitamin D synthesis or activation: rs3829251 on chromosome 11q13.4 in NADSYN1 [encoding nicotinamide adenine dinucleotide (NAD) synthetase; P = 8.8 × 10−7], which was in high LD with rs1790349, located in DHCR7, the gene encoding 7-dehydrocholesterol reductase that synthesizes cholesterol from 7-dehydrocholesterol; rs6599638 in the region harboring the open-reading frame 88 (C10orf88) on chromosome 10q26.13 in the vicinity of ACADSB (acyl-Coenzyme A dehydrogenase), involved in cholesterol and vitamin D synthesis (P = 3.3 × 10−7); and rs2060793 on chromosome 11p15.2 in CYP2R1 (cytochrome P450, family 2, subfamily R, polypeptide 1, encoding a key C-25 hydroxylase that converts vitamin D3 to an active vitamin D receptor ligand; P = 1.4 × 10−5). We genotyped SNPs in these four regions in 2221 additional samples and confirmed strong genome-wide significant associations with 25(OH)D through meta-analysis with the GWAS data for GC (P = 1.8 × 10−49), NADSYN1/DHCR7 (P = 3.4 × 10−9) and CYP2R1 (P = 2.9 × 10−17), but not C10orf88 (P = 2.4 × 10−5).
HPV infrequently persists and progresses to cervical cancer. We examined host genetic factors hypothesized to play a role in determining which subset of individuals infected with oncogenic human papillomavirus (HPV) have persistent infection and further develop cervical pre-cancer/cancer compared to the majority of infected individuals who will clear infection.
We evaluated 7140 tag single nucleotide polymorphisms (SNPs) from 305 candidate genes hypothesized to be involved in DNA repair, viral infection and cell entry in 416 cervical intraepithelial neoplasia 3 (CIN3)/cancer cases, 356 HPV persistent women (median: 25 months), and 425 random controls (RC) from the 10,049 women Guanacaste Costa Rica Natural History study. We used logistic regression to compute odds ratios and p-trend for CIN3/cancer and HPV persistence in relation to SNP genotypes and haplotypes (adjusted for age). We obtained pathway and gene-level summary of associations by computing the adaptive combination of p-values. Genes/regions statistically significantly associated with CIN3/cancer included the viral infection and cell entry genes 2′,5′ oligoadenylate synthetase gene 3 (OAS3), sulfatase 1 (SULF1), and interferon gamma (IFNG); the DNA repair genes deoxyuridine triphosphate (DUT), dosage suppressor of mck 1 homolog (DMC1), and general transcription factor IIH, polypeptide 3 (GTF2H4); and the EVER1 and EVER2 genes (p<0.01). From each region, the single most significant SNPs associated with CIN3/cancer were OAS3 rs12302655, SULF1 rs4737999, IFNG rs11177074, DUT rs3784621, DMC1 rs5757133, GTF2H4 rs2894054, EVER1/EVER2 rs9893818 (p-trends≤0.001). SNPs for OAS3, SULF1, DUT, and GTF2H4 were associated with HPV persistence whereas IFNG and EVER1/EVER2 SNPs were associated with progression to CIN3/cancer. We note that the associations observed were less than two-fold. We identified variations DNA repair and viral binding and cell entry genes associated with CIN3/cancer. Our results require replication but suggest that different genes may be responsible for modulating risk in the two critical transition steps important for cervical carcinogenesis: HPV persistence and disease progression.
Although several genes (including a strong effect in the human leukocyte antigen (HLA) region) and some environmental factors have been implicated to cause susceptibility to rheumatoid arthritis (RA), the etiology of the disease is not completely understood. The ability to screen the entire genome for association to complex diseases has great potential for identifying gene effects. However, the efficiency of gene detection in this situation may be improved by methods specifically designed for high-dimensional data. The aim of this study was to compare how three different statistical approaches, multifactor dimensionality reduction (MDR), random forests (RF), and an omnibus approach, worked in identifying gene effects (including gene-gene interaction) associated with RA. We developed a test set of genes based on previous linkage and association findings and tested all three methods. In the presence of the HLA shared-epitope factor, other genes showed weaker effects. All three methods detected SNPs in PTPN22 and TRAF1-C5 as being important. But we did not detect any new genes in this study. We conclude that the three high-dimensional methods are useful as an initial screening for gene associations to identify promising genes for further modeling and additional replication studies.
It is increasingly recognized that pathway analyses—a joint test of association between the outcome and a group of single nucleotide polymorphisms (SNPs) within a biological pathway—could potentially complement single-SNP analysis and provide additional insights for the genetic architecture of complex diseases. Building upon existing P-value combining methods, we propose a class of highly flexible pathway analysis approaches based on an adaptive rank truncated product (ARTP) statistic that can effectively combine evidence of associations over different SNPs and genes within a pathway. The statistical significance of the pathway-level test-statistics is evaluated using a highly efficient permutation algorithm that remains computationally feasible irrespective of the size of the pathway and complexity of the underlying test-statistics for summarizing SNP- and gene-level associations. We demonstrate through simulation studies that a gene-based analysis, that treats the underlying genes, as opposed to the underlying SNPs, as the basic units for hypothesis testing, is a very robust and powerful approach to pathway-based association testing. We also illustrate the advantage of the proposed methods using a study of the association between the nicotinic receptor pathway and cigarette smoking behaviors.
Pathway analysis; genetic association study; permutation procedure
For comparing the distribution of two samples with multiple endpoints, O’Brien (1984) proposed rank-sum-type test statistics. Huang et al. (2005) extended these statistics to the general nonparametric Behrens-Fisher hypothesis problem and obtained improved test statistics by replacing the ad hoc variance with the asymptotic variance of the rank-sum statistics. In this paper we generalize the work of O’Brien (1984) and Huang et al. (2005) and propose a weighted rank-sum statistic. We show that the weighted rank-sum statistic is asymptotically normally distributed, permitting the computation of power, p-values and confidence intervals. We further demonstrate via simulation that the weighted rank-sum statistic is efficient in controlling the type I error rate and under certain alternatives, is more powerful than the statistics of O’Brien (1984) and Huang et al.(2005).
Asymptotic normality; Behrens-Fisher problem; Case-Control; Clinical trials; Multiple endpoints; Rank-sum statistics; Weights
Population stratification (PS) can lead to an inflated rate of false positive findings in genome-wide association studies (GWAS). A commonly used approach is to adjust for a fixed number of principal components (PCs) in GWAS but this approach could have a deleterious impact on power when the cases and controls are equally distributed along selected PCs, or if the adjustment of certain covariates, such as self-identified ethnicity or recruitment center, already included in the association analyses, correctly map to major axes of genetic heterogeneity. We propose a computationally efficient procedure, PC-Finder, to identify a minimal set of PCs while permitting an effective correction for PS. A general pseudo F statistic, derived from a non-parametric multivariate regression model, can be used to assess whether PS exists or has been adequately corrected by a set of selected PCs. Empirical data from two GWAS conducted as part of the Cancer Genetic Markers of Susceptibility (CGEMS) project demonstrate the application of the procedure. Furthermore, simulation studies show the power advantage of the proposed procedure in GWAS over currently used PS correction strategies, particularly when the PCs with substantial genetic variation are distributed similarly in cases and controls and therefore do not induce PS.
Determination of the relevance of both demanding classical epidemiologic criteria for control selection and robust handling of population stratification (PS) represents a major challenge in the design and analysis of genome-wide association studies (GWAS). Empirical data from two GWAS in European Americans of the Cancer Genetic Markers of Susceptibility (CGEMS) project were used to evaluate the impact of PS in studies with different control selection strategies. In each of the two original case-control studies nested in corresponding prospective cohorts, a minor confounding effect due to PS (inflation factor λ of 1.025 and 1.005) was observed. In contrast, when the control groups were exchanged to mimic a cost-effective but theoretically less desirable control selection strategy, the confounding effects were larger (λ of 1.090 and 1.062). A panel of 12,898 autosomal SNPs common to both the Illumina and Affymetrix commercial platforms and with low local background linkage disequilibrium (pair-wise r2<0.004) was selected to infer population substructure with principal component analysis. A novel permutation procedure was developed for the correction of PS that identified a smaller set of principal components and achieved a better control of type I error (to λ of 1.032 and 1.006, respectively) than currently used methods. The overlap between sets of SNPs in the bottom 5% of p-values based on the new test and the test without PS correction was about 80%, with the majority of discordant SNPs having both ranks close to the threshold. Thus, for the CGEMS GWAS of prostate and breast cancer conducted in European Americans, PS does not appear to be a major problem in well-designed studies. A study using suboptimal controls can have acceptable type I error when an effective strategy for the correction of PS is employed.