PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (938766)

Clipboard (0)
None

Related Articles

1.  Adjusting for Population Stratification in a Fine Scale with Principal Components and Sequencing Data 
Genetic epidemiology  2013;37(8):10.1002/gepi.21764.
Population stratification is of primary interest in genetic studies to infer human evolution history and to avoid spurious findings in association testing. Although it is well studied with high-density single nucleotide polymorphisms (SNPs) in genome-wide association studies (GWASs), next-generation sequencing brings both new opportunities and challenges to uncovering population structures in finer scales. Several recent studies have noticed different confounding effects from variants of different minor allele frequencies (MAFs). In this paper, using a low-coverage sequencing dataset from the 1000 Genomes Project, we compared a popular method, principal component analysis (PCA), with a recently proposed spectral clustering technique, called spectral dimensional reduction (SDR), in detecting and adjusting for population stratification at the level of ethnic subgroups. We investigated the varying performance of adjusting for population stratification with different types and sets of variants when testing on different types of variants. One main conclusion is that principal components based on all variants or common variants were generally most effective in controlling inflations caused by population stratification; in particular, contrary to many speculations on the effectiveness of rare variants, we did not find much added value with the use of only rare variants. In addition, SDR was confirmed to be more robust than PCA, especially when applied to rare variants.
doi:10.1002/gepi.21764
PMCID: PMC3864649  PMID: 24123217
1000 Genomes Project; Association testing; Common variants; Principal component analysis; Rare variants; Spectral analysis
2.  Conditional Random Fields for Fast, Large-Scale Genome-Wide Association Studies 
PLoS ONE  2011;6(7):e21591.
Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science.
doi:10.1371/journal.pone.0021591
PMCID: PMC3134455  PMID: 21765897
3.  Principal-component-based population structure adjustment in the North American Rheumatoid Arthritis Consortium data: impact of single-nucleotide polymorphism set and analysis method 
BMC Proceedings  2009;3(Suppl 7):S108.
Population structure occurs when a sample is composed of individuals with different ancestries and can result in excess type I error in genome-wide association studies. Genome-wide principal-component analysis (PCA) has become a popular method for identifying and adjusting for subtle population structure in association studies. Using the Genetic Analysis Workshop 16 (GAW16) NARAC data, we explore two unresolved issues concerning the use of genome-wide PCA to account for population structure in genetic associations studies: the choice of single-nucleotide polymorphism (SNP) subset and the choice of adjustment model. We computed PCs for subsets of genome-wide SNPs with varying levels of LD. The first two PCs were similar for all subsets and the first three PCs were associated with case status for all subsets. When the PCs associated with case status were included as covariates in an association model, the reduction in genomic inflation factor was similar for all SNP sets. Several models have been proposed to account for structure using PCs, but it is not yet clear whether the different methods will result in substantively different results for association studies with individuals of European descent. We compared genome-wide association p-values and results for two positive-control SNPs previously associated with rheumatoid arthritis using four PC adjustment methods as well as no adjustment and genomic control. We found that in this sample, adjusting for the continuous PCs or adjusting for discrete clusters identified using the PCs adequately accounts for the case-control population structure, but that a recently proposed randomization test performs poorly.
PMCID: PMC2795879  PMID: 20017972
4.  ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction 
BMC Bioinformatics  2013;14:61.
Background
Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification.
Results
We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver.
Conclusions
ETHNOPRED is a novel technique for producing classifiers that can identify an individual’s continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
doi:10.1186/1471-2105-14-61
PMCID: PMC3618021  PMID: 23432980
5.  Ethnicity, Body Mass, and Genome-Wide Data 
Biodemography and social biology  2010;56(2):123-136.
This article combines social and genetic epidemiology to examine the influence of self-reported ethnicity on body mass index (BMI) among a sample of adolescents and young adults. We use genetic information from more than 5,000 single nucleotide polymorphisms in combination with principal components analysis to characterize population ancestry of individuals in this study. We show that non-Hispanic white and Mexican-American respondents differ significantly with respect to BMI and differ on the first principal component from the genetic data. This first component is positively associated with BMI and accounts for roughly 3% of the genetic variance in our sample. However, after controlling for this genetic measure, the observed ethnic differences in BMI remain large and statistically significant. This study demonstrates a parsimonious method to adjust for genetic differences among individual respondents that may contribute to observed differences in outcomes. In this case, adjusting for genetic background has no bearing on the influence of self-identified ethnicity.
doi:10.1080/19485565.2010.524589
PMCID: PMC3155265  PMID: 21387985
6.  Methods for adjusting population structure and familial relatedness in association test for collective effect of multiple rare variants on quantitative traits 
BMC Proceedings  2011;5(Suppl 9):S35.
Because of the low frequency of rare genetic variants in observed data, the statistical power of detecting their associations with target traits is usually low. The collapsing test of collective effect of multiple rare variants is an important and useful strategy to increase the power; in addition, family data may be enriched with causal rare variants and therefore provide extra power. However, when family data are used, both population structure and familial relatedness need to be adjusted for the possible inflation of false positives. Using a unified mixed linear model and family data, we compared six methods to detect the association between multiple rare variants and quantitative traits. Through the analysis of 200 replications of the quantitative trait Q2 from the Genetic Analysis Workshop 17 data set simulated for 697 subjects from 8 extended families, and based on quantile-quantile plots under the null and receiver operating characteristic curves, we compared the false-positive rate and power of these methods. We observed that adjusting for pedigree-based kinship gives the best control for false-positive rate, whereas adjusting for marker-based identity by state slightly outperforms in terms of power. An adjustment based on a principal components analysis slightly improves the false-positive rate and power. Taking into account type-1 error, power, and computational efficiency, we find that adjusting for pedigree-based kinship seems to be a good choice for the collective test of association between multiple rare variants and quantitative traits using family data.
doi:10.1186/1753-6561-5-S9-S35
PMCID: PMC3287871  PMID: 22373066
7.  Marbled Inflation From Population Structure in Gene-Based Association Studies With Rare Variants 
Genetic epidemiology  2013;37(3):286-292.
Accurate genetic association studies are crucial for the detection and the validation of disease determinants. One of the main confounding factors that affect accuracy is population stratification, and great efforts have been extended for the past decade to detect and to adjust for it. We have now efficient solutions for population stratification adjustment for single-SNP (where SNP is single-nucleotide polymorphisms) inference in genome-wide association studies, but it is unclear whether these solutions can be effectively applied to rare variation studies and in particular gene-based (or set-based) association methods that jointly analyze multiple rare and common variants. We examine here, both theoretically and empirically, the performance of two commonly used approaches for population stratification adjustment—genomic control and principal component analysis—when used on gene-based association tests. We show that, different from single-SNP inference, genes with diverse composition of rare and common variants may suffer from population stratification to various extent. The inflation in gene-level statistics could be impacted by the number and the allele frequency spectrum of SNPs in the gene, and by the gene-based testing method used in the analysis. As a consequence, using a universal inflation factor as a genomic control should be avoided in gene-based inference with sequencing data. We also demonstrate that caution needs to be exercised when using principal component adjustment because the accuracy of the adjusted analyses depends on the underlying population substructure, on the way the principal components are constructed, and on the number of principal components used to recover the substructure.
doi:10.1002/gepi.21714
PMCID: PMC3716585  PMID: 23468125
sequencing studies; gene-based association test; genomic control; principal component analysis; C-alpha test; burden test
8.  Assessing the impact of global versus local ancestry in association studies 
BMC Proceedings  2009;3(Suppl 7):S107.
Background
To account for population stratification in association studies, principal-components analysis is often performed on single-nucleotide polymorphisms (SNPs) across the genome. Here, we use Framingham Heart Study (FHS) Genetic Analysis Workshop 16 data to compare the performance of local ancestry adjustment for population stratification based on principal components (PCs) estimated from SNPs in a local chromosomal region with global ancestry adjustment based on PCs estimated from genome-wide SNPs.
Methods
Standardized height residuals from unrelated adults from the FHS Offspring Cohort were averaged from longitudinal data. PCs of SNP genotype data were calculated to represent individual's ancestry either 1) globally using all SNPs across the genome or 2) locally using SNPs in adjacent 20-Mbp regions within each chromosome. We assessed the extent to which there were differences in association studies of height depending on whether PCs for global, local, or both global and local ancestry were included as covariates.
Results
The correlations between local and global PCs were low (r < 0.12), suggesting variability between local and global ancestry estimates. Genome-wide association tests without any ancestry adjustment demonstrated an inflated type I error rate that decreased with adjustment for local ancestry, global ancestry, or both. A known spurious association was replicated for SNPs within the lactase gene, and this false-positive association was abolished by adjustment with local or global ancestry PCs.
Conclusion
Population stratification is a potential source of bias in this seemingly homogenous FHS population. However, local and global PCs derived from SNPs appear to provide adequate information about ancestry.
PMCID: PMC2795878  PMID: 20017971
9.  Interrogating local population structure for fine mapping in genome-wide association studies 
Bioinformatics  2010;26(23):2961-2968.
Motivation: Adjustment for population structure is necessary to avoid bias in genetic association studies of susceptibility variants for complex diseases. Population structure may differ from one genomic region to another due to the variability of individual ancestry associated with migration, random genetic drift or natural selection. Current association methods for correcting population stratification usually involve adjustment of global ancestry between study subjects.
Results: We suggest interrogating local population structure for fine mapping to more accurately locate true casual genes by better adjusting the confounding effect due to local ancestry. By extensive simulations on genome-wide datasets, we show that adjusting global ancestry may lead to false positives when local population structure is an important confounding factor. In contrast, adjusting local ancestry can effectively prevent false positives due to local population structure and thus can improve fine mapping for disease gene localization. We applied the local and global adjustments to the analysis of datasets from three genome-wide association studies, including European Americans, African Americans and Nigerians. Both European Americans and African Americans demonstrate greater variability in local ancestry than Nigerians. Adjusting local ancestry successfully eliminated the known spurious association between SNPs in the LCT gene and height due to the population structure existed in European Americans.
Contact: xiaofeng.zhu@case.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq560
PMCID: PMC2982153  PMID: 20889494
10.  Clustering by genetic ancestry using genome-wide SNP data 
BMC Genetics  2010;11:108.
Background
Population stratification can cause spurious associations in a genome-wide association study (GWAS), and occurs when differences in allele frequencies of single nucleotide polymorphisms (SNPs) are due to ancestral differences between cases and controls rather than the trait of interest. Principal components analysis (PCA) is the established approach to detect population substructure using genome-wide data and to adjust the genetic association for stratification by including the top principal components in the analysis. An alternative solution is genetic matching of cases and controls that requires, however, well defined population strata for appropriate selection of cases and controls.
Results
We developed a novel algorithm to cluster individuals into groups with similar ancestral backgrounds based on the principal components computed by PCA. We demonstrate the effectiveness of our algorithm in real and simulated data, and show that matching cases and controls using the clusters assigned by the algorithm substantially reduces population stratification bias. Through simulation we show that the power of our method is higher than adjustment for PCs in certain situations.
Conclusions
In addition to reducing population stratification bias and improving power, matching creates a clean dataset free of population stratification which can then be used to build prediction models without including variables to adjust for ancestry. The cluster assignments also allow for the estimation of genetic heterogeneity by examining cluster specific effects.
doi:10.1186/1471-2156-11-108
PMCID: PMC3018397  PMID: 21143920
11.  Choice of population structure informative principal components for adjustment in a case-control study 
BMC Genetics  2011;12:64.
Background
There are many ways to perform adjustment for population structure. It remains unclear what the optimal approach is and whether the optimal approach varies by the type of samples and substructure present. The simplest and most straightforward approach is to adjust for the continuous principal components (PCs) that capture ancestry. Through simulation, we explored the issue of which ancestry informative PCs should be adjusted for in an association model to control for the confounding nature of population structure while maintaining maximum power. A thorough examination of selecting PCs for adjustment in a case-control study across the possible structure scenarios that could occur in a genome-wide association study has not been previously reported.
Results
We found that when the SNP and phenotype frequencies do not vary over the sub-populations, all methods of selection provided similar power and appropriate Type I error for association. When the SNP is not structured and the phenotype has large structure, then selection methods that do not select PCs for inclusion as covariates generally provide the most power. When there is a structured SNP and a non-structured phenotype, selection methods that include PCs in the model have greater power. When both the SNP and the phenotype are structured, all methods of selection have similar power.
Conclusions
Standard practice is to include a fixed number of PCs in genome-wide association studies. Based on our findings, we conclude that if power is not a concern, then selecting the same set of top PCs for adjustment for all SNPs in logistic regression is a strategy that achieves appropriate Type I error. However, standard practice is not optimal in all scenarios and to optimize power for structured SNPs in the presence of unstructured phenotypes, PCs that are associated with the tested SNP should be included in the logistic model.
doi:10.1186/1471-2156-12-64
PMCID: PMC3150322  PMID: 21771328
12.  Matching on Race and Ethnicity in Case-Control Studies as a Means of Control for Population Stratification 
Some investigators argue that controlling for self-reported race or ethnicity, either in statistical analysis or in study design, is sufficient to mitigate unwanted influence from population stratification. In this report, we evaluated the effectiveness of a study design involving matching on self-reported ethnicity and race in minimizing bias due to population stratification within an ethnically admixed population in California. We estimated individual genetic ancestry using structured association methods and a panel of ancestry informative markers, and observed no statistically significant difference in distribution of genetic ancestry between cases and controls (P=0.46). Stratification by Hispanic ethnicity showed similar results. We evaluated potential confounding by genetic ancestry after adjustment for race and ethnicity for 1260 candidate gene SNPs, and found no major impact (>10%) on risk estimates. In conclusion, we found no evidence of confounding of genetic risk estimates by population substructure using this matched design. Our study provides strong evidence supporting the race- and ethnicity-matched case-control study design as an effective approach to minimizing systematic bias due to differences in genetic ancestry between cases and controls
doi:10.4172/2161-1165.1000101
PMCID: PMC3966291
Population stratification; Genetic susceptibility; Case-control; Matching
13.  Effect of population stratification analysis on false-positive rates for common and rare variants 
BMC Proceedings  2011;5(Suppl 9):S116.
Principal components analysis (PCA) has been successfully used to correct for population stratification in genome-wide association studies of common variants. However, rare variants also have a role in common disease etiology. Whether PCA successfully controls population stratification for rare variants has not been addressed. Thus we evaluate the effect of population stratification analysis on false-positive rates for common and rare variants at the single-nucleotide polymorphism (SNP) and gene level. We use the simulation data from Genetic Analysis Workshop 17 and compare false-positive rates with and without PCA at the SNP and gene level. We found that SNPs’ minor allele frequency (MAF) influenced the ability of PCA to effectively control false discovery. Specifically, PCA reduced false-positive rates more effectively in common SNPs (MAF > 0.05) than in rare SNPs (MAF < 0.01). Furthermore, at the gene level, although false-positive rates were reduced, power to detect true associations was also reduced using PCA. Taken together, these results suggest that sequence-level data should be interpreted with caution, because extremely rare SNPs may exhibit sporadic association that is not controlled using PCA.
doi:10.1186/1753-6561-5-S9-S116
PMCID: PMC3287840  PMID: 22373282
14.  Allowing for Population Stratification in Association Analysis 
In genetic association studies, it is necessary to correct for population structure to avoid inference bias. During the past decade, prevailing corrections often only involved adjustments of global ancestry differences between sampled individuals. Nevertheless, population structure may vary across local genomic regions due to the variability of local ancestries associated with natural selection, migration, or random genetic drift. Adjusting for global ancestry alone may be inadequate when local population structure is an important confounding factor. In contrast, adjusting for local ancestry can more effectively prevent false-positives due to local population structure. To more accurately locate disease genes, we recommend adjusting for local ancestries by interrogating local structure. In practice, locus-specific ancestries are usually unknown and cannot be accurately inferred when ancestral population information is not available. For such scenarios, we propose employing local principal components (PC) to represent local ancestries and adjusting for local PCs when testing for genotype–phenotype association. With an acceptable computation burden, the proposed algorithm successfully eliminates the known spurious association between SNPs in the LCT gene and height due to the population structure in European Americans.
doi:10.1007/978-1-61779-555-8_21
PMCID: PMC3589145  PMID: 22307710
Genome-wide association studies; Local ancestries; Local principal components; Migration; Random genetic drift; Natural selection; Genomic inflation factor; Genomic control; Local ancestry principal components correction; Fine mapping
15.  Does High C-reactive Protein Concentration Increase Atherosclerosis? The Whitehall II Study 
PLoS ONE  2008;3(8):e3013.
Background
C-reactive protein (CRP), a marker of systemic inflammation, is associated with risk of coronary events and sub-clinical measures of atherosclerosis. Evidence in support of this link being causal would include an association robust to adjustments for confounders (multivariable standard regression analysis) and the association of CRP gene polymorphisms with atherosclerosis (Mendelian randomization analysis).
Methodology/Principal Findings
We genotyped 3 tag single nucleotide polymorphisms (SNPs) [+1444T>C (rs1130864); +2303G>A (rs1205) and +4899T>G (rs 3093077)] in the CRP gene and assessed CRP and carotid intima-media thickness (CIMT), a structural marker of atherosclerosis, in 4941 men and women aged 50–74 (mean 61) years (the Whitehall II Study). The 4 major haplotypes from the SNPs were consistently associated with CRP level, but not with other risk factors that might confound the association between CRP and CIMT. CRP, assessed both at mean age 49 and at mean age 61, was associated both with CIMT in age and sex adjusted standard regression analyses and with potential confounding factors. However, the association of CRP with CIMT attenuated to the null with adjustment for confounding factors in both prospective and cross-sectional analyses. When examined using genetic variants as the instrument for serum CRP, there was no inferred association between CRP and CIMT.
Conclusions/Significance
Both multivariable standard regression analysis and Mendelian randomization analysis suggest that the association of CRP with carotid atheroma indexed by CIMT may not be causal.
doi:10.1371/journal.pone.0003013
PMCID: PMC2507732  PMID: 18714381
16.  Self-reported Ethnicity, Genetic Structure and the Impact of Population Stratification in a Multiethnic Study 
Human genetics  2010;128(2):165-177.
It is well-known that population substructure may lead to confounding in case-control association studies. Here, we examined genetic structure in a large racially and ethnically diverse sample consisting of 5 ethnic groups of the Multiethnic Cohort study (African Americans, Japanese Americans, Latinos, European Americans and Native Hawaiians) using 2,509 SNPs distributed across the genome. Principal component analysis on 6,213 study participants, 18 Native Americans and 11 HapMap III populations revealed 4 important principal components (PCs): the first two separated Asians, Europeans and Africans, and the third and fourth corresponded to Native American and Native Hawaiian (Polynesian) ancestry, respectively. Individual ethnic composition derived from self-reported parental information matched well to genetic ancestry for Japanese and European Americans. STRUCTURE-estimated individual ancestral proportions for African Americans and Latinos are consistent with previous reports. We quantified the East Asian (mean 27%), European (mean 27%) and Polynesian (mean 46%) ancestral proportions for the first time, to our knowledge, for Native Hawaiians. Simulations based on realistic settings of case-control studies nested in the Multiethnic Cohort found that the effect of population stratification was modest and readily corrected by adjusting for race/ethnicity or by adjusting for top PCs derived from all SNPs or from ancestry informative markers; the power of these approaches was similar when averaged across causal variants simulated based on allele frequencies of the 2,509 genotyped markers. The bias may be large in case-only analysis of gene by gene interactions but it can be corrected by top PCs derived from all SNPs.
doi:10.1007/s00439-010-0841-4
PMCID: PMC3057055  PMID: 20499252
AIMs; African American; Native Hawaiian; Latino; admixture; principal component analysis
17.  Osteopontin and Systemic Lupus Erythematosus Association: A Probable Gene-Gender Interaction 
PLoS ONE  2008;3(3):e1757.
Osteopontin (SPP1) is an important bone matrix mediator found to have key roles in inflammation and immunity. SPP1 genetic polymorphisms and increased osteopontin protein levels have been reported to be associated with SLE in small patient collections. The present study evaluates association between SPP1 polymorphisms and SLE in a large cohort of 1141 unrelated SLE patients [707 European-American (EA) and 434 African-American (AA)], and 2009 unrelated controls (1309 EA and 700 AA). Population-based case-control association analyses were performed. To control for potential population stratification, admixture adjusted logistic regression, genomic control (GC), structured association (STRAT), and principal components analysis (PCA) were applied. Combined analysis of 2 ethnic groups, showed the minor allele of 2 SNPs (rs1126616T and rs9138C) significantly associated with higher risk of SLE in males (P = 0.0005, OR = 1.73, 95% CI = 1.28–2.33), but not in females. Indeed, significant gene-gender interactions in the 2 SNPs, rs1126772 and rs9138, were detected (P = 0.001 and P = 0.0006, respectively). Further, haplotype analysis identified rs1126616T-rs1126772A-rs9138C which demonstrated significant association with SLE in general (P = 0.02, OR = 1.30, 95%CI 1.08–1.57), especially in males (P = 0.0003, OR = 2.42, 95%CI 1.51–3.89). Subgroup analysis with single SNPs and haplotypes also identified a similar pattern of gender-specific association in AA and EA. GC, STRAT, and PCA results within each group showed consistent associations. Our data suggest SPP1 is associated with SLE, and this association is especially stronger in males. To our knowledge, this report serves as the first association of a specific autosomal gene with human male lupus.
doi:10.1371/journal.pone.0001757
PMCID: PMC2258418  PMID: 18335026
18.  Principal components ancestry adjustment for Genetic Analysis Workshop 17 data 
BMC Proceedings  2011;5(Suppl 9):S66.
Statistical tests on rare variant data may well have type I error rates that differ from their nominal levels. Here, we use the Genetic Analysis Workshop 17 data to estimate type I error rates and powers of three models for identifying rare variants associated with a phenotype: (1) by using the number of minor alleles, age, and smoking status as predictor variables; (2) by using the number of minor alleles, age, smoking status, and the identity of the population of the subject as predictor variables; and (3) by using the number of minor alleles, age, smoking status, and ancestry adjustment using 10 principal component scores. We studied both quantitative phenotype and a dichotomized phenotype. The model with principal component adjustment has type I error rates that are closer to the nominal level of significance of 0.05 for single-nucleotide polymorphisms (SNPs) in noncausal genes for the selected phenotype than the model directly adjusting for population. The principal component adjustment model type I error rates are also closer to the nominal level of 0.05 for noncausal SNPs located in causal genes for the phenotype. The power for causal SNPs with the principal component adjustment model is comparable to the power of the other methods. The power using the underlying quantitative phenotype is greater than the power using the dichotomized phenotype.
doi:10.1186/1753-6561-5-S9-S66
PMCID: PMC3287905  PMID: 22373457
19.  Trans-population Analysis of Genetic Mechanisms of Ethnic Disparities in Neuroblastoma Survival 
Background
Black patients with neuroblastoma have a higher prevalence of high-risk disease and worse outcome than white patients. We sought to investigate the relationship between genetic variation and the disparities in survival observed in neuroblastoma.
Methods
The analytic cohort was composed of 2709 patients. Principal components were used to assign patients to genomic ethnic clusters for survival analyses. Locus-specific ancestry was calculated for use in association analysis. The shorter spans of linkage disequilibrium in African populations may facilitate the fine mapping of causal variants in regions previously implicated by genome-wide association studies conducted primarily in patients of European descent. Thus, we evaluated 13 single nucleotide polymorphisms known to be associated with susceptibility to high-risk neuroblastoma from genome-wide association studies and all variants with highly divergent allele frequencies in reference African and European populations near the known susceptibility loci. All statistical tests were two-sided.
Results
African genomic ancestry was associated with high-risk neuroblastoma (P = .007) and lower event-free survival (P = .04, hazard ratio = 1.4, 95% confidence interval = 1.05 to 1.80). rs1033069 within SPAG16 (sperm associated antigen 16) was determined to have higher risk allele frequency in the African reference population and statistically significant association with high-risk disease in patients of European and African ancestry (P = 6.42×10−5, false discovery rate < 0.0015) in the overall cohort. Multivariable analysis using an additive model demonstrated that the SPAG16 single nucleotide polymorphism contributes to the observed ethnic disparities in high-risk disease and survival.
Conclusions
Our study demonstrates that common genetic variation influences neuroblastoma phenotype and contributes to the ethnic disparities in survival observed and illustrates the value of trans-population mapping.
doi:10.1093/jnci/djs503
PMCID: PMC3691940  PMID: 23243203
20.  Effect of population stratification on the identification of significant single-nucleotide polymorphisms in genome-wide association studies 
BMC Proceedings  2009;3(Suppl 7):S13.
The North American Rheumatoid Arthritis Consortium case-control study collected case participants across the United States and control participants from New York. More than 500,000 single-nucleotide polymorphisms (SNPs) were genotyped in the sample of 2000 cases and controls. Careful adjustment for the confounding effect of population stratification must be conducted when analyzing these data; the variance inflation factor (VIF) without adjustment is 1.44. In the primary analyses of these data, a clustering algorithm in the program PLINK was used to reduce the VIF to 1.14, after which genomic control was used to control residual confounding. Here we use stratification scores to achieve a unified and coherent control for confounding. We used the first 10 principal components, calculated genome-wide using a set of 81,500 loci that had been selected to have low pair-wise linkage disequilibrium, as risk factors in a logistic model to calculate the stratification score. We then divided the data into five strata based on quantiles of the stratification score. The VIF of these stratified data is 1.04, indicating substantial control of stratification. However, after control for stratification, we find that there are no significant loci associated with rheumatoid arthritis outside of the HLA region. In particular, we find no evidence for association of TRAF1-C5 with rheumatoid arthritis.
PMCID: PMC2795903  PMID: 20017996
21.  European Population Genetic Substructure: Further Definition of Ancestry Informative Markers for Distinguishing among Diverse European Ethnic Groups 
Molecular Medicine  2009;15(11-12):371-383.
The definition of European population genetic substructure and its application to understanding complex phenotypes is becoming increasingly important. In the current study using over 4,000 subjects genotyped for 300,000 single-nucleotide polymorphisms (SNPs), we provide further insight into relationships among European population groups and identify sets of SNP ancestry informative markers (AIMs) for application in genetic studies. In general, the graphical description of these principal components analyses (PCA) of diverse European subjects showed a strong correspondence to the geographical relationships of specific countries or regions of origin. Clearer separation of different ethnic and regional populations was observed when northern and southern European groups were considered separately and the PCA results were influenced by the inclusion or exclusion of different self-identified population groups including Ashkenazi Jewish, Sardinian, and Orcadian ethnic groups. SNP AIM sets were identified that could distinguish the regional and ethnic population groups. Moreover, the studies demonstrated that most allele frequency differences between different European groups could be controlled effectively in analyses using these AIM sets. The European substructure AIMs should be widely applicable to ongoing studies to confirm and delineate specific disease susceptibility candidate regions without the necessity of performing additional genome-wide SNP studies in additional subject sets.
doi:10.2119/molmed.2009.00094
PMCID: PMC2730349  PMID: 19707526
22.  Application of Bayesian network structure learning to identify causal variant SNPs from resequencing data 
BMC Proceedings  2011;5(Suppl 9):S109.
Using single-nucleotide polymorphism (SNP) genotypes from the 1000 Genomes Project pilot3 data provided for Genetic Analysis Workshop 17 (GAW17), we applied Bayesian network structure learning (BNSL) to identify potential causal SNPs associated with the Affected phenotype. We focus on the setting in which target genes that harbor causal variants have already been chosen for resequencing; the goal was to detect true causal SNPs from among the measured variants in these genes. Examining all available SNPs in the known causal genes, BNSL produced a Bayesian network from which subsets of SNPs connected to the Affected outcome were identified and measured for statistical significance using the hypergeometric distribution. The exploratory phase of analysis for pooled replicates sometimes identified a set of involved SNPs that contained more true causal SNPs than expected by chance in the Asian population. Analyses of single replicates gave inconsistent results. No nominally significant results were found in analyses of African or European populations. Overall, the method was not able to identify sets of involved SNPs that included a higher proportion of true causal SNPs than expected by chance alone. We conclude that this method, as currently applied, is not effective for identifying causal SNPs that follow the simulation model for the GAW17 data set, which includes many rare causal SNPs.
doi:10.1186/1753-6561-5-S9-S109
PMCID: PMC3287832  PMID: 22373088
23.  Understanding the Population Structure of North American Patients with Cystic Fibrosis 
Clinical genetics  2011;79(2):136-146.
Rationale
It is generally presumed that the Cystic Fibrosis (CF) population is relatively homogeneous, and predominantly of European origin. The complex ethnic make-up observed in the CF patients collected by the North American CF Modifier Gene Consortium has brought this assumption into question, and suggested the potential for population substructure in the three CF study samples collected from North America. It is well appreciated that population substructure can result in spurious genetic associations.
Objectives
To understand the ethnic composition of the North American CF population, and to assess the need for population structure adjustment in genetic association studies with North American CF patients.
Methods
Genome-wide single-nucleotide polymorphisms on 3076 unrelated North American CF patients were used to perform population structure analyses. We compared self-reported ethnicity to genotype-inferred ancestry, and also examined whether geographic distribution and CFTR mutation type could explain the structure observed.
Main Results
Although largely Caucasian, our analyses identified a considerable number of CF patients with admixed African-Caucasian, Mexican-Caucasian and Indian-Caucasian ancestries. Population substructure was present and comparable across the three studies of the consortium. Neither geographic distribution nor mutation type explained the population structure.
Conclusion
Given the ethnic diversity of the North American CF population, it is essential to carefully detect, estimate and adjust for population substructure to guard against potential spurious findings in CF genetic association studies. Other Mendelian diseases that are presumed to predominantly affect single ethnic groups may also benefit from careful analysis of population structure.
doi:10.1111/j.1399-0004.2010.01502.x
PMCID: PMC2995003  PMID: 20681990
ethnicity; principal component analysis; population substructure; population stratification
24.  Correcting population stratification in genetic association studies using a phylogenetic approach 
Bioinformatics  2010;26(6):798-806.
Motivation: The rapid development of genotyping technology and extensive cataloguing of single nucleotide polymorphisms (SNPs) across the human genome have made genetic association studies the mainstream for gene mapping of complex human diseases. For many diseases, the most practical approach is the population-based design with unrelated individuals. Although having the advantages of easier sample collection and greater power than family-based designs, unrecognized population stratification in the study samples can lead to both false-positive and false-negative findings and might obscure the true association signals if not appropriately corrected.
Methods: We report PHYLOSTRAT, a new method that corrects for population stratification by combining phylogeny constructed from SNP genotypes and principal coordinates from multi-dimensional scaling (MDS) analysis. This hybrid approach efficiently captures both discrete and admixed population structures.
Results: By extensive simulations, the analysis of a synthetic genome-wide association dataset created using data from the Human Genome Diversity Project, and the analysis of a lactase-height dataset, we show that our method can correct for population stratification more efficiently than several existing population stratification correction methods, including EIGENSTRAT, a hybrid approach based on MDS and clustering, and STRATSCORE , in terms of requiring fewer random SNPs for inference of population structure. By combining the flexibility and hierarchical nature of phylogenetic trees with the advantage of representing admixture using MDS, our hybrid approach can capture the complex population structures in human populations effectively.
Software Availability: Codes can be downloaded from http://people.pcbi.upenn.edu/∼lswang/phylostrat/
Contact: mingyao@upenn.edu; iswang@upenn.edu.
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq025
PMCID: PMC2832820  PMID: 20097913
25.  Associations of VEGF-C Genetic Polymorphisms with Urothelial Cell Carcinoma Susceptibility Differ between Smokers and Non-Smokers in Taiwan 
PLoS ONE  2014;9(3):e91147.
Background
Vascular endothelial growth factor (VEGF)-C is associated with lymphangiogenesis, pelvic regional lymph node metastasis, and an antiapoptotic phenotype in urothelial cell carcinoma (UCC). Knowledge of potential roles of VEGF-C genetic polymorphisms in susceptibility to UCC is lacking. This study was designed to examine associations between VEGF-C gene variants and UCC susceptibility and evaluate whether they are modified by smoking.
Methodology/Principal Findings
Five single-nucleotide polymorphisms (SNPs) of VEGF-C were analyzed by a TaqMan-based real-time polymerase chain reaction (PCR) in 233 patients with UCC and 520 cancer-free controls. A multivariate logistic regression was applied to model associations between genetic polymorphisms and UCC susceptibility, and to determine if the effect was modified by smoking. We found that after adjusting for other covariates, individuals within the entire population and the 476 non-smokers carrying at least one A allele at VEGF-C rs1485766 respectively had 1.742- and 1.834-fold risks of developing UCC than did wild-type (CC) carriers. Among the 277 smokers, we found that VEGF-C rs7664413 T (CT+TT) and rs2046463 G (AG+GG) allelic carriers were more prevalent in UCC patients than in non-cancer participants. Moreover, UCC patients with the smoking habit who had at least one T allele of VEGF-C rs7664413 were at higher risk of developing larger tumor sizes (p = 0.021), compared to those patients with CC homozygotes.
Conclusions
Our results suggest that the involvement of VEGF-C genotypes in UCC risk differs among smokers compared to non-smokers among Taiwanese. The genetic polymorphism of VEGF-C rs7664413 might be a predictive factor for the tumor size of UCC patients who have a smoking habit.
doi:10.1371/journal.pone.0091147
PMCID: PMC3946732  PMID: 24608123

Results 1-25 (938766)