Motivation: Admixed populations offer a unique opportunity for mapping diseases that have large disease allele frequency differences between ancestral populations. However, association analysis in such populations is challenging because population stratification may lead to association with loci unlinked to the disease locus.
Methods and results: We show that local ancestry at a test single nucleotide polymorphism (SNP) may confound with the association signal and ignoring it can lead to spurious association. We demonstrate theoretically that adjustment for local ancestry at the test SNP is sufficient to remove the spurious association regardless of the mechanism of population stratification, whether due to local or global ancestry differences among study subjects; however, global ancestry adjustment procedures may not be effective. We further develop two novel association tests that adjust for local ancestry. Our first test is based on a conditional likelihood framework which models the distribution of the test SNP given disease status and flanking marker genotypes. A key advantage of this test lies in its ability to incorporate different directions of association in the ancestral populations. Our second test, which is computationally simpler, is based on logistic regression, with adjustment for local ancestry proportion. We conducted extensive simulations and found that the Type I error rates of our tests are under control; however, the global adjustment procedures yielded inflated Type I error rates when stratification is due to local ancestry difference.
Contact: email@example.com; firstname.lastname@example.org.
Supplementary information: Supplementary data are available at Bioinformatics online.
Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification.
We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver.
ETHNOPRED is a novel technique for producing classifiers that can identify an individual’s continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
Motivation: Adjustment for population structure is necessary to avoid bias in genetic association studies of susceptibility variants for complex diseases. Population structure may differ from one genomic region to another due to the variability of individual ancestry associated with migration, random genetic drift or natural selection. Current association methods for correcting population stratification usually involve adjustment of global ancestry between study subjects.
Results: We suggest interrogating local population structure for fine mapping to more accurately locate true casual genes by better adjusting the confounding effect due to local ancestry. By extensive simulations on genome-wide datasets, we show that adjusting global ancestry may lead to false positives when local population structure is an important confounding factor. In contrast, adjusting local ancestry can effectively prevent false positives due to local population structure and thus can improve fine mapping for disease gene localization. We applied the local and global adjustments to the analysis of datasets from three genome-wide association studies, including European Americans, African Americans and Nigerians. Both European Americans and African Americans demonstrate greater variability in local ancestry than Nigerians. Adjusting local ancestry successfully eliminated the known spurious association between SNPs in the LCT gene and height due to the population structure existed in European Americans.
Supplementary information: Supplementary data are available at Bioinformatics online.
Genome-wide association studies in cohorts of European descent have identified novel genomic regions as associated with lipids, but their relevance in African Americans remains unclear.
Methods and Results
We genotyped 8 index SNPs and 488 tagging SNPs across 8 novel lipid loci in the Jackson Heart Study, a community-based cohort of 4605 African Americans. For each trait, we calculated residuals adjusted for age, sex, and global ancestry and performed multivariable linear regression to detect genotype-phenotype association with adjustment for local ancestry. To explore admixture effects, we conducted stratified analyses in individuals with a high probability of 2 African ancestral alleles or at least 1 European allele at each locus. We confirmed 2 index SNPs as associated with lipid traits in African Americans, with suggestive association for 3 more. However, the effect sizes for 4 of the 5 associated SNPs were larger in the European local ancestry subgroup compared to the African local ancestry subgroup, suggesting that the replication is driven by European ancestry segments. Through fine-mapping, we discovered 3 new SNPs with significant associations, two with consistent effect on triglyceride levels across ancestral groups: rs636523 near DOCK7/ANGPTL3 and rs780093 in GCKR. African LD patterns did not assist in narrowing association signals.
We confirm that 5 genetic regions associated with lipid traits in European-derived populations are relevant in African Americans. To further evaluate these loci, fine-mapping in larger African American cohorts and/or resequencing will be required.
lipids; genetics; epidemiology; risk factors
Populations of the Americas were founded by early migrants from Asia, and some have experienced recent genetic admixture. To better characterize the native and non-native ancestry components in populations from the Americas, we analyzed 815,377 autosomal SNPs, mitochondrial hypervariable segments I and II, and 36 Y-chromosome STRs from 24 Mesoamerican Totonacs and 23 South American Bolivians.
Results and Conclusions
We analyzed common genomic regions from native Bolivian and Totonac populations to identify 324 highly predictive Native American ancestry informative markers (AIMs). As few as 40–50 of these AIMs perform nearly as well as large panels of random genome-wide SNPs for predicting and estimating Native American ancestry and admixture levels. These AIMs have greater New World vs. Old World specificity than previous AIMs sets. We identify highly-divergent New World SNPs that coincide with high-frequency haplotypes found at similar frequencies in all populations examined, including the HGDP Pima, Maya, Colombian, Karitiana, and Surui American populations. Some of these regions are potential candidates for positive selection. European admixture in the Bolivian sample is approximately 12%, though individual estimates range from 0–48%. We estimate that the admixture occurred ~360–384 years ago. Little evidence of European or African admixture was found in Totonac individuals. Bolivians with pre-Columbian mtDNA and Y-chromosome haplogroups had 5–30% autosomal European ancestry, demonstrating the limitations of Y-chromosome and mtDNA haplogroups and the need for autosomal ancestry informative markers for assessing ancestry in admixed populations.
Admixture; Ancestry Informative Markers (AIMs); Native Americans; Bolivian; Totonac; Positive selection
Genome-wide association studies have recently identified genetic polymorphisms associated with common, etiologically complex diseases, for which direct-to-consumer genetic testing with provision of absolute genetic risk estimates is marketed by commercial companies. Polymorphisms associated with atrial fibrillation (AF) have shown relatively large risk estimates but the robustness of such estimates across populations and study designs has not been studied.
A systematic literature review with meta-analysis and assessment of between-study heterogeneity was performed for single nucleotide polymorphisms (SNPs) in the six genetic regions associated with AF in genome-wide or candidate gene studies.
Data from 18 samples of European ancestry (n=12,100 cases; 115,702 controls) were identified for the SNP on chromosome 4q25 (rs220733), 16 samples (n=12,694 cases; 132,602 controls) for the SNP on 16q22 (rs2106261) and 4 samples (n=5,272 cases; 59,725 controls) for the SNP in KCNH2 (rs1805123). Only the discovery studies were identified for SNPs on 1q21 and in GJA5 and IL6R, why no meta-analyses were performed for those SNPs. In overall random-effects meta-analyses, association with AF was observed for both SNPs from genome-wide studies on 4q25 (OR 1.67, 95% CI=1.50–1.86, p=2×10−21) and 16q22 (OR 1.21, 95% CI=1.13–1.29, p=1×10−8), but not the SNP in KCNH2 from candidate gene studies (p=0.15). There was substantial effect heterogeneity across case-control and cross-sectional studies for both polymorphisms (I2=0.50–0.78, p<0.05), but not across prospective cohort studies (I2=0.39, p=0.15). Both polymorphisms were robustly associated with AF for each study design individually (p<0.05).
In meta-analyses including up to 150,000 individuals, polymorphisms in two genetic regions were robustly associated with AF across all study designs but with substantial context-dependency of risk estimates.
atrial fibrillation; genetics; genome-wide; prediction; SNP; meta-analysis
Genomewide association studies (GWAS) routinely apply principal component analysis (PCA) to infer population structure within a sample to correct for confounding due to ancestry. GWAS implementation of PCA uses tens of thousands of SNPs to infer structure, despite the fact that only a small fraction of such SNPs provides useful information on ancestry. The identification of this reduced set of ancestry-informative markers (AIMs) from a GWAS has practical value; for example, researchers can genotype the AIM set to correct for potential confounding due to ancestry in follow-up studies that utilize custom SNP or sequencing technology. We propose a novel technique to identify AIMs from genomewide SNP data using sparse principal component analysis (sparse PCA). The procedure uses penalized regression methods to identify those SNPs in a genomewide panel that significantly contribute to the principal components while encouraging SNPs that provide negligible loadings to vanish from the analysis. We found that sparse PCA leads to negligible loss of ancestry information compared to traditional PCA analysis of genomewide SNP data. We further demonstrate the value of sparse PCA for AIM selection using real data from the International HapMap Project and a genomewide study of Inflammatory Bowel Disease. We have implemented our approach in open-source R software for public use.
Ancestry-informative markers; Genome-wide association studies; Population stratification; Principal component analysis; Variable selection
We investigated the ability of several principal components analysis (PCA)-based strategies to detect and control for population stratification using data from a multi-center study of epithelial ovarian cancer among women of European-American ethnicity. These include a correction based on an ancestry informative markers (AIMs) panel designed to capture European ancestral variation and corrections utilizing un-thinned genome-wide SNP data; case-control samples were drawn from four geographically distinct North-American sites. The AIMs-only and genome-wide first principal components (PC1) both corresponded to the previously described North or Northwest-Southeast axis of European variation. We found that the genome-wide PCA captured this primary dimension of variation more precisely and identified additional axes of genome-wide variation of relevance to epithelial ovarian cancer. Associations evident between the genome-wide PCs and study site corroborate North American immigration history and suggest that undiscovered dimensions of variation lie within Northern Europe. The structure captured by the genome-wide PCA was also found within control individuals and did not reflect the case-control variation present in the data. The genome-wide PCA highlighted three regions of local LD, corresponding to the lactase (LCT) gene on chromosome 2, the human leukocyte antigen system (HLA) on chromosome 6 and to a common inversion polymorphism on chromosome 8. These features did not compromise the efficacy of PCs from this analysis for ancestry control. This study concludes that although AIMs panels are a cost-effective way of capturing population structure, genome-wide data should preferably be used when available.
Accurate genetic association studies are crucial for the detection and the validation of disease determinants. One of the main confounding factors that affect accuracy is population stratification, and great efforts have been extended for the past decade to detect and to adjust for it. We have now efficient solutions for population stratification adjustment for single-SNP (where SNP is single-nucleotide polymorphisms) inference in genome-wide association studies, but it is unclear whether these solutions can be effectively applied to rare variation studies and in particular gene-based (or set-based) association methods that jointly analyze multiple rare and common variants. We examine here, both theoretically and empirically, the performance of two commonly used approaches for population stratification adjustment—genomic control and principal component analysis—when used on gene-based association tests. We show that, different from single-SNP inference, genes with diverse composition of rare and common variants may suffer from population stratification to various extent. The inflation in gene-level statistics could be impacted by the number and the allele frequency spectrum of SNPs in the gene, and by the gene-based testing method used in the analysis. As a consequence, using a universal inflation factor as a genomic control should be avoided in gene-based inference with sequencing data. We also demonstrate that caution needs to be exercised when using principal component adjustment because the accuracy of the adjusted analyses depends on the underlying population substructure, on the way the principal components are constructed, and on the number of principal components used to recover the substructure.
sequencing studies; gene-based association test; genomic control; principal component analysis; C-alpha test; burden test
In genetic association studies, it is necessary to correct for population structure to avoid inference bias. During the past decade, prevailing corrections often only involved adjustments of global ancestry differences between sampled individuals. Nevertheless, population structure may vary across local genomic regions due to the variability of local ancestries associated with natural selection, migration, or random genetic drift. Adjusting for global ancestry alone may be inadequate when local population structure is an important confounding factor. In contrast, adjusting for local ancestry can more effectively prevent false-positives due to local population structure. To more accurately locate disease genes, we recommend adjusting for local ancestries by interrogating local structure. In practice, locus-specific ancestries are usually unknown and cannot be accurately inferred when ancestral population information is not available. For such scenarios, we propose employing local principal components (PC) to represent local ancestries and adjusting for local PCs when testing for genotype–phenotype association. With an acceptable computation burden, the proposed algorithm successfully eliminates the known spurious association between SNPs in the LCT gene and height due to the population structure in European Americans.
Genome-wide association studies; Local ancestries; Local principal components; Migration; Random genetic drift; Natural selection; Genomic inflation factor; Genomic control; Local ancestry principal components correction; Fine mapping
Genetic variants that contribute to asthma susceptibility may be present at varying frequencies in different populations, which is an important consideration and advantage for performing genetic association studies in admixed populations.
To identify asthma-associated loci in African Americans.
We compared local African and European ancestry estimated from dense single nucleotide polymorphism (SNP) genotype data in African American adults with asthma and non-asthmatic controls. Allelic tests of association were performed within the candidate regions identified, correcting for local European admixture.
We identified a significant ancestry association peak on chromosomes 6q. Allelic tests for association within this region identified a SNP (rs1361549) on 6q14.1 that was associated with asthma exclusively in African Americans with local European admixture (OR=2.2). The risk allele is common in Europe (42% in the HapMap CEU) but absent in West Africa (0% in the HapMap YRI), suggesting the allele is present in African Americans due to recent European admixture. We replicated our findings in Puerto Ricans and similarly found that the signal of association is largely specific to individuals who are heterozygous for African and non-African ancestry at 6q14.1. However, we found no evidence for association in European Americans or in Puerto Ricans in the absence of local African ancestry, suggesting that the association with asthma at rs1361549 is due to an environmental or genetic interaction.
We identified a novel asthma-associated locus that is relevant to admixed populations with African ancestry, and highlight the importance of considering local ancestry in genetic association studies of admixed populations.
asthma; population structure; genome-wide association study; admixture mapping; ancestry association testing; admixed populations; African Americans; Puerto Ricans
Some investigators argue that controlling for self-reported race or ethnicity, either in statistical analysis or in study design, is sufficient to mitigate unwanted influence from population stratification. In this report, we evaluated the effectiveness of a study design involving matching on self-reported ethnicity and race in minimizing bias due to population stratification within an ethnically admixed population in California. We estimated individual genetic ancestry using structured association methods and a panel of ancestry informative markers, and observed no statistically significant difference in distribution of genetic ancestry between cases and controls (P=0.46). Stratification by Hispanic ethnicity showed similar results. We evaluated potential confounding by genetic ancestry after adjustment for race and ethnicity for 1260 candidate gene SNPs, and found no major impact (>10%) on risk estimates. In conclusion, we found no evidence of confounding of genetic risk estimates by population substructure using this matched design. Our study provides strong evidence supporting the race- and ethnicity-matched case-control study design as an effective approach to minimizing systematic bias due to differences in genetic ancestry between cases and controls
Population stratification; Genetic susceptibility; Case-control; Matching
A 58kb region on chromosome 9p21.3 has consistently shown strong association with coronary artery disease (CAD) in multiple genome-wide association studies in populations of European and East Asian ancestry. In this study we sought to further characterize the role of genetic variants in 9p21.3 in African American individuals.
Methods and Results
Apparently healthy African American siblings (n=548) of patients with documented CAD <60 years of age were genotyped and followed for incident CAD for up to 17 years. Tests of association for 86 SNPs across the 9p21.3 region in a GEE logistic framework under an additive model adjusting for traditional risk factors, family, follow-up time, and population stratification were performed. A single SNP within the CDKN2B gene met stringent criteria for statistical significance, including permutation-based evaluations. This variant, rs3217989, was common (minor allele [G] frequency 0.242), conveyed protection against CAD (OR=0.19, 95% CI: 0.07 to 0.50, p=0.0008) and was replicated in a combined analysis of two additional case/control studies of prevalent CAD/MI in African Americans (n=990, p=0.024, OR= 0.779, 95% CI: 0.626-0.968).
This is the first report of a CAD association signal in a population of African ancestry with a common variant within the CDKN2B gene, independent from previous findings in European and East Asian ancestry populations. The findings demonstrate a significant protective effect against incident CAD in African American siblings of persons with premature CAD, with replication in a combination of two additional African American cohorts.
African American; CDKN2B; Coronary Artery Disease; Genetics; 9p21
While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations.
This paper presents improved methodologies for the analysis of genome-wide association studies in admixed populations, which are populations that came about by the mixing of two or more distant continental populations over a few hundred years (e.g., African Americans or Latinos). Studies of admixed populations offer the promise of capturing additional genetic diversity compared to studies over homogeneous populations such as Europeans. In admixed populations, correlation between genetic variants exists both at a fine scale in the ancestral populations and at a coarse scale due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered either one or the other type of correlation, but not both. In this work we develop novel statistical methods that account for both types of genetic correlation, and we show that the combined approach attains greater statistical power than that achieved by applying either approach separately. We provide analysis of simulated and real data from major studies performed in African-American men and women to show the improvement obtained by our methods over the standard methods for analyzing association studies in admixed populations.
The haplotypes of the X chromosome are accessible to direct count in males, whereas the diplotypes of the females may be inferred knowing the haplotype of their sons or fathers. Here, we investigated: 1) the possible large-scale haplotypic structure of the X chromosome in a Caucasian population sample, given the single-nucleotide polymorphism (SNP) maps and genotypes provided by Illumina and Affimetrix for Genetic Analysis Workshop 14, and, 2) the performances of widely used programs in reconstructing haplotypes from population genotypic data, given their known distribution in a sample of unrelated individuals.
All possible unrelated mother-son pairs of Caucasian ancestry (N = 104) were selected from the 143 families of the Collaborative Study on the Genetics of Alcoholism pedigree files, and the diplotypes of the mothers were inferred from the X chromosomes of their sons. The marker set included 313 SNPs at an average density of 0.47 Mb. Linkage disequilibrium between pairs of markers was computed by the parameter D', whereas for measuring multilocus disequilibrium, we developed here an index called D*, and applied it to all possible sliding windows of 5 markers each. Results showed a complex pattern of haplotypic structure, with regions of low linkage disequilibrium separated by regions of high values of D*. The following programs were evaluated for their accuracy in inferring population haplotype frequencies: 1) ARLEQUIN 2.001; 2) PHASE 2.1.1; 3) SNPHAP 1.1; 4) HAPLOBLOCK 1.2; 5) HAPLOTYPER 1.0. Performances were evaluated by Pearson correlation (r) coefficient between the true and the inferred distribution of haplotype frequencies.
The SNP haplotypic structure of the X chromosome is complex, with regions of high haplotype conservation interspersed among regions of higher haplotype diversity. All the tested programs were accurate (r = 1) in reconstructing the distribution of haplotype frequencies in case of high D* values. However, only the program PHASE realized a high correlation coefficient (r > 0.7) in conditions of low linkage disequilibrium.
Population structure occurs when a sample is composed of individuals with different ancestries and can result in excess type I error in genome-wide association studies. Genome-wide principal-component analysis (PCA) has become a popular method for identifying and adjusting for subtle population structure in association studies. Using the Genetic Analysis Workshop 16 (GAW16) NARAC data, we explore two unresolved issues concerning the use of genome-wide PCA to account for population structure in genetic associations studies: the choice of single-nucleotide polymorphism (SNP) subset and the choice of adjustment model. We computed PCs for subsets of genome-wide SNPs with varying levels of LD. The first two PCs were similar for all subsets and the first three PCs were associated with case status for all subsets. When the PCs associated with case status were included as covariates in an association model, the reduction in genomic inflation factor was similar for all SNP sets. Several models have been proposed to account for structure using PCs, but it is not yet clear whether the different methods will result in substantively different results for association studies with individuals of European descent. We compared genome-wide association p-values and results for two positive-control SNPs previously associated with rheumatoid arthritis using four PC adjustment methods as well as no adjustment and genomic control. We found that in this sample, adjusting for the continuous PCs or adjusting for discrete clusters identified using the PCs adequately accounts for the case-control population structure, but that a recently proposed randomization test performs poorly.
Peripheral arterial disease (PAD) is associated with significant morbidity and mortality, and has a higher prevalence in African Americans than Caucasians. Ankle arm index (AAI) is the ratio of systolic blood pressure in the leg to that in the arm, and, when low, is a marker of PAD. We used an admixture mapping approach to search for genetic loci associated with low AAI. Using data from 1040 African-American participants in the observational, population-based Health, Aging, and Body Composition Study who were genotyped at 1322 single nucleotide polymorphisms(SNPs) that are informative for African versus European ancestry and span the entire genome, we estimated genetic ancestry in each chromosomal region and then tested the association between AAI and genetic ancestry at each locus. We found a region of chromosome 11 that reaches its peak between 80 and 82 Mb associated with low AAI (p<0.001 for rs12289502 and rs9665943, both within this region). 753 African-American participants in the observational, population-based Cardiovascular Health Study were genotyped at rs9665943 to test the reproducibility of this association, and this association was also statistically significant (odds ratio(OR) for homozygous African genotype 1.59 (95% confidence interval (CI) 1.12–2.27)). Another candidate SNP (rs1042602) in the same genomic region was tested in both populations, and was also found to be significantly associated with low AAI in both populations (OR for homozygous African genotype 1.89 (95% CI 1.29–2.76)). This study identifies a novel region of chromosome 11 representing an area with a potential candidate gene associated with PAD in African Americans.
peripheral vascular disease; genetics; African-American
Survival of patients with pancreatic adenocarcinoma is limited and few prognostic factors are known. We conducted a two-stage genome-wide association study (GWAS) to identify germline variants associated with survival in patients with pancreatic adenocarcinoma.
We analyzed overall survival in relation to single nucleotide polymorphisms (SNPs) among 1,005 patients from two large GWAS datasets, PanScan I and ChinaPC. Cox proportional hazards regression was used in an additive genetic model with adjustment for age, sex, clinical stage and the top four principal components of population stratification. The first stage included 642 cases of European ancestry (PanScan), from which the top SNPs (P≤10−5) were advanced to a joint analysis with 363 additional patients from China (ChinaPC).
In the first stage of cases of European descent, the top-ranked loci were at chromosomes 11p15.4, 18p11.21, and 1p36.13, tagged by rs12362504 (P=1.63×10−7), rs981621 (P=1.65×10−7), and rs16861827 (P=3.75×10−7), respectively. One-hundred thirty-one SNPs with P ≤ 10−5 were advanced to a joint analysis with cases from the ChinaPC study. In the joint analysis, the top-ranked SNP was rs10500715 (minor allele frequency, 0.37; P=1.72×10−7) on chromosome 11p15.4, which is intronic to the SET binding factor 2 (SBF2) gene. The hazard ratio (95% CI) for death was 0.74 (0.66–0.84) in PanScan I, 0.79 (0.65–0.97) in ChinaPC, and 0.76 (0.68–0.84) in the joint analysis.
Germline genetic variation in the SBF2 locus was associated with overall survival in patients with pancreatic adenocarcinoma of European and Asian ancestry. This association should be investigated in additional large patient cohorts.
Pancreatic cancer; GWAS; single nucleotide polymorphism; SET binding factor 2
Chronic kidney disease (CKD) is an increasing global public health concern, particularly among populations of African ancestry. We performed an interrogation of known renal loci, genome-wide association (GWA), and IBC candidate-gene SNP association analyses in African Americans from the CARe Renal Consortium. In up to 8,110 participants, we performed meta-analyses of GWA and IBC array data for estimated glomerular filtration rate (eGFR), CKD (eGFR <60 mL/min/1.73 m2), urinary albumin-to-creatinine ratio (UACR), and microalbuminuria (UACR >30 mg/g) and interrogated the 250 kb flanking region around 24 SNPs previously identified in European Ancestry renal GWAS analyses. Findings were replicated in up to 4,358 African Americans. To assess function, individually identified genes were knocked down in zebrafish embryos by morpholino antisense oligonucleotides. Expression of kidney-specific genes was assessed by in situ hybridization, and glomerular filtration was evaluated by dextran clearance. Overall, 23 of 24 previously identified SNPs had direction-consistent associations with eGFR in African Americans, 2 of which achieved nominal significance (UMOD, PIP5K1B). Interrogation of the flanking regions uncovered 24 new index SNPs in African Americans, 12 of which were replicated (UMOD, ANXA9, GCKR, TFDP2, DAB2, VEGFA, ATXN2, GATM, SLC22A2, TMEM60, SLC6A13, and BCAS3). In addition, we identified 3 suggestive loci at DOK6 (p-value = 5.3×10−7) and FNDC1 (p-value = 3.0×10−7) for UACR, and KCNQ1 with eGFR (p = 3.6×10−6). Morpholino knockdown of kcnq1 in the zebrafish resulted in abnormal kidney development and filtration capacity. We identified several SNPs in association with eGFR in African Ancestry individuals, as well as 3 suggestive loci for UACR and eGFR. Functional genetic studies support a role for kcnq1 in glomerular development in zebrafish.
Chronic kidney disease (CKD) is an increasing global public health problem and disproportionately affects populations of African ancestry. Many studies have shown that genetic variants are associated with the development of CKD; however, similar studies are lacking in African ancestry populations. The CARe consortium consists of more than 8,000 individuals of African ancestry; genome-wide association analysis for renal-related phenotypes was conducted. In cross-ethnicity analyses, we found that 23 of 24 previously identified SNPs in European ancestry populations have the same effect direction in our samples of African ancestry. We also identified 3 suggestive genetic variants associated with measurement of kidney function. We then tested these genes in zebrafish knockdown models and demonstrated that kcnq1 is involved in kidney development in zebrafish. These results highlight the similarity of genetic variants across ethnicities and show that cross-species modeling in zebrafish is feasible for genes associated with chronic human disease.
A wealth of genomic information is available in public and private databases. However, this information is underutilized for uncovering population specific and functionally relevant markers underlying complex human traits. Given the huge amount of SNP data available from the annotation of human genetic variation, data mining is a faster and cost effective approach for investigating the number of SNPs that are informative for ancestry. In this study, we present AncestrySNPminer, the first web-based bioinformatics tool specifically designed to retrieve Ancestry Informative Markers (AIMs) from genomic data sets and link these informative markers to genes and ontological annotation classes. The tool includes an automated and simple “scripting at the click of a button” functionality that enables researchers to perform various population genomics statistical analyses methods with user friendly querying and filtering of data sets across various populations through a single web interface. AncestrySNPminer can be freely accessed at https://research.cchmc.org/mershalab/AncestrySNPminer/login.php.
Ancestry; Ancestry informative markers; AIMs; Bioinformatics; AncestrySNPminer; Data mining; Admixture; Admixture mapping
Principal components analysis (PCA) has been successfully used to correct for population stratification in genome-wide association studies of common variants. However, rare variants also have a role in common disease etiology. Whether PCA successfully controls population stratification for rare variants has not been addressed. Thus we evaluate the effect of population stratification analysis on false-positive rates for common and rare variants at the single-nucleotide polymorphism (SNP) and gene level. We use the simulation data from Genetic Analysis Workshop 17 and compare false-positive rates with and without PCA at the SNP and gene level. We found that SNPs’ minor allele frequency (MAF) influenced the ability of PCA to effectively control false discovery. Specifically, PCA reduced false-positive rates more effectively in common SNPs (MAF > 0.05) than in rare SNPs (MAF < 0.01). Furthermore, at the gene level, although false-positive rates were reduced, power to detect true associations was also reduced using PCA. Taken together, these results suggest that sequence-level data should be interpreted with caution, because extremely rare SNPs may exhibit sporadic association that is not controlled using PCA.
There are many ways to perform adjustment for population structure. It remains unclear what the optimal approach is and whether the optimal approach varies by the type of samples and substructure present. The simplest and most straightforward approach is to adjust for the continuous principal components (PCs) that capture ancestry. Through simulation, we explored the issue of which ancestry informative PCs should be adjusted for in an association model to control for the confounding nature of population structure while maintaining maximum power. A thorough examination of selecting PCs for adjustment in a case-control study across the possible structure scenarios that could occur in a genome-wide association study has not been previously reported.
We found that when the SNP and phenotype frequencies do not vary over the sub-populations, all methods of selection provided similar power and appropriate Type I error for association. When the SNP is not structured and the phenotype has large structure, then selection methods that do not select PCs for inclusion as covariates generally provide the most power. When there is a structured SNP and a non-structured phenotype, selection methods that include PCs in the model have greater power. When both the SNP and the phenotype are structured, all methods of selection have similar power.
Standard practice is to include a fixed number of PCs in genome-wide association studies. Based on our findings, we conclude that if power is not a concern, then selecting the same set of top PCs for adjustment for all SNPs in logistic regression is a strategy that achieves appropriate Type I error. However, standard practice is not optimal in all scenarios and to optimize power for structured SNPs in the presence of unstructured phenotypes, PCs that are associated with the tested SNP should be included in the logistic model.
Advances in genotyping technologies have contributed to a better understanding of human population genetic structure and improved the analysis of association studies. To analyze patterns of human genetic variation in Brazil, we used SNP data from 1129 individuals – 138 from the urban population of Sao Paulo, Brazil, and 991 from 11 populations of the HapMap Project. Principal components analysis was performed on the SNPs common to these populations, to identify the composition and the number of SNPs needed to capture the genetic variation of them. Both admixture and local ancestry inference were performed in individuals of the Brazilian sample. Individuals from the Brazilian sample fell between Europeans, Mexicans, and Africans. Brazilians are suggested to have the highest internal genetic variation of sampled populations. Our results indicate, as expected, that the Brazilian sample analyzed descend from Amerindians, African, and/or European ancestors, but intermarriage between individuals of different ethnic origin had an important role in generating the broad genetic variation observed in the present-day population. The data support the notion that the Brazilian population, due to its high degree of admixture, can provide a valuable resource for strategies aiming at using admixture as a tool for mapping complex traits in humans.
genetic structure; Brazilian; admixture mapping; admixture
Identifying ancestry along each chromosome in admixed individuals provides a wealth of information for understanding the population genetic history of admixture events and is valuable for admixture mapping and identifying recent targets of selection. We present PCAdmix (available at https://sites.google.com/site/pcadmix/home), a Principal Components-based algorithm for determining ancestry along each chromosome from a high-density, genome-wide set of phased single-nucleotide polymorphism (SNP) genotypes of admixed individuals. We compare our method to HAPMIX on simulated data from two ancestral populations, and we find high concordance between the methods. Our method also has better accuracy than LAMP when applied to three-population admixture, a situation as yet unaddressed by HAPMIX. Finally, we apply our method to a data set of four Latino populations with European, African, and Native American ancestry. We find evidence of assortative mating in each of the four populations, and we identify regions of shared ancestry that may be recent targets of selection and could serve as candidate regions for admixture-based association mapping.
Admixture; Principal Components Analysis (Pca); Local Ancestry Deconvolution; Haplotype-Based; Forward-Backward Algorithm
Recent studies in population of European ancestry have shown that 30%∼50% of heritability for human complex traits such as height and body mass index, and common diseases such as schizophrenia and rheumatoid arthritis, can be captured by common SNPs and that genetic variation attributed to chromosomes are in proportion to their length. Using genome-wide estimation and partitioning approaches, we analysed 49 human quantitative traits, many of which are relevant to human diseases, in 7,170 unrelated Korean individuals genotyped on 326,262 SNPs. For 43 of the 49 traits, we estimated a nominally significant (P<0.05) proportion of variance explained by all SNPs on the Affymetrix 5.0 genotyping array (). On average across 47 of the 49 traits for which the estimate of is non-zero, common SNPs explain approximately one-third (range of 7.8% to 76.8%) of narrow sense heritability.
The estimate of is highly correlated with the proportion of SNPs with association P<0.031 (r2 = 0.92). Longer genomic segments tend to explain more phenotypic variation, with a correlation of 0.78 between the estimate of variance explained by individual chromosomes and their physical length, and 1% of the genome explains approximately 1% of the genetic variance. Despite the fact that there are a few SNPs with large effects for some traits, these results suggest that polygenicity is ubiquitous for most human complex traits and that a substantial proportion of the “missing heritability” is captured by common SNPs.
The “missing heritability” problem has been intensely debated for the last few years. Possible explanations include the existence of many genetic variants each with a small effect, rare variants with large effects, and heritability being over-estimated. Previous studies using whole-genome estimation have demonstrated that for human complex traits such as height, body mass index, and intelligence, a large portion of the heritability can be captured by all the common SNPs on the current genotyping arrays. These studies, however, were all concentrated only on a few traits. In this study, we analysed 49 quantitative traits in a sample of ∼7,000 unrelated Korean individuals. We found that, on average over all the traits, common SNPs on the Affymetrix 5.0 genotyping array explain approximately a third of the heritability, that genetic variants are widely distributed across the whole genome with longer chromosomes explaining more phenotypic variation, and that approximately any 1% of the genome explains 1% of the heritability. Despite examples where a few variants explain a substantial amount of variation, all these results are consistent with polygenicity being ubiquitous for most complex traits.