Genome-wide association studies (GWAS) test for disease-trait associations and estimate effect sizes at tag single-nucleotide polymorphisms (SNPs), which imperfectly capture variation at causal SNPs. Sequencing studies can examine potential causal SNPs directly; however, sequencing the whole genome or exome can be prohibitively expensive. Costs can be limited by using a GWAS to detect the associated region(s) at tag SNPs followed by targeted sequencing to identify and estimate the effect size of the causal variant. Genetic effect estimates obtained from association studies can be inflated because of a form of selection bias known as the winner’s curse. Conversely, estimates at tag SNPs can be attenuated compared to the causal SNP because of incomplete linkage disequilibrium. These two effects oppose each other. Analysis of rare SNPs further complicates our understanding of the winner’s curse because rare SNPs are difficult to tag and analysis can involve collapsing over multiple rare variants. In two-stage analysis of Genetic Analysis Workshop 17 simulated data sets, we find that selection at the tag SNP produces upward bias in the estimate of effect at the causal SNP, even when the tag and causal SNPs are not well correlated. The bias similarly carries through to effect estimates for rare variant summary measures. Replication studies designed with sample sizes computed using biased estimates will be under-powered to detect a disease-causing variant. Accounting for bias in the original study is critical to avoid discarding disease-associated SNPs at follow up.
The approach to molecular genetic studies of complex phenotypes has evolved considerably during the recent years. The candidate gene approach, restricted to analysis of a few single nucleotide polymorphisms (SNPs) in a modest number of cases and controls, has been supplanted by the unbiased approach of Genome-Wide Association Studies (GWAS), wherein a large number of tagger SNPs are typed in a large number of individuals. GWAS, which are designed upon the common disease- common variant hypothesis (CD-CV), have identified a large number of SNPs and loci for complex phenotypes. However, alleles identified through GWAS are typically not causative but rather in linkage disequilibrium (LD) with the true causal variants. The common alleles, which may not capture the uncommon and rare variants, account only for a fraction of heritability of the complex traits. Hence, the focus is being shifted to rare variants – common disease (RV-CD) hypothesis, surmising that rare variants exert large effect sizes on the phenotype. In conjunctional with this conceptual shift technological advances in DNA sequencing techniques have dramatically enhanced whole genome or whole exome sequencing capacity. The sequencing approach affords identification of not only the rare but also the common variants. The approach – whether used in complementation with GWAS or as a stand-alone approach - could define the genetic architecture of the complex phenotypes. Robust phenotyping and large-scale sequencing studies are essential to extract the information content of the vast number of DNA sequence variants (DSVs) in the genome. To garner meaningful clinical information and link the genotype to a phenotype, identification and characterization of a very large number of causal fields beyond the information content of DNA sequence variants would be necessary. This review provides an update on the current progress and limitations in identifying DSVs that are associated with phenotypic effects.
Genome-wide association studies (GWAS) are a widely used study design for detecting genetic causes of complex diseases. Current studies provide good coverage of common causal SNPs, but not rare ones. A popular method to detect rare causal variants is haplotype testing. A disadvantage of this approach is that many parameters are estimated simultaneously, which can mean a loss of power and slower fitting to large datasets.
Haplotype testing effectively tests both the allele frequencies and the linkage disequilibrium (LD) structure of the data. LD has previously been shown to be mostly attributable to LD between adjacent SNPs. We propose a generalised linear model (GLM) which models the effects of each SNP in a region as well as the statistical interactions between adjacent pairs. This is compared to two other commonly used multimarker GLMs: one with a main-effect parameter for each SNP; one with a parameter for each haplotype.
We show the haplotype model has higher power for rare untyped causal SNPs, the main-effects model has higher power for common untyped causal SNPs, and the proposed model generally has power in between the two others. We show that the relative power of the three methods is dependent on the number of marker haplotypes the causal allele is present on, which depends on the age of the mutation. Except in the case of a common causal variant in high LD with markers, all three multimarker models are superior in power to single-SNP tests.
Including the adjacent statistical interactions results in lower inflation in test statistics when a realistic level of population stratification is present in a dataset.
Using the multimarker models, we analyse data from the Molecular Genetics of Schizophrenia study. The multimarker models find potential associations that are not found by single-SNP tests. However, multimarker models also require stricter control of data quality since biases can have a larger inflationary effect on multimarker test statistics than on single-SNP test statistics.
Analysing a GWAS with multimarker models can yield candidate regions which may contain rare untyped causal variants. This is useful for increasing prior odds of association in future whole-genome sequence analyses.
In this paper, we develop a powerful test for identifying SNP-sets that are predictive of survival with data from genome-wide association studies (GWAS). We first group typed SNPs into SNP-sets based on genomic features and then apply a score test to assess the overall effect of each SNP-set on the survival outcome through a kernel machine Cox regression framework. This approach uses genetic information from all SNPs in the SNP-set simultaneously and accounts for linkage disequilibrium (LD), leading to a powerful test with reduced degrees of freedom when the typed SNPs are in LD with each other. This type of test also has the advantage of capturing the potentially non-linear effects of the SNPs, SNP-SNP interactions (epistasis), and the joint effects of multiple causal variants. By simulating SNP data based on the LD structure of real genes from the HapMap project, we demonstrate that our proposed test is more powerful than the standard single SNP minimum p-value based test for association studies with censored survival outcomes. We illustrate the proposed test with a real data application.
cox model; genetic studies; gene-based analysis; kernel machine; multi-locus test; score test; single nucleotide polymorphism
Human genome contains millions of common single nucleotide polymorphisms (SNPs) and these SNPs play an important role in understanding the association between genetic variations and human diseases. Many SNPs show correlated genotypes, or linkage disequilibrium (LD), thus it is not necessary to genotype all SNPs for association study. Many algorithms have been developed to find a small subset of SNPs called tag SNPs that are sufficient to infer all the other SNPs. Algorithms based on the r2 LD statistic have gained popularity because r2 is directly related to statistical power to detect disease associations. Most of existing r2 based algorithms use pairwise LD. Recent studies show that multi-marker LD can help further reduce the number of tag SNPs. However, existing tag SNP selection algorithms based on multi-marker LD are both time-consuming and memory-consuming. They cannot work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.
We propose an efficient algorithm called FastTagger to calculate multi-marker tagging rules and select tag SNPs based on multi-marker LD. FastTagger uses several techniques to reduce running time and memory consumption. Our experiment results show that FastTagger is several times faster than existing multi-marker based tag SNP selection algorithms, and it consumes much less memory at the same time. As a result, FastTagger can work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.
FastTagger also produces smaller sets of tag SNPs than existing multi-marker based algorithms, and the reduction ratio ranges from 3%-9% when length-3 tagging rules are used. The generated tagging rules can also be used for genotype imputation. We studied the prediction accuracy of individual rules, and the average accuracy is above 96% when r2 ≥ 0.9.
Generating multi-marker tagging rules is a computation intensive task, and it is the bottleneck of existing multi-marker based tag SNP selection methods. FastTagger is a practical and scalable algorithm to solve this problem.
Association studies hold great promise for the elucidation of the genetic basis of diseases. Studies based on functional single nucleotide polymorphisms (SNPs) or on linkage disequilibrium (LD) represent two main types of designs. LD-based association studies can be comprehensive for common causative variants, but they perform poorly for rare alleles. Conversely, functional SNP-based studies are efficient because they focus on the SNPs with the highest a priori chance of being associated. Our poor ability to predict the functional effect of SNPs, however, hampers attempts to make these studies comprehensive. Recent progress in comparative genomics, and evidence that functional elements tend to lie in conserved regions, promises to change the landscape, permitting functional SNP association studies to be carried out that comprehensively assess common and rare alleles. SNP genotyping technologies are already sufficient for such studies, but studies will require continued genomic sequencing of multiple species, research on the functional role of conserved sequences and additional SNP discovery and validation efforts (including targeted SNP discovery to identify the rare alleles in functional regions). With these resources, we expect that comprehensive functional SNP association studies will soon be possible.
functional SNPs; association studies; human disease
OBJECTIVE— Variants in ADIPOQ have been inconsistently associated with adiponectin levels or diabetes. Using comprehensive linkage disequilibrium mapping, we genotyped single nucleotide polymorphisms (SNPs) in ADIPOQ to evaluate the association of common variants with adiponectin levels and risk of diabetes.
RESEARCH DESIGN AND METHODS— Participants in the Framingham Offspring Study (n = 2,543, 53% women) were measured for glycemic phenotypes and incident diabetes over 28 years of follow-up; adiponectin levels were quantified at exam 7. We genotyped 22 tag SNPs that captured common (minor allele frequency >0.05) variation at r2 > 0.8 across ADIPOQ plus 20 kb 5′ and 10 kb 3′ of the gene. We used linear mixed effects models to test additive associations of each SNP with adiponectin levels and glycemic phenotypes. Hazard ratios (HRs) for incident diabetes were estimated using an adjusted Cox proportional hazards model.
RESULTS— Two promoter SNPs in strong linkage disequilibrium with each other (r2 = 0.80) were associated with adiponectin levels (rs17300539; Pnominal [Pn] = 2.6 × 10−8; Pempiric [Pe] = 0.0005 and rs822387; Pn = 3.8 × 10−5; Pe = 0.001). A 3′-untranslated region (3′UTR) SNP (rs6773957) was associated with adiponectin levels (Pn = 4.4 × 10−4; Pe = 0.005). A nonsynonymous coding SNP (rs17366743, Y111H) was confirmed to be associated with diabetes incidence (HR 1.94 [95% CI 1.16–3.25] for the minor C allele; Pn = 0.01) and with higher mean fasting glucose over 28 years of follow-up (Pn = 0.0004; Pe = 0.004). No other significant associations were found with other adiposity and metabolic phenotypes.
CONCLUSIONS— Adiponectin levels are associated with SNPs in two different regulatory regions (5′ promoter and 3′UTR), whereas diabetes incidence and time-averaged fasting glucose are associated with a missense SNP of ADIPOQ.
Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets.
Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection.
Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available.
Genome-wide association studies (GWASs) aim to genotype enough single nucleotide polymorphisms (SNPs) to effectively capture common genetic variants across the genome. Even though the number of SNPs genotyped in such studies can exceed a million, there is still interest in testing association with SNPs that were not genotyped in the study sample. Analyses of such untyped SNPs can assist in signal localization, permit cross-platform integration of samples from separate studies, and can improve power – especially for rarer SNPs. External information on a larger collection of SNPs from an appropriate reference panel, comprising both SNPs typed in the sample and the untyped SNPs we wish to test for association, is necessary for an untyped variant analysis to proceed. Linkage disequilibrium patterns observed in the reference panel are then used to infer the likely genotype at the untyped SNPs in the study sample. We propose here a novel statistical approach for testing untyped SNPs in case-control GWAS, based on an efficient score function derived from a prospective likelihood, that automatically accounts for the variability in the process of estimating the untyped variant. Computationally efficient methods of phasing can be used without affecting the validity of the test, and simple measures of haplotype sharing can be used to infer genotypes at the untyped SNPs, making our approach computationally much faster than existing approaches for untyped analysis. At the same time, we show, using simulated data, that our approach often has performance nearly equivalent to hidden Markov methods of untyped analysis. The software package ‘untyped’ is available to implement our approach.
Genotype imputation; Genome-wide association study; Efficient score; Case-control study
Multiple sclerosis (MS) is a complex disease with underlying genetic and environmental factors. Although the contribution of alleles within the major histocompatibility complex (MHC) are known to exert strong effects on MS risk, much remains to be learned about the contributions of loci with more modest effects identified by genome-wide association studies (GWASs), as well as loci that remain undiscovered. We use a recently developed method to estimate the proportion of variance in disease liability explained by 475,806 single nucleotide polymorphisms (SNPs) genotyped in 1,854 MS cases and 5,164 controls. We reveal that ~30% of MS genetic liability is explained by SNPs in this dataset, the majority of which is accounted for by common variants. These results suggest that the unaccounted for proportion could be explained by variants that are in imperfect linkage disequilibrium with common GWAS SNPs, highlighting the potential importance of rare variants in the susceptibility to MS.
We show that the statistical power of a single single-nucleotide polymorphism (SNP) score test for genetic association reflects the cumulative effect of all causal SNPs that are correlated with the test SNP. Statistical significance of a score test can sometimes be explained by the collective effect of weak correlations between the test SNP and multiple causal SNPs. In a finite population, weak but significant correlations between the test SNP and the causal SNPs can arise by chance alone. As a consequence, when a single-SNP score test shows significance, the causal SNPs contributing to the power of the test are not necessarily located near the test SNP, nor do they have to be in linkage disequilibrium with the test SNP. These findings are confirmed with the Genetic Analysis Workshop 17 mini-exome data. The findings of this study highlight the often overlooked importance of long-range and weak linkage disequilibrium in genetic association studies.
Background: The haplotype H1 of the tau gene, MAPT, is highly associated with progressive supranuclear palsy (PSP) and corticobasal degeneration (CBD).
Objective: To investigate the pathogenic basis of this association.
Methods: Detailed linkage disequilibrium and common haplotype structure of MAPT were examined in 27 CEPH trios using validated HapMap genotype data for 24 single nucleotide polymorphisms (SNPs) spanning MAPT.
Results: Multiple variants of the H1 haplotype were resolved, reflecting a far greater diversity of MAPT than can be explained by the H1 and H2 clades alone. Based on this, six haplotype tagging SNPs (htSNPs) that capture 95% of the common haplotype diversity were used to genotype well characterised PSP and CBD case–control cohorts. In addition to strong association with PSP and CBD of individual SNPs, two common haplotypes derived from these htSNPs were identified that are highly associated with PSP: the sole H2 derived haplotype was underrepresented and one of the common H1 derived haplotypes was highly associated, with a similar trend observed in CBD. There were powerful and highly significant associations with PSP and CBD of haplotypes formed by three H1 specific SNPs. This made it possible to define a candidate region of at least ∼56 kb, spanning sequences from upstream of MAPT exon 1 to intron 9. On the H1 haplotype background, these could harbour the pathogenic variants.
Conclusions: The findings support the pathological evidence that underlying variations in MAPT could contribute to disease pathogenesis by subtle effects on gene expression and/or splicing. They also form the basis for the investigation of the possible genetic role of MAPT in Parkinson's disease and other tauopathies, including Alzheimer's disease.
The calculation of the power and sample size required for association studies is essential, particularly for follow-up of genome-wide association studies, where much genotyping is required to replicate the original finding and identify the true disease susceptibility mutation.
In this paper, we derive equations for estimation of sample sizes for the transmission disequilibrium test (TDT) and for case-control studies, in the presence of allelic heterogeneity and indirect association – where the genotyped tagging SNP is in linkage disequilibrium (LD) with the true mutation. Using data from NOD2 and PTPN22, we show that the true sample sizes required to detect association may be incorrect when calculated under the assumption of a single mutation and complete LD with the genotyped marker.
The true sample sizes may be lower when allelic heterogeneity acts in a recessive model across mutations, or increased when mutations lie on different alleles of a common tagging SNP.
Calculating power and sample size under a range of realistic models of LD and allelic heterogeneity is essential to ensure that association studies have sufficient power to detect mutations.
Allelic association; Case-control study; Heterogeneity; Linkage disequilibrium; Mutations; Sample size; Transmission disequilibrium test; Allelic heterogeneity
The high-throughput genotyping chips have contributed greatly to genome-wide association (GWA) studies to identify novel disease susceptibility single nucleotide polymorphisms (SNPs). The high-density chips are designed using two different SNP selection approaches, the direct gene-centric approach, and the indirect quasi-random SNPs or linkage disequilibrium (LD)-based tagSNPs approaches. Although all these approaches can provide high genome coverage and ascertain variants in genes, it is not clear to which extent these approaches could capture the common genic variants. It is also important to characterize and compare the differences between these approaches.
In our study, by using both the Phase II HapMap data and the disease variants extracted from OMIM, a gene-centric evaluation was first performed to evaluate the ability of the approaches in capturing the disease variants in Caucasian population. Then the distribution patterns of SNPs were also characterized in genic regions, evolutionarily conserved introns and nongenic regions, ontologies and pathways. The results show that, no mater which SNP selection approach is used, the current high-density SNP chips provide very high coverage in genic regions and can capture most of known common disease variants under HapMap frame. The results also show that the differences between the direct and the indirect approaches are relatively small. Both have similar SNP distribution patterns in these gene-centric characteristics.
This study suggests that the indirect approaches not only have the advantage of high coverage but also are useful for studies focusing on various functional SNPs either in genes or in the conserved regions that the direct approach supports. The study and the annotation of characteristics will be helpful for designing and analyzing GWA studies that aim to identify genetic risk factors involved in common diseases, especially variants in genes and conserved regions.
By assaying hundreds of thousands of single nucleotide polymorphisms, genome wide association studies (GWAS) allow for a powerful, unbiased review of the entire genome to localize common genetic variants that influence health and disease. Although it is widely recognized that some correction for multiple testing is necessary, in order to control the family-wide Type 1 Error in genetic association studies, it is not clear which method to utilize. One simple approach is to perform a Bonferroni correction using all n single nucleotide polymorphisms (SNPs) across the genome; however this approach is highly conservative and would "overcorrect" for SNPs that are not truly independent. Many SNPs fall within regions of strong linkage disequilibrium (LD) ("blocks") and should not be considered "independent".
We proposed to approximate the number of "independent" SNPs by counting 1 SNP per LD block, plus all SNPs outside of blocks (interblock SNPs). We examined the effective number of independent SNPs for Genome Wide Association Study (GWAS) panels. In the CEPH Utah (CEU) population, by considering the interdependence of SNPs, we could reduce the total number of effective tests within the Affymetrix and Illumina SNP panels from 500,000 and 317,000 to 67,000 and 82,000 "independent" SNPs, respectively. For the Affymetrix 500 K and Illumina 317 K GWAS SNP panels we recommend using 10-5, 10-7 and 10-8 and for the Phase II HapMap CEPH Utah and Yoruba populations we recommend using 10-6, 10-7 and 10-9 as "suggestive", "significant" and "highly significant" p-value thresholds to properly control the family-wide Type 1 error.
By approximating the effective number of independent SNPs across the genome we are able to 'correct' for a more accurate number of tests and therefore develop 'LD adjusted' Bonferroni corrected p-value thresholds that account for the interdepdendence of SNPs on well-utilized commercially available SNP "chips". These thresholds will serve as guides to researchers trying to decide which regions of the genome should be studied further.
OBJECTIVE—A recent meta-analysis demonstrated a nominal association of the ectonucleotide pyrophosphatase phosphodiesterase 1 (ENPP1) K→Q missense single nucleotide polymorphism (SNP) at position 121 with type 2 diabetes. We set out to confirm the association of ENPP1 K121Q with hyperglycemia, expand this association to insulin resistance traits, and determine whether the association stems from K121Q or another variant in linkage disequilibrium with it.
RESEARCH DESIGN AND METHODS—We characterized the haplotype structure of ENPP1 and selected 39 tag SNPs that captured 96% of common variation in the region (minor allele frequency ≥5%) with an r2 value ≥0.80. We genotyped the SNPs in 2,511 Framingham Heart Study participants and used age- and sex-adjusted linear mixed effects (LME) models to test for association with quantitative metabolic traits. We also examined whether interaction between K121Q and BMI affected glycemic trait levels.
RESULTS—The Q allele of K121Q (rs1044498) was associated with increased fasting plasma glucose (FPG), A1C, fasting insulin, and insulin resistance by homeostasis model assessment (HOMA-IR; all P = 0.01–0.006). Two noncoding SNPs (rs7775386 and rs7773477) demonstrated similar associations, but LME models indicated that their effects were not independent from K121Q. We found no association of K121Q with obesity, but interaction models suggested that the effect of the Q allele on FPG and HOMA-IR was stronger in those with a higher BMI (P = 0.008 and 0.01 for interaction, respectively).
CONCLUSIONS—The Q allele of ENPP1 K121Q is associated with hyperglycemia and insulin resistance in whites. We found an adiposity-SNP interaction, with a stronger association of K121Q with diabetes-related quantitative traits in people with a higher BMI.
Though multiple interacting loci are likely involved in the etiology of complex diseases, early genome-wide association studies (GWAS) have depended on the detection of the marginal effects of each locus. Here, we evaluate the power of GWAS in the presence of two linked and potentially associated causal loci for several models of interaction between them and find that interacting loci may give rise to marginal relative risks that are not generally considered in a one-locus model. To derive power under realistic situations, we use empirical data generated by the HapMap ENCODE project for both allele frequencies and LD structure. The power is also evaluated in situations where the causal SNPs may not be genotyped, but rather detected by proxy using a SNP in linkage disequilibrium (LD). A common simplification for such power computations assumes that the sample size necessary to detect the effect at the tSNP is the sample size necessary to detect the causal locus directly divided by the LD measure r2 between the two. This assumption, which we call the “proportionality assumption”, is a simplification of the many factors that contribute to the strength of association at a marker, and has recently been criticized as unreasonable [Terwilliger and Hiekkalinna 2006], in particular in the presence of interacting and associated loci. We find that this assumption does not introduce much error in single locus models of disease, but may do so in so in certain two-locus models.
Genetic Predisposition to Disease; Genome, Human; genetics; Genotype; Humans; Linkage Disequilibrium; Models, Genetic; Polymorphism, Single Nucleotide; Quantitative Trait Loci; genetics; linkage disequilibrium; genome-wide; tagSNPs
Predisposition to psoriasis is known to be affected by genetic variation in HLA-C, IL12B and IL23R, but other genetic risk factors also exist. We recently reported three psoriasis-associated single nucleotide polymorphisms (SNPs) in the 5q31 locus, a region of high linkage disequilibrium laden with inflammatory pathway genes. The aim of this study was to assess whether other variants in the 5q31 region are causal to these SNPs or make independent contributions to psoriasis risk by genotyping a comprehensive set of tagging SNPs in a 725 kb region bounded by IL3 and IL4 and testing for disease association. Ninety SNPs, capturing 86.4% of the genetic diversity, were tested in one case–control sample set (467 cases/460 controls) and significant markers (Pallelic < 0.05) (n = 9) were then tested in two other sample sets (981 cases/925 controls). All nine SNPs were significant in a meta-analysis of the combined sample sets. Pair-wise conditional association tests showed rs1800925, an intergenic SNP located just upstream of IL13 (Mantel–Haenszel Pcombined = 1.5 × 10−4, OR = 0.77 [0.67–0.88]), could account for observed significant association of all but one other SNP, rs11568506 in SLC22A4 [Mantel–Haenszel Pcombined = 0.043, OR = 0.68 (0.47–0.99)]. Haplotype analysis of these two SNPs showed increased significance for the two common haplotypes (rs11568506–rs1800925: GC, Pcombined = 5.67 × 10−6, OR = 1.37; GT, Pcombined = 6.01 × 10−5, OR = 0.75; global haplotype P = 8.93 × 10−5). Several 5q31-region SNPs strongly associated with Crohn's disease (CD) in the recent WTCCC study were not significant in the psoriasis sample sets tested here. These results identify the most significant 5q31 risk variants for psoriasis and suggest that distinct 5q31 variants contribute to CD and psoriasis risk.
Multiple loss-of-function (LOF) alleles at the same gene may influence a phenotype not only in the homozygote state when alleles are considered individually, but also in the compound heterozygote (CH) state. Such LOF alleles typically have low frequencies and moderate to large effects. Detecting such variants is of interest to the genetics community, and relevant statistical methods for detecting and quantifying their effects are sorely needed. We present a collapsed double heterozygosity (CDH) test to detect the presence of multiple LOF alleles at a gene. When causal SNPs are available, which may be the case in next generation genome sequencing studies, this CDH test has overwhelmingly higher power than single SNP analysis. When causal SNPs are not directly available such as in current GWA settings, we show the CDH test has higher power than standard single SNP analysis if tagging SNPs are in linkage disequilibrium with the underlying causal SNPs to at least a moderate degree (r2>0.1). The test is implemented for genome-wide analysis in the publically available software package GenABEL which is based on a sliding window approach. We provide the proof of principle by conducting a genome-wide CDH analysis of red hair color, a trait known to be influenced by multiple loss-of-function alleles, in a total of 7,732 Dutch individuals with hair color ascertained. The association signals at the MC1R gene locus from CDH were uniformly more significant than traditional GWA analyses (the most significant P for CDH = 3.11×10−142 vs. P for rs258322 = 1.33×10−66). The CDH test will contribute towards finding rare LOF variants in GWAS and sequencing studies.
Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.
We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.
Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.
Copy number variants (CNVs) within humans can have both adaptive and deleterious effects. Because of their phenotypic significance, researchers have attempted to find single nucleotide polymorphisms (SNPs) in high linkage disequilibrium (LD) with CNVs to use in genomewide association studies. However, studies have found that CNVs are less likely to be in strong LD with flanking markers. We hypothesized that this “taggability gap” can be explained by duplication events that place paralogous sequences far apart. In support of our hypothesis, we find that duplications are significantly less likely than deletions to have a “tag” SNP, even after controlling for CNV length, allele frequency, and availability of appropriate flanking SNPs. Using a novel likelihood method, we are able to show that many complex CNVs—those due to multiple duplication or deletion polymorphisms—are made up of two loci with little LD between them. Additionally, we find that many polymorphic duplications detected in a recent clone-based study are located far from their parental loci. We also examine two other common hypotheses for the taggability gap, and find that recurrent mutation of both deletions and duplications appears to have an effect on LD, but that lower SNP density around CNVs has no effect. Overall, our results suggest that a substantial fraction of CNVs caused by duplication cannot be tagged by markers flanking the parental locus because they have changed genomic location.
copy number variation; population genetics; association studies
Association studies can focus on candidate gene(s), a particular genomic region, or adopt a genome wide association approach, each of which has implications for marker selection. The strategy for marker selection will affect the statistical power of the study to detect a disease association and is a crucial element of study design. The abundant single nucleotide polymorphisms (SNPs) are the markers of choice in genetic case-control association studies. The genotypes of neighbouring SNPs are often highly correlated (‘in linkage disequilibrium’ – LD) within a population which is utilised for selecting specific ‘tagSNPs’ to serve as proxies for other nearby SNPs in high LD. General guidelines for SNP selection in candidate genes/regions and genome-wide studies are provided in this protocol, along with illustrative examples. Publicly available web-based resources are utilised to browse and retrieve data and software such as Haploview and Goldsurfer2, are applied to investigate LD and to select tagSNPs.
gene; genetic marker; SNP; case-control study; association; design
The mitochondrion, conventionally thought to be an organelle specific to energy metabolism, is in fact multi-functional and implicated in many diseases, including cancer. To evaluate whether mitochondria-related genes are associated with increased risk for prostate cancer, we genotyped 24 single nucleotide polymorphisms (SNPs) within the mitochondrial genome (mtSNPs) and 376 tagSNPs localized to 78 nuclear-encoded mitochondrial genes. The tagSNPs were selected to achieve ≥80% coverage based on linkage disequilibrium. We compared allele and haplotype frequencies in ~1000 prostate cancer cases with ~500 population controls. An association with prostate cancer was not detected for any of the mtSNPs individually or for 10 mitochondrial common haplotypes when evaluated using a global score statistic. For the nuclear-encoded genes, none of the tagSNPs were significantly associated with prostate cancer after adjusting for multiple testing. Nonetheless, we evaluated unadjusted p-values by comparing our results with those from the CGEMS phase I data set. Seven tagSNPs had unadjusted p-values ≤ 0.05 in both our data and in CGEMS (two SNPs were identical and five were in strong linkage disequilibrium with CGEMS SNPs). These seven SNPs (rs17184211, rs4147684, rs4233367, rs2070902, rs3829037, rs7830235, and rs1203213) are located in genes MTRR, NDUFA9, NDUFS2, NDUFB9 and COX7A2, respectively. Five of the seven SNPs were further included in the CGEMS phase II study, however, none of the findings for these were replicated. Overall, these results suggest that polymorphisms in the mitochondrial genome and those in the nuclear encoded mitochondrial genes evaluated are not substantial risk factors for prostate cancer.
mitochondria; prostate cancer; genetic polymorphism; cancer risk
Published genome-wide association studies (GWASs) have identified few variants in the known biological pathways involved in lung cancer etiology. To mine the possibly hidden causal single nucleotide polymorphisms (SNPs), we explored all SNPs in the extrinsic apoptosis pathway from our published GWAS dataset for 1154 lung cancer cases and 1137 cancer-free controls. In an initial association analysis of 611 tagSNPs in 41 apoptosis-related genes, we identified only 10 tagSNPs associated with lung cancer risk with a P value <10−2, including four tagSNPs in DAPK1 and three tagSNPs in TNFSF8. Unlike DAPK1 SNPs, TNFSF8 rs2181033 tagged other four predicted functional but untyped SNPs (rs776576, rs776577, rs31813148 and rs2075533) in the promoter region. Therefore, we further tested binding affinity of these four SNPs by performing the electrophoretic mobility shift assay. We found that only rs2075533T allele modified levels of nuclear proteins bound to DNA, leading to significantly decreased expression of luciferase reporter constructs by 5- to –10-fold in H1299, HeLa and HCT116 cell lines compared with the C allele. We also performed a replication study of the untyped rs2075533 in an independent Texas population but did not confirm the protective effect. We further performed a mini meta-analysis for SNPs of TNFSF8 obtained from other four published lung cancer GWASs with 12 214 cases and 47 721 controls, and we found that only rs3181366 (r2 = 0.69 with the untyped rs2075533) was associated to lung cancer risk (P = 0.008). Our findings suggest a possible role of novel TNFSF8 variants in susceptibility to lung cancer.
The HapMap project aimed to catalog millions of common single nucleotide polymorphisms (SNPs) in the human genome in four major populations, in order to facilitate association studies of complex diseases. To examine the transferability of Han Chinese in Beijing HapMap data to the Southern Han Chinese in Shanghai, we performed comparative analyses between genotypes from over 4,500 SNPs in a 21 Mb region on chromosome 1q21-q25 in 80 unrelated Shanghai Chinese and 45 HapMap Chinese data.
Three thousand and forty-two SNPs were analyzed after removal of SNPs that failed quality control and those not in the HapMap panel. We compared the allele frequency distributions, linkage disequilibrium patterns, haplotype frequency distributions and tagging SNP sets transferability between the HapMap population and Shanghai Chinese population. Among the four HapMap populations, Beijing Chinese showed the best correlation with Shanghai population on allele frequencies, linkage disequilibrium and haplotype frequencies. Tagging SNP sets selected from four HapMap populations at different thresholds were evaluated in the Shanghai sample. Under the threshold of r2 equal to 0.8 or 0.5, both HapMap Chinese and Japanese data showed better coverage and tagging efficiency than Caucasian and African data.
Our study supported the applicability of HapMap Beijing Chinese SNP data to the study of complex diseases among southern Chinese population.