Genome-wide association (GWA) studies have proved extremely successful in identifying novel genetic loci contributing effects to complex human diseases. In doing so, they have highlighted the fact that many potential loci of modest effect remain undetected, partly due to the need for samples consisting of many thousands of individuals. Large-scale international initiatives, such as the Wellcome Trust Case Control Consortium, the Genetic Association Information Network, and the database of genetic and phenotypic information, aim to facilitate discovery of modest-effect genes by making genome-wide data publicly available, allowing information to be combined for the purpose of pooled analysis. In principle, disease or control samples from these studies could be used to increase the power of any GWA study via judicious use as “genetically matched controls” for other traits. Here, we present the biological motivation for the problem and the theoretical potential for expanding the control group with publicly available disease or reference samples. We demonstrate that a naïve application of this strategy can greatly inflate the false-positive error rate in the presence of population structure. As a remedy, we make use of genome-wide data and model selection techniques to identify “axes” of genetic variation which are associated with disease. These axes are then included as covariates in association analysis to correct for population structure, which can result in increases in power over standard analysis of genetic information from the samples in the original GWA study. Genet. Epidemiol. 34: 319–326, 2010. © 2010 Wiley-Liss, Inc.
genome-wide association study; expanded control group; population structure; multidimensional scaling; model selection
Genome-wide association (GWA) studies are a powerful approach for identifying novel genetic risk factors associated with human disease. A GWA study typically requires the inclusion of thousands of samples to have sufficient statistical power to detect single nucleotide polymorphisms (SNPs) that are associated with only modest increases in risk of disease given the heavy burden of a multiple test correction that is necessary to maintain valid statistical tests. Low statistical power and the high financial cost of performing a GWA study remains prohibitive for many scientific investigators anxious to perform such a study using their own samples. A number of remedies have been suggested to increase statistical power and decrease cost, including the utilization of free publicly available genotype data and multi-stage genotyping designs. Herein, we compare the statistical power and relative costs of alternative association study designs that use cases and screened controls to study designs that are based only on, or additionally include, free public control genotype data. We describe a novel replication-based two-stage study design, which uses free public control genotype data in the first stage and follow-up genotype data on case-matched controls in the second stage, that preserves many of the advantages inherent when using only an epidemiologically matched set of controls. Specifically, we show that our proposed two-stage design can substantially increase statistical power and decrease cost of performing a GWA study while controlling the type I error rate that can be inflated when using public controls due to differences in ancestry and batch genotype effects.
Case-Control; Association Study; Genome-wide; Two-stage; Power
Aitkin recently proposed an integrated Bayesian/likelihood approach that he claims is general and simple. We have applied this method, which does not rely on informative prior probabilities or large-sample results, to investigate the evidence of association between disease and the 16 variants in the KDR gene provided by Genetic Analysis Workshop 17. Based on the likelihood of logistic regression models and considering noninformative uniform prior probabilities on the coefficients of the explanatory variables, we used a random walk Metropolis algorithm to simulate the distributions of deviance and deviance difference. The distribution of probability values and the distribution of the proportions of positive deviance differences showed different locations, but the direction of the shift depended on the genetic factor. For the variant with the highest minor allele frequency and for any rare variant, standard logistic regression showed a higher power than the novel approach. For the two variants with the strongest effects on Q1 under a type I error rate of 1%, the integrated approach showed a higher power than standard logistic regression. The advantages and limitations of the integrated Bayesian/likelihood approach should be investigated using additional regions and considering alternative regression models and collapsing methods.
Identification of population structure can help trace population histories and identify disease genes. Structured association (SA) is a commonly used approach for population structure identification and association mapping. A major issue with SA is that its performance greatly depends on the informativeness and the numbers of ancestral informative markers (AIMs). Present major AIM selection methods mostly require prior individual ancestry information, which is usually not available or uncertain in practice. To address this potential weakness, we herein develop a novel approach for AIM selection based on principle component analysis (PCA), which does not require prior ancestry information of study subjects. Our simulation and real genetic data analysis results suggest that, with equivalent AIMs, PCA-based selected AIMs can significantly increase the accuracy of inferred individual ancestries compared with traditionally randomly selected AIMs. Our method can easily be applied to whole genome data to select a set of highly informative AIMs in population structure, which can then be used to identify potential population structure and correct possible statistical biases caused by population stratification.
population structure; principle component analysis; ancestral informative markers
Summary: Recent technological developments in measuring genetic variation have ushered in an era of genome-wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in individuals with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistic which takes into account prior information. Our method improves both power and resolution by 8% and 27%, respectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are interpretable as the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power.
Availability: The method presented herein is available at http://masa.cs.ucla.edu
Motivation: As many complex disease and expression phenotypes are the outcome of intricate perturbation of molecular networks underlying gene regulation resulted from interdependent genome variations, association mapping of causal QTLs or expression quantitative trait loci must consider both additive and epistatic effects of multiple candidate genotypes. This problem poses a significant challenge to contemporary genome-wide-association (GWA) mapping technologies because of its computational complexity. Fortunately, a plethora of recent developments in biological network community, especially the availability of genetic interaction networks, make it possible to construct informative priors of complex interactions between genotypes, which can substantially reduce the complexity and increase the statistical power of GWA inference.
Results: In this article, we consider the problem of learning a multitask regression model while taking advantage of the prior information on structures on both the inputs (genetic variations) and outputs (expression levels). We propose a novel regularization scheme over multitask regression called jointly structured input–output lasso based on an ℓ1/ℓ2 norm, which allows shared sparsity patterns for related inputs and outputs to be optimally estimated. Such patterns capture multiple related single nucleotide polymorphisms (SNPs) that jointly influence multiple-related expression traits. In addition, we generalize this new multitask regression to structurally regularized polynomial regression to detect epistatic interactions with manageable complexity by exploiting the prior knowledge on candidate SNPs for epistatic effects from biological experiments. We demonstrate our method on simulated and yeast eQTL datasets.
Availability: Software is available at http://www.sailing.cs.cmu.edu/.
We are moving to second-wave analysis of genome-wide association studies (GWAS), characterized by comprehensive bioinformatical and statistical evaluation of genetic associations. Existing biological knowledge is very valuable for GWAS, which may help improve their detection power particularly for disease susceptibility loci of moderate effect size. However, a challenging question is how to utilize available resources that are very heterogeneous to quantitatively evaluate the statistic significances.
We present a novel knowledge-based weighting framework to boost power of the GWAS and insightfully strengthen their explorative performance for follow-up replication and deep sequencing. Built upon diverse integrated biological knowledge, this framework directly models both the prior functional information and the association significances emerging from GWAS to optimally highlight single nucleotide polymorphisms (SNPs) for subsequent replication. In the theoretical calculation and computer simulation, it shows great potential to achieve extra over 15% power to identify an association signal of moderate strength or to use hundreds of whole-genome subjects fewer to approach similar power. In a case study on late-onset Alzheimer disease (LOAD) for a proof of principle, it highlighted some genes, which showed positive association with LOAD in previous independent studies, and two important LOAD related pathways. These genes and pathways could be originally ignored due to involved SNPs only having moderate association significance.
With user-friendly implementation in an open-source Java package, this powerful framework will provide an important complementary solution to identify more true susceptibility loci with modest or even small effect size in current GWAS for complex diseases.
Sequencing studies have been discovering a numerous number of rare variants, allowing the identification of the effects of rare variants on disease susceptibility. As a method to increase the statistical power of studies on rare variants, several groupwise association tests that group rare variants in genes and detect associations between genes and diseases have been proposed. One major challenge in these methods is to determine which variants are causal in a group, and to overcome this challenge, previous methods used prior information that specifies how likely each variant is causal. Another source of information that can be used to determine causal variants is the observed data because case individuals are likely to have more causal variants than control individuals. In this article, we introduce a likelihood ratio test (LRT) that uses both data and prior information to infer which variants are causal and uses this finding to determine whether a group of variants is involved in a disease. We demonstrate through simulations that LRT achieves higher power than previous methods. We also evaluate our method on mutation screening data of the susceptibility gene for ataxia telangiectasia, and show that LRT can detect an association in real data. To increase the computational speed of our method, we show how we can decompose the computation of LRT, and propose an efficient permutation test. With this optimization, we can efficiently compute an LRT statistic and its significance at a genome-wide level. The software for our method is publicly available at http://genetics.cs.ucla.edu/rarevariants.
rare variants; association studies; SNPs; genetics; statistics
The genetic basis of bipolar disorder has long been thought to be complex, with the potential involvement of multiple genes, but methods to analyze populations with respect to this complexity have only recently become available. We have carried out a genome-wide association study of bipolar disorder by genotyping over 550,000 SNPs in two independent case-control samples of European origin. The initial association screen was performed using pooled DNA; selected SNPs were confirmed by individual genotyping. While DNA pooling reduces power to detect genetic associations, there is a substantial cost savings and gain in efficiency. A total of 88 SNPs representing 80 different genes met the prior criteria for replication in both samples. Effect sizes were modest: no single SNP of large effect was detected. Of 37 SNPs selected for individual genotyping, the strongest association signal was detected at a marker within the first intron of DGKH (p = 1.5 × 10−8, experiment-wide p<0.01, OR= 1.59). This gene encodes diacylglycerol kinase eta, a key protein in the lithium-sensitive phosphatidyl inositol pathway. This first genome-wide association study of bipolar disorder shows that several genes, each of modest effect, reproducibly influence disease risk. Bipolar disorder may be a polygenic disease.
mania; DAG; polygenic; lithium; DFNB31; whirlin; Wnt; pooling; HapMap
Genome-wide association studies (GWASs) have been used to identify genes that increase risk of psychiatric diseases. However, much of the variation in disease risk is still unexplained, suggesting that there are genes still to be discovered. Functional annotation of genetic variants may increase the power of GWASs to identify disease genes by providing prior information that can be used in Bayesian analysis or in reducing the number of tests. Genetic mapping of expression quantitative trait loci (eQTLs) is helping us to reveal novel functional effects of thousands of single nucleotide polymorphisms (SNPs). The published brain eQTL studies are reviewed here, and major methodological issues and their possible solutions are discussed. We emphasize the frequently-ignored problems of batch effects, covariates, and multiple testing, all of which can lead to false positives and false negatives. The future application of eQTL data to the GWAS analysis is also discussed.
genome-wide association study; brain; psychiatric diseases; eQTL; genetics; SNP
Quantitative traits (QT) are an important focus of human genetic studies both because of interest in the traits themselves, and because of their role as risk factors for many human diseases. For large-scale QT association studies including genome-wide association studies (GWAS), investigators usually focus on genetic loci showing significant evidence for SNP-QT association, and genetic effect size tends to be overestimated as a consequence of the winner’s curse. In this paper, we study the impact of the winner’s curse on QT association studies in which the genetic effect size is parameterized as the slope in a linear regression model. We demonstrate by analytical calculation that the overestimation in the regression slope estimate decreases as power increases. To reduce the ascertainment bias, we propose a three-parameter maximum likelihood method and then simplify this to a one-parameter method by excluding nuisance parameters. We show that both methods reduce the bias when power to detect association is low or moderate, and that the one-parameter model generally results in smaller variance in the estimate.
quantitative trait; winner’s curse; ascertainment bias; genome-wide association study; linear regression; maximum likelihood
Lack of power and reproducibility are caveats of genetic association studies of common complex diseases. Indeed, the heterogeneity of disease etiology demands that causal models consider the simultaneous involvement of multiple genes. Rothman’s sufficient-cause model, which is well known in epidemiology, provides a framework for such a concept. In the present work, we developed a three-stage algorithm to construct gene clusters resembling Rothman’s causal model for a complex disease, starting from finding influential gene pairs followed by grouping homogeneous pairs.
The algorithm was trained and tested on 2,772 hypertensives and 6,515 normotensives extracted from four large Caucasian and Taiwanese databases. The constructed clusters, each featured by a major gene interacting with many other genes and identified a distinct group of patients, reproduced in both ethnic populations and across three genotyping platforms. We present the 14 largest gene clusters which were capable of identifying 19.3% of hypertensives in all the datasets and 41.8% if one dataset was excluded for lack of phenotype information. Although a few normotensives were also identified by the gene clusters, they usually carried less risky combinatory genotypes (insufficient causes) than the hypertensive counterparts. After establishing a cut-off percentage for risky combinatory genotypes in each gene cluster, the 14 gene clusters achieved a classification accuracy of 82.8% for all datasets and 98.9% if the information-short dataset was excluded. Furthermore, not only 10 of the 14 major genes but also many other contributing genes in the clusters are associated with either hypertension or hypertension-related diseases or functions.
We have shown with the constructed gene clusters that a multi-causal pie-multi-component approach can indeed improve the reproducibility of genetic markers for complex disease. In addition, our novel findings including a major gene in each cluster and sufficient risky genotypes in a cluster for disease onset (which coincides with Rothman’s sufficient cause theory) may not only provide a new research direction for complex diseases but also help to reveal the disease etiology.
Genetic causal mechanism; Sufficient cause; Data-mining; Young-onset hypertension; Complex disease
Genome-wide association studies are a promising new tool for deciphering the genetics of complex diseases. To choose the proper sample size and genotyping platform for such studies, power calculations that take into account genetic model, tag SNP selection, and the population of interest are required.
The power of genome-wide association studies can be computed using a set of tag SNPs and a large number of genotyped SNPs in a representative population, such as available through the HapMap project. As expected, power increases with increasing sample size and effect size. Power also depends on the tag SNPs selected. In some cases, more power is obtained by genotyping more individuals at fewer SNPs than fewer individuals at more SNPs.
Genome-wide association studies should be designed thoughtfully, with the choice of genotyping platform and sample size being determined from careful power calculations.
For most complex trait association studies using next-generation sequencing, in addition to the primary phenotype of interest, many clinically important secondary traits are also available, which can be analyzed to map susceptibility genes. Owing to high sequencing costs, most studies use selected samples, and the sampling mechanisms of these studies can be complicated. When the primary and secondary traits are correlated, analyses of secondary phenotypes can cause spurious associations in selected samples and existing methods are inadequate to adjust for them. To address this problem, a likelihood-based method, MULTI-TRAIT-ASSOCIATION (MTA) was developed. MTA is flexible and can be applied to any study with known sampling mechanisms. It also allows efficient inferences of genetic parameters. To investigate the power of MTA and different study designs, extensive simulations were performed under rigorous population genetic and phenotypic models. It is demonstrated that there are great benefits for analyzing secondary phenotypes in selected samples. In particular, using case–control samples and samples with extreme primary phenotypes can be more powerful than analyzing random samples of equivalent size. One major challenge for sequence-based association studies is that most data sets are not of sufficient size to be adequately powered. By applying MTA, data sets ascertained under distinct mechanisms or targeted at different primary traits can be jointly analyzed to map common phenotypes and greatly increase power. The combined analysis can be performed using freely available data sets from public repositories, for example, dbGaP. In conclusion, MTA will have an important role in dissecting the etiology of complex traits.
multiple phenotypes; next-generation sequencing; rare variants; pleiotropy; secondary trait; selective sampling
The genetic association analysis using haplotypes as basic genetic units is anticipated to be a powerful strategy towards the discovery of genes predisposing human complex diseases. In particular, the increasing availability of high-resolution genetic markers such as the single-nucleotide polymorphisms (SNPs) has made haplotype-based association analysis an attractive alternative to single marker analysis.
We consider haplotype association analysis under the population-based case-control study design. A multinomial logistic model is proposed for haplotype analysis with unphased genotype data, which can be decomposed into a prospective logistic model for disease risk as well as a model for the haplotype-pair distribution in the control population. Environmental factors can be readily incorporated and hence the haplotype-environment interaction can be assessed in the proposed model. The maximum likelihood estimation with unphased genotype data can be conveniently implemented in the proposed model by applying the EM algorithm to a prospective multinomial logistic regression model and ignoring the case-control design. We apply the proposed method to the hypertriglyceridemia study and identifies 3 haplotypes in the apolipoprotein A5 gene that are associated with increased risk for hypertriglyceridemia. A haplotype-age interaction effect is also identified. Simulation studies show that the proposed estimator has satisfactory finite-sample performances.
Our results suggest that the proposed method can serve as a useful alternative to existing methods and a reliable tool for the case-control haplotype-based association analysis.
Statistical power calculations inform the design and interpretation of genetic association studies, but few programs are tailored to case-control studies of single nucleotide polymorphisms (SNPs) in unrelated subjects.
We have developed the "Power for Genetic Association analyses" (PGA) package which comprises algorithms and graphical user interfaces for sample size and minimum detectable risk calculations using SNP or haplotype effects under different genetic models and study constrains. The software accounts for linkage disequilibrium and statistical multiple comparisons. The results are presented in graphs or tables and can be printed or exported in standard file formats.
PGA is user friendly software that can facilitate decision making for association studies of candidate genes, fine-mapping studies, and whole-genome scans. Stand-alone executable files and a Matlab toolbox are available for download at:
Systemic lupus erythematosus (SLE) is a serious prototype autoimmune disease characterized by chronic inflammation, auto-antibody production and multi-organ damage. Recent association studies have identified a long list of loci that were associated with SLE with relatively high statistical power. However, most of them only established the statistical associations of genetic markers and SLE at the DNA level without supporting evidence of functional relevance. Here, using publically available datasets, we performed integrative analyses (gene relationship across implicated loci analysis, differential gene expression analysis and functional annotation clustering analysis) and combined with expression quantitative trait loci (eQTLs) results to dissect functional mechanisms underlying the associations for SLE. We found that 14 SNPs, which were significantly associated with SLE in previous studies, have cis-regulation effects on four eQTL genes (HLA-DQA1, HLA-DQB1, HLA-DQB2, and IRF5) that were also differentially expressed in SLE-related cell groups. The functional evidence, taken together, suggested the functional mechanisms underlying the associations of 14 SNPs and SLE. The study may serve as an example of mining publically available datasets and results in validation of significant disease-association results. Utilization of public data resources for integrative analyses may provide novel insights into the molecular genetic mechanisms underlying human diseases.
The systematic comparison of transcriptional responses of organisms is a powerful tool in functional genomics. For example, mutants may be characterized by comparing their transcript profiles to those obtained in other experiments querying the effects on gene expression of many experimental factors including treatments, mutations and pathogen infections. Similarly, drugs may be discovered by the relationship between the transcript profiles effectuated or impacted by a candidate drug and by the target disease. The integration of such data enables systems biology to predict the interplay between experimental factors affecting a biological system. Unfortunately, direct comparisons of gene expression profiles obtained in independent, publicly available microarray experiments are typically compromised by substantial, experiment-specific biases. Here we suggest a novel yet conceptually simple approach for deriving ‘Functional Association(s) by Response Overlap’ (FARO) between microarray gene expression studies. The transcriptional response is defined by the set of differentially expressed genes independent from the magnitude or direction of the change. This approach overcomes the limited comparability between studies that is typical for methods that rely on correlation in gene expression. We apply FARO to a compendium of 242 diverse Arabidopsis microarray experimental factors, including phyto-hormones, stresses and pathogens, growth conditions/stages, tissue types and mutants. We also use FARO to confirm and further delineate the functions of Arabidopsis MAP kinase 4 in disease and stress responses. Furthermore, we find that a large, well-defined set of genes responds in opposing directions to different stress conditions and predict the effects of different stress combinations. This demonstrates the usefulness of our approach for exploiting public microarray data to derive biologically meaningful associations between experimental factors. Finally, our results indicate that FARO is more powerful in associating mutants in common pathways than existing methods such as co-expression analysis.
To understand how stroke risk factors mechanistically contribute to stroke, the genetic components regulating each risk factor need to be integrated and evaluated with respect to biological function and through pathway-based algorithms. This resource will provide information to researchers studying the molecular and genetic causes of stroke in terms of genomic variants, genes, and pathways.
Reported genetic variants, gene structure, phenotypes, and literature information regarding stroke were collected and extracted from publicly available databases describing variants, genome, proteome, functional annotation, and disease subtypes. Stroke related candidate pathways and etiologic genes that participate significantly in risk were analyzed in terms of canonical pathways in public biological pathway databases. These efforts resulted in a relational database of genetic signals of cerebral stroke, SigCS base, which implements an effective web retrieval system.
The current version of SigCS base documents 1943 non-redundant genes with 11472 genetic variants and 165 non-redundant pathways. The web retrieval system of SigCS base consists of two principal search flows, including: 1) a gene-based variant search using gene table browsing or a keyword search, and, 2) a pathway-based variant search using pathway table browsing. SigCS base is freely accessible at http://sysbio.kribb.re.kr/sigcs.
SigCS base is an effective tool that can assist researchers in the identification of the genetic factors associated with stroke by utilizing existing literature information, selecting candidate genes and variants for experimental studies, and examining the pathways that contribute to the pathophysiological mechanisms of stroke.
Bayesian clinical trial designs offer the possibility of a substantially reduced sample size, increased statistical power, and reductions in cost and ethical hazard. However when prior and current information conflict, Bayesian methods can lead to higher than expected Type I error, as well as the possibility of a costlier and lengthier trial. This motivates an investigation of the feasibility of hierarchical Bayesian methods for incorporating historical data that are adaptively robust to prior information that reveals itself to be inconsistent with the accumulating experimental data. In this paper, we present several models that allow for the commensurability of the information in the historical and current data to determine how much historical information is used. A primary tool is elaborating the traditional power prior approach based upon a measure of commensurability for Gaussian data. We compare the frequentist performance of several methods using simulations, and close with an example of a colon cancer trial that illustrates a linear models extension of our adaptive borrowing approach. Our proposed methods produce more precise estimates of the model parameters, in particular conferring statistical significance to the observed reduction in tumor size for the experimental regimen as compared to the control regimen.
Adaptive Designs; Bayesian; Colorectal Cancer; Clinical Trials; Power Priors
Association studies can focus on candidate gene(s), a particular genomic region, or adopt a genome wide association approach, each of which has implications for marker selection. The strategy for marker selection will affect the statistical power of the study to detect a disease association and is a crucial element of study design. The abundant single nucleotide polymorphisms (SNPs) are the markers of choice in genetic case-control association studies. The genotypes of neighbouring SNPs are often highly correlated (‘in linkage disequilibrium’ – LD) within a population which is utilised for selecting specific ‘tagSNPs’ to serve as proxies for other nearby SNPs in high LD. General guidelines for SNP selection in candidate genes/regions and genome-wide studies are provided in this protocol, along with illustrative examples. Publicly available web-based resources are utilised to browse and retrieve data and software such as Haploview and Goldsurfer2, are applied to investigate LD and to select tagSNPs.
gene; genetic marker; SNP; case-control study; association; design
Motivation: A two-stage association study is the most commonly used method among multistage designs to efficiently identify disease susceptibility genes. Recently, some SNP studies have utilized more than two stages to detect disease genes. However, there are few available programs for calculating statistical powers and positive predictive values (PPVs) of arbitrary n-stage designs.
Results: We developed programs for a multistage case–control association study using R language. In our programs, input parameters include numbers of samples and candidate loci, genome-wide false positive rate and proportions of samples and loci to be selected at the k-th stage (k=1,…, n). The programs output statistical powers, PPVs and numbers of typings in arbitrary n-stage designs. The programs can contribute to prior simulations under various conditions in planning a genome-wide association study.
Availability: The R programs are freely available for academic users and can be downloaded from http://www.med.niigata-u.ac.jp/eng/resources/informatics/gwa.html
Supplementary information: Supplementary data are available at Bioinformatics online.
Genome-wide association studies (GWAS) test hundreds of thousands of single-nucleotide polymorphisms (SNPs) for association to a trait, treating each marker equally and ignoring prior evidence of association to specific regions. Typically, promising regions are selected for further investigation based on p-values obtained from simple tests of association. However, loci that exert only a weak, low-penetrant role on the trait, producing modest evidence of association, are not detectable in the context of a GWAS. Implementing prior knowledge of association in GWAS could increase power, help distinguish between false and true positives, and identify better sets of SNPs for follow-up studies.
Here we performed a GWAS on rheumatoid arthritis (RA) patients and controls (Problem 1, Genetic Analysis Workshop 16). In order to include prior information in the analysis, we applied four methods that distinctively deal with markers in candidate genes in the context of GWAS. SNPs were divided into a random and a candidate subset, then we applied empirical correction by permutation, false-discovery rate, false-positive report probability, and posterior odds of association using different prior probabilities. We repeated the same analyses on two different sets of candidate markers defined on the basis of previously reported association to RA following two different approaches. The four methods showed similar relative behavior when applied to the two sets, with the proportion of candidate SNPs ranked among the top 2,000 varying from 0 to 100%. The use of different prior probabilities changed the stringency of the methods, but not their relative performance.
Single Nucleotide Polymorphism (SNP) analysis only captures a small proportion of associated genetic variants in Genome-Wide Association Studies (GWAS) partly due to small marginal effects. Pathway level analysis incorporating prior biological information offers another way to analyze GWAS's of complex diseases, and promises to reveal the mechanisms leading to complex diseases. Biologically defined pathways are typically comprised of numerous genes. If only a subset of genes in the pathways is associated with disease then a joint analysis including all individual genes would result in a loss of power. To address this issue, we propose a pathway-based method that allows us to test for joint effects by using a pre-selected gene subset. In the proposed approach, each gene is considered as the basic unit, which reduces the number of genetic variants considered and hence reduces the degrees of freedom in the joint analysis. The proposed approach also can be used to investigate the joint effect of several genes in a candidate gene study.
We applied this new method to a published GWAS of psoriasis and identified 6 biologically plausible pathways, after adjustment for multiple testing. The pathways identified in our analysis overlap with those reported in previous studies. Further, using simulations across a range of gene numbers and effect sizes, we demonstrate that the proposed approach enjoys higher power than several other approaches to detect associated pathways.
The proposed method could increase the power to discover susceptibility pathways and to identify associated genes using GWAS. In our analysis of genome-wide psoriasis data, we have identified a number of relevant pathways for psoriasis.
Genome-wide association studies (GWAS) are now feasible for studying the genetics underlying complex diseases. For many diseases, a list of candidate genes or regions exists and incorporation of such information into data analyses can potentially improve the power to detect disease variants. Traditional approaches for assessing the overall statistical significance of GWAS results ignore such information by inherently treating all markers equally.
We propose the prioritized subset analysis (PSA), in which a prioritized subset of markers is pre-selected from candidate regions, and the false discovery rate (FDR) procedure is carried out in the prioritized subset and its complementary subset, respectively.
The PSA is more powerful than the whole-genome single-step FDR adjustment for a range of alternative models. The degree of power improvement depends on the fraction of associated SNPs in the prioritized subset and their nominal power, with higher fraction of associated SNPs and higher nominal power leading to more power improvement. The power improvement can be substantial; for disease loci not included in the prioritized subset, the power loss is almost negligible.
The PSA has the flexibility of allowing investigators to combine prior information from a variety of sources, and will be a useful tool for GWAS.
Association analysis; False discovery rate; HapMap