|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Although Genome Wide Association Studies (GWAS) genotype a very large number of single nucleotide polymorphisms (SNPs), the data are often analyzed one SNP at a time. The low predictive power of single SNPs, coupled with the high significance threshold needed to correct for multiple testing, greatly decreases the power of GWAS.
Results: We propose a procedure in which all the SNPs are analyzed in a multiple generalized linear model, and we show its use for extremely high-dimensional datasets. Our method yields P-values for assessing significance of single SNPs or groups of SNPs while controlling for all other SNPs and the family wise error rate (FWER). Thus, our method tests whether or not a SNP carries any additional information about the phenotype beyond that available by all the other SNPs. This rules out spurious correlations between phenotypes and SNPs that can arise from marginal methods because the ‘spuriously correlated’ SNP merely happens to be correlated with the ‘truly causal’ SNP. In addition, the method offers a data driven approach to identifying and refining groups of SNPs that jointly contain informative signals about the phenotype. We demonstrate the value of our method by applying it to the seven diseases analyzed by the Wellcome Trust Case Control Consortium (WTCCC). We show, in particular, that our method is also capable of finding significant SNPs that were not identified in the original WTCCC study, but were replicated in other independent studies.
Availability and implementation: Reproducibility of our research is supported by the open-source Bioconductor package hierGWAS.
Supplementary information: Supplementary data are available at Bioinformatics online.
Genome-wide association studies (GWAS) have enjoyed increasing success and popularity in recent years, due mostly to the thousands of genetic variants found to be significantly associated with complex traits (Welter et al., 2014). The two common designs are case–control studies, which look for associations between SNPs and disease, and population-based studies which focus on finding associations between SNPs and continuous traits (McCarthy et al., 2008). The larger goal of these studies is to function as hypothesis-generating machines, resulting in sets of loci that require further analysis. Thus GWAS are an important first step in the gene identification process (Cantor et al., 2010). The findings from these studies provide preliminary genetic information, which need additional analysis and follow-up experiments to be validated. However, many studies have found only a few common SNPs per trait, and these SNPs have generally low predictive power, explaining only a small percentage of the variance (Manolio et al., 2009).
Often, SNPs are tested individually for association with the phenotype, using the Armitage Trend Test. Because genome-wide scans analyze hundreds of thousands or even millions of markers, the multiple testing issue is resolved by applying a stringent significance threshold—most commonly (Panagiotou and Ioannidis, 2012)—to the P-values. This method is successful only if the study is well-powered, such that the associations are strong enough to pass the stringent threshold. However, even if that is the case, this type of analysis has several limitations, which have been addressed in the literature (He and Lin, 2011; Hoggart et al., 2008; Li et al., 2011; Rakitsch et al., 2013; Schork, 2001). Here we focus on two of them. First, single SNPs tend to have small effect sizes. We can increase the explanatory power by looking at the joint effect of multiple SNPs. Second, when we test a SNP individually, we ignore the effects of all other SNPs. If we analyze marginally two sufficiently correlated SNPs, out of which only one is causal for the disease, both may show an association. This leads to higher false positive rates.
Joint modeling of all SNPs is challenging. Since in most GWAS the number of SNPs is much larger than the number of samples, the data cannot be analyzed using standard multivariable approaches. An established method in the field is the Genome-wide Complex Trait Analysis (GCTA), which is based on linear mixed models (Yang et al., 2011, 2014) and enables some joint analysis of SNPs. It allows for statistical significance tests of single SNPs (as fixed effects) while all SNPs other than the considered single SNP are built into the model as a simultaneous random effect. We would classify the obtained statistical significance of the SNPs as a hybrid between marginal (with only one or a few SNPs as fixed effects) and joint (since all the SNPs are in the model) modeling. It also enables to assess the combined effect of all SNPs which quantifies the heritable component of phenotype variation explained jointly by all the genotyped SNPs (Yang et al., 2010). Another solution to the high-dimensionality of the problem is the use of penalized regression, which constrains the magnitude of the regression coefficients, and allows them to be estimated. The two most widely used penalization methods are the Lasso (Tibshirani, 1996) and Ridge regression (Hoerl and Kennard, 1970). The Lasso penalizes the sum of the absolute values of the regression coefficients. It is a sparse estimator, meaning that it sets some regression coefficients to zero, while keeping others non-zero. Ridge regression penalizes the sum of squared regression coefficients, but it does not reduce the number of parameters in the model. In Abraham et al. (2013) it has been shown in the context of GWAS that penalization decreases the false positive rate and increases the probability of detecting the causal SNPs. There are several papers which consider a joint analysis. Methods which apply a penalized model include: the Bayesian Lasso (Li et al., 2011), a two-stage procedure using single regression followed by a Lasso selection (Shi et al., 2011), stability selection in the context of GWAS (Alexander and Lange, 2011), the so-called ISIS (Iterative Sure Independence Screening) combined with stability selection to select significant SNPs (He and Lin, 2011), a combination of Lasso and linear mixed models (Rakitsch et al., 2013), Lasso for screening (Wu et al., 2010) or ridge regression (Malo et al., 2008). None of the proposals (Alexander and Lange, 2011; He and Lin, 2011; Li et al., 2011; Rakitsch et al., 2013; Shi et al., 2011) compute P-values for SNPs. Wu et al. (2010) aims to control the type I error rate, the approaches using stability selection aim to control the expected number of false positive selections (Meinshausen and Bühlmann, 2010), while Shi et al. (2011) controls the False Discovery Rate (FDR).
Our goal is to construct valid P-values for SNPs in a (joint) multiple generalized linear model together with a computationally efficient and powerful way to address the issue of massive multiple statistical hypothesis testing. The problem is challenging due to the complex setting with hundreds of thousands of SNPs. Our method relies on a hierarchical procedure from Mandozzi and Bühlmann (2015) which we apply here for the first time to GWAS with very high-dimensional datasets. It provides P-values for multiple (joint) regression modeling of SNPs in high-dimensional settings. We compute P-values not only for individual SNPs, but also for groups of SNPs. The idea is to adapt to the strength of the signal present in the data: if the signal is too weak or the SNPs exhibit too high correlation, we might still detect a significant group of SNPs, instead of single SNP markers. Additionally, we compute the explained variance for every such group in a high-dimensional generalized linear model.
We demonstrate our method on the WTCCC data (The Wellcome Trust Case Control Consortium, 2007), due to the fact that strong associations have been found for some phenotypes in this dataset, and many of their findings have been replicated in subsequent studies. However, our method’s advantages are also evident for phenotypes with weak associations: for biologically distant phenotypic traits, the goal is rather to find regions of the genome that are strongly associated with the phenotype. Our proposed method makes it statistically and computationally possible to assess the significance of the parameters in a multiple (generalized) linear model, for large scale GWAS problems with millions of SNP markers. The interpretation of the parameters in a multiple (generalized) linear model is markedly different from marginal association and also from GCTA (Yang et al., 2011). In fact, under some assumptions, we can link the (joint) multiple linear model to causal inference (see Section 2.1). Thus, it as an important step to perform the statistical inference in a multiple generalized linear model.
Consider the following setting and notation. There are n samples (e.g. persons in a study), and each of them is indexed with . A response variable Yi for the ith sample point (e.g. the ith person in the study) encodes the status of a phenotype of interest. For example the binary status of a disease with , the continuous value of a survival time with or the continuous degree of an exposure or (log-) concentration with . The regressor Xi is a (long) vector which encodes the SNP profile for the ith sample point: is the value of the jth SNP for sample point i, taking three possible values corresponding to the number of minor alleles per person. Typically, the number of SNPs (regressors) is , while the number of samples is at least one order of magnitude smaller. A model measuring multivariable association is introduced next.
A well-established model for relating the phenotype (response variable) and the SNPs (regressors) is a generalized linear model (McCullagh and Nelder, 1989).
The easiest form thereof is a linear model for continuous (-valued) responses:
where are independent and identically distributed noise terms with expectation , finite variance and which are uncorrelated with the regressors .
For binary responses with , representing case (=1) or control (=0), we consider a logistic regression model:
Here, πi represents the probability of individual i having a case status given its SNPs Xi. There is no additional noise term and the stochastic nature of the model comes from the probability πi.
In both models, β0 denotes the intercept, and the coefficients βj are the (logistic) regression coefficients which measure the association of the jth SNP with the response. Such models, which take into account all SNPs, have two features. First, the (generalized) regression coefficients have the following (well-known) interpretation: βj measures the association effect of on Yi which is not explained by all other variables . Thus, a large βj, in absolute value, has the very powerful interpretation that SNP j has a strong association to the phenotype given all other SNPs or controlling for all other SNPs. This is in sharp contrast to marginal correlation between SNP j and the phenotype Y which can easily be of spurious nature and caused by another SNP k having a strong correlation with the phenotype and with SNP j.
Furthermore, the regression models are predictive in the sense that for a new sample point (e.g. person) with a given SNP profile we obtain a prediction for the corresponding phenotype (e.g. disease status) or , where . Note that this prediction is likely to be more informative or precise than a prediction that is based merely on marginal correlations because the general linear model applied here enables us to use the whole new SNP profile, and not just single SNPs, for predictive purposes.
Our main goal is to infer statistical significance of a single SNP or of a possibly large group of correlated SNPs for a given phenotype. More precisely, we aim for P-values, adjusted for multiple testing, when testing the following hypotheses:for single SNP j
or for a group of SNPs
The obtained P-values are with respect to a regression model and hence, they share the interpretation with the regression parameters described above. In particular, they are markedly different from a marginal or linear mixed model approach: the differences are also illustrated in simulation studies in Section 3.1.
A link to causal inference. If we assume (i) that the model is correct and that beyond the measured SNPs there are no hidden confounding variables—a condition that might be somewhat less problematic when having a million or more SNP markers—and (ii) that the causes point from the SNPs to the phenotype Y, the parameters can be given a causal interpretation. This link to causal inference shows again that a (joint) multiple regression model is very different from a marginal model. In a structural equation model the assumption that the causes point from the SNPs to the phenotype Y means that the arrows in a directed acyclic graph, that encode the causal influence diagram, point to Y and never point away from Y, i.e. Y is childless. Such an assumption says that some SNPs might be the cause for a phenotype, but the phenotype cannot be a cause for the SNPs, which seems a very reasonable assumption. Under these conditions, the following holds: if , then there must be a directed edge in the causal influence diagram of a linear structural equation model from SNP j to the phenotype Y with non-zero edge weight, i.e. there exists a non-zero direct causal effect from SNP j to the phenotype Y. This statement is not true with marginal associations (i.e. if SNP j is only marginally associated with Y) since adjusting for all other SNPs (different from SNP j) is crucial for causal statements. The details are given in Proposition S1.1 in the Supplementary Material Section S1.
The difficulty with a regression type analysis is the sheer high-dimensionality of the problem. The number of SNPs is massively larger than sample size n, which is at least one order of magnitude smaller. In such scenarios, standard statistical inference methods fail. Recent progress based on new methods such as multiple sample splitting, has allowed us to obtain statistical significance measures for regression parameters βj (Bühlmann, 2013; Meinshausen et al., 2009; Zhang and Zhang, 2014, cf.) or groups thereof (Mandozzi and Bühlmann, 2015). We rely here on this method (Mandozzi and Bühlmann, 2015), which shows reliable performance over a wide range of simulation settings (Dezeure et al., 2015), and enjoys the property of being computationally vastly more efficient than procedures which operate on the entire dataset. We extend the procedure from Mandozzi and Bühlmann (2015) from linear to logistic regression, and we show here for the first time how it performs for extremely high-dimensional GWAS data. The entire statistical procedure is schematically summarized in Figure 1.
In view of the high-dimensional nature of GWAS, it is rather unlikely to detect single SNPs which are significant when controlling for all other SNPs. Thus, it is a-priori more likely to detect (large) significant groups of SNPs with respect to the group hypotheses in a regression model. The construction of such groups is achieved by clustering the SNPs, as explained next.
Our goal is to perform significance testing on single SNPs (the hypotheses ) as well as arbitrarily large groups of SNPs (the hypotheses ). We do this hierarchically since this allows for powerful multiple testing adjustment as well as for efficient computation (see Section 2.4).
We first discuss the hierarchical clustering of SNPs. The hierarchy can be constructed in different ways. One option is to use specific domain knowledge to group the SNPs, for instance by clustering them first into genes, and then into functional pathways. Another option is to use standard hierarchical clustering methods which rely on a distance measure between the SNPs.
Here we adopt the second approach, which is similar to the construction of haplotype maps (Barrett et al., 2005). We use hierarchical clustering with average linkage (Jain and Dubes, 1988) which can be represented as a cluster tree, denoted by . The method requires a distance or dissimilarity measure between SNPs. We consider the distance between two SNPs as one minus their linkage disequilibrium (LD) value, where LD refers to the statistical dependency of the DNA content at nearby locations of the chromosome. One of the most common measures of LD is the square of the Pearson correlation coefficient (Hill and Robertson, 1968), which quantifies the linear dependence between two loci. Thus, two SNPs will have an LD equal to one if they are perfectly correlated, or an LD equal to zero if they are uncorrelated. Since LD has a tendency to decay with the distance of the studied loci, close-by SNPs are typically in high LD. This means that SNPs belonging to the same gene, or more generally, neighboring SNPs will end up in the same cluster. Often, LD is studied within each chromosome separately. Therefore, we construct separate cluster trees for each chromosome (in addition to providing a biological interpretation, clustering each chromosome separately results in substantial computational gains for problems with SNPs), and we then join these into one tree which contains all the SNPs in the study, as shown in Figure 2.
A cluster, as described in Section 2.3, is denoted by the generic letter G which encodes a subset of of single SNPs. We explain here how to test a null-hypothesis for a group in (4) or for a single SNP in (3).
In Section 2.4.2 we will show how one can construct valid P-values for the hypotheses and . On the basis of valid P-values, our hierarchical approach proceeds as follows:
In such a hierarchical testing procedure, which belongs to the scheme of sequential multiple hypothesis testing, the multiple testing adjustment is resolution dependent. To guarantee that the familywise error, i.e. the probability for at least one false rejection of the hypotheses among the multiple tests, is smaller than or equal to α for some pre-specified , e.g. , the hypothesis tests must be performed at different significance levels, depending on where one is in the hierarchy. The more we descend in the hierarchy, the more the multiple testing adjustment increases because we do more tests. It is important to keep in mind that even though the procedure controls the type I error simultaneously over all levels of the hierarchy, the adjustment for larger clusters does not depend on whether one will test their subclusters or not. While there is an ordering of the clusters, due to the nature of the hierarchical clustering procedure, and testing of subclusters stops once the null hypothesis of the parent cluster is accepted, the adjustment applied to the P-value of any cluster does not depend on the number of tests that have already been performed, but only (essentially) on the size of that particular cluster, see Section 2.4.2. The details have been developed by Meinshausen (2008). In other words: the final output of the method are P-values for significant groups with the interpretation, that these P-values control the familywise error rate for multiple testing (if we collect all groups with a P-value smaller or equal to α, then the probability for making one or more false rejections among all considered tests is less or equal to α). Furthermore, due to the hierarchical structure of the procedure, we can massively reduce the number of computations: if the final clusters or groups are relatively high up in the hierarchy of the cluster tree, we only need to compute relatively few hypothesis tests.
The hierarchical inference procedure above assumes that one has a method that constructs P-values which are valid (a P-value P is valid for a null-hypothesis H0 if for any , where denotes the probability assuming that H0 is true).
Due to the high-dimensionality with , obtaining a P-value for the hypotheses or in (3) or (4) is a non-trivial problem. We rely here on a multiple sample splitting approach from Meinshausen et al. (2009), and we follow exactly the method from Mandozzi and Bühlmann (2015). The idea is as follows. For repetitions:
The sample splitting in step 1 is made to avoid being over-optimistic when performing variable selection and P-value construction on the same dataset. The repeated sample splitting in step 4 helps to achieve much more reliable results which do not depend in a sensitive way on how we split the sample (Meinshausen et al., 2009, cf.). More details about the assumptions which guarantee control of the familywise error rate are provided in Supplementary Material Section S2.
This multi-sample splitting method is computationally fast since Lasso in step 2 is rather cheap to perform and step 3 requires classical P-value computations in low-dimensional models with fewer than n regressors only. In terms of accuracy for type I error control, i.e. avoiding false rejections of hypotheses, the multi-sample splitting approach has been found very reliable in extensive simulations relative to other methods. This reliability comes at the price of being slightly inferior in terms of power to detect true underlying positive findings (Dezeure et al., 2015), see also Mandozzi and Bühlmann (2015). However, this slightly more conservative scheme has the advantage of limiting false positives.
We used the WTCCC Crohn’s disease genotype data to create semi-synthetic datasets. To generate the new genotype matrix, we kept all the samples (n=4682), but selected a block of 500 consecutive SNPs from each of the 22 autosomal chromosomes, having in total 11 000 SNPs. The phenotype data was generated from a logistic regression model with the probability of having the disease as the dependent variable and 10 causal SNPs as the independent variables. We considered three designs for choosing the causal SNPs:
To ensure that the number of cases and controls are not too different, we required the ratio between cases and controls to be within the interval . We kept the genotype matrix constant, and generated 100 simulation runs for each design, using new coefficients for every simulation run.
We chose to compare our method to three other algorithms. One is the classic bivariate testing, implemented in PLINK (Purcell et al., 2007). The other two are mixed model approaches: the FaST-LMM (Lippert et al., 2011) and the GCTA (Yang et al., 2011) algorithms. Both calculate a genetic relationship matrix (GRM) to control for the effect of the other SNPs. There are many ways of computing the GRM. One can use all the SNPs, or just a particular subset. An approach that is computationally efficient is leave-one-chromosome-out (LOCO) (Yang et al., 2014). With this option when one tests the SNPs in a particular chromosome, the SNPs from all the other chromosomes besides the one being tested are used to compute the GRM. The main difference between mixed models and our method is the way in which the effects of other SNPs are modeled. While the mixed model uses a random component to account for all the other SNPs, our method considers each SNP as a fixed effect, and includes all of them in the model.
Our goal was to assess how good these methods are at detecting the causal variants (which are known for simulated data), while limiting the number of false positives: SNPs that are not truly causal (but perhaps correlated with the causal variants). Assessing the performance of the methods was done by considering several criteria. The first is the FWER, which we expect to be controlled at level α. This is equivalent to expecting 100 false discoveries, when performing 100 simulations. As a less conservative criterion, we also consider the k-FWER, a generalized version of the FWER. The k-FWER is defined as , where V is the total number of false rejections. For our simulations, we are interested in the value of k, under which the k-FWER is controlled at level . In the case of our method, a rejection is considered false only if the cluster does not contain any of the true causal SNPs. The third assessment criteria is the power of the method. Also here we consider two variants. The first is a ‘naive’ version, which considers all findings where the true causal SNP is present in a cluster, irrespective of the cluster size. The second metric penalizes the size of the group with respect to the causal variants. This is computed in the following way:
where Gsign is the set of groups declared significant by our method, and S0 is the set of true causal SNPs. For the three comparison methods, we declare significant SNPs that have a P-value below , and compute the power, FWER and k-FWER using this set.
The results shown in Table 1 are in line with our expectations. In the first 2 designs, our method has a lower power compared to the other three methods that behave almost identically. The cost of a larger power is however a significant increase in the number of false positives. While our method fails to control the FWER due to the very complex correlation structure in the data, it does however control the 2-FWER at level α. This means that the probability of making more than 2 false rejections is below α. In comparison, the other three methods do on average at least one order of magnitude more false rejections. This has to do with the fact that they infer marginal associations, or associations which are partially adjusted by random effects: many significant findings are spurious because of high correlation between some of the SNPs. The lower power of our hierGWAS procedure can be explained by the fact that it aims to infer associations which are adjusted for other SNPs: our method would detect some of them individually, some of them as groups, and if the correlation among some SNPs is too strong (or the signal too weak), it would miss some. This becomes apparent also when we consider the two power measures: POWadaptive is always smaller than POW, because some of the causal SNPs will be grouped into clusters, due to their correlation structure. In the case of the third design, our method has the same power as the other three to detect the causal SNPs. If we assume that this is close in spirit to the real life case of having one causal SNP surrounded by many others that are in LD with the causal one, our method has the power to detect it just as well as the marginal methods, while providing a greatly reduced set of false positives, and a much stronger interpretation of the findings. While the FWER is the highest in the third design, due to the fact that there are many more confounders around each SNP compared to the other two designs, our method still performs much better, by controlling the 3-FWER at level α. We note that GCTA and FaST-LMM have a slight disadvantage regarding the control of false positives, because we used the LOCO approach to compute the GRM. However, since this is an established approach, and the strategy of eliminating only the SNP being tested, and using all other SNPs to construct the GRM is computationally infeasible (Yang et al., 2014), we believe that this situation reflects reality.
We validate our method on data from The Wellcome Trust Case Control Consortium (2007). The Wellcome Trust Case Control Consortium study used 3000 subjects and 2000 shared controls from the British population to examine 7 major diseases: bipolar disorder (BD), coronary artery disease (CAD), Crohn’s disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D) and type 2 diabetes (T2D). The subjects were genotyped using the Affymetrix GeneChip 500K Mapping Array Set. Though The Wellcome Trust Case Control Consortium (2007) reported all SNPs with a P-value , the threshold for strong association was set to . Using the standard marginal analysis, the WTCCC study identified 21 new SNPs strongly associated to the phenotype. For BD rs420259 on chromosome 16, for CAD rs1333049 on chromosome 9, for CD, the WTCCC study identified 9 SNPs strongly associated to the phenotype: rs11805303 on chromosome 1, rs10210302 on chromosome 2, rs9858542 on chromosome 3, rs17234657 and rs1000113 on chromosome 5, rs10761659 and rs10883365 on chromosome 10, rs17221417 on chromosome 16 and finally rs2542151 on chromosome 18. 2 SNPs were found for RA: rs6679677 on chromosome 1 and rs6457617 on chromosome 6. T1D was strongly associated to 5 SNPs: rs6679677 on chromosome 1, rs9272346 on chromosome 6, rs11171739 and rs17696736 on chromosome 12 and rs12708716 on chromosome 16. Finally, for T2D 3 associations were found: rs9465871 on chromosome 6, rs4506565 on chromosome 10 and rs9939609 on chromosome 16. For HT, the WTCCC did not find any SNP strongly associated to the phenotype.
Before applying our analysis, we have preprocessed the data, by excluding some SNPs and samples, as well as imputing the missing SNPs. Details about this procedure are given in the Supplementary Material Section S4.1.
The output of our method is a list of SNP groups of different sizes. These represent the smallest jointly significant groups in the hierarchical tree of SNPs. We create a distinction between small (<10 SNPs) and large groups, and present the corresponding results separately. The number 10 is somewhat arbitrary and determined by notational simplicity to list at most 10 SNPs per group. We identified small groups for 5 of the 7 diseases, and present them below. Large groups have been identified for all of the diseases, however we chose to present in detail the results for BD only. We chose BD because it is the disease for which the WTCCC found a single strongly associated SNP, which we did not identify using our method. The large groups for the other six diseases are detailed in the Supplementary Material Section S4. It is important to note that these groups are not overlapping. For example, if in a specific chromosome we find a small group of four SNPs, as well as two large groups both containing thousands of SNPs, these three groups do not share common SNPs and they belong to different regions of the chromosome. This happens because our method finds the smallest group of SNPs for which the null hypothesis can be rejected. Such a result means that one region of the chromosome exhibits a strong signal, while there are other regions exhibiting weaker signal. Thus, the size of the group reflects the strength of associations: the weaker these associations, the larger the significant groups.
Tables 2 reports on individual SNPs or small clusters of SNPs selected by our method for the seven diseases we analyzed. We found a total of 20 such clusters, out of which 16 are individual SNPs.
Twelve out of the 20 clusters contain at least one SNP that was found to be strongly associated to the phenotype in the original WTCCC study. The remaining eight clusters contain SNPs that are either in LD with the ones identified by the WTCCC, belong to the same gene or genomic region, or have been identified as having a significant effect in other studies. While it is informative to see if our findings have been previously reported in other studies, it is important to remember the distinction in terms of interpretation. Our method makes the significance of previous findings much stronger, because it does not simply compute the marginal correlation, but it instead tests whether the effect of a SNP or a group is still significant after we have taken into account the effect of all other SNPs. In the case of the small clusters, it restricts the confounders to a reduced number of SNPs that are either introns in the same gene, or in close proximity to each other. Besides these small significant groups, our method also identified larger groups. Again, it is important to keep in mind that these larger groups are not overlapping with the smaller ones, and they are in other regions of the chromosome. These clusters contain many of the SNPs that were identified to have moderate associations in the original study. Because of their size, they are given lower weight in terms of power, however, they reflect the assumption that these diseases are highly polygenic, and associations appear in many places throughout the genome. In the following we will describe in more detail our findings for each disease.
We replicated rs1333049, an intergenic SNP on chromosome 9, the only finding from the WTCCC study. Our result however has a much stronger interpretation compared to the original finding, because we control for all possible confounders. Thus, rs1333049 shows an association with the phenotype, even after taking into account the effects of all other SNPs.
On chromosome 1 we identified a small significant cluster of five SNPs: rs11805303, rs2201841, rs11209033, rs12141431 and rs12119179. Two of them: rs11805303 and rs2201841 are introns in the IL23R gene, while the last three SNPs are up to 22-kb downstream from IL23R. Though rs11805303 showed strong association in the WTCCC study (The Wellcome Trust Case Control Consortium, 2007), our result has a different and much stronger interpretation. Our finding is a group of five SNPs that are jointly significant, though none of them is significant individually. Because this significance results from a joint model of all SNPs, it means that our group is jointly significant while controlling for all other SNPs in the study. The interpretation of this finding is that we limit the set of confounding SNPs to four other SNPs. When a SNP is declared significant by computing the marginal correlation, like in The Wellcome Trust Case Control Consortium (2007), the set of possible confounders that produce this correlation is the set of all other SNPs. Thus, if the correlation turns out to be a spurious one, there could be hundreds of other SNPs that produce it. In contrast, our method not only drastically reduces the number of confounders, but gives a small set of much more plausible ones, that are in a narrow region of the chromosome, often clustered around a gene. On chromosome 2 we identified an individual SNP, rs10210302, which showed strong association in the WTCCC paper (The Wellcome Trust Case Control Consortium, 2007). This SNP has by far the lowest P-value () in CD, and it also explains a relatively large proportion of the variance attributed to chromosome 2: 0.014 compared to 0.05 explained by all the selected SNPs in the chromosome. On chromosome 5 we identified a group of 9 SNPs: rs6871834, rs4957295, rs11957215, rs10213846, rs4957297, rs4957300, rs9292777, rs10512734 and rs16869934. They are all intergenic, 85kb apart and located in the 40.4M region of the chromosome. rs16869934 is 4 kb downstream from the SNP rs17234657 showing strong association to CD in The Wellcome Trust Case Control Consortium (2007). On chromosome 10 we found 2 significant SNPs. The first is rs10883371, a 2-kb upstream variant both for LINC01475 and NKX2-3. rs10883365, found to be strongly associated to CD in the WTCCC study is a 2-kb upstream variant in LINC01475. Our second finding on chromosome 10 is rs10761659, a non-coding intergenic SNP mapping 14-kb telomeric to gene ZNF365 and was identified first by the WTCCC (The Wellcome Trust Case Control Consortium, 2007), followed by a meta-analysis (Franke et al., 2010) and later a study of a southern european population by Julia et al. (2013). On chromosome 16, we found an individual SNP, rs2076756, which is an intron in NOD2. Interestingly, this SNP was not found to be significant in the original WTCCC study, while our approach shows that it is even significant when we control for all other SNPs. This SNP has been confirmed by several studies (Franke et al., 2010; Julia et al., 2013; Kenny et al., 2012; Rioux et al., 2007). Finally, on chromosome 18, we identified a single intergenic SNP: rs2542151. This finding was reported by The Wellcome Trust Case Control Consortium (2007), as well as by Parkes et al. (2007).
We identified two SNPs, both individually significant. The first, rs6679677 is located on chromosome 1 and is a 2-kb upstream variant in the PHTF1 gene. This finding was reported by The Wellcome Trust Case Control Consortium (2007). The second SNP, rs9272346, is located on chromosome 6 and is also a 2-kb upstream variant in the HLA-DQA1 gene. This SNP belongs to the MHC region, just like the WTCCC finding.
Eight individual SNPs were declared significant by our method. Five of these are the five associations found in The Wellcome Trust Case Control Consortium (2007). These are: rs6679677, a 2-kb upstream variant in the PHTF1 gene on chromosome 1, rs9272346, a 2-kb upstream variant in the HLA-DQA1 gene on chromosome 6, rs11171739, an intergenic SNP on chromosome 12, rs17696736, an intron in the NAA25 gene on chromosome 12 and rs12924729, an intron in the CLEC16A gene on chromosome 16. Additionally, our method identified 3 new associations. Two of them are located on chromosome 6: rs9272723 is an intron in the HLA-DQA1 gene and rs2523691 is intergenic. The third new finding, rs17388568, is located on chromosome 4 and is an intron in the ADAD1 gene. It did not reach the genome wide significance threshold in the WTCCC study, however it showed moderate association with a P-value of . It also showed moderate association in an independent study by Plagnol et al. (2011).
We identified two small SNP clusters, one on chromosome 10 and the other on chromosome 16. The first cluster contains 10 SNPs: rs4074720, rs10787472, rs7077039, rs11196208, rs11196205, rs10885409, rs12243326, rs4132670, rs7901695, rs4506565, all introns in the TCF7L2 gene, spanning a 62 kb region. One of them, rs4506565, was originally identified by The Wellcome Trust Case Control Consortium (2007), while rs7901695 showed a significant association in a replication study by Zeggini et al. (2007). The second cluster is comprised of four SNPs: rs9926289, rs7193144, rs8050136, rs9939609, all introns in the FTO gene spanning 10 kb. rs9939609 was significantly associated to the phenotype in The Wellcome Trust Case Control Consortium (2007). Additionally, rs8050136 was found to have strong significance in Zeggini et al. (2007) and Scott et al. (2007).
For BD, the WTCCC identified only one SNP strongly associated to the phenotype: rs420259. While we did not identify it in a small group, it is present in the large group found to be significant on chromosome 16. Furthermore, as can be seen in Table 3, we found clusters in many of the chromosomes. Table 3 shows the group size, both in terms of number of SNPs, as well as in terms of percentage of the total SNPs in that particular chromosome. Additionally, we investigate whether the SNPs identified using the standard analysis with PLINK (Purcell et al., 2007) map into our groups. The size of the group is in a way inversely proportional to the strength of associations. If a certain chromosome contains SNPs with large effects, we will be able to find them in very small clusters, or maybe even individually. If however the signal is weak, we can only identify larger regions. For example, on chromosomes 4, 6, 7, 8, 10 and 15 the signal is so weak that we can only report that the joint effect of all SNPs in these chromosomes is significant, but we cannot further localize the signal. On the other hand, on chromosome 3 we were able to identify a much smaller group containing only 6% of the SNPs.
Our method returns the smallest number of SNPs for which we can find a significant effect, while controlling for all other SNPs. The fact that we cannot disaggregate the signal to small clusters, or single SNPs does not mean that genetics plays no role in BD, but rather that the signal is very dispersed and the effect sizes are very small. This explains why our groups are so large, and why we cannot attribute the signal to narrower regions. Figure 3 shows the variance in bipolar disorder explained by the SNPs on individual chromosomes. We only consider the SNPs selected by the Lasso, in step 2 of Section 2.4.2, as these SNPs are a proxy for the truly relevant SNPs. The total variance explained by all the SNPs is 0.5, and Figure 3 describes how this variation is distributed across the chromosomes. The fitted line corresponds to a linear model, where the predictor is the chromosome length, and the response is the explained variance. The plot gives weight to our previous statement, as the variance is homogeneously distributed across the chromosomes. The plot shows an excellent fit (), meaning that the length of a chromosome is a very good predictor for the amount of variance that a particular chromosome explains.
Hypertension was the only disease where The Wellcome Trust Case Control Consortium (2007) did not find any strongly associated SNPs. We also haven’t found small clusters or individual SNPs, but we did find larger clusters on 13 of the chromosomes. The results are shown in the Supplementary Material Section S4.7.
We have presented a new method for assigning statistical significance in GWAS. Our approach goes beyond the bivariate testing of individual SNPs that looks only at marginal associations. Instead, we use a multivariable approach which includes all the SNPs and controls the familywise error rate. We propose to assign P-values in a hierarchical manner: first for chromosomes, and then in a top-down fashion from larger to smaller groups of SNPs. Such an approach addresses several issues. First, since regression parameters of individual SNPs are typically very small, due to their interpretation and meaning in the model, it is much more likely to detect significant groups of SNPs. Second, because we proceed hierarchically, the problem of multiple testing is much less severe than for the classical one-SNP-at-a-time approach: roughly speaking, one has to adjust only for the number of tests which are considered, and this number is typically much smaller than the entire number of SNPs in the study. Our method is data-driven in the sense that its resolution for the groups of SNPs depends on the strength of the signal present in the data: how much we proceed in the hierarchy and refine the clusters of SNPs depends on how strong the associations are. If the signal is strong and well-localized, we find small clusters or individual SNPs, whereas if the signal is weak, we identify larger regions.
We demonstrate our method on the WTCCC data (The Wellcome Trust Case Control Consortium, 2007), where we analyze the seven diseases. Though it is interesting to conceptually validate our findings by comparing them with a measure of marginal association, our method is different and allows for a more powerful interpretation of the findings than testing only marginal association between a SNP and the phenotype. This is because we test whether or not SNPs in a cluster carry any additional information about the phenotype, beyond that available through all the other SNPs. That is, we adjust for the effect of all other SNPs that are not part of this cluster, which translates to a very strong interpretation of the significant clusters. This can be related to causal statements when making additional assumptions (see last paragraph in Section 2.1). Due to the fact that we control for all other SNPs, often we can reduce the number of possible confounders from hundreds or thousands of SNPs to less than 10. Moreover, our possible confounders are desirable candidates, as they are usually part of the same functional unit. This is a favorable outcome because in most cases it is unclear which is the causal SNP, and in many contexts the gene might be the more meaningful biological unit. Even for phenotypes with weaker and more dispersed signal, such as BD and HT, we could still identify larger regions. While these clusters might be too large to identify specific genes, we can still gain insights into the joint influence of all selected SNPs, or the distribution of the variance across the chromosomes. This case is the one which motivated our approach. For distant, non-disease related phenotypes it is perhaps more useful to identify the chromosomes, or the regions that drive the signal, and their contribution to the total explained variance. In such cases identifying single SNPs is most likely impossible, and due to their low predictive power, not very useful.
It is difficult to directly compare our results to other marginal methods because we assign significance with respect to a generalized multiple regression parameter, and not only for individual but also for groups of SNPs. Nevertheless, we performed a small simulation study in which we compared the results for our method to the standard marginal approach, as well as two mixed model algorithms. The findings were in line with our expectations: while our method had slightly reduced power in two of the settings, it compensated by producing a significantly reduced number of false positive selections. In the third design our method had the same power as the mixed model and marginal testing approaches, while still having a superior control of the false positives.
One direction for improving the method would be to change the clustering, which could be performed through the use of more in-depth biological knowledge. For instance, if we would cluster the SNPs into genes, and then into pathways, even for a weak signal we would potentially identify larger pathways, which would be useful in terms of biological meaning.
This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113.
E.F. and L.B. gratefully acknowledge financial support from the European Research Council (grant 295642, The Foundations of Economic Preferences, FEP). D.S. gratefully acknowledges financial support from the German National Science Foundation (DFG, grant SCHU 2828/2-1, Inference statistical methods for behavioral genetics and neuroeconomics). A.N. gratefully acknowledges support from the Instituto de Salud Carlos III (grants RD12/0032/0011 and PT13/0001/0026) and the Spanish Government Grant (BFU2012-38236) and from FEDER.
Conflict of Interest: none declared.