|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWAS) have had limited success when applied to complex diseases. Analyzing SNPs individually requires several large studies to integrate the often divergent results. In the presence of epistasis, multivariate approaches based on the linear model (including stepwise logistic regression) often have low sensitivity and generate an abundance of artifacts.
Recent advances in distributed and parallel processing spurred methodological advances in nonparametric statistics. U-statistics for structured multivariate data (μStat) are not confounded by unrealistic assumptions (e.g., linearity, independence).
By incorporating knowledge about relationships between SNPs, μGWAS (GWAS based on μStat) can identify clusters of genes around biologically relevant pathways and pinpoint functionally relevant regions within these genes.
With this computational biostatistics approach increasing power and guarding against artifacts, personalized medicine and comparative effectiveness will advance while subgroup analyses of Phase III trials can now suggest risk factors for ad verse events and novel directions for drug development.
Almost a decade after the completion of the Human Genome Project , the scientific and medical advances hoped for from genome-wide association studies (GWAS) have not yet been realized. After early successes with diseases where a single haplotype confers all or most risk , the same statistical approaches have often produced ambiguous results when applied to complex diseases [3,4]. Increasing the sample size (to tens of thousands of subjects as suggested ) is impractical for rare disease forms, and also greatly increases the duration and cost of data collection. Improving accrual by broadening the inclusion criteria increases variance and thus requires yet larger samples; a vicious cycle. Moreover, increasing sample size in a nonrandomized study may, somewhat paradoxically, increase the risk of false positives [6,7].
Several mutations within a gene may contribute to the risk of common diseases and several SNPs may have become associated with the same mutation over time. One risk factor’s contribution may depend on the presence of others and sets of mutations may confer more risk if they affect both chromosomes (compound heterozygosity). Hence, any statistical approach based on p-values derived one SNP at a time (ssGWAS) is ill-suited to identify the short-range epistasis involved  (following Fisher , the term ‘epistasis’ will be used for any deviation from independence, be it between neighboring SNPs, intragenic regions or genes). Analyzing diplotypes (sets of neighboring SNPs with unknown phase) comprehensively would be preferable , yet traditional multivariate methods  including linear/logistic regression (lr) assume independence and additivity/multiplicativity of risk factors to yield computationally simple algorithms. Making unrealistic assumptions, such as linearity, may easily lead to meaningful nonlinear relationships being overlooked (false negatives). More importantly, random errors, not subject to biological constraints, may occasionally fulfill these assumptions, so that the most ‘significant’ results are often false positives.
Association studies, in general, are exploratory ‘selection procedures’  to generate rather than confirm hypotheses. Even though the same algorithms are used as in confirmatory tests, ‘p-values’ merely serve to sort candidates, so that a sufficiently large selection of candidate genes will include the most interesting genes with high power. Even minor differences in the composition of the study population can result in different subsets of genes being selected , and each could help with understanding a different aspect of the disease etiology when confirmed using mouse studies or clinical trials. Hence, the challenge in improving GWAS is to reduce artifacts caused by applying oversimplifying approaches to complex diseases (analyzing one SNP at a time, assuming independence and additivity of effects) while incorporating more knowledge to increase the sensitivity for detecting biologically relevant subsets of the genes involved.
With the advent of mainframe computers, more complex calculations (e.g., factor analysis) became feasible. More recently, personal computers triggered the development of resampling methods. Now we are, again, entering an era of advances in computational biostatistics, where massively parallel computing has spurred the methodological advances making wide-locus GWAS based on a nonparametric approach (μGWAS, based on u-statistics for structured multivariate data) feasible . Below, we introduce two novel concepts.
First, several ‘tag’ sets of ‘genetically indistinguishable’ SNPs  are typically scattered across a linkage disequilibrium (LD) block, yet traditional methods cannot differentiate between ‘permuted’ diplotypes containing members of the same tag sets in different order. μGWAS draw on the spatial structure of SNPs within a diplotype and expected LD from HapMap  to improve the resolution of GWAS to intragenic regions. Second, we apply the concept of ‘information content of multivariate data’ (μIC)  at several stages of the ana lysis to guard against artifacts. With these methodological advances, disease-relevant genes and intragenic regions can now be suggested from a single study, often of only a few hundred narrowly defined cases, rather than from a variety of large studies, turning GWAS from a technique to identify isolated SNPs into a powerful tool to generate plausible and testable hypotheses about the etiology of complex diseases.
It is often reasonable to assume that risk conferred by a heterozygous SNP lies somewhere between baseline risk and a homozygous SNP (having two risk alleles) that is, between the risk of a recessive and dominant allele, respectively. U-statistics (including the Wilcoxon/Mann–Whitney U test ) treat SNPs as ordinal (wild-type = xx < xX < XX = homozygous), but do not require the degree of dominance to be known. Treating diplotypes as multivariate data then avoids the need for assumptions about independence and relative importance of the SNPs , yet the theory was never broadly developed owing to prohibitively high computational demand . With GWAS, for instance, the number of ‘polarities’ (combinations of −1 = bad; 0 = irrelevant; +1 = good) increases exponentially with diplotype length, yet with massively parallel computing we were now able to include diplotypes up to length six.
Traditionally, one would have more confidence in a ‘significant’ locus if neighboring loci also show association  and add recombination information to the data displayed. Here we integrate the concepts behind this intuitive visual inspection into the statistical approach itself. Recently, U-scores for multivariate data (μ-scores) have been extended to reflect structures among variables with applications including sports , policy-making  and medicine . The proposed GWAS-specific structure is based on the notion that neighboring disease loci may have similar effects and that a disease locus may be in LD with both adjacent SNPs, unless the SNPs are separated by a recombination hotspot (boundary between LD blocks; Figure 1).
μGWAS starts with computing matrices representing the partial order of each SNP, combining pairs of these matrices into matrices representing the intervals and, finally, combining SNP and interval matrices into a diplotype matrix from which the μ-scores are computed [14,22]. As diplotype profiles are built from intervals around and between neighboring SNPs, diplotypes where members Xi, Yi and Zi of the tag sets (X), (Y) and (Z) appear in different order (permuted diplotypes), such as (X1, Y1 and Z1) versus (Y2, X2 and Z2) can be distinguished. This novel approach to incorporate knowledge of neighborhood relationships between SNPs increases power over merely combining all SNPs within a diplotype in a single step , yet avoids the need for assumptions about dependencies and relative importance required when using linear combinations (weighted sums) of univariate scores. With GWAS based on lr (lrGWAS), one could work towards a similar goal by adding sequential interaction terms. Hence, we will compare μGWAS not only with ssGWAS for dominant, linear trend , and recessive effects, but also with stepwise logistic regression with and without sequential interaction terms.
Childhood absence epilepsy (CAE) , formerly known as ‘petit mal’, is characterized by frequent, short episodes of ‘daydreaming’. Through trial and error of different combinations of valproic acid and various ion channel blockers, these absences can be controlled in approximately 75% of affected children . For adult patients, etiracetam, an IL-1β inhibitor  was approved in November 1999, and a caspase 1 inhibitor (VRT-765) is under going a controlled Phase IIb study . CAE does not follow a simple Mendelian pattern of inheritance, although recurrence of epilepsy in families is high. A high concurrence in monozygotic twins and the absence of known exogenic factors make CAE an ideal model for studying the genetics of complex diseases and approaches to unravel their genetic risk factors to better match patients to existing drugs and identify new drug targets for patients who do not respond to existing drugs.
The 185 CAE patients in this study were predominantly Caucasian (83%) and white Hispanic (10%) with the well-known female preponderance (115 female vs 70 male patients). Average age of onset for absence seizures was 5.7 years. Patients were required to be seizure free on antiepileptic medication. Controls were selected from a publicly available database ; see Supplementary Material at www.futuremedicine.com/doi/suppl/10.2217/pgs.13.28 for details.
As is typical for ssGWAS, especially with small samples, only two SNPs reached the customary s = −log10(p) > 7.5 level of significance with univariate tests (Figure 2, black foreground), one in a noncoding region (chromosome 1, lr only), the other in the pseudogene EE1A1P12 (chromosome 2).
Since ssGWAS was inconclusive and sequential interaction terms created an abundance of likely false positives with lrGWAS (Supplementary Figure 1), even with regularization (AIC ), the following discussion focuses on μGWAS versus traditional lrGWAS. In the spirit of conducting a selection procedure [12,28], rather than confirmatory tests, p-values were used solely for the purpose of ranking the loci and at any given level, lrGWAS had more ‘significant’ results in general, including many likely false positives. Hence, methods were compared using similar arbitrary numbers of top regions (first comparison used only the top 6, second comparison used ~20 and third comparison used ~40; see Supplementary Table 1), the latter cutoffs adjusted for display purposes (Figure 2) to match commonly used s-values (μ: 7.5/7.0; lr: 8.0/7.5).
Only one of the top six genes in lrGWAS (RBFOX1) ranks higher than rμ = 73rd in μGWAS (5th), while the other four among the top six regions in μGWAS are also among the top 22 in lrGWAS (the above elongtion factor pseudo gene EEF1A1P12; synapsin III, SYN3; FAT4; and CREB5; Supplementary Table 1). Of the top 17 μGWAS regions (s > 7.5), 14 (82%) are known to be in genes directly related to the NOD/ID–axonal guidance signaling/ataxin pathway (Figure 3), including PANX1, SEC16B, the Rho GTPase activating proteins OPHN1/ARHGAP41 and RICS/ARHGAP32, ABCC8, the potassium channel KCNJ5, BRE, NLRP3 and RASSF8, compared with only eight genes (36%) of the top 22 (s > 8), including KCNB2, DOK6 and MYO16, or 16 (40%) of the top 40 lrGWAS (s > 7.5) regions.
Epilepsy is commonly seen as a channelopathy, and lrGWAS identify both postsynaptic (KCNB2, rlr = 3rd; DOK6, rlr = 10th) and presynaptic (MYO16, 13th) membrane processes. μGWAS adds KCNJ15 (rμ = 14th), confirms CNTNAP2  and CNTNAP4 (27th, and 48th, respectively), and hints at two targets for approved antiepileptic drugs, the ion channels SCN4A and GABRB3 (43rd and 57th, respectively) . Both methods implicate SYN3 (rμ/rlr = 3rd/22nd), a presynaptic vesicle-associated protein . μGWAS adds OPHN1 and ABCC8 (8th and 12th, respectively).
Two approved antiepileptic drugs, topiramate and levetiracetam, and the investigational drug VRT-765 target the NOD-like receptor signaling pathway . While both approaches suggest genetic variations in PANX1 (rμ/rlr = 13th/16th), μGWAS adds the TNFRSF1A modulator BRE (15th) as involved and NLRP3 as a risk factor (16th). Hence, VRT-765 might be particularly effective for patients with a ‘gain-of-function’ mutation in NLRP3.
RHOA was upregulated in patients with intractable epilepsy , yet the mechanism involved is unknown. Two genes known to regulate RHOA, OPHN1 (also known as ARHGAP41) and ARHGAP32 are among the top ten genes with μGWAS, but rank only 99th and 58th, respectively, in lrGWAS. The risk of epilepsy is increased in children with intellectual disability (ID), where ARHGAP32 has been implicated. Binding between ARHGAP32 and ATXN1 has been implicated in inherited ataxias . OPHN1 is known to affect X-linked ID  and thus might explain the preponderance of CAE among girls. μGWAS adds a pair of binding partners downstream of RAC1 to the picture, RASSF8 (17th) and PARD3 (26th). Finally, the ‘pseudogene’ EEF1A1P12, being among the top ten regions in both approaches, hints at an involvement of EEF1A1, which regulates CDC42. Hence, μGWAS uniquely provides a testable hypothesis about the mechanism by which RHOA is upregulated in some forms of epilepsy.
Ataxias and epilepsy share genetic risk factors [36,37], including OPHN1 [38,39], and both methods implicate two genes binding ataxins, RBFOX1 (rμ = rlr = 5th) and FAT4 (rμ/rlr = 6th/17th). μGWAS also hints at the calcium transporter ATPB2 (39th) and the calcium channel ITPR1 (42nd) as potential drug targets.
The effectiveness of valproic acid in treating epilepsies hints at a role of nucleosome assembly in epilepsies and, in fact, μGWAS implicates mutations in CREB5 and SEC16B (4th and 10th, respectively).
Among the genes involved in cytoskeleton dynamics, ARHGAP32, with known direct interactions with many of the key players, ranked 11th in μGWAS, but only 58th in lrGWAS. Moreover, it had two separate ‘peaks’ in μGWAS, one in the promoter region.
The most significant SNPs in ARHGAP32 by ssGWAS (s = 4.3–4.7) are all members of tag set a (Figure 4e). The two μGWAS peaks, separated by a clear trough (Figure 4C), pinpoint two loci where the effects of different haplotypes converge, centered within 4 kb of exon 10 and the promoter region (exon 0), respectively. Both regions contain a set c SNP as a distant member (≈20 kb), indicating a common ‘background’ risk factor, and two members of region specific tag sets (exon 10: a/b, exon 0: g/h). Both regions belong to a recently identified alternative splice variant, which is expressed during neural development and involved in axon and dendrite extension [40,41]. lrGWAS results are also elevated, yet without discriminating intragenic regions (Figure 4D, insert).
No case or control subject had more than four risk alleles among the three relevant SNPs in either region, although homozygous variants for each SNP are present. Hence, the unobserved combinations must have been selected against, for example, because of a more severe phenotype.
As approximately one-third of all subjects with the ARHGAP32 genetic risk factor lack the phenotype, other genetic cofactors are yet to be identified. Figure 3 suggests the possibility of epistasis in trans between regulatory and functional factors; that is, between the plasma membrane (NGF/NRG–RAS) and the cytoplasm (RAC–RHOA/CDC42).
In this analysis, we have reduced the potential for false-positive results by taking advantage of the novel internal validation features made possible with μGWAS. During data preparation, we used a data quality μ-score based on a comprehensive assessment of missing data, Hardy–Weinberg equilibrium, short-range LD, and expected LD from HapMap information. During ana lysis, we have drawn on polarity conflict and lower than expected μIC (Supplementary Figure 2). Finally, we utilized μIC to indicate highly significant results with low μIC. Notably, none of the pathway related genes flagged as potentially unreliable are related to the genes downstream of RAC1 (Figure 2).
Larger genes are both more likely to carry mutations and to have false positives. Still, although several of the genes identified are among the largest 5% (>200 kb) in the human genome, only two of the top 11 unique genes in μGWAS (CREB5 and BRE) and three of the top 13 unique genes in lrGWAS (DYSF, DOK6 and TMCO7) are ‘direct hits’ within the coding region (Supplementary Table 1). ARHGAP32 and OPHN1 were implicated by ‘hits’ in the stop or promoter regions, respectively, and thus are not at an increased risk for being false positives owing to their size.
The results on ARHGAP32 (Figure 4) are supported by further evidence. First, each of the six SNPs included in the two diplotypes is in high LD with several other SNPs (Figure 4e), for which the probe sequences differ and thus are not subject to the same calling errors. Second, only the two pairs of diplotypes having the highest association with disease risk by μ-scores were in high LD between the intragenic regions (Figure 4C, horizontal dashed arrows), while lower risk diplotypes were unrelated. Not only is it highly unlikely for each of these results to occur by chance alone, it is virtually impossible that they could occur together, and in both independent populations. While this cannot rule out a false-positive result due to association with factors beyond the etiology of epilepsy, these findings validate the ability of μGWAS to detect intragenic regions of biologically relevant epistatic patterns. Finally, the diplotype with the highest overall (exon 10 and promoter region [E,P]) score μ(E,P) is clearly over-represented among cases, with a prevalence of 14.1% (26 out of 185) and 6.5% (23 out of 354) in cases and controls, respectively, compared with 3.8% (7 out of 185) and 6.2% (22 out of 354) for the diplotypes with the lowest μ-scores, confirming that μ-scores are, in fact, reflecting disease risk.
As one would expect, μ and lr scores (Figure 4, right border) are correlated. The subjects with the pair of diplotypes having the highest μ(E,P)-scores (Figure 4D) also share a diplotype with a high lr-score (Figure 4e), but the subjects scoring even higher in lr-scores comprise four different diplotypes. Interestingly, the largest of these groups differs only in the first SNP from a diplotype with low lr- and μ-scores (vertical arrows in Figure 4D), consistent with the sensitivity of linear model results to outliers. As the partial ordering underlying μ-scores, which directly reflects an underlying functional model, results in more genetic uniformity among subjects with extreme scores, these more homogeneous subpopulations could then be selected for identification of functional variations through sequencing.
With GWAS of complex diseases, only a few solitary SNPs typically stand out from the noise, especially in small studies, and this study is no exception. Different compositions of rare disease variants across studies almost inevitably result in different SNPs being ‘significant’, so that validation in independent ssGWAS requires many large studies until a testable hypothesis emerges. μGWAS, by contrast, related approved and experimental drugs to functional clusters of genes along a known pathway in a study of 185 well-characterized cases only.
ssGWAS can efficiently screen for loci, where a single haplotype confers all or most of the risk (EEF1A1P12). lrGWAS has advantages when the effects of SNPs are at least approximately independent and additive (as they might be in some transporters and ion channels). With more complex processes, however, like the interactions of ARHGAP32 with its various binding and activation partners, not constraining results by making overly simplistic assumptions leads to biologically relevant hypotheses about functionally related genes clustered around biologically relevant pathways.
Pathway-based approaches  and gene set enrichment analyses  combine results of univariate statistics using assumptions regarding the relative importance of genes and prior declarations of relatedness among genes instead of observed interactions. How ever, this ana lysis suggests that few, if any, pathway genes themselves may carry mutations in common diseases, unless they are members of redundant complexes (NLRP3, SYN3 and PARD3; Figure 3), in which case multiple genes may need to be knocked out to produce a phenotype .
Wide-locus GWAS aim at accounting for compound heterozygosity, different haplotypes carrying the same mutation and epistasis between nearby disease loci. Hence, functional regions can be identified more easily, even when the contribution of individual SNPs would be difficult – if not impossible – to detect. Many traditional statistical methods, however, have deficiencies for relevant types of epistasis. ARHGAP32, which ranked 10th among μGWAS genes and was validated through the distinct epistatic pattern among the highest-risk allelotypes confirmed in sequencing (Figure 4), did not even appear among the top 50 lrGWAS regions.
μGWAS requires neither Hardy–Weinberg equilibrium nor independence or additivity/multiplicativity of genetic effects, thereby improving sensitivity for nonlinear effects (including evolutionary selection; Figure 4, horizontal dashed arrows and Supplementary Table 1). Adding sequential interactions and recombination hotspots improves resolution, rather than creating artifacts. Together with OPHN1 (also unique to μGWAS at rank 8), this study provides a plausible hypothesis why expression of RHOA is upregulated in some forms of epilepsy .
Increased expression of RHOA was recently associated with some epilepsies . Both OPHN1 and ARHGAP32 interact with both RHOA and PI3K (Figure 3), a drug target currently being investigated in cancer  and inflammatory diseases . Wortmannin, an inhibitor of PI3K, attenuates the effects of seizures in rats  and PX-866 (a oral drug derivative of wortmannin in a Phase II prostate cancer trial ), targets PI3K. If our results are confirmed and hold for patients with other epilepsies as well, this might lead to novel therapeutic approaches to treat patients whose seizures do not respond to drugs targeting ion channels, the inflammasome or the nucleosome. As this study included only patients whose seizures were controlled by valproic acid and/or ion channel blockers, these genes may play an even larger role in other populations.
A particular advantage of μGWAS is the ability to guide the interpretation of data patterns in terms of biological function. Sorting diplotypes by the overall risk they confer (Figure 4C), rather than by linear weight scores lacking direct biological interpretation (Figure 4D) provided compelling evidence for intragenic epistasis (Figure 4C), facilitated validation (Figure 4e), and generated testable hypotheses regarding the function of underlying mutations. By utilizing the order of neighboring SNPs and HapMap information about their expected LD, μGWAS can often identify functional intragenetic regions, whereas the resolution of lrGWAS, irrespective of sample size, is typically limited to an LD block as a whole. For instance, this ana lysis suggests that the combinations of diplotypes with the highest μ-score, in either of the ARHGAP32 regions, have been selected for because they partially compensate for each other. Epistasis might also explain why knocking out the entire ARHGAP32 gene produced no obvious phenotype in mice .
In summary, our results show that genetic risk factors for complex diseases cannot be adequately addressed with ssGWAS alone and that the computationally simple lrGWAS approach may be insensitive to complex forms of epistasis. Reducing artifacts by avoiding models motivated by computational convenience, rather than biological plausibility, reduces the need for independent studies to guard against false-positive results from model misspecifications. For comparative effectiveness research and personalized diagnostics to live up to their expectations, cases and controls need to be closely matched to the population or patient involved. Adequately controlling for genetic and environmental confounders when selecting appropriate cases and controls is essential to tease out predictive factors. This goal is much easier to achieve with only a few hundred subjects, rather than several thousands to be matched. Finally, subset analyses of Phase III trials and published epidemiological studies could rapidly reveal novel insights for drug development.
The Ras pathway is known to be involved in both cancers and many developmental disorders , so the findings here suggest that identifying genetic risk factors modulating this pathway may help in better using information from sequencing patients when targeting pharmacological interventions not only in cancers, but also in other neurodevelopmental diseases other than CAE, including ID and autism spectrum disorders .
With more appropriate statistical methods and more powerful computational tools becoming available, the focus in screening for genetic risk factors of complex diseases can now shift from individual SNPs scattered across the genome to clusters of genes around biologically meaningful pathways. With further advances in computational resources, μGWAS can be extended from epistasis across recombination hotspots (Figure 1) to epistasis between intragenic regions (Figure 4), and between genes (Figure 3).
As μGWAS can provide therapeutically relevant information from substantially smaller sample sizes, decisions in personalized medicine and comparative effectiveness research can be based on samples fine-tuned to the particular patient or population, respectively.
As a few hundred subjects experiencing adverse events or lack of a treatment effect and matched controls from the same population suffice to determine genetic risk factors, data from previous or upcoming Phase III trials can now be effectively mined to determine subpopulations at risk of adverse events and identify directions for development of drugs with a broader target population.
The authors would like to acknowledge S Wrigley, D Politis and R Cauley for collecting the samples, K Olden, LE Thorpe, M Dornbaum and F Steen of the City University of New York School of Public Health for providing access to computational resources for the grid operation, D Kaplun, B Zhou and E Schiffmiller for help with data inspection and ana lysis, WH Greer and GS Zhang for implementing GPU cloud instances, E Horn for editorial assistance, and the reviewers and editors for many helpful suggestions.
KM Wittkowski, V Sonakya and T Song are in part funded by grant #2 UL1 RR024143 from the US National Center for Research Resources (NCRR) Clinical and Translational Science Award (CTSA) and #8 UL1 TR000043 from the National Center for Research Resources and the National Center for Advancing Translational Sciences (NCATS). M Durner and V Sonakya are in part funded by grant #2 R01NS037466 from the US National Institute of Neurological Disorders and Stroke (NINDS). MP Seybold was in part funded by the German National Academic Foundation and the German Academic Exchange Agency.
Financial & competing interests disclosure
The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.
Ethical conduct of research
The authors state that they have obtained appropriate institutional review board approval or have followed the principles outlined in the Declaration of Helsinki for all human or animal experimental investigations. In addition, for investigations involving human subjects, informed consent has been obtained from the participants involved.
Papers of special note have been highlighted as:
■ of interest
■■ of considerable interest