The initial maelstrom of GWAS began in 2005/2006. Fig. (
) shows that since 2006 the number of published SNPs exceeding genome-wide significance from GWAS has risen linearly. The National human genome Research Institute (NGHRi) database of published results contains 5118 entries to date affecting over 500 traits (http://www.genome.gov/gwastudies
]. Fig. (
) shows the top 30 diseases with the greatest number of entries. There have been more than 90 cancer susceptibility loci identified [18
], over 180 loci for height [19
], 39 for type 2 diabetes [20
] and 71 for Crohn’s disease [21
]. With commercial arrays it is now commonplace for published studies to involve analyses using over 500,000 SNPs.
Fig. (2) Summary statistics for entries added to NGHRi database since 2005 http://www.genome.gov/gwastudies. a) shows the number of
studies added per year (Regression r2=0.97), b) shows the distribution of reported –log10 P-values – median is 7.2, (more ...)
Number of entries in NHGRi database for top 30 traits from 2005-2011.
Despite these successes much of the heritability or genetic variance estimated to exist remains unaccounted for [22
). Complex traits are often associated with high heritability yet there is mounting empirical evidence from GWAS results that there are few common variants of large effect. Fig. (
) shows the median odds ratio for the NHGRi database is only 1.1. Height is often used as an example as although 180 loci have been found above the genome wide significance level, these variants only explain around 12% of the genetic variance [19
]. Studies rarely seem to find variants explaining more than 1% of sibling recurrence risk. There are of course exceptions such as large variants found for Crohn’s disease with an odds ratio of 3.99 [25
] and activated partial thromboplastin time where three common variants explain 18% of the phenotypic variance [26
]. It is increasingly likely, however, that most large common variants have now been discovered, therefore, the crucial question is how to capture the remaining variation - often dubbed the missing heritability or the dark matter of the genome. Furthermore, if we have already discovered the low hanging fruit, are there diminishing returns to be expected from further GWAS?
Table 1. Examples of Number of Discovered Loci and Percentage of Heritability (h2) Explained 
Although a proportion of the missing heritability may potentially be due to inflation of estimates of additive variance due to other non-linear sources of variation such as epistasis, epigenetics and gene by environment interaction [2
], this is, in general, not supported by theoretical and empirical data [28
]. Even if epistasis was widespread, its detection would be challenging due to the number of tests involved, lack of power and the difficulty of setting appropriate significance thresholds. These difficulties are reflected in the little evidence available from large genome-wide association studies to suggest interlocus interactions. It is highly probable that there are a large number of loci with effects too small to achieve significance by the stringent thresholds set. It is also possible that the remaining variants are rare and therefore poorly tagged by current arrays, which use common tag SNPs unlikely to be in sufficient LD with rarer variants. More powerful methodology can capture this hidden genetic variation by using all SNP information regardless of significance thus avoiding the problem of stringent significant thresholds set in GWAS. Yang et al
] used ~295K SNPs on ~4K individuals and explained 10-fold more variation of height than previously reported (i.e. the ~295K SNPs explained about 45% of variance). Furthermore if the incomplete LD caused by the differences between the distributions of minor allele frequency for SNPs and causal variants is accounted for, 80% of the variation of height can be explained in line with literature estimates of narrow sense heritability. The approach of Yang et al
] can be readily extended to partition the genetic variance for designated regions (often per chromosomes) by using a linear mixed model to fit all SNPs in that region simultaneously. This provides a mechanism for locating regions of importance containing markers or groups of markers which may not meet significant thresholds on an individual basis.
Single marker based association models can also be extended further to incorporate combinations of markers or haplotypes. If the distribution of CTL differs from that of markers, the utilisation of haplotype information could capture associations with rarer variants eluded by single SNP analysis because allele frequencies at the CTL and the ‘virtual’ marker created by the haplotype will be better matched [30
]. Combinations of multiple SNPs or haplotypes could potentially capture greater proportions of genetic variation than single SNPs [32
A further explanation for the lack of genetic variation captured by GWAS is that the allelic architecture underlying complex traits is not described accurately by the CD/CV hypothesis. The common disease rare variant hypothesis (CD/RV) states that there are many low frequency variants of large effect segregating in the population and that each phenotype is due to combined effect of a few of these variants [33
]. Under mutation selection balance theory the expectation would be an inverse correlation between deleterious SNPs and minor allele frequency with a probable upper limit of 1% applying to deleterious alleles [36
]. Recent evidence for the (CD/RV) theory includes a rare variant in MHY6 associated with sick sinus syndrome or slow heart rate [37
]. The risk allele has a frequency in the Icelandic population of 0.38% but has an associated odds ratio of 12.53. Lifetime risk in the population is 6%, however for carriers of the risk allele is 50%. Interestingly common variants of the gene modulate cardiac conduction.
The CD/RV theory also supports variation due to widespread allelic heterogeneity. There is increasing evidence to suggest that multiple independent signals within a locus exist for a number of complex traits. There are a large number of disease causing allelic variants in some known genes such as BRCA1 and MLH1. Haiman et al
] found multiple independent regions associated with prostate cancer within the 8q24 locus and Lango Allen et al
] found that out of 180 loci significant for height at least 19 loci had multiple independently associated variants. Overall, this suggests that previously discovered loci are strong candidates for harbouring further missing genetic variation.
The underlying distribution of effects in complex diseases may have a huge impact on the application of information discovered to date. One of the greatest hopes of GWAS was that it could be used for the detection of disease related CTL. Furthermore by identifying the genes involved there was much hope for the prediction and prevention of disease alongside new potential drug targets or therapeutics. If we were able to accurately estimate the effects of sufficient loci to explain half of the known genetic variance then genomic profiles for most common diseases would achieve sufficient discriminative ability to be of clinical validity. Even if accurately estimated loci explained only one quarter of genetic variance, for rare diseases (i.e. low prevalence) the genomic profile would be a more useful predictor of risk than self-reported family history [39
]. These profiles can be used from birth enabling susceptible individuals to avoid environmental exposure to risk, thereby reducing absolute risk of disease.
The development of risk predictors for complex diseases has been slower than anticipated due to the small number of loci identified by current GWAS. For complex traits individual variants rarely explain enough variation to be utilised as risk predictors, however profiles based on many of these variants could potentially be used [40
]. There is a mounting body of evidence showing that whole genome prediction methods developed over decades to estimate livestock breeding values offer the opportunity to increase the accuracy of prediction of disease risk [43
]. Challenges include how to select and estimate the predictors of the model that minimises the mean square error of prediction of the phenotype. The genetic architecture of the trait determines the best strategy for model selection and the accuracy of the prediction model. Issues for model selection and expected accuracy include determining whether models that fit a subset of the available SNPs will perform better than models that fit all available SNPs simultaneously, how sensitive both approaches are to the misspecification of the genetic architecture of the trait [46
], or the best strategy to shrink the estimates of the effects to prevent over-fitting of the models [48
]. Furthermore in populations with high LD many loci will be correlated with each other affecting model choice and assumptions about prior distribution. Some of these issues are reviewed by Daetwyler et al.
who give deterministic formulae for assessing the accuracy of genomic prediction [50
The area under the receiver operator characteristic (ROC) curve (or its equivalent C-index) can be used to assess the discriminative ability of a prediction model [51
] and has been used to assess the performance of genetic predictors and genomic profiling [52
]. This is a technique for visualising, organising and selecting classifiers based on their performance often employed by the medical decision making community for diagnostic testing [54
]. The performance of a diagnostic classifier over a range of thresholds can be examined to identify the threshold at which the classifier is most accurate.
For disease prediction, the ROC curve represents the trade off between true positive rate (sensitivity) and false positive rate (1-specificity) and the area under the receiver operator characteristic curve (AUC) the probability that for a randomly selected case and control, the case will be ranked higher by the prediction model than the control. This is equivalent to the Mann–Whitney–Wilcoxon test statistic [55
]. ROC curves are a useful measure as they are not affected by the skewness of the data i.e. they are not affected by the proportion of cases and controls (other than sampling error) which might vary from one data set to the next [51
]. Wray et al
] give parameter estimates for 17 common complex diseases. They show that it is theoretically possible for a genomic profile for complex disease to exceed the threshold of discriminative ability of 0.75 that could, arguably, be considered clinically useful. They also show an AUC of 0.75 can imply anything from 0.1 to 0.74 of the genetic variance explained thus care should be taken in the interpretation of this statistic without some knowledge of the parameters used.
To explore this further we simulated data for prostate, breast and colorectal cancer to investigate the effect of various parameters on the prediction of risk. Ten thousand cases and 10,000 controls were simulated under a liability threshold model. The distribution of allele frequencies was taken from a beta distribution and the additive genetic effect sizes from a normal distribution. The data was randomly assigned to two equal sized groups, a training and a validation set. Three scenarios were examined a) the use of prediction or AUC when effects of all loci are known (i.e. we used the simulated effects), b) the use of prediction when the effects of all loci are estimated, and c) prediction using the 15 most significant loci. Results showed that, in particular, for the prediction of prostate cancer the discriminative ability across all ages (0.85) was higher than for the over 65’s (0.79). This is counter intuitive for a disease primarily of late onset but can be explained by the fact that the prevalence across all ages is lower, therefore those that have genetic risk factors leading to clinical diagnosis are likely to be at the extremes of the distribution in the population making the probability of discriminating accurately higher. Accuracy of prediction is very dependent on the underlying genetic architecture of the trait. Fig. (
) shows the discriminative ability for breast cancer (BC), colorectal cancer (CRC) and prostate cancer (PC) cancers either across all ages or in the over 65s comparing models with 500 or 10,000 underlying additive loci. When 10,000 loci were simulated the discriminative ability when using the 15 most significant effects ranged from 0.53-0.57. Whilst these figures are low and do not appear to offer a hopeful prognosis for clinical utility, recent results for 32 loci associated with BMI [57
] found that although AUC for prediction of obesity was only 0.57 this was still an increase from using family and environmental information alone which only yielded an accuracy of 0.51. When we simulated a prevalence of 0.004, a heritability of 0.42 and 500 loci underlying the trait, the discriminative ability for a model which estimated the effects of all loci was 0.93 which was the equivalent of using the known simulated values for the loci. Accuracy using the top 15 most significant estimated effects was 0.75. In general the discriminative ability of the prediction models using the estimates of genetic effects are as good as using the actual simulated values. The maximum AUC of 0.92, 0.87, and 0.91, are similar to estimates from Wray et al
] who estimated 0.90, 0.89 and 0.96 for Prostate. Breast, and Colon cancer respectively using a threshold liability model.
Fig. (4) Descriminative ability (AUC) for prostate (PC), breast (BC) and colorectal cancer (CRC) for all ages and for over 65 years of age.
Either 500 or 10,000 loci were simulated with estimates or real/actual values for all loci or top 15 significant results (more ...)
Park et al
] used summary statistics from existing GWAS to calculate the expected distribution and the number of loci that exist within the range of SNP effects observed on a trait by trait basis. They estimate discoveries for future GWAS for given sample sizes by integrating power over the number of unidentified loci that probably exist whilst accounting for the distribution of relative risk and allele frequency. Based on the assumption of a spectrum of low-penetrance common variants the predicted total number of loci within the range of effects currently detected by GWAS for height, Crohn’s disease and BPC (breast, prostate, colorectal) cancers are 201, 142 and 67 explaining 16.4, 20 and 17.1% of the genotypic variance respectively. They use these predictions to estimate the AUC for Crohn’s and the cancers. They predict that all 142 loci would give an AUC of 79.2% for Crohn’s in comparison to 72.8% from the 30 loci discovered to date. An AUC for breast cancer given the 5-10 loci that exist of 57% could be improved to 63.5% if all 67 loci were discovered. These results are based on modest estimates of heritability and MAF from published studies generally > 0.05 and do not reflect results from our simulation studies [59
] or empirical data from Yang et al
A recent paper by Meuwissen et al
] takes genomic profiling a step further by exploring the prospect of prediction from whole genome sequencing data. They use simulation analyses to explore the accuracy of genomic prediction using sequence data which has the advantage of using all polymorphisms such as indels as well as SNPs. Furthermore the sequence data contains the causal variants. They conclude that if there are a finite number of causal variants a Bayesian approach is most successful, however should the distribution of effects follow an infinitesimal model with thousands of loci of small effects it is expected that BLUP methods will outperform the Bayesian analysis. Statistical methods for whole genome prediction are also reviewed by de los Campos et al
]. Ultimately, however, the success of prediction methods for complex diseases will be limited by the disease prevalence and the heritability. Even if the predictor explains 100% of the genetic variance, the maximum AUC is dependent on trait heritability ad prevalence. This is shown in Fig. (
) where even if the simulated values for all loci are used the AUC ranges from 0.87-0.93. Care must also be taken to note that genomic profiling infers genetic risk and not absolute risk. These profiles are a predictor of genotypic value rather than phenotype although they can be extended further to incorporate environmental risk factors such as smoking, diet or exercise. Furthermore, the discriminative ability of a model in a case-control study is different to discriminative ability at the population level and a model that is well calibrated for a case-control study may well yield a poor performance when screening the whole population. Strengths and weaknesses are further reviewed by Hand [55
The advent of whole genome sequencing methods [61
] provides the opportunity of increasing the information available for genetic mapping studies by the inclusion of all sources of genetic variation (SNPs, CNVs, indels, rearrangements, etc.) that may be causal and allows the unbiased screening of the genetic variation present in the sampled individuals. Of the 3200Mb of DNA in the human genome only 1.5% or ~50Mb is functional or coding DNA comprising approximately 20-25,000 genes. Large numbers of repetitive elements make up approximately 50% of the DNA. These include indels, copy number variants (CNVs), translocations, inversions, and chromosomal duplications. In a recent review of 75 cancer genes with germline mutations, 28 were reported to be mutated by genomic deletion or duplication [18
The latest commercial arrays contain a combination of SNPs and structural variants [62
]. CNVs have been associated with many traits [63
] including starch digestion (AMY1) [64
], HIV [62
], Schizophrenia [65
] and Crohn’s disease [66
], although in European populations, most common CNVs are likely to have been tagged with SNPs. High heritability, high mutation rates and greater variation in African populations still provide compelling arguments for sequencing thousands more genomes and using CNVs to complement SNPs.
Following the Hapmap project the thousand genomes project [10
promises to deliver individual sequence variation by sequencing 1000 individuals. This will give insights into the variation in structural variation such as CNV’s, indels and deletions and into regulatory elements. The genomes of 2500 individuals from 25 populations around the world will be sequenced using next-generation sequencing technologies [61
]. This international collaboration is set to produce an unprecedented public catalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts to support genome-wide association studies and other medical research studies. The use of this valuable resource as a reference panel for imputation of cohorts with genome-wide association data may help to screen for rarer genetic variants. However, it is yet unclear whether rarer genetic variants will be accurately imputed using general reference panels. It may be more useful that each cohort generates its own reference panel by either sequencing or genotyping a subset of it for a dense SNP array. Even in the latter case, it is likely that the imputation quality of rare variants in low LD with common tagging SNP will be low, making re-sequencing of large numbers of samples necessary [67
It is becoming clear that SNP trait associations alone rarely lead to identifying the causal variant or the context in which the gene operates. This biological context is a necessary step for the generation of new biological hypotheses and the identification of drug targets for disease. There are many intermediate or endo-phenotypes which can be used to map complex variation. It is possible that these could be more heritable and represent a more comprehensive approach in the quest for the underlying causal variants of quantitative traits.
The most widely studied intermediate phenotype is the study of expression analysis or the abundance of mRNA transcripts using microarrays or RNASeq to look at which genes are expressed at a given timepoint in a given tissue. Expression QTL may be categorised into cis (local) or trans (distant) effects. There appears to be a current bias towards large cis effects inferring regulatory processes are involved [68
]. Associating patterns of gene expression with genotypic and phenotypic values facilitates insights into biological function. Genes which are differentially expressed can be used to infer regulatory networks and underlying pathways associated with traits or diseases [70
Regulatory networks and discovered pathways in turn provide another dimension to GWAS and can themselves be used as intermediate phenotypes. Pathway enrichment analysis involves assigning SNPs to genes and subsequent pathways in order to find associations at the pathway level providing greater insight into biological function. Enrichment scores for known pathways can be obtained to investigate whether there is over representation of genes in any one pathway associated with phenotype [72
]. The 180 loci discovered for height [19
] are enriched for genes that are connected in biological pathways and underlie skeletal growth defects. Pathway analysis increasingly shows that pathways reputedly underlying common diseases are often common across diseases. It is difficult to ascertain whether this is due to annotation bias or whether there is genuine pleiotropy across the mechanisms underlying these diseases. It could be hypothesised some level of pleiotropy must exist given that there are approximately 21,000 genes and millions of traits. A recent study of coronary artery disease found that 5 out of 13 loci showed strong association with various other diseases or traits [76
]. Interleukin receptor genes are linked with several common diseases such as Crohn’s, lupus, and rheumatoid arthritis implicating common underlying mechanisms or pathways [75
]. The prediction of biological mechanisms seems at best tenuous with many potential biases not only from limitations of annotation but from setting of significance thresholds, assignment of variant to gene or pathway, algorithm or method used to ascertain enrichment scores, and fundamentally the type of biological data selected [77
]. It is possible that in using a statistical approach we are much closer to the accurate prediction of phenotype using molecular data, which is almost a black box whole genome approach, than we are to gaining real insights into the specific underlying biological mechanisms which remain rather more elusive for most traits.
Finally, an important source of heritable variation not seen in coding sequence is epigenetic modification. This includes promoter methylation, histone tail modifications and altered expression of non-coding RNAs that associate with chromatin modifying complexes. These modifications contribute to gene regulation in normal development, particularly in foetal growth, to gene expression in tumorigenesis and have been shown to mediate the influence of environment on gene expression. Technologies are now sufficiently advanced to carry out methylation profiling. Gibbs et al
] find abundant QTL for DNA CpG methylation across the genome in brain tissue. Methylation studies are likely to play a wider role in the post genomic era. There are arguments, yet to be proven, for epigenetic variation as a driving force for development, evolutionary adaptation, and disease [80