|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide investigations for identifying the genes for complex traits are considered to be agnostic in terms of prior assumptions for the responsible DNA alterations. The agreement of genome-wide association studies (GWAS) and genome-wide linkage scans (GWLS) has not been explored to date. In this study, a genomic convergence approach of GWAS and GWLS was implemented for the first time in order to identify genomic loci supported by both methods. A database with 376 GWLS and 102 GWAS for 19 complex traits was created. Data regarding the location and statistical significance for each genetic marker were extracted from articles or web-based databases. Convergence was quantified as the proportion of significant GWAS markers located within linked regions. Convergence was variable (0–73.3%) and was found to be significantly higher than expected by chance only for two of the 19 phenotypes. Seventy five loci of interest were identified, which being supported by independent lines of evidence, could merit prioritization in future investigations. Although convergence is supportive of genuine effects, lack of agreement between GWLS and GWAS is also indicative that these studies are designed to answer different questions and are not equally well suited for deciphering the genetics of complex traits.
Despite extensive efforts, the genetic basis of common, multifactorial disorders remains poorly understood. Few genetic regions and genes can reliably be considered as true positives. The majority of findings from the two basic tools for deciphering the genetics of complex traits (genome-wide linkage scans (GWLS) and candidate gene studies) have shown inconsistency and non-reproducibility [Altmuller et al., 2001; Lohmuller et al., 2003, Ioannidis 2007; Morgan et al., 2007, Lasky-Su et al., 2008]. The advent of the first wave of genome-wide association studies (GWAS) provided substantial advances in uncovering DNA variations influencing the risk of common diseases and clinical phenotypes. More than 100 loci for 40 traits have been identified [Pearson & Manolio, 2008]. Although concerns relevant to the analysis, interpretation and translation of GWAS results have been expressed [Hardy & Singleton 2009; Shriner et al. 2007; McCarthy et al., 2008], publication trends for GWAS across several fields are rapidly increasing [Yu et al., 2008].
The multifactorial etiology of common disorders involving complex epistatic and gene-environment interactions reduces the likelihood that a single method could provide definitive answers. Alternatively, an integration approach combining different kinds of genetic data analysis (`genomic convergence') has been proposed for the prioritization of promising results from the bulk of exponentially accumulating data [Hauser et al., 2003]. Evidence for the implication of a genetic locus can be considered as more valid when supported by independent research methodologies, such as linkage, association or expression profiling studies.
Since the majority of genetic associations claimed by hypothesis-driven candidate-gene studies are eventually refuted [Lohmuller et al., 2003, Morgan et al., 2007], the `hypothesis-free' methods of genetic analysis represent an area of major interest. The GWAS and GWLS can be considered as the only agnostic approaches, free from underlying assumptions regarding the location, number or functionality of causal genetic variants [Altmuller et al., 2001; Pearson & Manolio, 2008; Hunter et al., 2008]. Although GWAS have revealed many loci and variants of unprecedented biological implication [Pearson & Manolio, 2008; Hunter et al., 2008], no previous study has examined the agreement of these association findings with previous linkage analysis results. By comparing the independent findings stemming from `hypothesis-free' methods, GWLS and GWAS can be considered as complementary strategies and not as concurrent [Bourgain et al., 2007]. In this study, we combined for the first time the strengths of GWLS and GWAS to identify genetic loci for complex traits supported by both type of studies, using a systematic literature search and data mining from web-based genetic databases.
Applying a genomic convergence approach, data from GWLS and GWAS for complex traits were compared and examined for agreement. Eligible studies were retrieved by a systematic search of the PubMed database from its inception through August 2008, for all available GWLS and GWAS in humans. Combinations of search terms as `genome-wide' OR `linkage' OR `association' OR `gene' OR `polymorphism' were used. Eligible phenotypes included disorders and traits for which multifactorial etiology has been proposed. To ensure accumulation of sufficient analyzable information and examine the most extensively studied phenotypes, only phenotypes with at least three published GWAS and at least three published GWLS were considered. Finally, a database including 376 GWLS and 102 GWAS for 19 phenotypes was created, available at our website (http://biomath.med.uth.gr).
From each study, the following information was extracted: publication details (author, year of study, journal); phenotypic traits; details of study population (ethnic background, sample-size); genotyping methods (type and number of markers, type of genotyping platform); statistical methods; results obtained (marker accession numbers, genetic and physical location, genetic annotation, statistical significance).
Within each GWAS, the statistically significant Single Nucleotide Polymorphisms (SNPs) were recorded. In order to maximize the number of potentially promising findings from each study, a critical p-value at the 10−6 level was initially used [Hindorff et al., 2008], instead of the commonly used but more stringent 0.5×10−8 level [McCarthy et al., 2008]. For each SNP, the physical location and genetic annotation were identified through the dbSNP database [Sherry et al., 2001]. For SNPs exhibiting strong linkage disequilibrium (r2> 0.8) on the basis of HapMap data B35 [The International HapMap Project. 2003] within a single study, only the most significant representative marker of the locus was considered. The extracted GWAS data were cross-validated with the available GWAS data in HapMap B36, the Catalog of Published Genome-Wide Association Studies [Hindorff et al., 2008] and the HuGE Navigator database [Yu et al., 2008].
From the concordant to phenotypes available GWLS, the markers reaching genome-wide significant or suggestive linkage score (i.e. LOD score > 2) in their respective main analyses were recorded [Sawcer et al., 1997]. For each marker, the respective linkage confidence intervals were determined either by the 1.0-LOD-unit-down method when LOD score curves and markers were available, or by using the ± 15 centiMorgan interval around the peak marker in the absence of the above information, as previously described [Chen et al., 2007; Zintzaras et al., 2007]. These data were supplemented by linkage intervals identified in GWLS meta-analyses, and for five phenotypes (Diabetes Mellitus Type I, Obesity, Bipolar Disorder, Schizophrenia, Alzheimer's Disease) by their respective online linkage databases [Hulbert et l., 2007; Bertram et al., 2007; Rankinen et al., 2006; Konneker et al., 2008]. The extent of genomic coverage by linkage intervals for each phenotype was calculated by adding the non-overlapping intervals over the Marshfield genetic map [Broman et al., 1998] (Table 1). Comparisons of GWLS and GWAS results were mediated by translations of genetic and physical map distances using the MapOMat software [Kong et al., 2004], the Marshfield genetic map [Broman et al., 1998] and the Uni-STS database.
Genomic convergence for each phenotype was defined as the proportion of statistically significant SNPs that fell within the areas of linkage intervals. Convergence was estimated when at least one significant SNP in at least one GWAS was detected for the specific phenotype. To examine whether the observed convergence exceeded or fell behind the convergence expected by chance, we used a two-sided, one-sample z-test of proportions at the alpha=0.05 level of significance. The convergence expected by chance for each phenotype was set to be equal to the proportion of genomic coverage by linkage intervals. In example, for a phenotype where 40% of the genotype is covered by linkage intervals, each SNP has a 40% probability to fall within a linkage interval and produce a convergent result. Exact binomial test probabilities were calculated to test for agreement with the z-test. Correlation of the convergence proportions with the proportions of linkage coverage was also examined.
Subgroup analyses by a) study sample size and b) genomic location of SNPs were performed. In the first analysis, we explored convergence only for GWLS and GWAS with large sample sizes. Large sample sizes were defined as those exceeding the median sample size of the group of studies investigating the same phenotype. In the second analysis, convergence was examined for intergenic and genic SNPs separately (Table 1).
In order to place the convergent findings in our database on exact chromosomal locations and define unique convergent loci, the linkage intervals harboring GWAS SNPs and showing extensive overlapping between them (>80%) were merged. Additionally, from SNPs across different GWAS that showed strong linkage disequilibrium or were located in physical proximity (less than ±100kb apart), only the most significant `hit' was recorded. The physical distance between the most highly associated SNP and the linkage peak marker in the interval was also recorded. In the final step, more stringent criteria for genome-wide linkage (LOD score > 3.2) and association (p-value < 0.5×10−8) were applied in this dataset of unique convergent findings, to identify signals with the most compelling evidence of convergence (Supplementary Table 1).
The number of linkage intervals, the proportion of genomic coverage by linkage intervals and the number of significant GWAS SNPs for each phenotype are shown in Table 1. Overall, 1221 linkage intervals were constructed and 387 (269 genic /118 intergenic) SNPs were identified. Convergence estimates ranged from 0% in the case of amyotrophic lateral sclerosis and colorectal cancer to 73.3 % for BMI-obesity (median 30.8, interquartile range (IQR) 12.0). The genomic convergence proportions correlated with the proportions of linkage genomic coverage (Spearman correlation coefficient r=0.67). However, convergence was significantly higher than expected by chance for only two phenotypes (LDL cholesterol concentration and restless leg syndrome).
In the subgroup analysis for studies with large sample sizes, the convergence ranged from 0% to 75.0% (median 22.2, IQR 28.6). Convergence was now statistically significant for six out of the 19 phenotypes (Alzheimer's disease, BMI-obesity, Crohn's disease, LDL cholesterol concentration, restless leg syndrome and rheumatoid arthritis).
When intergenic SNPs were considered separately, the observed convergence did not exceed the expected by chance for any of the phenotypes (median 10.0, IQR 33.3). On the contrary, the convergence for rheumatoid arthritis was significantly less than expected (convergence=0%, p<0.05). For SNPs located within genes, the convergence estimates ranged from 0–100% (median 35.0, IQR 35.7) and it was significantly higher than expected for four phenotypes (Alzheimer's disease, diabetes mellitus type I, LDL cholesterol concentration and rheumatoid arthritis).
A detailed description of convergent SNPs and their harbouring linkage intervals is provided in Supplementary Table 1. Seventy five unique convergent loci were identified. A graphical depiction of convergent findings is provided on a chromosomal map, available at our website (http://biomath.med.uth.gr). The stringent criteria for genome-wide linkage and association were met for 16 out of these 75 unique loci (21.3%). When considering all 389 SNPs initially identified from GWAS, 35 of them (9.0%) provided strong evidence of convergence (Supplementary Table 1).
Convergent findings can be considered as landmarks of loci and variants meriting priority in replication studies, in an effort to limit the plethora of false positive results in genetic epidemiology [NCI-NHGRI Working Group on Replication in Association Studies et al., 2007; Zintzaras & Lau., 2008]. By comparing the results from two agnostic methodologies for unraveling the genetics of complex traits, we identified a subset of genomic loci with concomitant evidence of implication.
In our main analysis, genomic convergence was variable and was found to be significantly higher than the convergence expected by chance for only two of the 19 phenotypes. In the subgroup analysis for studies with large sample sizes, the number of phenotypes with statistically significant convergence increased to six. This pattern of results implied that studies with small sample sizes may have caused inflation in the inconsistency between GWLS and GWAS.
Consistently replicated GWAS findings that uncovered previously unsuspected pathophysiology, such as the 8q24 locus for prostate cancer or the 9p21.3 locus for coronary artery disease, have shown convergence in our database (Supplementary Table 1). Such signals located in “gene deserts” (intergenic regions), would never have been considered in a candidate gene study. The fact that intergenic signals had been previously implicated by linkage analysis and remained largely unnoticed until the emergence of GWAS is of particular importance. However, our subgroup analysis for convergence of intergenic SNPs did not identify any phenotype with significantly increased convergence. In contrast, convergence of genic SNPs showed significant results for four phenotypes.
The use of prior linkage information in the analysis of GWAS data has been proposed as a strategy to increase statistical power [Roeder et al., 2006]. By applying a false-discovery rate that involves weighting the hypotheses on the basis of prior, informative linkage data, the power of the GWAS is expected to improve considerably. On the other hand, methodologies for enriching linkage results on the basis of the commonality of functional annotation have been proposed, in order to identify candidate genes expected to participate in a common pathway or process [Shriner et al., 2008]. Additionally, gene-prioritization bioinformatic tools have been developed, which exploit linkage and other types of prior information (such as gene expression profiles or protein-protein interactions) [Hutz et al., 2008]. These promising methodologies could help prioritize for further research those genes and SNPs that provide evidence of implication across different venues of research.
Nevertheless, our findings warrant cautious interpretation in the light of inherent weaknesses of genomic epidemiology investigations. Our study was based on the assumption that variants detectable by linkage designs are the same as those detectable by association, and vice versa. However, susceptibility loci in the multiplex families used in GWLS might not overlap with susceptibility loci in population-based samples, as the ones used in GWAS [Roeder et al. 2006].
The imprecision and limited power to detect multiple loci with modest contributions are well-recognized limitations of GWLS [Altmuller et al., 2001]. The current design of GWAS makes them more suitable in the context of the common disease-common variant hypothesis [Hunter et al., 2008; McCarthy et al., 2008]. Consequently, significant SNPs located outside linkage regions can still mark true positive associations, since the corresponding GWLS possibly lacked the power to uncover linkage in the surrounding region. Some of the most convincing associations that have emerged in the GWAS era, like the IL23R gene for Crohn's disease or the KCNJ11 gene with type 2 Diabetes, were not supported by linkage data. Furthermore, the imprecision of GWLS results in outputs of extended linked regions containing hundreds of genes. Inability to narrow down the linkage interval might have resulted in some spurious convergent findings in our database.
Apart from the limitation of GWLS to detect low penetrance genes, GWAS do not capture the effect of SNPs with minor allele frequencies of less than 5%. However, according to an alternative hypothesis (common disease-rare variant hypothesis), complex traits are caused collectively by multiple rare variants with moderate to high penetrance [Iyengar & Elston, 2007]. These rare variants, with frequencies lying somewhere between the limits of deleterious mutations and polymorphic variations (i.e.0.1–1%), would be detectable only after extensive resequencing of carefully chosen candidate genes in selected cases [Bodmer & Bonilla, 2008].
In conclusion, genomic convergence techniques do not provide direct validation of findings, nor represent a substitute for high-quality replication studies. Interpreted with caution, evidence from convergence could provide an additional piece of information (together with strong statistical evidence or biological plausibility) [McCarthy et al., 2008; NCI-NHGRI Working Group on Replication in Association Studies et al., 2007] in determining the design of replication studies. The convergent findings that emerged from two exploratory techniques methodologies for complex traits are supportive of genuine effects that could merit prioritization in future studies. Lack of convergence is not indicative of false positive findings or methodological flaws; genome-wide linkage and association studies are probably designed to answer different questions and are not equally well suited for deciphering the genetics of complex diseases.
Convergent Findings from GWAS and GWLS.
Georgios Kitsios is Pfizer-Tufts Medical Center Research Fellow in Clinical Research.
Scientific support for this project was provided through the Tufts Clinical and Translational Science Institute (Tufts CTSI) under funding from the National Institute of Health/National Center for Research Resources (UL1 RR025752). Points of view or opinions in this paper are those of the authors and do not necessarily represent the official position or policies of the Tufts CTSI.
Web Resources The URLs for data presented herein are as follows: