|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The number of genome-wide association studies (GWAS) is growing rapidly leading to the discovery and replication of many new disease loci. Combining results from multiple GWAS datasets may potentially strengthen previous conclusions and suggest new disease loci, pathways or pleiotropic genes. However, no database or centralized resource currently exists that contains anywhere near the full scope of GWAS results.
We collected available results from 118 GWAS articles into a database of 56,411 significant SNP-phenotype associations and accompanying information, making this database freely available here. In doing so, we met and describe here a number of challenges to creating an open access database of GWAS results. Through preliminary analyses and characterization of available GWAS, we demonstrate the potential to gain new insights by querying a database across GWAS.
Using a genomic bin-based density analysis to search for highly associated regions of the genome, positive control loci (e.g., MHC loci) were detected with high sensitivity. Likewise, an analysis of highly repeated SNPs across GWAS identified replicated loci (e.g., APOE, LPL). At the same time we identified novel, highly suggestive loci for a variety of traits that did not meet genome-wide significant thresholds in prior analyses, in some cases with strong support from the primary medical genetics literature (SLC16A7, CSMD1, OAS1), suggesting these genes merit further study. Additional adjustment for linkage disequilibrium within most regions with a high density of GWAS associations did not materially alter our findings. Having a centralized database with standardized gene annotation also allowed us to examine the representation of functional gene categories (gene ontologies) containing one or more associations among top GWAS results. Genes relating to cell adhesion functions were highly over-represented among significant associations (p < 4.6 × 10-14), a finding which was not perturbed by a sensitivity analysis.
We provide access to a full gene-annotated GWAS database which could be used for further querying, analyses or integration with other genomic information. We make a number of general observations. Of reported associated SNPs, 40% lie within the boundaries of a RefSeq gene and 68% are within 60 kb of one, indicating a bias toward gene-centricity in the findings. We found considerable heterogeneity in information available from GWAS suggesting the wider community could benefit from standardization and centralization of results reporting.
The number of genome-wide association studies (GWAS) is growing nearly exponentially, heralding an era of unprecedented discovery. Numerous novel genetic loci underlying disease susceptibility have been discovered using the unbiased GWAS approach, and many of these associations hold up to rigorous standards for replication . Journal editors and scientists are increasingly calling for full disclosure of aggregate research results to accompany publication of GWAS in the form of published appendices or public websites. Under the recently implemented National Institutes of Health data-sharing policy http://grants.nih.gov/grants/guide/notice-files/NOT-OD-07-088.html, powerful opportunities now exist for the conduct of research using GWAS datasets due to the availability of increasing numbers of participant-level datasets. Analytic and computational approaches that further probe the results of individual studies or combine results from multiple GWAS datasets may strengthen previous conclusions, suggest novel loci or pathways , contribute to more calibrated effect estimates, suggest pleiotropy, refine the localization of association signals, or highlight likely functional variants . A key variable for the capacity to conduct such analyses is the extent of access to full versus selective results as well as the nature and relative standardization of the information content.
While a centralized GWAS database, dbGAP, exists at NCBI, inclusion of data and results is voluntary and many GWAS have chosen not to participate, choosing instead not to release results, or to release results at a journal or independent web site . A review of GWAS associations by the NHGRI has been published that grouped associations in specific disease categories , and a companion data table does provide a centralized resource for accessing some top GWAS results, but at the time of this submission was limited to 334 SNPs with minimal annotation (see http://www.genome.gov/26525384/). The overall objective of this study was to create an open access, centralized database of significant published GWAS results, and to provide basic informatics standardization of these results in the format of the current genome build with updated gene annotations. We furthermore sought to characterize and analyze this initial GWAS database to assess data availability, data quality and annotations across all phenotypes, and to identify key genomic characteristics of GWAS associations and opportunities and obstacles to further analysis of this potentially vast genetic data space. With this objective, we collected and analyzed GWAS results compiled from a series of 118 GWAS studies published through March 1, 2008, all of which tested trait associations with > 50,000 markers, identifying genomic characteristics of associated loci in GWAS, facilitating new analyses and highlighting limitations in available data sources (study characteristics of the GWAS included are detailed [see Additional file 1]). Our initial analyses suggest novel candidate regions may be identified for further biological validation and that straightforward density analyses of associations across GWAS may be an effective way of highlighting candidate loci for further targeted analysis. Recent independent analyses have replicated genetic associations for loci suggested by our analysis (see Discussion). However, we also found reporting inconsistencies across GWAS and gaps in current reporting, suggesting substantial barriers to future analyses. To encourage further scientific cross-study exploration of published GWAS, we make our database fully available as an online supplement [see Additional file 2].
One-hundred-eighteen GWAS articles published before March 1, 2008 and their associated supplemental information was collected. The articles were identified through Pubmed searches (GWAS, GWA, WGAS, WGA, genome-wide, genomewide, whole genome, all terms +association or +scan), scanning the citations within each article and through direct searches of journal websites where GWAS were previously published. For citation information for all included articles and data sources [see Additional file 3]. All GWAS tested > 50,000 SNPs. When available via open web access, additional GWAS data was collected except if the additional data required an application process. Some papers included results for scans of multiple phenotypes or population groups. Thus, results included here reflect partial aggregate data from more than 400 individual genotype-phenotype GWAS datasets.
For each article, we scanned all available text, tables, figures and accessible supplemental data to extract the most statistically significant phenotype association described in the article per SNP, meeting the following minimum criteria: 1) the SNP had an identifiable ID or verifiable genomic position, 2) a statistical p-value for association was reported, 3) the p-value was less than or equal to 0.001 (allowing for rounding) if the association was from a raw, unadjusted scan, 4) the p-value was less than or equal to 0.05 if the association was derived from replication, fine mapping or re-sequencing efforts, or if it was identified as belonging to a locus or region that was specifically identified as an a priori candidate by the authors. In many cases due to the large amount of available data or the non-uniform data format, a custom Perl program was written to facilitate the processing of the associations. We did not collect full disclosure association results for scans with density < 200,000 SNPs, even though these full disclosure association results were available in some cases. The primary reason for this analytic decision was that the wide availability of many trait results for lower density scans would result in an extremely large meta-dataset biased toward lower density genotyping results which have less power to detect true associations. Likewise, the discovery scan p-value threshold was set to p < 0.001 to create a set of significant GWAS results of manageable size in which the representation of significant results from studies that released limited results would not be dwarfed by results from studies that released most, or all, results. Information specific to each GWAS and to each SNP-phenotype meeting our criteria was collected in a single table [see Additional file 2]. This table represents a large, open access database of GWAS results (also presented as a Microsoft Access database [see Additional file 4]). For an extended description of each data field and how they were derived [see Additional file 3]. Genome-wide plots of all included associations are shown in Figure Figure1,1, and for those associations above the threshold of 5 × 10-8 in Figure Figure22.
We independently verified the quality of the extracted GWAS results database in a blinded fashion. Three independent reviewers extracted information from the same 2 GWAS articles in parallel to ensure they applied our inclusion criteria in the same manner. Twelve of the 118 GWAS articles were randomly selected. These articles were assigned 4 each to 3 reviewers. Each reviewer independently generated information from the GWAS article according to the guidelines above and then compared their results to the original extracted results.
Information from the GWAS papers spanned at least 3 human genome Builds and 12 dbSNP builds, resulting in SNP positions and SNPids that have shifted in some cases. Additionally, some papers gave only genome coordinates without SNPids or supplied only commercial chip IDs. Thus, in order to maximize the analysis of available GWAS results in the current genome build context and retrieve current SNP annotation it was necessary to apply multiple strategies to update SNP coordinates and SNPids. For some markers old genome coordinates were translated into current coordinates using the UCSC Genome Browser LiftOver conversion tool in order to discover missing SNPids. When only commercial chip IDs were given these were translated into rsIDs using the most appropriate annotation files from the corresponding company. For some associations we were unable to establish SNP identification based on the information provided in the original report. Although this was only the case for a handful of associations, it does suggest more vigilance is required by journals in order to standardize the reporting of genetic variants (e.g., SNP identifiers and precise genome coordinates).
To facilitate retrieval of current SNP annotation information we wrote a Perl program, GRASP (Genome-wide Retrieval of Annotation for SNPs Program, available upon request). Current coordinates were retrieved from the dbSNP table "b128_SNPChrPosOnRef_36_2". Alias SNP IDs were retrieved from the "RsMerge128Arch" table and used to find current coordinates when necessary. SNPs that mapped to multiple genome locations were noted and further gene annotation was not included for these. The GRASP program integrated UCSC human genome browser annotation tracks for RefSeq genes and UCSC Known genes, yielding standardized annotations for overlapping and nearby genes for all GWAS SNP associations [see Additional file 2].
The main GWAS database contains 56,411 unique SNP-phenotype entries [see Additional file 2]. The database represents results from a heterogeneous set of studies with varied amounts and types of data available. Thus, we did not attempt to conduct formal statistical meta-analyses. Rather, our primary aim was to use this database to make observations that either strengthen prior associations or highlight them in a new way (e.g., in relation to additional phenotypes), or are suggestive of regions for future investigations. Using Perl programs, we: 1) enumerated and ranked repeated occurrences of individual SNPs across GWAS studies, 2) split the genome into 100 kb bins and counted SNP-phenotype associations within each bin, and 3) determined the average pairwise LD within each 100 kb bin based on the HapMap CEU data (release #23a). After standardizing the gene annotations for all associations, we applied High-Throughput GOminer analysis software to search for gene ontologies that are statistically over-represented among significant GWAS associations . For this approach SNP associations directly within genes nominated those genes as positively associated with a given trait or set of traits. GOminer tests for the over-enrichment of gene ontologies in large gene sets, using an FDR approach based on repeated random sampling to account for multiple testing . To test the sensitivity of the gene ontology findings to the inclusion of specific data we ran further analyses without the WTCCC and DGI results, and within specific disease subsets ([see Additional file 3] for a more detailed description of the approach and the disease subsets).
Full disclosure genotype-phenotype association results were publicly available for every SNP tested for only a minority of the GWAS scans. In 45% (n = 53) of GWAS articles, fewer than 40 SNP-specific association results were made publicly available, and in many studies results for very few loci and SNPs were disclosed (25 studies reported results for 10 or fewer SNPs). In thirty-one (26%) articles, the authors disclosed the complete set of associations, and in the remaining articles (n = 34), they disclosed only a moderate number of top-ranked associations (defined as ≥ 40 associations for ≥ 4 distinct loci). There was also substantial heterogeneity in the format and type of results data available from GWAS studies. In many studies, information regarding SNP strand, alleles and direction of effect, sample sizes passing quality control for individual SNPs, and genetic model were unavailable, thus precluding or limiting the conduct of formal meta-analyses. Despite the heterogeneity and limited data availability, we extracted a minimal redundancy database of 56,411 statistically significant SNP-phenotype associations across all studies by use of custom computer programming to facilitate further analysis. Briefly, the criteria for inclusion was the most significant mention per SNP per study, and only included SNPs with unadjusted genome-wide p-values for association ≤ 1 × 10-3, or which were significant in replication or further analysis at p = 0.05. (A full description of criteria for inclusion is found in the Methods, for full results [see Additional file 2]). We validated the completeness and accuracy of the extracted SNP database by a re-extraction of a random selection of 10% of the studies conducted by a panel of three reviewers. We found no detectable errors in regards to the total number and identity of SNPs that were included in the final dataset.
Currently available GWAS results span 3 builds of the human genome (34–36) and at least 12 builds of dbSNP (118–129). Since SNPids are being modified and merged over time, and the relative positions of SNPs often shift between human genome builds, there are substantial informatics challenges to GWAS meta-data accumulation, analysis and viewing. Using current dbSNP information, including mapping of alias SNPids, we migrated all reported SNPs from GWAS associations in Additional file 2 into the current framework of human genome Build 36.2 positions. Relying on these positions, we then re-annotated all associations with protein-coding gene information (see Methods) and compared current annotations with those originally described in GWAS results reports. In contrast to the original annotations from the GWAS articles and datasets, in which 23.3% of associated SNPs were reported to be in or near genes ([see Additional file 2], column V), when we applied standardized annotation we found 40.0% of associated SNPs are within the transcript boundaries of a RefSeq gene, indicating a relative under-estimation of the proximity to genes of loci in initial GWAS reports. Furthermore, from our database, we found that most top GWAS associations are relatively gene-centric, with 65.7% of the associated SNPs located either in or within 60 kb of a RefSeq gene (Figure (Figure3).3). Significantly associated SNPs showed a trend toward being more gene-centric than all SNPs present on the arrays used in most studies (seen in the contrast between Figure Figure33 and Figure Figure4).4). We compared studies that employed either Affymetrix only or Illumina only arrays and we found little difference in the proportion of associated SNPs located within genes (Figure (Figure3,3, Affymetrix: 39.5%, Illumina: 40.8%). When we considered associated SNPs in or near (within 60 kb) a RefSeq gene, there were a modestly increased proportion of gene centric associations within "Illumina only" studies (Figure (Figure3,3, Affymetrix: 64.7%, Illumina: 70.6%).
Using our standardized GWAS results database [see Additional file 2], we conducted a density analysis to find the densest regions of association in the genome, using 100 kb bins across the genome and including all SNP associations regardless of the magnitude of statistical signals. To account for LD, which could confound our analysis by inflating the density of associations in regions of high LD, we ran a parallel analysis where we adjusted for the average pairwise LD (r2) in the same regions for the HapMap CEU samples. Both analyses identified many previous strongly replicated loci within regions of the genome showing the highest density of previous associations (Table (Table1).1). In Figure Figure1,1, we provide a view of GWAS associations across a diverse set of phenotypes, including many regions of the genome where SNP associations exceed the common genome-wide significance threshold (P < 5 × 10-8), clearly highlighting both the density and magnitude of association signals at many replicated loci. Figure Figure22 shows a restricted view of GWAS results from 5 × 10-8 ≤ P ≤ 1 × 10-4, making apparent a number of clusters that approach the genome-wide significant threshold (P < 5 × 10-8). Across the genome, the 99th percentile cutpoint based on density of associations included bins with 13 or more GWAS associations within less than or equal to 100 kb. The MHC class II loci contained the densest bins. There are numerous phenotypes associated with the MHC loci, consistent with significant pleiotropy of this region, but there is also evidence for pleiotropy for a previously replicated Alzheimer's disease locus (MAPT, KIAA1267, STH), which displays a signal for Crohn's disease, and a second replicated Alzheimer's locus (DAPK1), which shows evidence for Type II diabetes and related traits across multiple studies.
We considered regions that were not highlighted in the original GWAS articles but that nonetheless reveal a high density of associations in our analysis (Table (Table2).2). Although associations were noted in more than two studies for all of these regions, none of the single SNP associations was considered to be significant on a genome-wide level. A dense cluster of significant associations for Crohn's disease and HDL cholesterol is located in the monocarboxylate transporter 2 gene (SLC16A7, also known as MCT2), a ubiquitously expressed transporter that imports and exports lactate and pyruvate. Other clusters of interest include: a complement related factor, CSMD1, mainly for association with HIV-1 viral load; and associations with Crohn's disease and Type I diabetes at OAS1, an enzyme that degrades viral RNA and has previously been associated with Type I diabetes, multiple sclerosis and SARS infection. Accounting for LD in the density analysis changed the rankings of the top regions (Tables (Tables11 and and2),2), but all of the unadjusted bins remained within the top 5% of all bins for known, replicated regions (Table (Table1)1) and within the top 10% of all bins for those presented in Table Table22.
After accounting for aliases used for identical SNPs, we counted the frequency of occurrence of individual SNPs among top GWAS associations to search for SNPs associated repeatedly across traits. Among all SNP associations (n = 56,411), 52,554 unique SNPs were observed, and the bulk of these SNPs were associated with a single phenotype once (n = 49,313). Examining the most redundant SNPs across GWAS associations revealed a set of known, replicated loci, validating this approach (Table (Table3).3). For example, a single SNP located 3' of APOC1 (rs4420638) was associated 11 times across GWAS, including association with Alzheimer's disease, lipid-related traits and coronary artery disease (CAD). Some replicated loci contained multiple SNPs with repeated associations, which may be due to LD and differences in representation on arrays (APOB, LPL, TCF7L2, SORT1, CELSR2, PSRC1, and FTO in Table Table3).3). By searching for repeated SNP associations, a number of new suggestive loci were also observed, each of which was associated independently with five traits, with none reaching genome-wide significance. These included: GABRG3 (rs968671) which showed association with BMI-related traits and hypertension and is located in a cluster of GABA receptor subunits, notable for the role of GABA signaling in sympathetic vasomotor tone; RAPGEF1 (rs7034356, rs4740294), an exchange factor involved in cell signaling, was associated with BMI-related traits; PIGU (rs2889849), a subunit of glycosylphosphatidylinositol transamidase, was associated with lipid- and BMI-related traits; and SPAG16 (rs10498015), a sperm-associated protein, was associated with height, weight and lipid-traits (Table (Table33).
Using our standardized RefSeq gene annotations of GWAS associations, we identified all protein-coding genes containing one or more association among top GWAS results (n = 5,966). We explored whether genes with specific types of biological function are over-represented across significant GWAS results using GOminer, software originally designed for microarray analysis . Genes relating to cell adhesion functions were highly over-represented (P < 4.6 × 10-14) across the meta-dataset, as were genes related to signal transduction (P < 9.7 × 10-11), transport activity (P < 1.1 × 10-9), and protein phosphorylation (P < 2.4 × 10-7) (Table (Table4).4). To test the sensitivity of these findings to the inclusion of specific datasets, we repeated the analysis after removing data from two of the largest data contributors (WTCCC, DGI) leaving a subset of associations within 2,888 genes. In the repeat analysis, the distribution and statistical significance of associations among the top biological function categories was not significantly altered (Table (Table4).4). In an analysis stratified by major disease categories ([see Additional file 3] for the specific studies included in each set), we found that genes relating to cell adhesion were significantly over-represented in every disease set and ion transport related genes were significantly over-represented in every disease set except lipid-related traits [see Additional file 5]. Examining significantly associated protein-coding gene categories with FDR < 0.05 in each disease set revealed a positive control for this approach, the "antigen processing and presentation" gene category in the rheumatoid arthritis set (P < 2.4 × 10-9) [see Additional file 5]. A number of other over-represented categories were also concordant with the expected specific disease contexts: "nervous system development" (ALS, Alzheimer's disease, Weight/BMI), "synaptic transmission" (ALS), "metal ion/sodium/calcium transport" (CAD, Hypertension), "phospholipid transport" (Type II Diabetes), and "response to nutrient levels" (Weight/BMI).
In our evaluation of a comprehensive GWAS results database across diverse phenotypes, we confirm the potential benefit of open access to GWAS results data by a series of observations. After re-annotation of all reported results, we determined that more than two-thirds of associations are in or within reasonably limited physical and genetic distance from a protein-coding gene, with a significant minority of associations more distant from a protein-coding gene. While intentionally hypothesis generating, the results of our analyses (Tables (Tables1,1, ,2,2, ,3,3, ,4,4, and [see Additional file 5]) suggest there are a number of novel associated loci, pleiotropic effects of known loci, and newly emphasized functional gene categories in human diseases. Using standardized gene annotations of top GWAS associations, we further undertook an ontology-based functional analysis, revealing a striking over-representation of cell adhesion-related genes implicated in GWAS studies encompassing a diversity of diseases (P < 4.6 × 10-14 for all diseases). We make the compiled results fully available in supplemental files, [see Additional file 2] or [see Additional file 4], and also provide input files that can be used to visualize all associations included here, or from specific studies, using UCSC Genome Graphs [see Additional file 6].
Using a straightforward bin clustering analysis of all GWAS results we identified known, replicated loci, but also observed high density clustering of associations in gene regions that were not previously highlighted in the primary GWAS studies, but displayed significance in two or more GWAS (Table (Table2).2). The densest cluster of such associations was observed for Crohn's disease and HDL cholesterol in the 3' region of a monocarboxylate transporter, SLC16A7, also known as MCT2. Notably a related monocarboxylate transporter, MCT1, was shown to be decreased in expression in the inflamed colonic mucosa of patients with ulcerative colitis and Crohn's disease relative to controls . The next densest cluster was primarily associated with HIV-1 viral set point  in CSMD1, a gene which encodes a soluble protein that can block the classical complement activation pathway . This is of particular interest since a characteristic of HIV-1 infection and persistence is the active evasion of the host humoral response, a key component of which is complement activation .
The preceding examples, and others in Tables Tables22 and and33 (RAPGEF1, PIGU, SPAG16, PFKFB3, COL4A1/2, A2BP1) suggest novel candidate genetic loci that require further replication, but we also noted GWAS associations of interest in at least one locus with previous evidence for association. A gene encoding 2',5' oligoadenylate synthetase 1 (OAS1) is stimulated by interferon, plays an important role in innate immunity and was previously shown to be genetically associated with Type I Diabetes , multiple sclerosis , SARS  and hepatitis C persistent infection . Here we report signals in GWAS results for both Type I Diabetes and Crohn's disease which, given prior associations, suggests this locus may harbor at least one functional allele that impacts a range of immune-related etiologies. Arguably, this example may demonstrate that previous candidate gene centered associations can be replicated via in silico analysis of GWAS results. During the review of this article, published and unpublished studies came to our attention, which provide some additional validation for results we present. We noted in Table Table33 the highly repeated association of SNPs in genes including RAPGEF1 and PIGU across multiple GWAS and suggested these as potential novel candidate genes for further study and replication. Recently a genome-scan for melanoma, reported the most significant association, which was replicated, was found in PIGU (p < 1.0 × 10-15) . The genome-wide significant SNP from Brown and colleagues is in significant LD with the SNP present in Table Table33 (D' 1.0, r2 0.57). This genomic region (20q11.22) also ranked relatively high in our bin-based analysis (density rank = 78, LD-adjusted rank = 1,338) as a previously, unreplicated region that contained a high density of GWAS associations for diverse diseases. In an analysis of 222 candidate genes for association with diabetes and related traits, extending previously published GWAS analyses, Gaulton and colleagues  report a RAPGEF1 SNP (rs4740283) as the most statistically significant associated SNP with Type II Diabetes among all SNPs and genes they analyzed. This SNP is nearby and in complete LD (D' 1.0, r2 1.0) with a RAPGEF1 SNP, rs7034356, we reported here in Table Table3.3. These newly reported and replicated results for PIGU and RAPGEF1, as well as some as yet unpublished, but replicated GWAS results for other genes we highlight strongly suggest that the availability and analysis of GWAS results across diverse traits may be useful in predicting and supporting functional loci for further biological study.
Creation of a standardized results database allowed us to conduct a functional gene category analysis. The over-representation of cell adhesion genes was strongest among weight- and BMI-associated traits (P < 7.1 × 10-20). This expands on an earlier report on the over-representation of cell adhesion genes in significant addiction-related GWAS results . The finding was not sensitive to the inclusion of data from specific studies, suggesting either a broad impact of genetic variability in cell adhesion genes on diverse disease etiologies or a systematic bias toward these genes on commercial genotyping arrays. A previous analysis of relative ontology representation of SNPs on major commercial genotyping arrays indicated that genes relating to biological adhesion account for relatively few arrayed genes (~2%) . Current evidence does support roles for cell adhesion molecules in a number of major diseases , and notably an ontology-based analysis of the Phase II data from the HapMap project indicated that cell adhesion genes are among the gene groups with the most evidence for recombination in recent human history suggesting potential selective pressures on this group . It is notable that expected gene ontologies were over-represented for specific disease categories (e.g., antigen processing and presentation in rheumatoid arthritis, CNS development and synaptic transmission in ALS, metal ion/sodium/calcium transport in CAD). This finding may be consistent with the hypothesis that multiple loci in related physiological pathways and processes, each with a relatively small magnitude of effect, may make a significant aggregate contribution to genetic risk of complex diseases.
Consistent, widespread standards of reporting and annotation of full disclosure results may facilitate hypothesis-generation and extend discovery that is already occurring from GWAS and their follow-up studies. While GWAS have resulted in the discovery of new and strongly replicated genetic associations relevant to human disease, there continues to be a substantial challenge to discovering meaningful genotype-phenotype associations among a surfeit of data. The typical staged approach to GWAS discovery consists of ranking statistical associations and replication testing in large follow-up sample cohorts; while some "true" associations are found positioned relatively low on the initial p-value ranked list . A recent follow-up meta-analysis across Type II Diabetes GWAS resulted in the identification and replication of additional loci that did not meet genome-wide significant thresholds in any primary GWAS analysis, highlighting the benefit of combining GWAS results from multiple studies . Other studies following initial GWAS data releases have employed pathway-based analyses , multilocus association testing  and in silico comparisons across multiple GWAS for related phenotypes, for example to find SNPs associated with both LDL cholesterol and CAD . As more data become available, further analyses become feasible, including the possibility of using Bayesian inference to weight SNPs with a priori evidence for association for use in the analysis of new trait scans [3,26,27]. Weighting of SNPs could be conducted based on a variety of parameters including a priori linkage, or functional evidence such as prior gene expression GWAS. Our results (Figure (Figure3)3) suggest that weighting schemes incorporating gene centricity and tagging of gene regions may be relevant, as previously demonstrated .
A growing number of GWAS investigative teams including the Diabetes Genetic Initiative , the Wellcome Trust Case Control Consortium  and the Framingham Heart Study  are leading efforts for the early and wide-spread dissemination of aggregate results from GWAS to enable further scientific research. Informatics initiatives including the National Center for Biotechnology Information's (NCBI) database of Genotype and Phenotype (dbGaP) have a core goal of systematically making available GWAS participant-level data and aggregate results for future analysis . However, our analysis suggests that the extent and quality of further analyses of GWAS results will largely depend upon the extent of SNP results to which researchers have access and the quality of data annotation. We found that a substantial portion of GWAS results are currently unavailable even through an application process, and further that available results are largely presented in a non-uniform manner among disparate databases and web clients, and are often lacking even the most basic gene and SNP annotation. Shifts in SNP-genome positions and SNPids over time and unavailable full SNP lists for some platforms and custom arrays also exacerbate attempts to harmonize results from different studies or genotyping platforms. Further complicating the move to widely distribute aggregate results is the report that the identity of individual research participants may be revealed from large numbers of aggregate genotype-phenotype research results. Estimation methods have been reported, using simple allele frequencies or genotype counts, which make it possible to accurately determine whether specific individuals with known high-density SNP profiles are participants in a complex genomic DNA mixture, such as the case or control groups from publicly available aggregate datasets . In response to this report, access to aggregate genotype data for GWAS studies on dbGaP and other GWAS portals has been removed from public access and made available through controlled access processes requiring the user to receive approval from a data access committee. In total these substantial obstacles to further analysis suggest a need to establish and adopt standards for GWAS reporting.
A previous working group paper suggested criteria for establishing and evaluating GWAS reports and replication, and their report highlights the types of information that would be central to a GWAS data standard . The centralization of GWAS results in a standardized repository containing information similar to that presented in the database here and periodically updated from the literature, could provide a platform for further analysis by the research community with many potential benefits, including functionality for integration with other informatics resources and the ability to iteratively access, search and conduct additional analyses as new scan data becomes available. The establishment of GWAS reporting standards is beyond the scope of this article and requires a dialogue throughout the community. The adoption of MIAME standards for microarray gene expression studies has enabled substantial advances in that field and more systematic bioinformatics analysis of results . In an ideal scenario journals would require authors to make a submission that meets or exceeds a GWAS reporting standard before accepting a paper for publication (Table (Table5).5). While the disclosure of genotype results even when appropriately de-identified and subject to other research protections has potential dilemmas ethical and otherwise, the disclosure of association p-values, basic experimental and SNP annotation information may be less problematic. We suggest that in order to also protect the interests of invested researchers who may have ongoing projects following initial GWAS analysis that any minimal standard allow for a lagging time period before the disclosure of full association results.
We provide a comprehensive open access database of available GWAS results, along with general observations and first analyses. We observed substantial heterogeneity in the amount and type of information currently reported in GWAS articles. After substantial data collection and informatics integration efforts, our first pass analysis across GWAS indicates there may be substantial benefits to centralizing and opening access to GWAS results. We found support for potential pleiotropy of known, replicated loci, as well as the suggestion of new, interesting candidate genes and functional categories that require further validation and study. The creation of an open access resource for GWAS results should encourage and facilitate new genetic and genomic analysis, and provides a potential resource for easier participation in results sharing among interested researchers.
The authors declare that they have no competing interests.
The project was designed and implemented by ADJ, with significant input from CO. Both ADJ and CO wrote and edited the manuscript.
The pre-publication history for this paper can be accessed here:
Summary information on the 118 GWAS studies included in this study. Information on GWAS including genotyping arrays, phenotype descriptions, discovery and replication samples, analytic strategies, data availability, URLs, publication date and contact information. Data fields are described in detail in Additional file 3.
56,411 GWAS genotype-phenotype associations and annotation. The database of significant GWAS associations and additional gene and SNP annotations used in this paper. Data fields are described in detail in Additional file 3.
Supplemental text. Supplemental text for the paper providing detailed descriptions of how data fields were ascertained for Additional files 1, 2 and 4, as well as a description of gene ontology analysis, full citation information for 118 GWAS and identification of studies included in disease groups for ontology analysis presented in Additional file 5.
Microsoft Access 2007 database of 56,411 GWAS genotype-phenotype associations and annotation. The database of significant GWAS associations and additional gene and SNP annotations used in this paper. Data fields are similar to Additional file 2 and are described in detail in Additional file 3.
GOminer gene ontology analysis results for GWAS in disease sub-categories. GO categories significantly enriched among significant disease groupings of GWAS results. Studies included in disease groups are identified in Additional file 3.
Formatted files for more than 400 GWAS analyses that can be used to upload and browse results in UCSC Genome Browser using Genome Graphs. This archive file contains Genome Graph files for all GWAS associations contained in Additional files 2 and 4. The files within the archive can be used to visualize GWAS associations described here using UCSC Genome Graphs http://genome.ucsc.edu/cgi-bin/hgGenome at regional, chromosomal and whole genome levels. A file "README.txt" describes file naming conventions. The file "JohnsonODonnell_ALLgwas_graph.txt" contains a single Genome Graph file containing all associations.
We would like to acknowledge the many thousands of international, anonymous participants of the GWAS included here for their contribution to human genetics. We thank Wolfgang Lieb, MD and Raghava Velagaleti, MD for their assistance as QC reviewers of the dataset. ADJ was supported by an intramural NHLBI IRTA fellowship award. This project was supported in part by the NHLBI's Framingham Heart Study (NIH/NHLBI Contract N01-HC-25195). We thank Teri Manolio, MD PhD for her comments on the manuscript.