Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Pharmacogenomics. Author manuscript; available in PMC 2009 December 1.
Published in final edited form as:
PMCID: PMC2714942

Collaborative genome-wide association studies of diverse diseases: programs of the NHGRI’s office of population genomics


In the past 3 years, genome-wide association (GWA) studies have revolutionized the discovery of genetic variants associated with complex diseases. These studies present unique challenges in their conduct; particularly in the need for meticulous quality control of genotyping and for sample sizes large enough to withstand the severe penalty for multiple comparisons necessitated by testing hundreds of thousands of SNPs. They also present unique opportunities in the unprecedented detail with which they characterize an individual’s genome and the potential for relating that information to any trait consistent with that person’s informed consent. Such data exceed the abilities of any single group of investigators to mine them fully and by NIH policy are distributed to qualified investigators agreeing to specified terms of use. This report describes collaborative programs of the National Human Genome Research Institute’s Office of Population Genomics for facilitating collection, analysis, interpretation, and dissemination of these data so that their research value can be maximized.

Keywords: collaboration, DNA, genome-wide association studies, gene, NIH

Genome-wide association (GWA) studies have arguably transformed the science of discovering genetic variants that influence common, complex diseases [1]. Prior to the advent of the GWA approach in 2005, only a handful of robustly replicated associations such as CARD15 in Crohn’s disease, [2], PPARG and KCNJ11 in diabetes [3], and APOE in Alzheimer’s disease [4] had been identified, while the vast majority of putative candidate gene–disease associations had failed to replicate [5]. This situation has changed dramatically in the past 3 years, with over 230 published studies reporting over 400 unique SNPs associated with over 75 common diseases and traits [6,101].

Most of these investigations have been conducted as isolated, single studies of a specific disease or trait, typically using a case–control design [6]. Notable collaborations such as those in diabetes [710] and height [11,12] have brought together diverse groups of investigators and samples for study of the same or similar traits, but only one consortium to date has published results of GWA studies conducted in multiple diseases under a shared genotyping and analytic framework [13]. These pioneering efforts of the Wellcome Trust Case Control Consortium (WTCCC) in seven diverse diseases and a common set of controls provided a highly successful model for the conduct of collaborative GWA studies, one that the Wellcome Trust recently extended to an additional 27 conditions in over 120,000 additional people [14].

Despite the numerous logistical and organizational challenges in coordinating multiple GWA studies in a single consortium, the WTCCC contributed to many important methodologic advances, including demonstration of the robustness of a single control group, the value of using cases of some diseases as controls for others, the greater power provided by increased sample size (numbers of subjects) rather than increased genomic coverage (numbers of SNPs), the importance of manual review of automated genotyping calls, and the feasibility of broad distribution of individual-level data to investigators agreeing to careful controls on their use.

Building upon this foundation, the National Human Genome Research Institute (NHGRI) and other Institutes of the NIH have initiated GWA collaborative studies using the case–control design, as well as more complex designs including cohort studies, randomized clinical trials, and biorepositories with phenotypes defined by electronic medical records. In addition, NHGRI’s Office of Population Genomics, under its mandate to promote the application of genomic technologies to population-based studies [102], has also led efforts to develop standard phenotype and exposure measures for GWA studies, characterize putative causal variants identified by GWA in well-characterized cohorts, and encourage fine mapping, sequencing, functional studies and pilot translational studies through the Genes, Environment, and Health Initiative (GEI) of the Department of Health and Human Services [103]. This special report describes these efforts, the gaps in scientific knowledge and methods they were designed to address, the challenges they have encountered, and the opportunities they present for extending the promise of GWA research.

Collaborative GWA studies

Genetic Association Information Network

The Genetic Association Information Network (GAIN) was initiated in late 2005 as a public/private partnership to investigate the genetic basis of common diseases through a series of collaborative GWA studies. Six studies involving a total of 18,000 DNA samples were selected on the basis of scientific merit, potential for genome-wide genotyping to provide valuable new insights, and public health significance of the traits proposed for study [15]. Because of the desire for transparency in public/private partnerships involving NIH, GAIN design and implementation were directed by a series of guiding principles that have become useful foundations for other GWA studies (Box 1). Foremost among these principles was the commitment to release the resulting data as broadly and rapidly as possible, with equal opportunity for access by all users who agree to protect the confidentiality of study participants and to respect the intellectual investment of investigators contributing data and samples to GAIN.

Box 1. Guiding principles of GAIN

GAIN will use the most rigorous scientific approaches and maintain the highest ethical standards as guided by the following principles:

  • The greatest public benefit will be achieved if GAIN results are made immediately available for research use by any interested and qualified investigator or organization, within the limits of providing appropriate protection of research participants.
  • Discovery of genetic variants related to health and disease and their translation into effective diagnostic, therapeutic and preventive strategies should be expedited.
  • The best available human studies of diseases and traits, chosen to achieve programmatic balance among diseases, should be used for this discovery process.
  • Relevance and applicability to all population subgroups and segments of society of the findings supported or enabled by GAIN should be ensured.
  • Investigators granted access to GAIN data should ensure confidentiality of study participants and follow any limitations specified by their informed consent.
  • Intellectual contributions and efforts of investigators submitting samples should be appropriately recognized by any user of GAIN data, consistent with the principles that guide the use of other community resource projects within the genomics field.
  • Access to GAIN data should be made available to GAIN partners, contributing investigators, and other users at the same time and through the same access approval mechanisms.

Adapted from [14].

GAIN: Genetic Association Information Network.

At the time GAIN was initiated, there was no NIH database capable of receiving and disseminating the genotype–phenotype data, nor were there NIH policies on access and proper use of this information. The commitment to make GWA genotyping data from GAIN widely available, together with 100K and 500K GWA scans from the Framingham Heart Study [16], stimulated the National Center for Biotechnology Information to develop the Database of Genotypes and Phenotypes (dbGaP) [17]. This database, which also functions as a site for investigators to request access, NIH staff to review these requests, and approved users to download data securely, was up and running by late 2006. Implementation of data access mechanisms for GAIN, Framingham, and other GWA studies led the NIH to develop its ‘Policy for Sharing of Data Obtained in NIH Supported or Conducted GWA Studies’ after a lengthy period of public comment [104]. The policy was finalized and implemented in late 2007 and applies to all applications for funding of GWA studies submitted to NIH after January 2008.

Collaborative activities within GAIN have focused on genotyping quality control and data analysis, particularly in relation to controlling for population stratification and sources of genotyping error. The potential for joint analysis of cases and controls has been limited by restrictions on data use in the consent forms from the original studies, which in many cases did not permit their use for phenotypes unrelated to the primary purpose for which the samples were collected. GAIN investigators and their collaborators are actively engaged in comparing methods for imputation of untyped SNPs and improved calling algorithms, and in replicating their initial results in additional samples and pooling their findings with other GWA studies. Involvement of the scientific community has been encouraged through a series of annual analysis workshops at which initial results and analytic challenges have been presented in depth [105].

Gene–environment association studies

The Genetic Association Information Network has provided a valuable foundation on which other collaborative GWA studies were built. Chief among these has been the Gene–Environment Association Studies (GENEVA) component of the GEI, which began with eight GWA studies and recently expanded to a total of 14 studies [106]. GENEVA added a coordinating center to the original GAIN model, allowing for a more standardized approach to data submission and data cleaning, with guidelines and procedures for data management under development for dissemination to the scientific community. GENEVA in turn provides an important foundation for other components of the GEI genetics program, including efforts in replication and fine mapping of GWA findings, functional studies, pilot translational efforts, and development of analytic methods. Integration of GWA studies with improved technologies for measuring environmental exposures (behavioral and lifestyle factors, toxins, pollutants and so on) being developed in GEI’s companion exposure biology program is anticipated once those measures become available. Components of GEI and the solicitations that provide their support are listed in Table 1.

Table 1
Components of the GEI that build on GWA studies

GWA studies in biorepositories

The Electronic Medical Records and Genomics (eMERGE) network is a five-member consortium formed to develop, disseminate and apply approaches to research that combine DNA biorepositories with electronic medical record (EMR) systems for large-scale, high-throughput genomic research [107]. EMR provide great potential for defining phenotypes and exposures in a cost- and time-efficient manner, and their integration into healthcare delivery presents important opportunities for translating genomic findings into improved care. eMERGE will define the extent and complexity of data extraction required to obtain clinical information needed for genome-wide research, and the compatibility of existing EMR formats with each other, with various standardized language systems, and with databases such as dbGaP. Concerns related to wide-spread data sharing for research use outside a given healthcare system will also be assessed. eMERGE will investigate the acceptability of such research to biorepository participants, their clinicians, biorepository investigators, and biorepository Institutional Review Boards as well as participants’ concerns and preferences regarding return of individual results of genetic studies.

Electronic Medical Records and Genomics will also compare various approaches for conducting genome-wide research in biorepositories so as to maximize the scientific value of the biorepository for identifying genes related to complex diseases. For proposed genome-wide studies deemed scientifically advantageous, it will investigate genetic variants associated with specific complex diseases and traits derived from EMR data. In accordance with NIH data-sharing policies, resulting GWA data will be made rapidly available for research use through dbGaP. By working collaboratively to share experience and expertise across participating biorepositories, guidelines will be developed for dissemination to other biorepositories on consultation and consent for such research, and on collecting, documenting and depositing phenotype and exposure data to databases such as dbGaP.

GWA studies of treatment response

A collaborative program currently in planning [108] will extend GWA studies to randomized clinical trials, with the intent of identifying genetic variants associated with response to treatments for conditions of clinical or public health significance. This program is needed because identification of variants related to treatment response to date has largely relied on candidate gene studies, much as did research in the genetics of complex diseases before the advent of GWA studies. Early pharmacogenomic studies have also placed an understandable emphasis on severe adverse drug reactions, which are generally too uncommon to permit accrual of the large series of cases needed for GWA studies. By contrast, clinical improvement (or lack thereof ) in response to treatment is a frequently occurring outcome in clinical trials, so that large numbers of responders and nonresponders can readily be collected. When treatment response can be assessed as a continuous trait, the power of GWA studies is likely to be even greater.

Leveraging existing clinical trial resources for genome-wide research also provides opportunities to explore the implications of generating genetic information in the context of delivering a controlled intervention. Analysis and release of genomic data in such settings, for example, may be complicated by the need for masking of treatment assignment until the trial is completed. Incorporation of point-of-care models, where prevention or treatment strategies are selected based on a patient’s genotype, have not been widely explored. Empirical data are needed on the expectations of and reactions from study participants regarding such approaches, and the impact of including such studies on participation rates in clinical trials. Collaborating investigators will share expertise and experience across studies to develop best practices for incorporating genome-wide studies in clinical trials. The resulting data and best practices will be widely shared as research tools with the scientific community.

Building upon & extending GWA research

In addition to the GEI-supported programs described above that build directly upon GWA studies, and dbGaP that grew directly out of them, NHGRI’s Office of Population Genomics has also established a data resource describing published GWA studies in its ‘Catalog of Published Genome-Wide Association Studies’ [101]. It provides characteristics such as author, trait, sample size and genotyping platform on all GWA studies attempting to assay at least 100K SNPs identified through PubMed literature searches and reports from the media. SNP-trait associations significant at p < 10 −6 and not previously reported are listed with characteristics such as rs number, genomic region, risk allele and its frequency, odds ratios and p-values. The site is interactive and can be searched by journal, first author, disease/trait and other characteristics, and a downloadable database is available in Excel format.

The advantages of pooling GWA studies to increase sample size and diversity have been widely recognized as described above, but can be impeded by lack of standard measures of phenotypes and exposures across studies. This is being addressed by the PhenX project on consensus measures for phenotypes and exposures [109], which is developing a toolkit of roughly 15 standardized measures in each of 20 domains (for a total of roughly 300 measures) related to complex diseases and environmental exposures. Once an individual’s genome has been comprehensively characterized by GWA genotyping, it can potentially be related to any trait, not just the primary trait initially proposed for the GWA study, as long as these additional uses are consistent with the informed consent provided by study participants. The value of such an approach is demonstrated by the 18 GWA publications arising from the Framingham Study [16] and from an array of metabolic traits in the Northern Finnish Birth Cohort [18]. Although cross-study analyses have been conducted for secondary phenotypes that are readily standardized, such as height, BMI and serum lipids [6], few GWA studies published to date have included phenotypic and exposure data on a sufficiently wide variety of traits and diseases to enable such additional analysis, particularly across multiple studies. Instead, most current GWA studies utilize a case–control or family design focusing on a single disease or group of related traits. More importantly, the potential for cross-study comparisons is restricted by the lack of standardized or comparable phenotypic and environmental measures, despite the many risk factors, such as smoking, dietary intake and low socioeconomic status, which are common to multiple diseases. PhenX will identify readily standardized and implemented phenotype and exposure measures for use in GWA studies. Success in this effort will facilitate more efficient use of GWA data to understand genetic influences on incidence of, and morbidity and mortality from, common diseases; on trait variation; and on responses to environmental exposures, including drugs or other therapies and lifestyle factors.

Finally, it is essential to recognize that large-scale GWA studies are a crucial first step in the identification of genetic variants related to complex diseases, but they are only a first step. Robustly replicated findings from GWA studies must be investigated in free-living human populations for their potential functional and public health implications, but once such potentially causal variants are isolated investigation often shifts away from human populations to the laboratory. Much remains to be learned, however, from well-characterized human population samples in which potentially causative variants have been, or could be, assayed. The Population Architecture using Genomics and Epidemiology (PAGE) program is designed to investigate, in four collaborating, well-characterized population studies, genetic variants identified as potentially causally associated with complex diseases in GWA and other genetic studies. The resulting population-based descriptive and association data will be widely shared through dbGaP and other user-friendly informatic systems to accelerate the understanding of genes related to complex diseases. PAGE forms a natural complement to past and ongoing GWA studies, which tend to examine hundreds of thousands of variants in relation to a handful of phenotypes, while this program will relate a small number of strongly-associated variants to a wide variety of traits in well-characterized cohort studies. It will define the ‘epidemiologic architecture’ of these putative causal variants – their population prevalence; prevalence in race/ethnic subgroups; relative risk of rigorously-defined, incident disease; consistency of association across subgroups defined by age, sex, race/ethnicity or exposures; and potential modifiability of associated risk. This information will help to determine the health implications of a given variant and the priority it should receive for identifying and testing interventions to reduce its associated risk. This information may also be quite valuable in exploring gene function, since the epidemiologic approach of genetic investigation, starting from observed phenotypic characteristics and moving more proximally to gene pathways and sequence variants, complements well the laboratory approach of moving from DNA sequence to function to phenotype. The collaborative nature of the program will facilitate the development and dissemination of efficient approaches for epidemiologic characterization of putative causal variants in these and other cohorts. Detection and interpretation of important differences between these data and initial GWA discovery reports will help inform the design of future GWA studies.

Future perspective

Genome-wide association studies have already identified a large number of loci reproducibly associated with complex diseases, but in general the risk associated with these variants is small, suggesting that the true causal variants have yet to be found or that multiple other factors, including other variants and environmental exposures, combine to produce inherited disease risk. Of particular importance may be environmental factors known to be related to multiple common diseases, such as smoking, alcohol use, dietary intake, physical activity and other behavioral factors. Determining the true functional consequences of these variants is likely to be challenging but will be critical to developing interventions to modify disease risk. Defining the role of structural variation, such as insertions, deletions, duplications and inversions that are poorly captured by SNPs, will also be a major challenge in the coming years but may hold promise for identifying other loci related to disease. High-throughput, low-cost sequencing technologies will permit identification of rarer sequence variants that may be causative or additive in the disease associations identified in GWA studies. Population-based studies will be needed to determine the population prevalence and risk associated with putative causal variants in unbiased and diverse population samples, and to estimate the increment in risk over established risk factors provided by GWA-defined variants.

Executive summary

  • Genome-wide association (GWA) studies have revolutionized the discovery of genetic variants associated with complex diseases.
  • The Office of Population Genomics of the National Human Genome Research Institute was established to promote the application of genomic technologies such as GWA genotyping to population studies including case–control and cohort studies, randomized clinical trials and biorepositories with phenotypes defined by electronic medical records.

Collaborative GWA studies

  • GWA studies are challenging to conduct properly and benefit from collaborative approaches.
  • The large sample sizes required for detection of small effects and replication of findings promotes collaboration across studies with similar phenotypes.
  • Complexities in quality control of genotyping, control for multiple comparisons, and imputation of untyped SNPs encourages collaboration even among studies with diverse phenotypes.
  • The richness of GWA datasets and the cost involved in producing them has led the NIH to develop the Database of Genotypes and Phenotypes (dbGaP) for deposition of these data and the Policy for sharing of data obtained in NIH supported or conducted GWA studies governing access and distribution of genotype–phenotype data to qualified users.

Building upon & extending GWA research

  • The unprecedented detail with which GWA studies characterize an individual’s genome provides unique opportunities for relating GWA data to any trait consistent with that person’s informed consent.
  • GWA-related efforts of the Office of Population Genomics focus on developing standard phenotype and exposure measures for GWA studies, characterizing putative causal variants identified by GWA in well-characterized cohorts, and encouraging fine mapping, sequencing, functional studies and pilot translational studies through the Genes, Environment, and Health Initiative.
  • These collaborative programs are designed to share expertise and experience across studies to develop best practices for incorporating genome-wide studies in population studies, with the resulting best practices to be widely shared as research tools with the scientific community.


Financial & competing interests disclosure

The author is an employee of the US Government. The author has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

No writing assistance was utilized in the production of this manuscript.


Papers of special note have been highlighted as:

•of interest

••of considerable interest

1. Hunter DJ, Kraft P. Drinking from the fire hose – statistical issues in genomewide association studies. N. Engl. J. Med. 2007;357:436–439. [PubMed]Review of analytic challenges in genome-wide association (GWA) studies, particularly the problem of multiple comparisons and correction for potential inflated type I error.
2. Hugot JP, Chamaillard M, Zouali H, et al. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease. Nature. 2001;411(6837):599–603. [PubMed]
3. McCarthy MI. Progress in defining the molecular basis of Type 2 diabetes mellitus through susceptibility-gene identification. Hum. Mol. Genet. 2004;13(Spec No 1):R33–R41. [PubMed]
4. Corder EH, Saunders AM, Strittmatter WJ, et al. Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer’s disease in late onset families. Science. 1993;261:921–923. [PubMed]
5. Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review of genetic association studies. Genet. Med. 2002;4:45–61. [PubMed]Classic review of candidate gene association studies showing only six of 600 reported associations replicated consistently.
6. Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. J. Clin. Invest. 2008;118:1590–1605. [PMC free article] [PubMed]
7. Scott LJ, Mohlke KL, Bonnycastle LL, et al. A genome-wide association study of Type 2 diabetes in Finns detects multiple susceptibility variants. Science. 2007;316:1341–1345. [PMC free article] [PubMed]
8. Saxena R, Voight BF, Lyssenko V, et al. Genome-wide association analysis identifies loci for Type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336. [PubMed]
9. Zeggini E, Weedon MN, Lindgren CM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for Type 2 diabetes. Science. 2007;316:1336–1341. [PMC free article] [PubMed]
10. Zeggini E, Scott LJ, Saxena R, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for Type 2 diabetes. Nat. Genet. 2008;40:638–645. [PMC free article] [PubMed]
11. Weedon MN, Lango H, Lindgren CM, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nat. Genet. 2008;40:575–583. [PMC free article] [PubMed]
12. Sanna S, Jackson AU, Nagaraja R, et al. Common variants in the GDF5-UQCC region are associated with variation in human height. Nat. Genet. 2008;40:198–203. [PMC free article] [PubMed]
13. Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PubMed]Seminal GWA study report with extensive methodologic explanations and illustrations of the effect of genotyping error and population stratification, plus systematic evaluation of available information on every identified.
14. Donnelly P. Progress and challenges in genome-wide association studies in humans. Nature. 2008;456(7223):728–731. [PubMed]
15. GAIN Collaborative Research Group. Manolio TA, Rodriguez LL, et al. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat. Genet. 2007;39:1045–1051. [PubMed]Description of the Genetic Association Information Network (GAIN) collaborative network, plans and policies for data sharing and data access, and initial genotyping quality control metrics.
16. Cupples LA, Arruda HT, Benjamin EJ, et al. The Framingham Heart Study 100K SNP genome-wide association study resource: overview of 17 phenotype working group reports. BMC Med. Genet. 2007;8 Suppl. 1:S1. [PMC free article] [PubMed]
17. Mailman MD, Feolo M, Jin Y, et al. The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet. 2007;39:1181–1186. [PubMed]Description of the design and implementation of Database of Genotypes and Phenotypes (dbGaP) including data flow, policies for submitting and requesting data, and plans for further development.
18. Sabatti C, Service SK, Hartikainen AL, et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 2008;41(1):35–46. [PMC free article] [PubMed]


101. Junkins HA, Hindorff LA, Manolio TA. National Human Genome Research Institute. A catalog of published genome-wide association studies. [Accessed 12/14/08].
102. National Human Genome Research Institute Office of Population Genomics.
103. Genes, Environment, and Health Initiative (GEI)
104. Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)
105. Genetic Association Information Network (GAIN) [PubMed]
106. Gene Environment Association Studies (GENEVA)
107. Electronic Medical Records and Genomics (eMERGE) Network.
108. RFA HG-08–004, Genome-Wide Association Studies of Treatment Response in Randomized Clinical Trials – Study Investigators.
109. PhenX: Consensus Measures for Phenotypes and eXposures.