|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association (GWA) studies have arguably transformed the science of discovering genetic variants that influence common, complex diseases . Prior to the advent of the GWA approach in 2005, only a handful of robustly replicated associations such as CARD15 in Crohn’s disease, , PPARG and KCNJ11 in diabetes , and APOE in Alzheimer’s disease  had been identified, while the vast majority of putative candidate gene–disease associations had failed to replicate . This situation has changed dramatically in the past 3 years, with over 230 published studies reporting over 400 unique SNPs associated with over 75 common diseases and traits [6,101].
Most of these investigations have been conducted as isolated, single studies of a specific disease or trait, typically using a case–control design . Notable collaborations such as those in diabetes [7–10] and height [11,12] have brought together diverse groups of investigators and samples for study of the same or similar traits, but only one consortium to date has published results of GWA studies conducted in multiple diseases under a shared genotyping and analytic framework . These pioneering efforts of the Wellcome Trust Case Control Consortium (WTCCC) in seven diverse diseases and a common set of controls provided a highly successful model for the conduct of collaborative GWA studies, one that the Wellcome Trust recently extended to an additional 27 conditions in over 120,000 additional people .
Despite the numerous logistical and organizational challenges in coordinating multiple GWA studies in a single consortium, the WTCCC contributed to many important methodologic advances, including demonstration of the robustness of a single control group, the value of using cases of some diseases as controls for others, the greater power provided by increased sample size (numbers of subjects) rather than increased genomic coverage (numbers of SNPs), the importance of manual review of automated genotyping calls, and the feasibility of broad distribution of individual-level data to investigators agreeing to careful controls on their use.
Building upon this foundation, the National Human Genome Research Institute (NHGRI) and other Institutes of the NIH have initiated GWA collaborative studies using the case–control design, as well as more complex designs including cohort studies, randomized clinical trials, and biorepositories with phenotypes defined by electronic medical records. In addition, NHGRI’s Office of Population Genomics, under its mandate to promote the application of genomic technologies to population-based studies , has also led efforts to develop standard phenotype and exposure measures for GWA studies, characterize putative causal variants identified by GWA in well-characterized cohorts, and encourage fine mapping, sequencing, functional studies and pilot translational studies through the Genes, Environment, and Health Initiative (GEI) of the Department of Health and Human Services . This special report describes these efforts, the gaps in scientific knowledge and methods they were designed to address, the challenges they have encountered, and the opportunities they present for extending the promise of GWA research.
The Genetic Association Information Network (GAIN) was initiated in late 2005 as a public/private partnership to investigate the genetic basis of common diseases through a series of collaborative GWA studies. Six studies involving a total of 18,000 DNA samples were selected on the basis of scientific merit, potential for genome-wide genotyping to provide valuable new insights, and public health significance of the traits proposed for study . Because of the desire for transparency in public/private partnerships involving NIH, GAIN design and implementation were directed by a series of guiding principles that have become useful foundations for other GWA studies (Box 1). Foremost among these principles was the commitment to release the resulting data as broadly and rapidly as possible, with equal opportunity for access by all users who agree to protect the confidentiality of study participants and to respect the intellectual investment of investigators contributing data and samples to GAIN.
GAIN will use the most rigorous scientific approaches and maintain the highest ethical standards as guided by the following principles:
Adapted from .
GAIN: Genetic Association Information Network.
At the time GAIN was initiated, there was no NIH database capable of receiving and disseminating the genotype–phenotype data, nor were there NIH policies on access and proper use of this information. The commitment to make GWA genotyping data from GAIN widely available, together with 100K and 500K GWA scans from the Framingham Heart Study , stimulated the National Center for Biotechnology Information to develop the Database of Genotypes and Phenotypes (dbGaP) . This database, which also functions as a site for investigators to request access, NIH staff to review these requests, and approved users to download data securely, was up and running by late 2006. Implementation of data access mechanisms for GAIN, Framingham, and other GWA studies led the NIH to develop its ‘Policy for Sharing of Data Obtained in NIH Supported or Conducted GWA Studies’ after a lengthy period of public comment . The policy was finalized and implemented in late 2007 and applies to all applications for funding of GWA studies submitted to NIH after January 2008.
Collaborative activities within GAIN have focused on genotyping quality control and data analysis, particularly in relation to controlling for population stratification and sources of genotyping error. The potential for joint analysis of cases and controls has been limited by restrictions on data use in the consent forms from the original studies, which in many cases did not permit their use for phenotypes unrelated to the primary purpose for which the samples were collected. GAIN investigators and their collaborators are actively engaged in comparing methods for imputation of untyped SNPs and improved calling algorithms, and in replicating their initial results in additional samples and pooling their findings with other GWA studies. Involvement of the scientific community has been encouraged through a series of annual analysis workshops at which initial results and analytic challenges have been presented in depth .
The Genetic Association Information Network has provided a valuable foundation on which other collaborative GWA studies were built. Chief among these has been the Gene–Environment Association Studies (GENEVA) component of the GEI, which began with eight GWA studies and recently expanded to a total of 14 studies . GENEVA added a coordinating center to the original GAIN model, allowing for a more standardized approach to data submission and data cleaning, with guidelines and procedures for data management under development for dissemination to the scientific community. GENEVA in turn provides an important foundation for other components of the GEI genetics program, including efforts in replication and fine mapping of GWA findings, functional studies, pilot translational efforts, and development of analytic methods. Integration of GWA studies with improved technologies for measuring environmental exposures (behavioral and lifestyle factors, toxins, pollutants and so on) being developed in GEI’s companion exposure biology program is anticipated once those measures become available. Components of GEI and the solicitations that provide their support are listed in Table 1.
The Electronic Medical Records and Genomics (eMERGE) network is a five-member consortium formed to develop, disseminate and apply approaches to research that combine DNA biorepositories with electronic medical record (EMR) systems for large-scale, high-throughput genomic research . EMR provide great potential for defining phenotypes and exposures in a cost- and time-efficient manner, and their integration into healthcare delivery presents important opportunities for translating genomic findings into improved care. eMERGE will define the extent and complexity of data extraction required to obtain clinical information needed for genome-wide research, and the compatibility of existing EMR formats with each other, with various standardized language systems, and with databases such as dbGaP. Concerns related to wide-spread data sharing for research use outside a given healthcare system will also be assessed. eMERGE will investigate the acceptability of such research to biorepository participants, their clinicians, biorepository investigators, and biorepository Institutional Review Boards as well as participants’ concerns and preferences regarding return of individual results of genetic studies.
Electronic Medical Records and Genomics will also compare various approaches for conducting genome-wide research in biorepositories so as to maximize the scientific value of the biorepository for identifying genes related to complex diseases. For proposed genome-wide studies deemed scientifically advantageous, it will investigate genetic variants associated with specific complex diseases and traits derived from EMR data. In accordance with NIH data-sharing policies, resulting GWA data will be made rapidly available for research use through dbGaP. By working collaboratively to share experience and expertise across participating biorepositories, guidelines will be developed for dissemination to other biorepositories on consultation and consent for such research, and on collecting, documenting and depositing phenotype and exposure data to databases such as dbGaP.
A collaborative program currently in planning  will extend GWA studies to randomized clinical trials, with the intent of identifying genetic variants associated with response to treatments for conditions of clinical or public health significance. This program is needed because identification of variants related to treatment response to date has largely relied on candidate gene studies, much as did research in the genetics of complex diseases before the advent of GWA studies. Early pharmacogenomic studies have also placed an understandable emphasis on severe adverse drug reactions, which are generally too uncommon to permit accrual of the large series of cases needed for GWA studies. By contrast, clinical improvement (or lack thereof ) in response to treatment is a frequently occurring outcome in clinical trials, so that large numbers of responders and nonresponders can readily be collected. When treatment response can be assessed as a continuous trait, the power of GWA studies is likely to be even greater.
Leveraging existing clinical trial resources for genome-wide research also provides opportunities to explore the implications of generating genetic information in the context of delivering a controlled intervention. Analysis and release of genomic data in such settings, for example, may be complicated by the need for masking of treatment assignment until the trial is completed. Incorporation of point-of-care models, where prevention or treatment strategies are selected based on a patient’s genotype, have not been widely explored. Empirical data are needed on the expectations of and reactions from study participants regarding such approaches, and the impact of including such studies on participation rates in clinical trials. Collaborating investigators will share expertise and experience across studies to develop best practices for incorporating genome-wide studies in clinical trials. The resulting data and best practices will be widely shared as research tools with the scientific community.
In addition to the GEI-supported programs described above that build directly upon GWA studies, and dbGaP that grew directly out of them, NHGRI’s Office of Population Genomics has also established a data resource describing published GWA studies in its ‘Catalog of Published Genome-Wide Association Studies’ . It provides characteristics such as author, trait, sample size and genotyping platform on all GWA studies attempting to assay at least 100K SNPs identified through PubMed literature searches and reports from the media. SNP-trait associations significant at p < 10 −6 and not previously reported are listed with characteristics such as rs number, genomic region, risk allele and its frequency, odds ratios and p-values. The site is interactive and can be searched by journal, first author, disease/trait and other characteristics, and a downloadable database is available in Excel format.
The advantages of pooling GWA studies to increase sample size and diversity have been widely recognized as described above, but can be impeded by lack of standard measures of phenotypes and exposures across studies. This is being addressed by the PhenX project on consensus measures for phenotypes and exposures , which is developing a toolkit of roughly 15 standardized measures in each of 20 domains (for a total of roughly 300 measures) related to complex diseases and environmental exposures. Once an individual’s genome has been comprehensively characterized by GWA genotyping, it can potentially be related to any trait, not just the primary trait initially proposed for the GWA study, as long as these additional uses are consistent with the informed consent provided by study participants. The value of such an approach is demonstrated by the 18 GWA publications arising from the Framingham Study  and from an array of metabolic traits in the Northern Finnish Birth Cohort . Although cross-study analyses have been conducted for secondary phenotypes that are readily standardized, such as height, BMI and serum lipids , few GWA studies published to date have included phenotypic and exposure data on a sufficiently wide variety of traits and diseases to enable such additional analysis, particularly across multiple studies. Instead, most current GWA studies utilize a case–control or family design focusing on a single disease or group of related traits. More importantly, the potential for cross-study comparisons is restricted by the lack of standardized or comparable phenotypic and environmental measures, despite the many risk factors, such as smoking, dietary intake and low socioeconomic status, which are common to multiple diseases. PhenX will identify readily standardized and implemented phenotype and exposure measures for use in GWA studies. Success in this effort will facilitate more efficient use of GWA data to understand genetic influences on incidence of, and morbidity and mortality from, common diseases; on trait variation; and on responses to environmental exposures, including drugs or other therapies and lifestyle factors.
Finally, it is essential to recognize that large-scale GWA studies are a crucial first step in the identification of genetic variants related to complex diseases, but they are only a first step. Robustly replicated findings from GWA studies must be investigated in free-living human populations for their potential functional and public health implications, but once such potentially causal variants are isolated investigation often shifts away from human populations to the laboratory. Much remains to be learned, however, from well-characterized human population samples in which potentially causative variants have been, or could be, assayed. The Population Architecture using Genomics and Epidemiology (PAGE) program is designed to investigate, in four collaborating, well-characterized population studies, genetic variants identified as potentially causally associated with complex diseases in GWA and other genetic studies. The resulting population-based descriptive and association data will be widely shared through dbGaP and other user-friendly informatic systems to accelerate the understanding of genes related to complex diseases. PAGE forms a natural complement to past and ongoing GWA studies, which tend to examine hundreds of thousands of variants in relation to a handful of phenotypes, while this program will relate a small number of strongly-associated variants to a wide variety of traits in well-characterized cohort studies. It will define the ‘epidemiologic architecture’ of these putative causal variants – their population prevalence; prevalence in race/ethnic subgroups; relative risk of rigorously-defined, incident disease; consistency of association across subgroups defined by age, sex, race/ethnicity or exposures; and potential modifiability of associated risk. This information will help to determine the health implications of a given variant and the priority it should receive for identifying and testing interventions to reduce its associated risk. This information may also be quite valuable in exploring gene function, since the epidemiologic approach of genetic investigation, starting from observed phenotypic characteristics and moving more proximally to gene pathways and sequence variants, complements well the laboratory approach of moving from DNA sequence to function to phenotype. The collaborative nature of the program will facilitate the development and dissemination of efficient approaches for epidemiologic characterization of putative causal variants in these and other cohorts. Detection and interpretation of important differences between these data and initial GWA discovery reports will help inform the design of future GWA studies.
Genome-wide association studies have already identified a large number of loci reproducibly associated with complex diseases, but in general the risk associated with these variants is small, suggesting that the true causal variants have yet to be found or that multiple other factors, including other variants and environmental exposures, combine to produce inherited disease risk. Of particular importance may be environmental factors known to be related to multiple common diseases, such as smoking, alcohol use, dietary intake, physical activity and other behavioral factors. Determining the true functional consequences of these variants is likely to be challenging but will be critical to developing interventions to modify disease risk. Defining the role of structural variation, such as insertions, deletions, duplications and inversions that are poorly captured by SNPs, will also be a major challenge in the coming years but may hold promise for identifying other loci related to disease. High-throughput, low-cost sequencing technologies will permit identification of rarer sequence variants that may be causative or additive in the disease associations identified in GWA studies. Population-based studies will be needed to determine the population prevalence and risk associated with putative causal variants in unbiased and diverse population samples, and to estimate the increment in risk over established risk factors provided by GWA-defined variants.
Financial & competing interests disclosure
The author is an employee of the US Government. The author has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.
Papers of special note have been highlighted as:
••of considerable interest