|Home | About | Journals | Submit | Contact Us | Français|
For 20 years, genetic linkage combined with positional cloning has offered a rational and increasingly straightforward route to finding gene mutations that lead to monogenic disease, such as cystic fibrosis and Huntington’s disease (see the Glossary). With a few important exceptions, these searches have led to mutations that alter the amino acid sequence of a protein and that enormously increase the risk of disease.
During the past few years, genomewide association studies have identified a large number of robust associations between specific chromosomal loci and complex human disease, such as type 2 diabetes and rheumatoid arthritis1 (Fig. 1). This approach relies on the foundation of data produced by the International Human HapMap Project and the fact that genetic variance at one locus can predict with high probability genetic variance at an adjacent locus, typically over distances of 30,000 base pairs of DNA2 in the human genome, which contains about 3×109 base pairs. This haplotypic structure of the human genome means that it is possible to survey the genome for common variability associated with the risk of disease simply by genotyping approximately 500,000 judiciously chosen markers in the genome of several thousand case subjects and control subjects.3 Consequently, it is now routine to identify common, low-risk variants (i.e., those that are present in more than 5% of the population) that confer a small risk of disease, typically with odds ratios of 1.2 to 5.0.4
The platform that is used to genotype markers in genomewide association studies and related approaches has uncovered a startling degree of structural genomic variation. Although such variants were known to be causes of rare monogenic disorders,5,6 the extent of structural genomic variation among persons was largely unanticipated, and there is increasing interest in understanding how such variants may confer a risk of common diseases.7,8
The initial contention surrounding the viability of genomewide association studies has largely subsided. However, discussion has centered on evaluating how far such studies will take us in understanding the risks and causes of disease — and thus the time and resources that should be invested in genotyping more case subjects with any one disease to garner what many see as diminishing genetic returns. These issues are discussed in three Perspective articles in this issue of the Journal.9–11 Nonetheless, the current phase of rapid discovery is a remarkable change that ends a long period of frustration, when the investigation of the genetic causes of complex diseases could boast few successes. The data from genomewide association studies and emerging sequencing techniques offer a route to the dissection of genetic causes of human disease (Table 1).12–21 Here we describe this route and some of its challenges.
Genomewide association studies identify loci and not genes per se and cannot easily identify loci at which there are many rare risk alleles in any given population.22 Rather, this approach is designed to find loci that fit the common disease–common variant hypothesis of human disease23,24 (Table 2). Refinement of susceptibility loci and the identification of causal variants may be achieved through fine mapping (see the Glossary).
One observation that has taken many observers by surprise is that most loci that have been discovered through genomewide association analysis do not map to amino acid changes in proteins. Indeed, many of the loci do not even map to recognizable protein open reading frames but rather may act in the RNA world by altering either transcriptional or translational efficiency. They are thus predicted to affect gene expression. Effects on expression may be quite varied and include temporal and spatial effects on gene expression that may be broadly characterized as those that alter transcript levels in a constitutive manner, those that modulate transcript expression in response to stimuli, and those that affect splicing.
Therefore, there are two clear and immediate tasks: to develop an understanding of the genetics of gene expression and to identify disease-linked variants that are too rare to be picked up by association methods and yet have risk alleles of sufficient “strength” to allow detection with the use of linkage strategies (see the Glossary for descriptions of genetic association and genetic linkage). Meeting these challenges will serve efforts to better understand environmental influences on the causes of disease and may facilitate a systems-based understanding of disease, in which we come to understand the full, molecular network that is perturbed in disease.
It is perhaps not surprising that many variants conferring a low risk of a complex disease effect a change of quantity in gene expression, because many of these diseases can be thought of as quantitative traits themselves, with disease diagnosis being made when a clinical threshold is surpassed (as is the case with hypertension and Alzheimer’s disease). This frequent observation from genomewide association studies was presaged by the observation that genetic variability in the insulin-gene promoter is associated with an increased risk of type 1 diabetes.25
Genetic variability in gene expression may occur at many stages: transcription, messenger RNA (mRNA) stability, and splicing or translation efficiency. In each of these instances, the underlying variability would be expected to occur in different DNA elements that may have element-specific sequence motifs and should be distinguishable by their effects on cellular RNA species. Tissue-specific genomewide analyses of gene expression offer a starting point for parsing the various possibilities.26 This approach has been particularly helpful in understanding the effect of susceptibility variants in immune-mediated disease, such as asthma, because the lymphocyte (which is pivotal to the pathological analysis of such disease) is easily accessible,27 although human fat tissue28 and human brain tissue29 obtained at autopsy have also been used. Three examples of this approach illustrate its power: a haplotype associated with asthma also shows an association with lymphoblastoid expression of the proteins ORMDL3 and GSDML,27,30 genetic variants that are associated with obesity are also associated with the expression of their cognate mRNAs in adipose tissue,28 and a variant of MAPT (encoding tau) that is associated with progressive supranuclear palsy is also associated with MAPT mRNA expression.29,31,32
Genetic variability can also result in differences in translational efficiency through changes in the mRNA sequence or in the level or sequence of regulatory RNAs.33 Both modes can be queried through high-throughput transcriptomic sequencing, which enumerates the number of times that any individual RNA species is present in preparations from that tissue. Unlike chip technologies, such sequencing does not depend on the relevant RNA being represented on an array; it can also provide a survey of all RNA species, not just mRNA. The eventual goal of the recently announced Genotype-Tissue Expression (GTEx) project is to create a whole-body map of haplotypic expression so that any risk haplotype for any disease can be easily checked for its effect on genomewide and tissuewide RNA expression (Fig. 2).
The present array techniques do not enable hypothesis-free means of identifying high-risk variants with one exception: that of structural genomic variation.5 The process for reaching the goal of systematic identification of rare high-risk variants is clear, both from candidate-gene studies and from emerging techniques of high-throughput sequencing, which will soon permit the routine and complete sequencing of the human genome.19 Existing but imperfect intermediate techniques toward that goal are transcriptome sequencing and exome sequencing.34 The latter uses array techniques to pull exonic DNA from genomic DNA, which is then sequenced to give full representation of the coding genome. All coding polymorphisms in a subject will therefore be identifiable. Candidate-gene studies have already suggested the power of this type of approach. For example, Cohen and colleagues35 sequenced several genes encoding cholesterol-metabolizing proteins in patients with low plasma levels of high-density lipoprotein (HDL) cholesterol and found that rare variants were more common in case subjects who had low levels of HDL cholesterol than in control subjects. This example, in which a limited number of candidate genes were sequenced in a large number of subjects (an approach called deep resequencing), shows the power of testing a specific hypothesis. One can imagine testing a hypothesis with the use of genomewide data, in which case the usual criterion of having sufficient power to overcome the limitation of multiple testing and the low prevalence of rare variants would apply.
There is increasing focus on the idea of networks that are composed of genes and proteins. Although the complex interplay of macromolecules is a certainty, there is benefit in taking a reductionist approach when envisioning common molecular routes toward disease. Indeed, for many diseases, different genetic loci must impinge on a common pathway to pathogenesis. This means that as a risk allele at a genetic locus comes into focus, it provides clues to other risk loci and mechanisms by which variability at the same locus or on the same pathway can contribute to disease.
This point is well illustrated in the case of coronary artery disease, in which cholesterol metabolism has long been thought to be a pathogenic pathway to disease.36 Although few pathogenic pathways are as well delineated as cholesterol metabolism, huge amounts of data pertaining to protein and pathway interactions have been obtained with the use of yeast, roundworms (Caenorhabditis elegans), and fruit flies (Drosophila melanogaster). Studies of these creatures have informed and continue to inform human genetic studies. For example, two of the genes that are involved in recessive parkinsonism, PARK2 and PINK1, have recently been shown to be involved in the same mitochondrial pathway through work in drosophila.37,38
The vast majority of success in defining genetic risk in disease has been a result of traditional gene-hunting efforts to find mutations that underlie monogenic diseases. In this approach, our understanding of disease revolves around the idea of normal and abnormal variation, with the latter greatly increasing the risk of disease. In considering the genetics of complex disease and particularly the role of common variants that affect expression, a more nuanced perspective is useful. The difference in genetic effect between rare high-risk variants and common low-risk variants is quantitative and not qualitative, as illustrated in Parkinson’s disease: point mutations within the α-synuclein gene39 and genomic multiplications containing this gene6 lead to monogenic disease, whereas a common haplotype of the α-synuclein gene moderates the risk of sporadic disease.40
Parsimony would suggest that there is probably a graded influence of genetic variation in gene expression because for any gene many elements contribute to the control of its expression, and genetic variability in any one of such genes is likely to result in a change in expression. In this model, at any locus there are multiple variants, which can exist across a single haplotype block or in multiple haplotype blocks proximal to the affected transcript. Thus, there is no single haplotype for disease risk and no single protective haplotype but, rather, a collection of haplotypes that confer a graded risk of disease. The variant with the highest population attributable risk (a combination of allele frequency and relative risk) is likely to be the first at the locus to be detected as a risk factor, and further dissection of the same locus will yield other risk alleles of smaller effect. Although such dissection is proving to be a tough task, there are already examples of success. After the identification of a risk allele for macular degeneration, a polymorphism that causes the substitution of tyrosine for histidine at position 402 in complement factor H (CFH),41–43 several additional and independent risk variants, including noncoding alterations, have been detected in and around the CFH gene, and none of these variants in isolation account for all the risk attributed to this locus.44
The evolution of genetic analysis of traits has revealed the power of testing markers across the whole genome to identify novel factors involved in disease and has shown that large samples are required to determine true biologic associations. This, in turn, has underscored the desirability of accessible resources and data, such as the human genome sequence and the haplotype map from the HapMap project, for these and future techniques. The generation of population-control data for genomewide association studies by the Wellcome Trust and other groups, while initially expensive, has been useful to many independent research groups and proved to be an economical approach. A similarly useful resource will be the 1000 Genomes Project, a large international effort that aims to identify all single-nucleotide polymorphisms (SNPs) with a prevalence of 1% or more in the human genome. This effort will focus on resequencing samples from the initial and extended HapMap populations from around the world. Even with new sequencing techniques, this is a monumental effort. However, it is still likely to be only a first installment. To reliably determine the pathogenicity of rare variants as they are identified, we will probably need reference sequences from tens or hundreds of thousands of subjects, coupled with a better understanding of the biologic effects of SNPs. Housing and making accessible such data will be a considerable challenge, especially when one considers that the data will include variants pertaining to both SNPs and structural genomic variability.
To state that most complex diseases are caused by an interaction between genome and environment is a cliché. Such interactions, while likely, have for the most part not been demonstrated, and we should be cautious about universally subscribing to this belief without evidence. Since the quantification of environmental influences is notoriously difficult, it is likely that such a demonstration will remain a formidable challenge. At least, the definition of gene-based pathways for disease will provide a framework for the systematic investigation of exogenous influences. This is one of the goals of the recently announced Genes, Environment, and Health Initiative of the National Institutes of Health. There is increasing interest in genomewide assessments of epigenetic modification brought about by a greater understanding of the ubiquitous nature of such modifications and the availability of genome-scale sequences, which makes such investigation tenable from a practical perspective. It is hoped that greater understanding of the epigenome, particularly in the context of genetic variation and gene expression, will offer a direct and quantifiable link between putative environmental influences and pathways relevant to pathogenesis.
The jigsaw puzzle of understanding the causes of disease lies before us: we now have the edges and corners in place. The identification of monogenic disease loci and the common genetic variability that contributes to disease risk is now a tractable problem. The techniques that are necessary for genomewide identification of such rare variants that contributes to disease risk are quickly being refined. There is an enormous amount of filling in to do (including the dissection of the interactions among different genes), and there are formidable challenges, which increased bioinformatic data will help to address. Undoubtedly, there will be surprises, but the boundaries of the task ahead have already been drawn.
Supported in part by the Intramural Research Program of the National Institute on Aging of the National Institutes of Health (project number, 1-Z01-AG000949-02) and by the Medical Research Council.
We thank Dr. Katrina Gwinn for her constructive comments on the manuscript.
No potential conflict of interest relevant to this article was reported.