|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies are increasingly being applied to search for novel genes that might underlie cardiovascular diseases. In this article, we briefly review the principles that underlie modern genetic analyses, and provide several illustrations from the SardiNIA Study of genome-wide association studies for cardiovascular risk factor traits.
“Variability is the law of life. As no two faces are the same, so no two bodies are alike, and no two individuals behave alike in the abnormal conditions we know as disease. This is fundamental to the education of the physician”
-Sir William Osler
Cardiovascular diseases remain the leading causes of mortality in industrialized countries (Rosamond et al. 2008). Spectacular advances in our understanding of the pathophysiology of theses diseases have provided the cornerstone for developing and implementing numerous pivotal preventive and therapeutic strategies, which are credited with remarkable improvements in prognosis and quality of life. However, it appears that the mortality rates from these diseases are no longer declining or, alarmingly, may even be on the rise (Hardoon et al. 2008). Rapid and well publicized advances in the field of genetics (reviewed in (Lusis 2003, Nabel 2003, Damani and Topol 2007, Kullo and Ding 2007)) raise the hope that imminent breakthroughs could be achieved with newfound genetic tools (see glossary of terms). In this article, we briefly review the principles that underlie modern genetic analyses, and provide several examples from our own experience with genome-wide association studies (GWAS) of cardiovascular risk factors in the SardiNIA Study.
The field of Mendelian-based genetics has traditionally been rooted in the “one gene/one trait hypothesis”. In the earliest clinical genetic studies, investigators focused on the study of rare diseases in which genetic variants found in multiple members of large families were examined to determine whether they played a causative role. Using this approach, researchers succeeded in identifying several single gene mutations that underlie predominantly autosomal dominant diseases and conditions, such as mutations in the LDL-Receptor, and APO-B genes which are associated with hypercholesterolemia (Hobbs et al. 1987, Soria et al. 1989, Kullo and Ding 2007). These exciting discoveries provided great insights into the pathophysiology of the diseases being studied. However, the successes have largely been restricted to rare, monogenic diseases.
Most common heritable conditions are “complex traits”, in that they are likely influenced by the aggregate effects of several genetic variations (or polymorphisms) that are commonly found in the population, are located in multiple distinct genes; and individually contribute only a small effect to the trait. According to the “common disease, common variant hypothesis” (Reich et al. 2001, Bodmer and Bonilla 2008) the polymorphisms or risk alleles that underlie heritable traits accumulate randomly during evolution if they do not individually cause debilitating disease. Individually, each polymorphism contributes a small effect to the trait, and some may even confer an evolutionary advantage to the health of the organism under particular conditions. The phenotypic manifestation of the trait is influenced, in part, by the interaction of these polymorphisms with each other, and is modulated by environmental factors, gene-environment interactions, epigenetic factors, and chance.
Early investigations into the genetics of complex cardiovascular traits focused on “candidate genes”. Selecting these genes requires an a priori understanding of the underlying pathophysiology of the trait. However, there is growing recognition that studies restricted to candidate genes are woefully inadequate to elucidate the genetic underpinning of complex traits, in part, because of our incomplete understanding of all the factors that regulate and influence the manifestation of a given trait. It was hoped that this could be remedied by a broader search for polymorphisms across the entire genome. Because each polymorphism is only responsible for a small proportion of the overall variability in the trait of interest, large populations are needed to gain the statistical power required to identify these polymorphisms. The feasibility of conducting genome-wide studies with sufficient resolution in these large populations, in turn, required progress on three, intertwining fronts- 1) Refinement of genetic markers, 2) utilization of linkage disequilibrium principles, and 3) advances in DNA sequencing technology.
The major challenges facing genetic investigators are to identify the genes of interest and pinpoint the location of the culprit polymorphisms among the billions of nucleotides that form the human genome. To achieve this, investigators have depended heavily on genetic markers. Initially, these markers were quite crude, and relied on characteristic patterns derived from cleavage of DNA by specific restriction enzymes that generated distinctive patterns known as Restriction Fragment Length Polymorphisms (RFLP). The subsequent use of microsatellite markers (e.g. variable number tandem repeats and short tandem repeats) improved the precision of these markers. More recently, completion of the Human Genome Project along with improvements in sequencing speed and analytic efficiency allowed the ultimate refinement of these markers into single nucleotide polymorphisms (SNPs).
SNPs are specific sites on chromosomal DNA where single nucleotides are known to vary among individuals. These SNPs are identified by a “reference SNP” (abbreviated rs) number. (http://www.ornl.gov/sci/techresources/Human_Genome/faq/snps.shtml). Approximately 12–15 × 106 SNPs have been identified (http://www.ncbi.nlm.nih.gov/SNP) to date, and additional SNPs are being identified constantly. On average, SNPs are found every 100–300 base pairs in the human genome, although their distribution is not homogeneous.
While it is possible for an individual SNP to be any one of the four nucleotides (adenine, cytosine, guanine, thymine), each specific location in the genomic code is typically only one of two possible nucleotides. The nucleotide most frequently present in a population is called the “major” allele, while the other one is called the “minor” allele. SNPs can occur in both coding and non-coding regions of the gene. SNPs can be “silent”, or they can result in alterations in gene expression or protein structure/function. In general, polymorphisms are distinguished from mutations in that they are present in >1% of the population.
A basic tenet of genetic analysis is that two alleles located on the same chromosome remain linked together unless they are separated during meiotic recombination (Figure 1A). Therefore, the likelihood that two alleles (or a marker and a gene of interest) would segregate from one another is proportional to their physical distance from one another and the probability that a meiotic crossover site is located between them.
Linkage disequilibrium (LD) denotes a situation in which two SNPs are not inherited independently from each other; instead they remain “linked” to one another with a greater frequency than would be expected from random recombination events. LD is influenced by several factors, including physical distance, genetic linkage, rates of recombination, rate of mutation, random drift, and population structure (reviewed in (Reich et al. 2001, Abecasis et al. 2005) (Figure 1B). At one extreme, two SNPs may fall within a region that does not undergo internal recombination during meiosis. This region is referred to as a haplotype block. Each block is composed of thousands of nucleotides (the number can vary widely), and usually remains as a single unit during meiosis, because recombination preferentially occur at the borders that delimit the block, not within the block. There are approximately 250,000 – 300,000 haplotype blocks (http://hapmap.org/whatishapmap.html.en) in the human genome. These have been characterized and categorized by the “HapMap Project” [http://www.hapmap.org/thehapmap.html.en]. An important corollary is that a single genetic marker (“tag SNP”) could be used to represent this haplotype block in genetic studies. The tag SNP and the culprit mutation will co-segregate if they are located in the same haplotype block. The tag SNP can thus serve as a guiding “marker” to the true location of the variant-causing gene or polymorphism.
With the rapid increase in the number of identified SNPs, quick and efficient genotyping technology was needed. This became feasible with the advent of high-throughput SNP-based multiplex systems. In one such approach, SNP microarray platforms called “chips” are made from small pieces of glass to which a large number of synthetic, single-stranded DNA oligonucleotide probes are chemically attached, which selectively bind to complementary DNA molecules (adenine to thymine, cytosine to guanine). Since each SNP commonly has two alternative alleles, each SNP will have two probes represented on the chip, one that binds to DNA with the major allele and one with the minor allele (although there can technically be any of four nucleotide variants at a site, only two probes per nucleotide, representing the most common variants are usually placed on the chip). A fluorescent marker is attached to each probe, which lights up when DNA binds to it. This is detected by a microarray reader that examines both the color and the precise location of the fluorescent signal on the microarray. Rapid advances in technology have allowed the development of chips with an ever expanding numbers of SNPs. The costs of the chips have been steadily dropping, which is allowing investigators to genotype an increasing number of subjects.
The aforementioned advances have allowed the introduction of “genome wide” analysis in which researchers search for phenotype-genotype associations across the entire genome, using one of two strategies: linkage analysis or association studies.
In linkage analysis, genetic markers are sequenced in members of large families to identify regions that are associated with a disease or trait of interest more often than would be expected by chance. These regions of DNA can be quite large and numerous, so further testing is often needed to identify the specific causative genes. Nonetheless, in contrast to the candidate gene approach described above which required a priori knowledge of the relationship of the gene to the trait or disease process, linkage scans were the first mapping technique capable of assessing the entire genome, thereby allowing evaluation of all SNPs in an unbiased or “agnostic” fashion (Riley et al. 2000). Linkage analysis has been credited with the discovery of many important genetic relationships (Wang et al. 2003, Helgadottir et al. 2005) and is still useful in situations where large family units are being studied (Hodge 1993). The limitations of this approach include limited precision, dependence on family structure, and difficulty in identifying genes with small effect sizes (Lusis 2003).
Association studies consist of examining the correlation of a phenotype (a disease or a quantitative trait) with a genotype. When this is performed for a large number of SNPs across the genome, it is termed a genome-wide association study (GWAS) (Hirschhorn and Daly 2005). Unlike linkage analysis, in GWAS large family units are not required (in fact, excessive family structure can bias the results and must be corrected for); and as a corollary, large numbers of individuals can be studied, which improves the likelihood that SNPs with modest effect size can be identified.
The SardiNIA Study recruited individuals from a circumscribed region on the island of Sardinia, Italy, who constitute a genetically isolated founder population by virtue of their geographic isolation and ethnic homogeneity. Founder populations are rare populations that arise from a delimited group living in a defined region for many centuries (founders) with minimal admixture from outside populations and reduced environmental variance. The high degree of relatedness in these populations and their environmental homogeneity make it likely that one or a few genes predominate in the causation of a complex phenotype. Thus the study of such populations facilitates detecting and refining the localization of the loci responsible for complex traits, if detailed genealogical information is available and extended multigenerational pedigrees are reconstructed (Varilo et al. 2000). In the SardiNIA study, 6,148 men and women over the age of 14 were recruited from a cluster of four towns in the Lanusei Valley in the Ogliastra province of the island, which has a total population of 11,000 (Pilia et al. 2006). These subjects underwent extensive phenotypic characterization of 98 quantitative traits, including cardiovascular risk factors (LDL, HDL, blood pressure, heart rate, CRP, etc.), anthropometrics (height, weight, waist circumference, BMI, etc), and arterial structure/function (pulse wave velocity, intimal medial thickness, etc).
As a first step, heritability analyses were performed for all the quantitative traits measured in SardiNIA (Pilia et al. 2006, Chen and Abecasis 2007). Heritability analyses evaluate the proportion of phenotypic variance that is shared among related individuals and estimate the contributions of genetic factors (Susser 1985). In the SardiNIA study, it was estimated that genetic effects explained 25% of the variance for 20 measures of cardiovascular risk factors (including systolic blood pressure diastolic blood pressure, pulse wave velocity, and intimal medial thickness), 51% for 5 anthropometric measures (including weight, waist circumference, body mass index, and height), and 40% for 38 blood tests (including C-reactive protein, LDL, HDL, triglycerides, uric acid) (Pilia et al. 2006).
In SardiNIA, genotyping was performed with the Affymetrix 500K and 10K GeneChip arrays. GWAS in this population have yielded insights into the genetic basis of such complex traits as obesity (Scuteri et al. 2007), uric acid (Li et al. 2007), glucose (Chen et al. 2008), lipids (Willer et al. 2008), height (Sanna et al. 2008), and fetal hemoglobin (Uda et al. 2008). A major aim of the SardiNIA study remains the genetic investigation of aging-related traits. With few recent exceptions (e.g. Tarasov et al. 2009) these analyses are still ongoing, particularly in the context of meta-analyses in large multinational consortia. In the following sections, we will briefly summarize the SardiNIA GWAS experience for five separate CAD risk factors, to illustrate some of the strengths and limitations of GWAS.
Lipoproteins (LDL, HDL, and triglycerides) have a well-established role in CAD (Grundy et al. 2004, Rosamond et al. 2008). While smoking, diet, physical activity, and other environmental factors affect the lipid levels of individual patients, family studies suggest that about half of the overall variation in these levels is genetically determined (Pollin et al. 2004, Pilia et al. 2006). Previous genetic studies using candidate genes and linkage analysis identified several genes with rare mutations that have large effects on lipid levels; e.g. LDL receptor (Goldstein et al. 1974), apolipoprotein B-100 (Vega and Grundy 1986). However, the vast majority of individuals afflicted by hypercholesterolemia do not have any of these rare mutations. Recently, several groups, including the Framingham Heart Study (Kathiresan et al. 2007), the SardiNIA/DGI/FUSION Consortium (Willer et al. 2008), and the DGI group (Kathiresan et al. 2008) conducted GWAS that identified SNPs which are associated with lipid levels across the continuum of cholesterol levels.
However, only a subset of these SNPs are believed to be functionally significant, i.e. they are located either in coding regions of genes known or suspected of influencing lipid levels, or in regions known to regulate these genes. Other SNPs were noted to be mapping SNPs in candidate genes, i.e. they are acting as simple markers, because they are non-coding SNPs located within or near genes that are known or suspected of influencing lipid levels. Lastly, some SNPs were found to be in or near newly discovered genes, or genes that were not previously known or suspected of influencing lipid levels.
An example of a SNP that may be functionally significant is rs780094, which is strongly associated with triglyceride levels (p=6.1×10−32) (Willer et al. 2008) and is located within a coding region of the glucokinase regulatory (GCKR) gene. It may, therefore affect the level of glucose conversion to triglycerides by affecting the phosphorylation of glucose to glucose 6-phosphate in the liver by glucokinase leading to the formation of glycogen. Whether rs780094 is itself the causative genetic variant responsible for the association with triglyceride levels awaits additional studies that will evaluate the specific effects of this particular SNP on the activity of the GCKR gene and on triglyceride levels.
An example of a “mapping” SNP in a candidate gene, is one in the cholesterol ester transfer protein (CETP) gene (rs3764261), which is strongly associated with HDL levels (p<2.3 × 10−57) (Willer et al. 2008). Although CETP is well known to play a role in cholesterol ester transport, it is not likely that rs3764261 is the causative variant in the CETP gene that is responsible for the association with HDL level, because this SNP is located 2 kilobases upstream of the actual gene. It is possible that this SNP is located within an enhancer or other regulatory region acting from a distance on the CETP gene. However, it is more likely that rs3764261 is only a marker for an as-yet undiscovered SNP that shares the same haplotype block as rs3764261 within the CETP gene.
Examples of SNPs in new unexpected genes are rs599839 (Willer et al. 2008) and rs646776 (Kathiresan et al. 2008), which are associated with LDL levels (p=6.11 × 10−33 and p=3×10−29 respectively). Both of these SNPs are located in a genetic region that was not previously known to be involved in lipid metabolism, specifically in the CELSR2-PSRC1-SORT1 locus on chromosome 1. Importantly, SNPs at this same locus were enriched in a cohort of subjects with known CAD compared to controls. This exciting finding suggests that further elucidation of the relationship between this locus and CAD may provide a potentially novel target for the prevention or treatment of CAD.
Obesity is a cardiovascular risk factor whose prevalence is increasing at such an alarming rate that it is feared it may negate the significant strides made in decreasing cardiovascular morbidity and mortality (Rosamond et al. 2008). Although obesity is thought to be related in large part to environmental factors, it is believed to also have a genetic component, as studies have shown that 60–70% of the variation in obesity-related phenotypes is heritable (Comuzzie AG, Allison DB (1998) (Comuzzie and Allison 1998). For example, three independent groups conducting GWAS (Frayling et al. 2007) (Dina et al. 2007) (Scuteri et al. 2007) found a very robust association between obesity and the FTO gene. In the Sardinian population, a cluster of SNPs in several genes (Figure 2), were found to be associated with multiple markers of obesity, with SNPs in FTO exhibiting the most robust relationships [association of rs9930506 with BMI p = 8.6×10−7, with hip circumference p = 3.4 × 10−8, and with weight p = 9.1 × 10−7]. This example reinforces the importance of GWAS in uncovering previously unsuspected genes that underlie traits, but also illustrates one of the main limitations of GWAS, namely that the function of the uncovered gene (FTO) is currently unknown. Thus additional experiments are needed to delineate the function of this gene and to determine how it influences obesity, in order to translate this newfound knowledge into clinically useful information. This is no easy task, as illustrated in the next example.
Elevated serum uric acid (UA) is an independent risk factor for cardiovascular diseases (CVD) in patients with hypertension, heart failure, or diabetes (Alderman and Aiyer 2004), even though the mechanisms underlying this relationship have remained elusive. In the SardiNIA study a GWAS found several SNPs in the GLUT9 gene to be associated with levels of uric acid (Li et al. 2007), including rs6855911 (p=1.8×10−16) and rs7442295 (p=2.57×10−12). While GLUT9 was known to function as a glucose transporter and to be expressed highly in the kidney, the link to uric acid was unexpected. A series of genetic experiments were undertaken to elucidate the role of these SNPs in connecting GLUT9 to uric acid levels. At the outset, sequence analysis of the coding regions as well as the exon/intron boundaries revealed that rs6855911 and rs7442295 were located in introns and were therefore unlikely to be independently responsible for changes in uric acid levels. Instead, they were suspected to be marker SNPs in high linkage disequilibrium with the culprit SNP, which, in turn would be located in an exonic/coding region. However, sequencing of the GLUT9 gene and putative promoter regions in a subset of the SardiNIA cohort has thus far not revealed the coding SNP that is associated with UA levels. Thus additional studies are needed that would explore novel paradigms. For example, some studies have postulated that the changes in UA levels may occur as a result of alternative splicing that modifies fructose transport which is known to stimulate increased hepatic production of uric acid (Doring et al. 2008, Vitart et al. 2008).
Elevated fasting glucose is an early diagnostic sign heralding the presence or impending arrival of diabetes mellitus, and is a risk factor for coronary artery disease. In GWAS of fasting glucose levels in non-diabetic members of the SardiNIA and FUSION studies, a strong association (p= 5.3 × 10−9) was found with SNP rs563694 (Chen et al. 2008). This association was strengthened further (p= 6.4 × 10−33) in a large meta-analysis that included 24,046 subjects from the SardiNIA, FUSION, METSIM, Caerphilly, BWWHS, and Inter99 studies (Chen et al. 2008). In spite of this very robust statistical association, it is important to point out that this SNP only accounts for approximately 1% of the total variation in fasting glucose. Each copy of the major allele of this SNP is associated with an approximately 0.01–0.16 mM higher levels of fasting glucose. The majority of published GWAS to date have identified SNPs that only exert a modest effect on the trait of interest.
Central arterial stiffening is one of the hallmarks of arterial aging. Carotid-femoral pulse wave velocity (PWV) is the preferred non-invasive measure of central arterial stiffness (Laurent et al. 2006). PWV is increased in patients with hypertension, diabetes, metabolic syndrome, and atherosclerosis. Furthermore, PWV is an independent predictor (Sutton-Tyrrell et al. 2005) of mortality and of incident cardiovascular disease. A GWAS in SardiNIA identified the non-synonymous SNP rs3742207 in the COL4A1 gene on chromosome 13 to be significantly associated with PWV, and this was successfully replicated in an independent population of European ancestry, the Old Order Amish, leading to an overall p=5.16×10−8 (Tarasov et al. 2009). Interestingly, unlike collagen types 1 and 3, which are found abundantly in the tunica media of the arterial wall and are important determinants of the tensile strength of the artery, collagen type 4 is a major structural component of the intimal basement membrane. If the association of rs3742207 with PWV is replicated in other studies, it suggests that previously unrecognized cell-matrix interactions may exert an important, and genetically-based, role in regulating arterial stiffness.
While the power of GWAS to identify common genetic variants with small individual effect-sizes has provided a wealth of new information, there are several issues that are worthy of note. First, while large GWAS have provided the statistical power needed to identify the common genetic variants involved in several complex traits, the individual effect of each of these variants is often quite small. Common diseases with a genetic basis usually do not become manifest without the presence and/or interaction of multiple genetic variants as well as the modulating influence of environmental factors. Thus, knowledge of an individual SNP usually cannot predict whether disease will ensue, rather it provides information on the predisposition to disease. Second, statistical methods are only now beginning to address SNP-SNP interactions which are thought to underlie many diseases and traits. While most researchers are eager to look at the interaction of multiple genes in causing disease (i.e. a minor allele for a single SNP may be harmless but a combination of minor alleles at two or more locations or a specific combination of minor/major alleles may lead to disease), the statistical tools and algorithms that are needed are still in the early stages of development (Kotti et al. 2007). Third, in many instances findings from initial linkage analysis failed to replicate in subsequent studies in other populations. The reasons for this are multifactorial, and likely include differing genetic pedigrees, linkage disequilibrium, weak associations, or simple chance (Abecasis et al. 2005). Nonetheless, this experience coupled with the known risks for false positive findings in GWAS due to multiple testing, has led to a growing insistence on replication of SNP association findings in external populations prior to publication.
Currently, we are witnessing a dramatic increase in the number of GWAS being conducted, and in the size of the microarrays that are being utilized. Also, given the modest effect size of individual SNPs, it is becoming increasingly evident that ever larger study populations are needed to attain the statistical power needed to conduct the GWAS of many common diseases and traits. Investigators in the field have responded by forming collaborative consortia composed of many individual study cohorts. Furthermore, funding institutions such as NIH are proactively requesting that groups rapidly make their GWAS data publicly available so it can be used most effectively: (http://www.ncbi.nlm.nih.gov/entrez/query/Gap/gap_tmpl/about.html). As these results become increasingly available to all investigators, it is hoped that the full potential of the promises of genetic studies can be harnessed and used towards preventing or alleviating the burden of cardiovascular diseases and their risk factors.
We thank Monsignore Piseddu, Bishop of Ogliastra; the Mayors of Lanusei, Ilbono, Arzana, and Elini; the head of the local Public Health Unit ASL4; and the residents of the towns for volunteering and cooperation. In addition, we are grateful to the Mayor and the administration in Lanusei for providing and furnishing the clinic site. We thank the team of physicians and nurses, who carried out the physical examinations and the recruitment personnel who enrolled the volunteers.
The SardiNIA (“ProgeNIA”) team was supported by Contract NO1-AG-1-2109 from the National Institute on Aging.
This research was supported (in part) by the Intramural Research Program of the NIH, National Institute on Aging.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.