|Home | About | Journals | Submit | Contact Us | Français|
Cardiovascular disease encompasses a range of conditions extending from myocardial infarction to congenital heart disease most of which are heritable. Enormous effort has been invested in understanding the genes and specific DNA sequence variants responsible for this heritability. Here, we review the lessons learned for monogenic and common, complex forms of cardiovascular disease. We also discuss key challenges that remain for gene discovery and for moving from genomic localization to mechanistic insights with an emphasis on the impact of next generation sequencing and the use of pluripotent human cells to understand the mechanism by which genetic variation contributes to disease.
Cardiovascular disease (CVD) is a leading health problem, affecting over 80,000,000 individuals in the United States alone. CVD encompasses a broad range of disorders including diseases of the vasculature, the myocardium, the heart’s electrical circuit, and congenital heart disease (Roger et al., 2012). For nearly all of these disorders, inherited DNA sequence variants play a role in conferring risk for disease. For example, in the general population, a history of premature atherosclerotic CVD in a parent confers ~3.0-fold increase in CVD risk to offspring (Lloyd-Jones et al., 2004). The precise magnitude of the role of inheritance, however, varies by disease and by other factors such as age of disease onset and subtype of disease.
Over the past century, a key goal of biomedical research has been to correlate genotype with phenotype, i.e., to identify the specific genes and DNA sequence variants responsible for trait variation in humans. What is the principal reason to pursue this goal? Naturally occurring genetic variation has the unique potential to reveal causal biologic mechanisms in humans. This is particularly important as some diseases like myocardial infarction (MI) are poorly modeled in non-human species.
In this review, we consider the approaches used to discover genes for human CVD, the lessons learned from the study of Mendelian and of common, complex forms of CVD, and take a look forward at research driven by next generation techniques, including sequencing and modeling human genetic disease in cells.
To discover genes for CVD and its risk factors in humans, two major approaches - linkage analysis and genetic association - have been utilized. The choice of approach has depended on the pattern of segregation, whether consistent with the ratios described by Mendel or more complex. Some forms of CVD exhibit a simple pattern of inheritance suggestive of a single causal gene that confers a large effect on phenotype. For many of these Mendelian forms of CVD, direct DNA sequencing and/or linkage analysis has successfully yielded the causal gene and mutation. In 1985, Lehrman and colleagues directly sequenced the low-density lipoprotein receptor (LDLR) gene in a patient with homozygous familial hypercholesterolemia and uncovered a 5 kilobase deletion that eliminated several exons, representing the first demonstration of a mutation for Mendelian CVD (Lehrman et al., 1985). In 1989, linkage analysis was used to localize the chromosomal position of a causal gene for hypertrophic cardiomyopathy and in the subsequent year, mutations in the beta cardiac myosin heavy chain were discovered as causal for the phenotype (Geisterfer-Lowrance et al., 1990; Jarcho et al., 1989). Other prominent examples in the CVD field include long QT syndrome, severe hypercholesterolemia, Mendelian forms of hypertension, Marfan’s syndrome, and several forms of congenital heart disease including septal defects and valve defects (Abifadel et al., 2003; Basson et al., 1997; Berge et al., 2000; Curran et al., 1995; Dietz et al., 1991; Garcia et al., 2001; Garg et al., 2003; Garg et al., 2005; Lifton et al., 2001; Schott et al., 1998; Soria et al., 1989; Tartaglia et al., 2001).
However, most CVD traits such MI or concentrations of plasma LDL cholesterol show complex inheritance, suggestive of an interplay between multiple genes and non-genetic factors. Mapping gene loci associated with complex traits requires substantial levels of information and analysis, but since 2007 approaches to accomplish this goal have matured, and genetic mapping for complex traits in humans has become a reality. The intellectual foundations that enabled a systematic genome-wide screen of common variants (termed genome wide association study or GWAS) and results from this approach were recently reviewed (Altshuler et al., 2008; O’Donnell and Nabel, 2011). The tools and methods included catalogs of polymorphisms, techniques to genotype these DNA sequence variants, and the analytical framework to distinguish true association signal from false positives. The initial focus has been on utilizing common DNA sequence variants (variants with allele frequency > 1:20) as a discovery tool, largely because it was practical to do so; recent advances in DNA sequencing and genotyping technology will allow interrogation of less frequent variants and will be considered below. The National Human Genome Research Institute hosts a catalog of published GWASs and as of January 14, 2012, the catalog includes 1143 publications and 5585 single nucleotide polymorphisms (SNPs) with association evidence at P < 10−5 (Hindorff et al., 2009).
What have we learned from these gene discovery efforts? Below, we discuss lessons that have emerged from genotype to phenotype correlation studies for Mendelian diseases followed by lessons from studies of common, complex diseases.
Since Mendelian diseases are rare in the population, there was initial skepticism about whether the genes and mechanisms that cause these diseases would inform our understanding of common forms of CVD. Linkage studies involve identifying and recruiting families with unique, often severe phenotypes, isolating a chromosomal segment that tracks with disease status in the family, and then pinpointing the causal gene and mutation in the linked segment. For a range of conditions, the genes identified by these linkage studies have transformed our understanding of CVD. Selected examples of Mendelian diseases, the responsible genes, and the gleaned biological and clinical insights are detailed in Table 1. Of particular significance is monogenic severe hypercholesterolemia where the six responsible genes have led to fundamental new biological concepts and supported the development of new therapies.
Although there are cases where single gene mutations lead to straightforward genotype-phenotype associations, other, more complex relationships exist as well. This complexity can arise from three distinct genetic phenomena: pleiotropy, penetrance, and expressivity. Sometimes, mutations in a single gene can influence multiple phenotypic traits (i.e., pleiotropy). In 1995, Wang, Keating and colleagues identified the alpha subunit of the type V voltage-gated sodium channel (SCN5A) as the cause of inherited long QT syndrome type 3 (Wang et al., 1995). Since then, mutations in the same gene have been demonstrated to cause Brugada syndrome (right precordial ST-segment elevation and increased risk for ventricular arrhythmias), cardiac conduction system disease, and dilated cardiomyopathy (Chen et al., 1998; McNair et al., 2004; Schott et al., 1999). This range of disease phenotypes may reflect the underlying functionality of the channel.
Among carriers of a Mendelian mutation in a given family, some may exhibit the condition and others may not. Penetrance refers to the proportion of individuals with a given genotype who exhibit the phenotype associated with the genotype. In many Mendelian cardiovascular conditions inherited in an autosomal dominant manner, there is evidence for incomplete penetrance. For example, Hobbs and colleagues reported that in a pedigree with familial hypercholesterolemia due to a point mutation in LDLR, only 12 of 18 heterozygotes had high LDL cholesterol (>95th percentile) whereas some of the remaining 6 had LDL cholesterol as low as 28th percentile for the population (Hobbs et al., 1989). The lack of a high cholesterol phenotype given the same genotype may be due to modifier genes or environmental influences.
Individuals with the same Mendelian genotype can also show different degrees of the same phenotype. Expressivity is the degree to which trait expression differs among individuals. Marfan’s syndrome is a multi-system Mendelian disorder that can include a range of signs and symptoms involving the skeletal system (pectus excavatum, increased arm span to height ratio, craniofacial alterations), ocular system (eye lens dislocation, flat cornea), and cardiovascular system (aortic aneurysm, dissection of the ascending aorta, mitral valve prolapse), among others (Canadas et al., 2010). Dietz and colleagues identified mutations in the FBN1 gene encoding the extracellular matrix protein fibrillin 1 as responsible for Marfan’s syndrome (Dietz et al., 1991). When a specific mutation in the fibrillin 1 (FBN1) gene causes Marfan’s syndrome in a family, carriers of the same mutation can display variable clinical manifestations (Faivre et al., 2007).
Pleiotropy, penetrance, expressivity, and non-genetic factors conspire to ensure that even in a single gene disorder, genotype does not “equal” a specific phenotype. This complexity has several consequences. First, gene discovery is more difficult as genotype may not segregate perfectly with phenotype, thereby reducing the power of linkage. Second, there is intense interest in identifying modifiers - genetic or environmental - that may modulate the relationship between genotype and phenotype. Finally, because of this complexity, in many Mendelian diseases, it has been difficult to develop genotype-specific prognostic or treatment recommendations.
For Marfan’s syndrome, the path from the discovery of FBN1 as the causal gene to a breakthrough in the molecular understanding of the disease has spanned more than two decades. Historically, Marfan’s syndrome had been viewed as a structural disease due to a defect in elastic fibers (Lindsay and Dietz, 2011). The identification of mutations in an extracellular matrix protein seemed to confirm this view. However, more recent studies suggest that microfibrils normally bind the large latent complex of the cytokine transforming growth factor β (TGF-β) and that failure of this event to occur results in increased TGF-β activation and signaling. Now, investigators are exploring the hypothesis that blocking TGF-β signaling will ameliorate the growth of aortic aneurysms in Marfan’s syndrome. For further examples of therapeutic approaches derived from the study of Mendelian disorders, we refer the reader to a recent review on this topic (Dietz, 2010).
Variants associated with common, complex traits range in frequency from common (>1:20 frequency), to low-frequency (1:1000 to 1:20), to very rare (< 1:1000). In other words, genetic heterogeneity from variants across the frequency spectrum may be the rule. Consider the example of plasma triglycerides, a phenotype that marks triglyceride-rich lipoproteins including very-low density lipoprotein particles, chylomicrons, and remnant products of their metabolism. Roughly 50% of the inter-individual variability in plasma triglycerides is estimated to be on the basis of DNA sequence variants. Johansen, Hegele, and colleagues studied individuals from the extremes of the plasma triglyceride distribution (438 individuals with high triglycerides (mean triglycerides = 14.2 mmol/l) and 327 individuals with low triglycerides (mean triglycerides = 1.2 mmol/l)) (Johansen et al., 2010) using both GWAS and resequencing of selected genes. In the GWAS, common variants at seven loci were associated with plasma triglycerides, and in the re-sequencing study, there was an excess of rare, non-synonymous variants across four genes in individuals with high triglycerides when compared with those with low triglycerides. A comprehensive logistic regression model including clinical variables and both common and rare genetic variants explained 42% of total variation in hypertriglyceridemia diagnosis: clinical variables explained 20%, common genetic variants in seven loci explained 21%, and rare genetic variants in four loci explained 1%. The genetic architecture for triglycerides in the population appears to be that of a mosaic comprised of large-effect variants rare in frequency, small-effect variants common in frequency, and environmental influences.
More generally, the concept of a mosaic model is supported by the fact that for many cardiovascular traits and diseases, there is strong overlap between the genes mapped using GWAS and those identified earlier through Mendelian families. Nineteen genes have been identified as monogenic causes of extremely low or high levels of LDL cholesterol, high-density lipoprotein (HDL) cholesterol, and triglycerides; loci harboring 16 of these genes were also mapped using GWAS (Figure 1) (Teslovich et al., 2010). Rare mutations in FBN1 cause the thoracic aortic aneurysms and dissections seen in Marfan’s syndrome, whereas common SNPs in the introns of FBN1 are the top association result in a GWAS for spontaneous, non-syndromic thoracic aortic aneurysm and dissection (Lemaire et al., 2011). Rare mutations in SCN5A, KCNQ1, KCNH2, KCNE1, and KCNJ2 cause monogenic long QT syndrome, whereas common SNPs in these five genes are associated with QT interval measured on electrocardiograms in the population (Newton-Cheh et al., 2009).
Plasma lipids, platelets, and sickle cell disease represent three fields where there has been progress toward new biology based on GWAS. GWASs for plasma LDL cholesterol, HDL cholesterol and triglycerides have evaluated >100,000 participants and mapped 95 distinct loci associated with at least one of these traits at a stringent statistical threshold (P < 5 × 10−8) (Kathiresan et al., 2008; Kathiresan et al., 2009b; Pollin et al., 2008; Teslovich et al., 2010; Willer et al., 2008). Approximately one-third of the loci harbored genes previously appreciated to play a role in lipoprotein metabolism including five targets of lipid-modifying therapies - HMGCR (statins), NPC1L1 (ezetimibe), APOB (mipomersen), CETP (anacetrapib, dalcetrapib, and evacetrapib), and PCSK9 (therapies in development by several pharmaceutical companies) (Figure 1).
Of note, the proportion of overall phenotypic variance explained by a genetic variant may have little correlation with the ultimate therapeutic or biological value of the gene mapped by the variant. Phenotypic variance explained by a variant is a function of two key parameters: allele frequency and effect size. For Mendelian diseases, the causal variants typically confer large effects but explain a small proportion of trait variance due to their rare frequencies. Variants from GWAS are common but explain a small proportion of trait variance due to modest effects. Nevertheless, variants that explain a small proportion of phenotypic variance may provide substantial biological or therapeutic insights. This has been highlighted for Mendelian genes in Table 1. Two examples for common SNPs include variants in the introns of HMGCR and NPC1L1 (Teslovich et al., 2010). These SNPs confer a small effect on plasma LDL cholesterol at 3 mg/dl and 2 mg/dl, respectively; however, targeting of these genes with statins or ezetimibe, respectively, has a much more dramatic effect on LDL cholesterol. And to date, there have been no rare, large-effect Mendelian mutations described in HMGCR, presumably because such mutations are highly deleterious and not tolerated. Thus, some disease genes may only be discoverable through common, small-effect variants.
Approximately two-thirds of the 95 loci discovered for plasma lipid traits harbored genes not previously appreciated to play a role in the biology of lipoproteins. Several genes at the “novel” loci have now been manipulated in the mouse and this manipulation led to plasma lipid changes analogous to that suggested by the human genetics. An example is the 8q24 locus containing the tribbles homolog 1 (TRIB1) gene (Burkhardt et al., 2010). DNA sequence variants downstream of the TRIB1 gene were initially associated with plasma lipids, with the minor allele having association with lower plasma triglycerides, lower plasma LDL cholesterol, and higher plasma HDL cholesterol. Given this pattern of plasma lipid changes, minor allele carriers would be expected to have lower risk for coronary heart disease and several groups have confirmed this expectation (Varbo et al., 2011). Targeted deletion of Trib1 in mice led to elevated levels of plasma triglycerides and cholesterol, whereas liver-specific overexpression of Trib1 reduced levels of plasma triglycerides and cholesterol. SORT1, GALNT2, PPP1R3B and TTC39B are four other genes implicated in lipoprotein regulation by GWAS and confirmed in a similar manner using mouse models (Musunuru et al., 2010b; Teslovich et al., 2010).
Analogous to plasma lipids, well-powered GWAS and experimental followup in cells and model organisms have identified new regulators of thrombopoiesis and/or erythropoiesis. A GWAS study for blood platelet count and volume mapped 58 loci (Gieger et al., 2011). For genes at 11 mapped loci, gene silencing in either the zebrafish or Drosophila led to alterations in thrombopoiesis and/or erythropoiesis. The precise mechanisms by which these genes alter blood cell formation remain to be defined through additional studies; however, we now have a plethora of new genes for experimental follow-up, all with a foundation of relevance to human phenotypes.
The identification of BCL11A as a transcriptional regulator of fetal hemoglobin synthesis and as a potential therapeutic target for sickle cell disease represents more biology learned from GWAS. Sickle cell disease results from the substitution of a valine for glutamic acid in the beta-globin chain of adult hemoglobin. The mutated hemoglobin (HbS) undergoes conformational change and polymerization upon deoxygenation, leading to hemolysis, red blood cell deformation, and pathology due to microvascular occlusion. Hereditary persistence of fetal hemoglobin (HbF) decreases the severity of sickle cell disease and the level of HbF in adults is inherited as a quantitative trait. As a result, induction of HbF in adults has been a long-standing goal of therapies for sickle cell disease. In 2007, GWAS for the HbF phenotype revealed a single strong locus, with sequence variants in the intron of a transcription factor, B-cell CLL/lymphoma 11A (BCL11A) (Menzel et al., 2007). A series of elegant studies led by Orkin and colleagues have now established that BCL11A represses the transcription of HbF expression in erythroid cells (Sankaran et al., 2008) (Sankaran et al., 2009; Xu et al., 2010) and demonstrated that inactivation of BCL11A in a mouse model of sickle cell disease leads to pancellular induction of HbF and corrected the hematologic and pathologic defects of the disease (Xu et al., 2011).
Despite these successes, for the vast majority of mapped loci, it has been a challenge to move from genomic localization to biologic mechanism. Why? First, as discussed earlier in the context of discovery of Mendelian disease genes, inferring new biology from human genetics takes time and only about five years have elapsed since the initial GWAS publications. Second, gene mapping and experimental follow-up require unique skill sets and expertise. In the examples highlighted above, collaborations between researchers focused on human genetics and experimentalists has been crucial to making progress, yet skepticism regarding the validity and value of discoveries made from GWAS may have dampened the enthusiasm for such collaborations (McClellan and King, 2010).
Third, genetic mapping by association gives us gene regions and not necessarily specific causal variants or causal genes. For each SNP and locus mapped by GWAS, we can generally conclude that within ~100,000 bases of the locus, there exists a causal gene. Ideally, at each discovered locus, we need to understand: 1) the causal variant; 2) the causal gene; 3) the mechanism by which the variant affects the gene; and 4) the mechanism by which the gene affects the phenotype. At the 1p13 locus for LDL cholesterol and MI, finemapping, re-sequencing and manipulation of positional candidate genes in cell culture and model organisms have addressed these key questions (Musunuru et al., 2010b). However, at most loci, the answers remain a mystery.
Several features of genetic mapping using common variants have contributed to the difficulties. Local correlation (linkage disequilibrium) and weak effect size of common variants have made it difficult to identify causal variants. Mapped SNP variants have weak or modest effect (odds ratios of 1.05 to 1.40 for a dichotomous trait and <1% of variance explained for a continuous traits) and these weak effects mean that in order to statistically distinguish between two correlated variants, the sample sizes need to be inordinately large. Nearly all of the mapped SNP variants are non-coding and our ability to interpret the non-coding portion of the genome remains limited. Mechanisms by which a non-coding variant may affect a nearby gene include, affecting a transcription factor or microRNA binding site and regulating the local chromatin state, among other possibilities.
In only a few instances has there been direction demonstration that a specific non-coding, GWAS SNP affects the relevant gene through one of these mechanisms (Musunuru et al., 2010b). The example of chromosome 9p21 and risk for MI exemplifies the challenge. In 2007, several independent GWASs identified SNPs on chromosome 9p21 as associated with MI or coronary artery disease, with ~50% of the population carrying a risk allele and each copy of the risk allele conferring ~29% increase in risk for MI/coronary artery disease (Helgadottir et al., 2007; McPherson et al., 2007; Samani et al., 2007). The most strongly associated SNPs were non-coding and more than 100 kilobases downstream of the nearest protein-coding genes CDKN2B (cyclin-dependent kinase inhibitor 2B encoding the protein p15INK4b) and CDKN2A (cyclin-dependent kinase inhibitor 2A encoding the proteins p16INK4a and p14ARF). Re-sequencing and fine-mapping studies in the gene region have identified a set of SNPs with indistinguishable statistical evidence, but not one causal variant (Shea et al., 2011). By what mechanism does the risk allele alter the relevant gene at the locus and what is the gene responsible for atherosclerosis susceptibility? Answers to these two questions have been particularly elusive. Several mechanisms by which the non-coding variant might impact phenotype have been explored including altering a non-coding RNA in the region (antisense noncoding RNA in the INK4 locus (ANRIL) or a synonym is CDKN2B antisense RNA (CDKN2BAS)), serving as an enhancer of regional gene expression, and disrupting the binding of a transcription factor STAT1 (Harismendy et al., 2011; Holdt and Teupser, 2012). No definitive answers have emerged. Finally, a non-coding gene region homologous to the human associated interval on 9p21 was deleted in mice and this targeted deletion led to near absence of CDKN2A and CDKN2B expression in several tissues, increased rate of tumors, and increased proliferation of aortic smooth muscle cells (Visel et al., 2010). However, these mice did not develop increased atherosclerosis on a diet rich in fat and cholesterol. For the 9p21 locus, the path from genomic localization to functional insights has not been a straightforward one, likely due to our limited ability at present to interpret the non-coding portions of the human genome. However, as recently reviewed (Raychaudhuri, 2011), integration of GWAS findings with expression quantitative trait loci (eQTL; eQTLs are genetic variants that correlate with the transcript level of a gene) and genome-wide maps of chromatin state dynamics may improve our ability to interpret GWAS findings and understand how non-coding variation regulates genes (Ernst et al., 2011; Schadt et al., 2008).
Hypotheses concerning etiologic agents for complex diseases have often initially come from observational epidemiology. In 1961, in a paper entitled “Factors of Risk in the Development of Coronary Heart Disease”, Dr. William Kannel and colleagues at the Framingham Heart Study established an association of plasma total cholesterol with future risk for coronary heart disease (Kannel et al., 1961). Since then, hundreds of soluble biomarkers have similarly been associated with risk for coronary artery disease. How many of these biomarkers directly cause coronary artery disease and how many simply reflect other causal processes and why is this question important? Both causal and non-causal biomarkers may be helpful in terms of predicting risk for future disease. However, only a causal biomarker may be appropriate as a target of therapy. The ultimate proof of causality in humans is a randomized controlled trial testing whether a treatment that alters the biomarker will affect disease risk. However, as clinical trials are expensive and time-consuming, it would be helpful to have evidence in humans prior to engaging in a clinical trial.
In a technique termed Mendelian randomization, DNA variants are used to address the question of whether an epidemiological association between a risk factor and disease reflects a causal influence of the former on the latter (Davey Smith and Ebrahim, 2003; Gray and Wheatley, 1991; Katan, 1986). In principle, if a DNA variant is known to directly affect an intermediate phenotype (e.g., a variant in the promoter of a gene encoding a biomarker, affecting its expression) and the intermediate phenotype truly contributes to the disease, then the DNA variant should be associated with the disease to the extent predicted by (a) the size of the effect of the variant on the phenotype and (b) the size of the effect of the phenotype on the disease (Musunuru and Kathiresan, 2010). If in an adequately powered sample, the predicted association between the variant and disease were not observed, it would argue against a purely causal role for the intermediate phenotype in the pathogenesis of the disease. The study design is akin to a prospective randomized clinical trial in that the randomization for each individual occurs at the moment of conception---genotypes of DNA variants are randomly “assigned” to gametes during meiosis, a process that should be impervious to the typical confounders observed in observational epidemiological studies. For example, a parent’s disease status or socioeconomic status should not affect which of the parent’s two alleles at a given SNP is passed to a child, with each allele having an equal (50%) chance of being transmitted via the gamete to the zygote. Thus, Mendelian randomization should be unaffected by confounding or reverse causation. Mendelian randomization has potential shortcomings, including (a) the technique is only as reliable as the robustness of the estimates of the effect sizes of the variant on the phenotype and of the phenotype on disease, and (b) it assumes that the DNA variant does not influence the disease by means other than the intermediate phenotype being studied (pleiotropy), which may not be true. Nevertheless, Mendelian randomization has the potential to be as informative as a traditional randomized clinical trial.
Several Mendelian randomization studies have confirmed a causal relationship between LDL cholesterol and coronary artery disease. Nonsense variants in the PCSK9 gene that significantly reduce plasma LDL cholesterol concentrations were observed to be associated with reduced incidence of coronary artery disease in an African American cohort (Cohen et al., 2006). Similarly, in European Americans a common missense variant in PCSK9 associated with lower LDL cholesterol levels was also found to be associated with lower risk of MI. These observations suggested that lower LDL cholesterol is sufficient to provide protection against coronary artery disease. Similar to LDL cholesterol, several recent genetic studies have confirmed prior observations that plasma lipoprotein(a) causally relates to coronary artery disease (Clarke et al., 2009; Kamstrup et al., 2009; Wensley et al., 2011).
Unlike the results with plasma LDL cholesterol concentrations and plasma lipoprotein(a), three recent, large Mendelian randomization studies of C-reactive protein gene (CRP) variants that affect plasma CRP concentrations, performed in thousands of individuals, did not show an association between these variants and ischemic vascular disease or coronary artery disease (Elliott et al., 2009; Wensley et al., 2011; Zacho et al., 2008). It is unlikely that these studies were confounding or affected by pleiotropy, since the tested variants were within the CRP gene itself rather than being in other loci that might secondarily affect CRP levels. Although these studies cannot definitively rule out a causal role of CRP in cardiovascular disease, they strongly suggest that high CRP levels are indirectly rather than directly related to coronary artery disease.
A parallel line of genetic evidence also casts doubt on the notion that inflammatory biomarkers such as CRP are critical mediators of MI and coronary artery disease. Of 33 loci most highly associated with MI and coronary artery disease (Table 2), nine are related to plasma LDL cholesterol or lipoprotein(a), arguing for a strong causal relationship between LDL [or a modified LDL particle such as lipoprotein(a)] and disease (Schunkert et al., 2011). None of the other 24 loci are clearly related to inflammation. Overall, with the recent explosion in our ability to measure both soluble biomarkers (including metabolites and proteins) and genetic variation, Mendelian randomization will likely be increasingly utilized approach to distinguish causal biomarkers from non-causal ones.
As described above, considerable progress has been made in correlating genotype to phenotype for both Mendelian and common, complex diseases over the past several decades. Over the next decade, we are in a unique position to “finish the job.” For Mendelian diseases, the goal is nothing short of attempting to solve all Mendelian diseases not yet mapped. For common, complex diseases, the goal is to extend the range of genetic variation evaluated from common (>1:20 frequency) to low-frequency (1:1000 - 1:20 frequency) and very rare (<1:1000 frequency).
Dramatic advances in sequencing and genotyping make such ambitious goals feasible. Next-generation sequencing platforms have markedly decreased the cost of DNA sequencing when compared with Sanger sequencing. Hybridization approaches have enabled selection of the portion of the genome that is protein-coding (roughly 1% of the 3.2 billion bases), the so-called “exome.” Sequencing just the exome (rather than the entire genome) is well justified in the search for genetic causes of rare inherited disorders (Ng et al., 2010). This is because the majority of alleles responsible for Mendelian disorders disrupt protein-coding sequence and a large fraction of rare missense mutations in the human genome are predicted to be deleterious (Kryukov et al., 2007; Stenson et al., 2009).
Over the past three years, whole exome and whole genome sequencing has been successfully utilized to identify new genes for several Mendelian forms of CVD. Most studies have sequenced the exomes of one or a few individuals affected with the disorder. Variants seen in these individuals are typically compared with those from reference individuals (unaffected individuals who are family members or unrelated). Variants that are shared by affected individuals and not present in the unaffected population are considered causal candidates. With this strategy, investigators identified ANGPTL3 mutations as a cause of familial combined hypolipidemia, BAG mutations for dilated cardiomyopathy, NT5E mutations for arterial calcification, KCNJ5 mutations for hereditary hypertension due to aldosterone-producing adenomas, and KLHL3 or CUL3 mutations for hypertension and hyperkalemia (Boyden et al., 2012; Choi et al., 2011; Musunuru et al., 2010a; Norton et al., 2011; St Hilaire et al., 2011).
The discovery of KLHL3 as a cause of pseudohypoaldosteronism type II, a Mendelian form of hypertension, is an elegant example of the value of exome sequencing (Boyden et al., 2012). The gene was identified by directly sequencing the exomes of 11 unrelated index cases with pseudohypoaldosteronism type II and contrasting with exome sequences from 699 population-based, unrelated controls without hypertension. Of 22 case chromosomes, 5 harbored a rare, protein-altering mutation in KLHL3 whereas only 2 of 1398 control chromosomes did so, leading to a low probability that this observation was due to chance (P = 1 × 10−8). Such successes have raised expectations for the exome sequencing approach.
However, the true yield of exome sequencing in solving Mendelian disorders is difficult to know at present, as negative results have not been routinely reported (Bamshad et al., 2011). It can be difficult to arrive at a single causal mutation after exome sequencing due to the following reasons. First, the causal variant may not be protein-coding. Second, the causal variant may be protein coding but the relevant gene not successfully captured and sequenced. Approximately 5-10% of all exons may be poorly sequenced due to genomic features such as high GC content. Third, if the causal mutation is not fully penetrant, it will be present in individuals who are both phenotypically affected and unaffected and thus, may be “filtered” out. Fourth, the segregation of a phenotype in a family may be due to non-genetic factors rather than a single gene of large effect. This problem is particularly true for CVD conditions that are common in the population like MI and atrial fibrillation. If one observes a multi-generation family where multiple individuals are affected with MI, is this pattern due to a new Mendelian gene or poor lifestyle habits shared by the family? Fifth, in contrast to single nucleotide substitutions, methods for calling small insertion-deletions and copy number changes from short-read sequence data are in need of improvement. If the causal mutation is a one or two base-pair insertion or deletion, it might be overlooked. Finally, exome sequencing and filtering is more efficient at reducing the number of candidate causal mutations for recessive rather than dominant disorders. After exome sequencing and filtering for a dominantly inherited condition, the number of mutations compatible with causing disease can be large.
What about exome or whole genome sequencing to identify rare variants that confer a large effect on common, complex traits and diseases? The optimal study design for complex traits will depend on the frequency of the genetic variant(s) that are the source of the association signal. Though the term “rare” is used in the literature to refer to variants less than 1:20 frequency, we find it useful to distinguish between two types of “rare” variants, those that range in frequency from 1:1000 to 1:20 and denoted as “low-frequency” and those that are <1:1000 frequency and denoted as “very rare” (Figure 2). This categorical distinction is important because the approach to discover and replicate the two signals can differ.
For low-frequency variants, it is now possible to catalogue nearly all coding variants segregating in a population at a frequency of 1:1000 or greater, directly genotype these variants, and test for association with disease (Figure 2A). For example, data from 12,000 human exome sequences has been mined and all coding sites where the alternate allele is seen at least three times (twice for nonsense and splice) and in two separate studies have been catalogued. Investigators have designed a custom genotyping array - the ‘exome array’ - to directly genotype these ~300,000 variants at low cost. This genotyping array will allow for comprehensive testing of low-frequency coding variants for association with cardiovascular traits and diseases.
In contrast, a signal emerging from a burden of very rare variants can best be discovered by sequencing and can often only be replicated by sequencing (Figure 2B). The promise and challenges of this approach are evident in the candidate gene sequencing studies completed over the past few years. In seminal work, Cohen, Hobbs, and colleagues tested the hypothesis that rare, coding variation in candidate genes contributes to plasma HDL cholesterol variation in the population (Cohen et al., 2004). Three genes that cause Mendelian forms of low HDL cholesterol (ABCA1, APOA1, and LCAT) were sequenced in 128 individuals ascertained from the lowest 5th percentile (mean, 31 mg/dl) and 128 individuals from the highest 95th percentile of the HDL cholesterol distribution (mean, 91 mg/dl) in a population-based cohort study. The investigators aggregated rare variants that met three criteria: 1) present within a gene; 2) changed amino acid; and 3) exclusively present in either the low or high group. Twenty such rare alleles were found in the low group and two such rare alleles were found in the high group (Fisher’s exact P=0.00006). This observation was replicated in an independent study and, additionally, cells from mutation carriers had reduced cholesterol effux rates, a key function of ABCA1. This study established that rare non-synonymous mutations in ABCA1 with large phenotypic effects contribute to low HDL cholesterol in the population.
The challenge now is to utilize sequencing to enable de novo discovery of new genes that contribute to complex traits. It is now practical to extend the experiment described above from the sequencing of 3 genes to ~20,000 genes. Exomes can be rapidly sequenced and single nucleotide substitutions can be accurately called; however, will such sequencing for complex traits readily yield new discoveries? The key barriers at present are statistical. DNA sequence variants that are very rare (e.g., a mutation seen in a single person or a singleton) cannot be tested individually for association with phenotype. Instead, they need to be aggregated with similar rare variants to be tested collectively for association with phenotype, due to their rarity (Figure 2B).
This requirement immediately brings forward several questions. Do we only aggregate variants within a single gene or should it be extended to collections of genes in a pathway? Do we only group variants of a certain annotation - non-synonymous, nonsense, synonymous, splice-site, etc.? Should we impose a frequency threshold to define ‘rare’ (i.e., only aggregate variants that are less than 1% frequency) and if so, what should this frequency threshold be? The answers to many of these questions are in development. We refer the reader to several recent publications that address these issues (Li and Leal, 2008; Madsen and Browning, 2009; Neale et al., 2011; Price et al., 2010).
Another key question is statistical power. What sample size will be required to identify a new disease gene based on a burden of rare coding mutations? An early computer simulation study conducted by Kryukov, Sunyaev, and colleagues suggested that 10,000 individuals (5000 drawn from the lowest 5th percentile and 5000 above the 95th percentile) will be needed to discover another gene with a mutation burden similar to ABCA1 (Kryukov et al., 2009). The principal reason for this large sample size is the stringent statistical threshold that needs to be imposed when testing 20,000 genes. A first approximation of the appropriate statistical threshold would be Bonferonni correction for all the genes sequenced (P value of 2.5 × 10−6). Our extrapolation of effect sizes and frequencies from published studies shows (Figure 3) that thousands of individuals are required to reach acceptable statistical power.
Statistical power would improve dramatically if we were able to distinguish rare mutations that affected protein function from those that did not. The bulk of discovered rare mutations are missense. Of all missense, it is estimated that ~1/5 are as deleterious as nonsense mutations, ~1/2 are mildly deleterious, and ~1/4 do not affect protein function (Kryukov et al., 2007). If one restricted mutation counts to only functional mutations, noise would be removed. At present, functional consequence of variants can be predicted based on comparative sequence analysis and protein structure analysis. Several software tools are available and the accuracy of these methods remains a subject of debate (Sunyaev et al., 2001).
An alternative method is experimental evaluation of whether a given mutation is functional. Such experimental evaluation is possible for genes with known function and where that function can be readily evaluated with moderate or high throughput approaches. However, such evaluation of mutations assumes that the parameter evaluated is an appropriate surrogate for the mechanism by which the gene impacts the human phenotype. For complex phenotypes such as MI that arise after decades of pathology, it is often difficult to know if this is the case. Nevertheless, distinguishing functional versus neutral missense mutations will be a key to improving statistical power.
As an increasing number of genetic variants become associated with human disease, it will be essential to develop effective and accurate model systems to understand the mechanism of disease. This step is a prerequisite to translate genetic findings into new targets for therapy. Animal models have traditionally been used to mimic human mutations through gene knock-in approaches or study of null mutations (Bruneau et al., 2001; Lindsay et al., 2001). While this approach has been of value and led to many discoveries, it is more common that mouse models fail to effectively recapitulate the human phenotype. This situation is even true for Mendelian traits, where heterozygous deletion of human disease genes is often well-tolerated in mice, while homozygous deletions can have catastrophic consequences, failing to model the human condition. For many human disease genes, this has been a bottleneck, limiting the value of gene-hunting approaches described above.
The recent revolution in cellular reprogramming technology provides a potential solution to the need for disease modeling (Figure 4). The striking observation by Yamanaka and colleagues that human adult somatic cells could be reprogrammed into a pluripotent state by expressing four transcription factors not only revolutionized the field of stem cell biology, but also has major consequences for the study of human genetics (Takahashi et al., 2007). Investigators can now reprogram skin or blood cells from patients with defined genetic variants/mutations into induced pluripotent stem (iPS) cells, and then differentiate the iPS cells into the specific cell type affected by disease. This approach provides an unprecedented opportunity to investigate the consequences of human genetic variation on cellular phenotypes that may contribute to disease. There have been several reports of partial modeling of CVD using human iPS cells in the last two years, although success to date has been limited to diseases with discrete cell-autonomous cellular defects. We will consider a few examples below and highlight the advantages and limitations of this approach.
Autosomal dominant long QT syndrome (LQTS) that predisposes to cardiac arrhythmias and sudden death involves single gene mutations of sodium or potassium channels that provide a discrete and quantitative phenotype in iPS-derived cardiomyocytes. In LQTS, repolarization of cardiomyocytes is delayed resulting in a prolonged action potential duration. Several groups have reported successful generation of iPS-derived cardiomyocytes from patients with LQTS, and an elongation of the action potential duration has been reported for a subset of genetically-induced LQTS (Itzhaki et al., 2011; Moretti et al., 2010; Yazawa et al., 2011). LQTS can also be drug-induced in selected individuals, and pluripotent stem cell-derived cardiomyocytes appear to be an effective model system to detect prolongation of action potentials due to cardiotoxic drugs (Braam et al., 2010). It is exciting to consider that identification of common genetic variants that predispose to drug-induced LQTS might be modeled in iPS-derived cardiomyocytes, thereby allowing a system in which to screen for this toxic side effect prior to in vivo use of new drugs.
Despite these successes, certain types of LQTS have been more difficult to model due to the incomplete maturation of iPS-derived cardiomyocytes and appropriate expression of ion channels in a consistent manner. It is possible that recent direct reprogramming strategies involving transdifferentiation, which do not involve a progenitor intermediate, will allow modeling of diseases where a mature cell type is required (Ieda et al., 2010). The inability to expand mature, transdifferentiated cells is currently a rate-limiting factor for this approach, but future improvements in efficiency of reprogramming may overcome this hurdle.
Another example of iPS-based modeling recently came from the study of Noonan and LEOPARD syndromes, autosomal dominant conditions that involve pulmonary valve stenosis and hypertrophic cardiomyopathy in a subset of patients (Gelb and Tartaglia, 2011). The genetic cause of Noonan and LEOPARD syndromes involves activating mutations in members of the Ras pathway. Correspondingly, iPS-derived cardiomyocytes from patients with cardiomyopathy associated with LEOPARD syndrome exhibit higher levels of Ras signaling that appear to contribute to excessive growth of cardiac cells (Carvajal-Vergara et al., 2010). While such findings successfully model the pathway known to be affected, the ultimate goal will be to identify new therapeutics that can disrupt abnormal pathways and restore normal physiology. This has remained elusive thus far, in part because most iPS-based disease models lead to broad variability in phenotype, whereas high-throughput screening assays typically require a consistent phenotype with a narrow range of variability. As protocols for generating iPS cells improve and directed differentiation methods become more consistent, “disease in a dish” models promise to alter the landscape of drug discovery (Figure 4).
Besides modeling the Mendelian syndromes described above, another potential use of iPS technology will be to annotate the function of SNPs identified by GWAS or rare mutations discovered by sequencing. iPS cells are currently being generated from large populations with defined genetic variants and CVD in the hope that certain aspects of the phenotype could be modeled in iPS-derived cells. It may prove difficult to identify consistent phenotypic differences associated with genetic variants of small effect; however, it is possible that the genetic background present in affected individuals will be sufficient to reveal a cellular defect.
The contribution of individual sequence variants in the context of complex genetic background issues can be interrogated through the use of recently developed DNA-correction approaches in human pluripotent stem cells. The use of zinc-finger nucleases and transcription activator-like effector nuclease (TALEN) technology to specifically introduce or remove a sequence variant in stem cells will reveal the contribution of individual variants to cellular phenotypes (Yusa et al., 2011). Addition of environmental stresses to the cellular system may be necessary to provoke phenotypes and will be an important tool in disease modeling in iPS cells. Such studies are currently underway and we look forward to the challenges and opportunities offered by this new approach to modeling human genetic variants.
The tools to decode human genetic variants on a large population scale have finally arrived and should begin to reveal the genetic contribution to disease at an exponential pace. Rigorous analysis of the functional consequences of sequence variants associated with disease will be critical, and new approaches leveraging stem cell technologies may facilitate study of human variants in relevant human cell types. Ultimately, the convergence of human genetics with functional biology will reveal new therapeutic targets for cardiovascular disease and a new generation of drug discovery efforts.
The authors wish to thank Drs. Ron Do, Nathan O. Stitziel, Gina M. Peloso, and Kiran Musunuru for assistance with analyses and critical review of the manuscript. S.K. was supported by grants from NHLBI/NIH, Foundation Leducq, and the Howard Goodman Fellowship from Massachusetts General Hospital; D.S. was supported grants from the NHLBI/NIH, the California Institute for Regenerative Medicine, the Younger Family Foundation, the Roddenberry Foundation and the Whittier Foundation.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.