|Home | About | Journals | Submit | Contact Us | Français|
The clinical significance of variants in genes associated with inherited cardiomyopathies can be difficult to determine due to uncertainty regarding population genetic variation and a surprising amount of tolerance of the genome even to loss of function variants. We hypothesized that genes associated with cardiomyopathy might be particularly resistant to the accumulation of genetic variation.
We analyzed the rates of single nucleotide genetic variation in all known genes from the exomes of >5,000 individuals from the National Heart, Lung, and Blood Institute’s Exome Sequencing Project (ESP), as well as the rates of structural variation from the Database of Genomic Variants. Most variants were rare, with over half unique to one individual. Cardiomyopathy associated genes exhibited a rate of nonsense variants 96.1% lower than other Mendelian disease genes. We tested the ability of in-silico algorithms to distinguish between a set of variants in MYBPC3, MYH7, and TNNT2 with strong evidence for pathogenicity and variants from the ESP data. Algorithms based on conservation at the nucleotide level (GERP, PhastCons) did not perform as well as amino acid level prediction algorithms (Polyphen-2, SIFT). Variants with strong evidence for disease causality were found in the ESP data at prevalence higher than expected.
Genes associated with cardiomyopathy carry very low rates of population variation. The existence in population data of variants with strong evidence for pathogenicity suggests that even for Mendelian disease genetics, a probabilistic weighting of multiple variants may be preferred over the ‘single gene’ causality model.
New DNA sequencing technologies are poised to transform the genetic evaluation of patients. Soon the availability of genetic information will no longer be a barrier to our understanding of the genetic basis of disease. Rather it will be our ability to understand and interpret the data that will be paramount. The interpretation of clinical genetic testing is a complex process that requires an appreciation of factors establishing causality as well as a detailed understanding of the ‘tolerated’ genetic variation present in human genomes of different ethnicities. Until recently, much of the genetic variation in human populations was unknown. With large scale population sequencing projects such as the 1000 Genomes Project1, the true extent of this variation is now becoming clear. Indeed, recent analyses indicate a surprising prevalence of tolerated genetic variation.2-4
Clinical genetic testing is increasingly available for conditions such as hypertrophic cardiomyopathy, where it is used for predictive family testing, and long QT syndrome, where it may alter management as well as impact family screening.5-7 The yield from genetic testing, however, can be variable. Evidence for or against a variant’s role is assembled from previous reports in the literature, co-segregation, the likelihood that the variant disrupts the reading frame (weighted more towards nonsense variants, small insertion-deletion variants, or splice site variants) and algorithmic predictions based on conservation, constraint, or protein motif disruption. Despite such resources, a large number of variants found through clinical genetic testing remain of unclear significance. Greatly lacking is knowledge of the population genetic variation in these and other genes, which is needed for the interpretation of variants not just in Mendelian diseases, but also for common disease risk assessment8,9 and pharmacogenomics.10-12
One recent project to catalog population scale single nucleotide variant (SNV) data has been the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP)13,14. This large scale effort is aimed at sequencing the exome, consisting of the protein coding regions (exons) of the human genome, from members of several different cohorts followed throughout the country for the purpose of defining the genetic components of complex diseases. In contrast with the 1000 Genomes study, which has low coverage of hundreds of genomes, the NHLBI exomes study has high coverage (average >100x), high quality sequencing data for >5000 individuals of Caucasian and African American ethnicity. Thus, it represents a valuable comparison dataset for variants thought to cause monogenic Mendelian disease. One limitation of both of these datasets, however, is the absence of structural variants. These may be particularly important because of their tendency to disrupt the reading frame. The Database of Genomic Variants (DGV)15 is a curated repository of structural variation (consisting of insertions, deletions, and copy-number variants) which serves a similar purpose as the above for structural variation.
Using these sources of population genetic variation, we sought to characterize the tolerance of the human genome to variation in genes associated with Mendelian diseases with a specific focus on those that have been associated with inherited cardiomyopathy.
Data from the NHLBI ESP5400 dataset was accessed on December 12th, 2011 and downloaded for analysis. This data is the accumulation of called variants from the exomes of 5,379 individuals from multiple cohorts of the ESP, including the Women’s Health Initiative, Framingham Heart Study, Jackson Heart Study, Multi-Ethnic Study of Atherosclerosis, Atherosclerosis Risk in Communities, Coronary Artery Risk Development in Young Adults, Cardiovascular Health Study, Genomic Research on Asthma in the African Diaspora, Lung Health Study, Pulmonary Arterial Hypertension population, Acute Lung Injury cohort, and the Cystic Fibrosis cohort. The primary purpose of the ESP is to sequence the exomes of a large number of individuals selected for the extremes of primarily complex traits from these cohorts. While these exomes may not represent a true sample of the general population, they do represent a phenotyped cohort that is unlikely to be enriched for Mendelian disease, with the possible exception of cystic fibrosis. Resulting SNV calls were filtered for depth and base call thresholds and were annotated for quality using a support vector machine algorithm by the NHLBI ESP data analysis group. Only calls that passed all quality filters were used for downstream analysis. Further information regarding alignment, variant calls, and filtering, as well as the entirety of this data is available at http://evs.gs.washington.edu/EVS/.
For evaluation of structural variation, the November 2010 data release from the Database of Genomic Variants (http://projects.tcag.ca/variation), aligned to the hg19 version of the human genome, was accessed. This includes data from 42 separate studies evaluating for structural variation involving segments of DNA greater than 1kb, as well as smaller insertions/deletions (indels). The data is collected from small individual genome and population level studies without known enrichment for disease.
Gene annotation data was accessed from the Online Mendelian Inheritance in Man database16(http://www.ncbi.nlm.nih.gov/omim). All genes as annotated in the NCBI Reference Sequence Database (RefSeq)17 via the University of California, Santa Cruz (UCSC) Genome Browser18 (including alternate isoforms) were divided into subgroups by OMIM annotation and/or literature review according to their known association with: i) cardiomyopathy, specifically HCM or DCM ii) any other Mendelian disease, or iii) neither of the above. After accounting for alternate isoforms, there were 120 isoforms of 46 separate genes associated with inherited cardiomyopathy, 5,764 isoforms of 2,831 separate genes with other Mendelian disease association, and 25,437 isoforms of 16,102 separate genes without known Mendelian disease association for which variant data from the ESP5400 dataset was available.
Variants from the ESP5400 dataset were grouped by gene into the three categories previously described and minor allele frequencies for each variant were extracted. Variant subtypes were analyzed by predicted functional effect (synonymous, missense, nonsense, splice) and the sum of minor allele frequencies across known isoforms was used to come up with a raw count of expected number of variants per type per transcript. For synonymous, missense, and nonsense variants, this number was then normalized for transcript length based on data from RefSeq. For splice site variants, this number was normalized by number of known exons per transcript.
In order to evaluate the distribution of small indels (1-50 bp) which were notably absent from the public release of the ESP5400 dataset, the subset of called indels from the 1000 Genomes Phase 1 March 2012 release was retrieved and annotated using ANNOVAR19 software against the NCBI RefSeq database17 to determine the subset in coding regions of genes with any disease association as above.
We manually curated a set of variants in MYH7, MYBPC3, and TNNT2 with strong evidence for causing cardiomyopathy. This set comprised of missense variants seen in patients at the Stanford Center for Inherited Cardiovascular Disease from September 2010 to December 2011 and considered likely or very likely disease causing. To supplement this list, we selected variants from a publicly available repository of sarcomeric variants20 with the highest number of independent citations. These variants were then manually curated and any variants we considered likely or very likely disease causing were included in our high confidence set. Curation relied on published data, cases from our clinical cohort, and case or control data from commercial genetic testing laboratories. Classification was based on segregation data, presence in multiple unrelated cases, absence in controls, and availability of compelling animal model or in vitro data. Variants were considered very likely disease causing only if strong segregation data and/or animal model data was available.
All missense variants from the NHLBI ESP5400 dataset as well as variants from our curated list of known pathogenic variants in HCM were scored using GERP21, a measure of evolutionary constraint at a nucleotide base level utilizing a rejected substitution score, and PhastCons22, another measure of evolutionary conservation at the nucleotide base level utilizing multiple sequence alignment, using the SeattleSeq SNP Annotation server (http://snp.gs.washington.edu/SeattleSeqAnnotation). Polyphen-223(http://genetics.bwh.harvard.edu/pph2/) and SIFT24 (http://sift.jcvi.org/) scores, both predictions of pathogenicity of missense variants based on the effects of the predicted resulting amino acid substitution, were obtained from their respective servers.
Structural variants from DGV were grouped on the basis of Mendelian disease association. The average number of structural variants per gene was computed. Due to the varying size of both structural variants and the transcripts they affect, we normalized by evaluating only structural variants affecting protein coding regions of genes and calculating the percent of each gene’s coding region based on transcript length affected by a deletion in DGV.
All data analysis was carried out using the R Statistical Programming Language. Tests for statistical significance between groups were non-parametric tests without assumption of the underlying distribution. These included the Wilcoxon rank-sum test for direct comparison between two groups, the Kruskal-Wallis test for analysis of variance, and Spearman’s rank order for correlation. Given that most genes are not in linkage with each other, linkage between genes does not affect the results of the Kruskal-Wallis test significantly.
For the analysis of the exonic distribution of pathogenic and ESP variants, Fisher’s exact test was used. While Fisher’s test does assume independence of events which may not necessarily be true for the distribution of variants in a gene due to linkage disequilibrium, given the overall rarity of most variants analyzed (almost all less than 1% minor allele frequency and the majority being unique) it is unlikely that a rare variant in one exon significantly affects the probability of a variant in another exon.
Most variants in the population data were not shared between many individuals. Private variants, those that were found only in one person, were abundant. Out of the 9,974 total variants called in the NHLBI exomes distributed amongst 46 separate cardiomyopathy associated genes, 9,103 (91%) had minor allele frequencies less than 1%. Of these rare variants, 5,448 (60%) were private. This predominance of rare variants was almost identical in other genes, whether or not they were associated with Mendelian disease. Common variants (minor allele frequency > 5%) comprised only 5% of all genetic variation in the coding regions of human genes.
We found many genes for which a large amount of genetic variation was not only expected, but likely serves a critical purpose. Among Mendelian disease associated genes, the five with the highest rates of missense variation were all HLA loci (Supplementary Table 2), where high rates of polymorphism are thought to be selectively maintained.25 Another well-recognized gene locus with very high missense variation was the ABO blood group locus. Among non-Mendelian disease associated genes, those with the most variation included many of the olfactory receptor genes, consistent with the survival advantage of a sophisticated sensing system for environmental odorant molecules.26
Missense variant and nonsense variant rates did not appear correlated when looking across all genes (Spearman’s rho=0.36). This remained true when looking at the subset of genes with Mendelian disease association or the subset without Mendelian disease association.
We found significantly lower levels of variation in genes associated with Mendelian disease as compared to genes without a known association (Table 1). In general, this reduction was much stronger for types of genetic variation that would be predicted to have more impact on the resulting protein product, such as splice site or nonsense variants. Mendelian disease genes were noted to have a 67.3% lower rate of nonsense variants as compared to genes without known disease association (p=9.6×10−6). These were even more rare in cardiomyopathy associated genes (Figure 1), which exhibited a 98.7% lower nonsense variant rate as compared to non-disease associated genes and a 96.1% lower rate as compared to the remaining Mendelian disease associated genes (p=5.7 × 10−7). Similarly lower variant rates were seen for both missense and splice site variants as well. Interestingly, this was reversed with respect to synonymous variation, with cardiomyopathy specific genes having slightly higher rates of variation (116.4 variants per megabase of coding region per chromosome in cardiomyopathy genes vs. 90.8 and 95.1 variants per megabase of coding region per chromosome for non-OMIM and OMIM genes, respectively, p=2.7×10−3).
Single nucleotide variants thought to have the most effect on protein function are ones that result in a premature stop codon, i.e. nonsense variants. We looked at cardiomyopathy-associated genes in the NHLBI exome data to evaluate for the overall prevalence of this type of variation in a population without known inherited cardiomyopathy. Overall, we found that nonsense variants were extremely rare in these genes. In fact, in the subset of genes that are routinely sequenced for clinical purposes in HCM, we found only one nonsense variant each in MYH7 and MYBPC3. Nonsense variants were completely absent in the sarcomeric genes ACTC1, TNNT2, TNNI3, MYL2, MYL3, and TPM1. While the nonsense variant in MYH7 has not been reported previously, the nonsense variant found in MYBPC3 (p.Trp1214Ter) has been associated with hypertrophic cardiomyopathy in one published report in an Asian Indian population27.
Among cardiomyopathy-associated genes, the gene with the greatest number of nonsense variants in the ESP5400 exomes data was the very large gene titin (TTN), which has been implicated in familial DCM. This may be largely due to its immense size, as the coding region of titin consists of upwards of 100 kilobases. In total we noted 23 predicted nonsense variants in titin in the NHLBI exome data. The majority of these nonsense variants seemed to be distributed evenly throughout the length of the gene, although there were two notable clusters of nonsense variants near the 5′ end of the gene (Figure 2). This is in direct contrast to a recent report of a high burden of variants in the A band of the titin protein (corresponding to a group of exons near the 3′ end of the transcript) associated with dilated cardiomyopathy (DCM)28. Both clusters of nonsense variants in our analysis were in exons that are specific to the novex alternate splice isoforms of titin, the first in the terminal exon (exon 46) of the novex-3 isoform (NM_133379) and the other in exon 44 of the novex-2 isoform (NM_133437). Neither of these are the major cardiac isoforms of titin, which may explain why nonsense variants in these regions may be more tolerated.
On the other hand, DMD, which has been implicated in Duchenne and Becker muscular dystrophy as well as X-linked familial cardiomyopathy,29,30 was noted to manifest an extremely low rate of nonsense variants despite its enormous size. Of all human genes, DMD spans the largest region of the genome: encompassing 2.4 million bases, with a coding region consisting of about 14 kb spread over greater than 70 exons. The NHLBI dataset, however, contained only one predicted nonsense variant within this gene.
We collected 46 variants, 40 of which were missense, with particularly strong evidence of causality from three genes most often found to be causal in HCM (MYBPC3, MYH7, and TNNT2) (Supplemental Table 3). Given a large amount of ambiguity over the effects of missense variants in the genome, we compared the missense variants from this pathogenic list to missense variants from the NHLBI exome data within the same genes. These 40 pathogenic missense variants were generally located in regions within these three genes that were notable for very low variant frequencies in the population data, suggesting that these are regions with vital functions that do not tolerate high rates of variation (Figure 3).
Furthermore, 10/26 pathogenic missense variants in MYH7 and 6/10 of the pathogenic missense variants in TNNT2 were found in exons that were notable for a complete absence of non-synonymous likely benign variation (Supplemental Table 4). These exons in MYH7 (exons 6,7,9, 13, and 19 of NM_000257) and in TNNT2 (exon 10 of NM_000364) thus likely encode critical functional domains in the resultant peptide. In support of this, the above noted exons in MYH7 all encode for portions of the functional head and neck domains31. In addition, the above mentioned exon in TNNT2 encodes a portion of a tropomyosin binding site, with induced variants in this exon previously shown to strongly reduce binding efficacy32,33. In general, exonic distribution was strikingly different between the pathogenic variants and ESP5400 variants in MYH7 (p=.0059) and TNNT2 (p=.013). This difference was not statistically significant in MYBPC3, which may be due to the low number of pathogenic missense variants in this gene in our collection, consistent with reports that the majority of disease-causing variants in this gene tend to be frameshift, splice, or nonsense variants rather than missense34,35.
Of note, 4 of the 46 variants with good evidence of pathogenicity were present in the NHLBI exome data. The individual incidences of these variants were very low, with almost all found in only 1 individual each, except for one variant in TNNT2, p.Arg278Cys that was found in 6 individuals in the NHLBI exome cohort. No phenotype information was available to us for these individuals. These variants were removed from the NHLBI ESP variant list for any further analysis.
We used widely accepted variant classification algorithms to predict pathogenicity of missense variants. We found the evolutionary constraint based algorithms GERP and PhastCons to be poorly predictive of variant pathogenicity in this data. Notably, GERP scores appeared on the whole to be higher in the NHLBI ESP variant set (Figure 4), the opposite of what would be expected. While PhastCons predicted scores of > 0.95 (max score of 1) for all the variants in our curated causative variant list, the majority of presumably tolerated missense variants (67%) from the NHLBI exome data set were also noted to have a similarly high PhastCons score, resulting in a c-statistic for classification of 0.52, akin to no discriminatory power (Figure 5).
The use of algorithms based on amino acid substitution gave much better results. SIFT had modest discriminatory power with a c-statistic of 0.70. Polyphen-2, which also utilizes information about peptide structure and interaction, performed the best with a c-statistic of 0.77. It should be noted however that Polyphen-2 is based on a machine-learning algorithm that was trained on variants that may have included some of those from our curated list.
We attempted to recapitulate these findings in other types of genetic variation by evaluating the distribution of small indels in data from the 1000 Genomes Project. There were notably only 5,969 indels from this dataset in coding regions, of which 868 were in Mendelian disease associated genes and 26 were in cardiomyopathy associated genes. This gave total rates of 17 indels per 1,000 exons in non-Mendelian disease genes, 10 indels per 1,000 exons in Mendelian disease genes, and 9 indels per 1,000 exons in cardiomyopathy genes. However the overall low number of these types of variants in this data limited any further statistical analysis.
We then used data from DGV to query on a per gene basis the number of all structural variants that have been reported as well as the overall extent of the coding region of genes that are covered by known structural variants. We found that the total number, per gene, of all structural variants and only structural variants affecting coding regions did not differ between genes associated with Mendelian disease and those that are not (Table 2). However we did note a 53% reduction of coding region covered by reported deletion type structural variants in genes that are specifically associated with cardiomyopathy as compared to genes without Mendelian disease association (p-value=0.02).
Recent studies have suggested a surprising rate of tolerance to genetic variation within the human genome. Here, we show that this tolerance does not extend to genes associated with cardiomyopathy, especially structural and sarcomere genes. This observation fits with a systems model of organism function where some genes are disproportionately intolerant of variation because their function has less redundancy. In addition, in describing population variation data for these genes, we note the presence of a surprising number of disease-associated variants in a population without enrichment for cardiomyopathy.
In contrast to the high rate of genetic variation found in genes dependent on diversity for effective function such as the olfactory receptor loci, we found that population genetic variation, especially variation expected to affect protein function, was rare in Mendelian disease associated genes. We hypothesized that genes essential for cardiac function might be among the genes most intolerant of variation. Not only was this the case, but the strength of these associations was also found to be dependent on the severity of the predicted alteration of protein function, exemplified by the extreme rarity of nonsense variants in cardiomyopathy-specific genes. These findings extended to structural variants as well, specifically in regards to the percent of the coding transcript that is involved in deletion type structural variants in individuals without disease.
One strength of our study is in the practical application to clinical genetic testing, which relies on data from unaffected individuals to judge the likely pathogenicity of novel variants. As our understanding of human genetic variation has improved, it has become clear that even rare genetic variation can be normal and well tolerated, representing a challenge in linking genotypes to phenotypes. One recent study has estimated, using 1000 Genomes data, that the average person has as many as 100 loss of function variants per genome2. This population level of variation has implications for the interpretation of results of clinical genetic testing. However our results indicate that this variation is not evenly distributed and genes for which associations with Mendelian disease have been established have much lower levels of such variation, likely representing the effects of purifying selection.
Why genes associated with cardiomyopathy show even lower rates of genetic variation than other Mendelian disease associated genes is not self-evident but there are many possibilities as to why this may be the case. One study has suggested that Mendelian disease genes may not necessarily be the hubs of gene networks36 (because to manifest disease a variant cannot be fatal). However, genes associated with cardiomyopathy may be an exception given their essential functions within the sarcomere and the heart’s unique position in serving all other organs. Variants in these highly structured peptides with molecular motor functions that operate constantly throughout life would be expected to be heavily selected against in the general population. The finding of a slight increase in synonymous variants in cardiomyopathy-associated genes is unexpected. It is possible that this represents a decrease in codon use bias in these genes relative to others, which may in turn reflect a decreased need of efficiency of translation of these structural proteins, but why this may be the case is not evident.
One intriguing finding in cardiomyopathy genetics is the contrast between disease causing variants found in MYBPC3 and those in MYH7, the two genes with the highest number of HCM-causing variants. Indeed, the high rate of nonsense pathogenic variants found in MYBPC337 is in contrast with the almost universal missense nature of those found in MYH7. The extreme rarity of nonsense variants in cardiomyopathy genes in the data presented here suggests that a high probability for pathogenicity for such variants found in MYBPC3 in patients would be appropriate. The absence of disease-causing nonsense variants in MYH7 is curious. It may be that MYH7 haploinsufficiency may not be tolerated at all. We do note that predisposition of genes towards one type of variation versus another is not uncommon given the poor correlation between rates of different types of variation noted in our data, which may be driven by the resulting effects of such variants (dominant-negative effects in missense variants versus haploinsufficiency states in nonsense variants).
Missense variants remain among the most difficult to interpret in a clinical context. Without a large number of affected and unaffected family members to show co-segregation of variant with disease, it is often difficult to determine if a missense variant truly is pathogenic. Much has been made of the use of measures of evolutionary conservation to prioritize missense variants. Our analysis shows that while these measures can help exclude variants at positions in the genome that do not show conservation, they are unable to efficiently discriminate between likely causative and non-causative variants. While evolutionary conservation at the nucleotide base level appears to be a necessary characteristic of a pathogenic variant, it is not sufficient in and of itself to classify a variant as causative. Algorithms using the predicted effects of the resulting amino acid substitution showed much better classification potential although this may in part reflect the use of cardiomyopathy causative variants as training data for these classifiers.
Our analysis also confirms recent evidence that the overwhelming majority of variation in the human genome is rare (i.e. affecting < 1% of the population). Interestingly, more than half of variants analyzed were private (found in only one person). In fact, taking all 8 commonly sequenced genes for HCM together (ACTC1, TNNT2, TNNI3, MYL2, MYL3, TPM1, MYH7, MYBPC3), we found 159 private missense variants, 3 private splice site variants, and 2 private nonsense variants for a total of 164 private variants that would have the potential to affect the resulting protein. Assuming that none of these variants was found in the same person, this would imply that 3% of a general population sample who were to be sequenced today would have candidate variants not seen previously on a small HCM disease genetic testing panel. This highlights the continued importance of co-segregation and other supporting data in deciding whether or not a novel variant is causative of disease.
It was also surprising to find 4 of 46 “gold standard” pathogenic variants present in this population sample, with a total pathogenic allele count of 9 among 5,379 individuals. These data would imply a background prevalence of variants believed causative of HCM of approximately 0.2% (based on 46 variants in 3 genes, and thus likely a substantial underestimate). However, this is much higher than expected in a general population where the prevalence of HCM is estimated to be 0.2% in multiple populations,38-40 when considering that the yield of genetic testing is far from 100%. This is consistent with other recently published studies finding higher than expected prevalence of genetic variants associated with other Mendelian cardiovascular diseases such as familial DCM14 and long QT syndrome41, though the burden of evidence of pathogenicity for variants in these studies was variable.
While it remains possible that some individuals within these cohorts may harbor undiagnosed HCM given that phenotype data for these individuals is not publicly available, the genetic prevalence rate would still be expected to be much lower than that observed in this data. Based on this genetic variant prevalence data, estimates of the incidence of HCM would have to be underestimated by a factor of at least 2 for our current models of HCM disease inheritance to be true. Given that these estimates of HCM disease prevalence were based on multi-modality screening in diverse populations, it seems likely that some proportion of the variants thought to be causal of HCM under a single gene model cannot be. Alternatively, we posit that the idea of a single gene disorder with variable penetrance is likely an artifact of a limited genomic window, and that what has commonly been perceived as a single gene disorder may in fact be the result of a combination of multiple genetic variants each contributing a portion of the variance, with variants contributing differently in different individuals. Just as some have suggested that a number of rare variants with strong effect size may be the driver of the inherited component in many common diseases8,42,43, so too might this be the case for what have historically been perceived as monogenic disorders.
Our study has limitations. No individual phenotype data for the cohorts in NHLBI-ESP, 1000 Genomes, or DGV is publicly available, so it is not possible for us at this time to determine if those individuals with variants from our curated set may have features of an undiagnosed cardiomyopathy. While the accumulated set of variants from these 5,379 individuals is available, individual exomes cannot be reconstructed so it is not possible to determine which variants may be shared on the same chromosome. Also the family structure of the individuals within the NHLBI ESP data was also unknown. It is thus possible that a rare variant could be overrepresented if many members of the same family were sequenced.
In conclusion, using publicly available exome-wide sequencing data from thousands of individuals, we found that genes associated with Mendelian diseases show much lower rates of protein-altering genetic variation, including missense, nonsense, and splice-site variation, with an extreme intolerance of variation noted specifically in cardiomyopathy-associated genes. Cardiomyopathy-associated genes specifically showed intolerance to structural variation as well. Nonsense variants in genes that have been recurrently linked to hypertrophic cardiomyopathy were extremely rare, and our results suggest that such variants in these genes found on clinical testing have a very high likelihood of being pathogenic. In contrast, novel missense variants were present in at least 3% of individuals, and thus the careful interpretation of missense variants found on clinical genetic testing is critical. Current in silico classification schemes for predicting the pathogenicity of missense variants unfortunately have low power in classifying cardiomyopathy variants. Finally, we note a much higher than expected prevalence of variants with strong evidence for pathogenicity. This suggests that, using the power of genome sequencing, a new framework for heterogeneous Mendelian disorders such as inherited cardiomyopathies needs to be developed where variants found in patients and family members are viewed probabilistically on a spectrum from unlikely to likely contributors of variable individual magnitude. While this model challenges the classic ‘single variant in a single gene disorder’ view, it may also begin to explain some of the significant variability in disease expression found in family members with the same ‘causal’ variant.
Recent studies have revealed a high degree of genetic variation in human populations, some of which would be predicted to cause loss of function of the genes encoded. These studies provide a challenge to the careful interpretation of the results from genetic testing and raise concerns about our ability to distinguish between benign and pathogenic variants. We analyzed data from more than 5,000 participants in the NHLBI Exome Sequencing Project. To study tolerance of genetic variation in different genes, we derived rates of genetic variation within: 1) genes without Mendelian disease association, 2) genes associated with Mendelian disease, and 3) genes associated with inherited cardiomyopathies. We found that genes associated with Mendelian diseases exhibit markedly lower rates of genetic variation. This was even more marked for genes associated with cardiomyopathy. Nonsense variants were extremely rare in most cardiomyopathy genes, suggesting that when such variants are found, they are likely to be pathogenic. We also compared known pathogenic variants in MYH7, MYBPC3, and TNNT2 with those in the population data. We found neither rarity nor nucleotide evolutionary conservation helpful in distinguishing benign from pathogenic variants in these genes. However, the exon distribution of pathogenic and benign variants in MYH7 and TNNT2 was significantly different. Rates of pathogenic variants in population data were higher than would be anticipated, suggesting that a single gene/variant model may not be sufficient to explain many cases of inherited cardiomyopathy. These findings highlight the continued importance of co-segregation and other supporting data in determining variant pathogenicity.
Supplemental Table I: List of genes determined to have association with cardiomyopathy. Associations were noted from the Online Mendelian Inheritance in Man (OMIM) database or through literature review.
Supplemental Table 2: Top 20 genes in each category with highest variant rate by subtype.
Supplemental Table 3. Manually curated high confidence pathogenic variants. a: total number of unrelated individuals with the variant from published data, our clinical cohort, and clinical laboratory data provided in genetic test report. b: strength of segregation data based on largest number of affected individuals with the variant within a single kindred. >5 – strong, 4-5 – moderate, 2-3 – weak. c: total number of controls the variant was not observed in from published data and clinical laboratory data.
Supplemental Table 4: Distribution of missense variants amongst the exons of MYBPC3, MYH7, and TNNT2. The canonical isoform was used in the case of multiple isoforms. P-values represent results of a Fisher’s Exact Test for independence of distributions between pathogenic variants and the variants from the Exome Sequencing Project (ESP) dataset.
The authors would like to thank the NHLBI GO Exome Sequencing Project and its ongoing studies which produced and provided exome variant calls for comparison: the Lung GO Sequencing Project (HL-102923), the WHI Sequencing Project (HL-102924), the Broad GO Sequencing Project (HL-102925), the Seattle GO Sequencing Project (HL-102926) and the Heart GO Sequencing Project (HL-103010).
Funding Sources: Stephen Pan is supported by NIH grant 5T15LM007033. This work was also supported in part by NIH grants DP2OD004613, R01HL105993, UL1RR029890 (Euan Ashley).
Conflict of Interest Disclosures: Euan Ashley reports equity and consulting in relation to Personalis Inc.
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.