This year marks the fifth anniversary of the first large, well-designed GWAS that employed a dense SNP chip with varying opinions on the contribution of this study design to a better understanding of common disease susceptibility
[32],
[33],
[34]. Despite the fact that over 2,000 loci have been found to be robustly associated with one or more complex traits, GWAS studies did not account for much of the individual trait's heritability. They generated reproducible hits often far from known genes that limited immediate translation of GWAS results into mechanistic understanding of phenotypic variation. Using genetic variants detected by GWAS studies of CAD and related traits, we sought to test whether unprocessed GWAS findings would support genetic pleiotropy that exists across several commonly co-existing morbidities and the extent to which the clinical and epidemiological notion of shared etiology can be reproduced. We report 44 genes identified by GWAS as being directly implicated in phenotypic variation, located upstream or downstream of the associated SNPs that were shared across at least two CVD-related traits. These overlapping genes recreated the established pathophysiological relationship between obesity, diabetes, hyperlipidemia, hypertension, kidney disease and CAD ( and ). However, this was only true if studies representing non-European cohorts were also included.
Previous reports have mined the GWAS catalog, testing the idea of a shared genetic basis across multiple phenotypes, such as pancreatic cancer, immune-mediated diseases, and hematologic traits, in the context of GWAS findings
[20],
[21],
[22],
[23],
[24]. Instead, we conducted analyses focused on a cardiovascular disease domain while including intermediate risk factor phenotypes and co-morbidities to evaluate the immediate interpretability of unprocessed GWAS results in this area of research.
We found that genes within or flanking the reported SNPs independently replicated the clinical, epidemiological and pathophysiological notion of cardiovascular risks in the ethnicity-pooled dataset. However, when only studies of European populations were included, some of the relationships were lost (
Figure S1). The fact that only 24% of the eligible studies were conducted in non-Europeans emphasizes the value of genetic studies in diverse ethnic/racial groups. When only studies of African ancestry cohorts were included in the analyses, we found that only 3 connections, between the blood pressure, lipids and chronic kidney disease phenotypes, were reproduced (
Figure S2).
Several genes showed an overlap between at least three CVD-related traits (). Notably, one region on chromosome 2p24.1, in the proximity of two genes, Apolipoprotein (
APOB) and kelch-like protein 29 (
KLHL29), revealed the most extensive pleiotropy across multiple phenotypes, such as blood pressure, lipids, CKD and CAD. While
APOB is well known for coding the main apolipoprotein of chylomicrons and low density lipoproteins,
KLHL29's function is widely unknown. Consistent signals pointing to that genomic area included a
KLHL29 intronic SNP and several intergenic SNPs positioned 19–2,323 bp from either gene, both imputed and typed, suggesting that both genes can be involved. Notably,
KLHL29 was only mapped to this genomic region in studies of African ancestry. This observation could suggest ethnic/racial differences in LD in the region or an ethnic-specific variant associated with multiple risks. Recently, there have been reports of variants associated with risk in specific ancestral populations, particularly in African-ancestry populations for diabetes
[35],
[36], kidney disease
[37], and hypertension
[38]. These discoveries could explain in part the increased prevalence of certain diseases in particular ancestral populations. However, whilst racial disparities or population differences in disease prevalence may correlate with differences in allele frequency resulting in different association signals, we were unable to effectively control for differences in allele frequency as the unit of analysis in our study was gene rather than SNP. Additionally, as the majority of GWAS-derived SNPs are not causal and population differences may be attributed to differences in LD structure, a lack of association in non-European populations may demonstrate ascertainment bias of GWAS markers rather than true population differences. Targeted re-sequencing should be conducted in multiple populations in an attempt to identify potential functional variants that generated the observed association signals.
We used pathway-based analysis as implemented in GRAIL to identify subsets of positional pleiotropic genes, shared across at least 2 phenotypes, involved in similar biological processes. GRAIL uses abstracts from the entirety of the published scientific literature to look for relatedness among genes within disease regions that may represent key pathways. We undertook this approach to capture both clearly established close gene relationships and potentially undocumented or distant ones. We found that the strongest connections were between genes involved in lipid transport and metabolism, such as PCSK9, LDLR, LPL, APOB, APOC1 (), significantly contributing to the GRAIL results. Among other significant connections was a link between the two growth factors, PDGFD and VEGFA, that belong to the platelet-derived growth factor/vascular endothelial growth factor (PDGF/VEGF) family implicated in a variety of functions in vertebrates, especially angiogenesis. Defects in VEGFA have been shown to be associated with diabetic retinopathy, diabetic nephropathy leading to end-stage renal disease and diabetic neuropathy. These genes were also connected to the lipid genes by GRAIL.
It is of interest that many pleiotropic GWAS loci had no relationship to each other or to genes with established functional connection regardless of how current the reference data. An incomplete gene function annotation and limited knowledge of biological pathways could potentially explain this finding. Despite considerable advances in expression quantitative trait loci (eQTL) research, there are questions about the completeness of the eQTL databases. Most of the human eQTL studies to date have analyzed cell types in blood, because these are the most readily available tissues, only recently moving to a wider variety of tissues such as cortical, adipose and liver tissues
[39],
[40],
[41]. This reality prevented us from formally evaluating the contribution of eQTLs to genetic pleiotropy. Further studies will help elucidate pathways whose relevance to a particular disease or trait was previously unsuspected.
We sought to analyze the data in as unbiased a way as possible. To this effect, we did not include metabolic syndrome as a CVD-related phenotype, as its definition encompasses two or more of the included phenotypes and would thus positively bias any overlapping gene lists. We also excluded catalog genes that were reported by investigators who may have had pre-existing notions about disease causality, and relied only on positional information provided by the catalog. Of note, the complete replication of the established relationships was not reproduced when the analysis was limited to associations that reached genome-wide significance (P<1×10−7) or based on author-reported genes instead of positional genes.
This study has several limitations. We mined data exclusively from the NHGRI GWAS catalog, which includes data on published GWAS studies meeting pre-specified criteria. The catalog does not include variants derived from candidate gene or linkage studies and as such, variants discovered through these means that may exhibit pleiotropy were not included in our analysis. Similarly, we could only assess pleiotropy in the context of which phenotypes have been already studied, thus the absence of pleiotropy may denote insufficient data rather than true absence
[42]. Conversely, it is possible that the degree of pleiotropic findings are artifactual because the implicated diseases have been explored in greater depth
[20]. Furthermore, we could not control for gene size, which may affect the likelihood of observing statistically significant associations, as this inherited bias is present from the ascertainment of markers on GWAS arrays through to the reporting of association results in the GWAS catalog. Nevertheless, we limited adding to this bias by only including one instance of any gene that could be represented by multiple SNPs per phenotype in the analysis. Additionally, it is rare for causal variants to be identified by GWAS and, in many cases; variants in LD with the true causal variant are recorded in the catalog. These may in turn have been mapped to alternate genes in our analysis and may have affected the observed pleiotropy. It is possible that we included GWAS studies that used the same samples to study different phenotypes. Also, consistent with other studies examining pleiotropy in the GWAS catalog
[21], we did not address the directionality of the reported associations, nor did we consider the level of statistical significance (other than the sub-analysis at the more stringent threshold) or their effect sizes. The goal of this study was to determine if it is possible to replicate an indisputable notion of commonly co-occurring CVD-related conditions using crude GWAS-derived genomic regions. Future studies will be needed to determine whether these genetic risks act independently, in synchrony or whether antagonistic pleiotropy exists between these phenotypes. The choice of the genotyping platform could have biased our results. Nevertheless, the top pleiotropic region,
APOB-
KLHL29, has been detected through the imputed and typed SNPs available from both Affymetrix and Illumina genotyping platforms. Furthermore, although variability in phenotypic characterization of CAD and related traits used by various GWAS studies may have affected our results, it has been shown that differences in phenotype definition in CAD have a small effect in between-study heterogeneity
[43]. Another challenge of our study was that genes clearly implicated in the pleiotropy were not fully annotated with respect to function. That is,
KLHL29, a gene within our most substantive pleiotropic region, as well as 5 other pleiotropic genes including the most widely replicated CAD locus on 9p21, were not found in the GRAIL databases and therefore, we could not examine whether their function is connected to that of other pleiotropic genes. For these genes, greater efforts will be required to chart new paths that could eventually lead to the most novel and important insights.