Home  About  Journals  Submit  Contact Us  Français 
Understanding causal relationships among large numbers of variables is a fundamental goal of biomedical sciences and can be facilitated by Directed Acyclic Graphs (DAGs) where directed edges between nodes represent the influence of components of the system on each other. In an observational setting, some of the directions are often unidentifiable because of Markov equivalency. Additional exogenous information, such as expert knowledge or genotype data can help establish directionality among the endogenous variables. In this study, we use the method of principle component analysis to extract information across the genome in order to generate a robust statistical causal network among phenotypes, the variables of primary interest. The method is applied to 590,020 SNP genotypes measured on 1,596 individuals to generate the statistical causal network of 13 cardiovascular disease risk factor phenotypes. First, principal component analysis was used to capture information across the genome. The principal components were then used to identify a robust causal network structure, GDAG, among the phenotypes. Analyzing a robust causal network over risk factors reveals the flow of information in direct and alternative paths, as well as determining predictors and good targets for intervention. For example, the analysis identified BMI as influencing multiple other risk factor phenotypes and a good target for intervention to lower disease risk.
Interindividual variation in disease susceptibility is influenced by genetic variants, which can be organized into a defined biologic pathways or datadriven associative networks [1–3]. By identifying variables correlated with the primary endpoint of interest, we are able to classify individuals and predict future disease. Going beyond partial correlations and evaluating causal relationships among variables plays an essential first step in risk prediction, thereby promoting more efficacious treatment of current disease and prevention of future disease. By changing the level of a causal variable (e.g. LDLcholesterol levels), we are able to change the risk of future disease (e.g. coronary heart disease), which may not be the case for mere associated variables (e.g. HDLcholesterol levels) [4]. In the case of a randomized intervention, such as a clinical trial, identification of causation is conceptually straight forward. However, in observational studies, which represent the majority of most largescale epidemiologic studies, causal inference is more complex. In most applications, especially “big data” applications, causal inference is embodied in Directed Acyclic Graphs (DAG), where any inference is based on an estimated graph (i.e. nodes and edges). DAGs are illustrations of causal relationships among the variables. Mendelian randomization is an established approach to identify causal relationships [5–8] and it is natural in a biomedical setting to integrate genomics and phenotypic information to help establish directionality within a network of phenotypes. We apply this technique in large data sets from different granularities to achieve robust causal graphs (i.e. DAGs). In the present context, granularities are defined as hierarchical levels with different quiddity that the causal relationship between them is known, e.g. they are reflecting different levels of biologic organization and measurement (genomic and phenotypic, [4]). In the application shown here, we use data from a deeper granularity, the genome, to generate a robust statistical causal network among 13 risk factor phenotypes. Inclusion of genotypes in the analysis of phenotypes (e.g. plasma glucose levels) provides two advantages: first, genotypes are assumed to be measured without error, and second, there is a natural order between these granularities (genome variation → phenotype variation; G→P) and this knowledge helps identify robust directionality in the upper granularity.
Using genome information is a promising approach to identify directionality that is less susceptible to confounding. Previous applications in data integration using gene expression data and genotypes have followed a similar logic [9–12]. For example, Mehrabian et al., [9] integrated genotypic and phenotypic data in a segregating mouse population to generate causal relationships. Aten et al., [11] introduced an algorithm to estimate directionality among nodes in a DAG by applying information from selected single nucleotide polymorphisms (SNPs). In this study, we apply the concept of granularity in a comprehensive manner and extract information from a deeper granularity, here the genome, to achieve a robust causal network among variables of interest in the upper level of granularity, here cardiovascular risk factor phenotypes. To go beyond using a sample of SNPs, which are incomplete and may introduce instability in the study results [13], the method of principal components is used to extract information across the genome. Integration of genome information embedded in the deeper granularity and captured using principal component analysis with phenotype information in the upper granularity results in a robust causal network among the phenotypes, and we call this algorithm granularity directed acyclic graph (GDAG).
We first briefly review the theory of graphical causal inference and introduce the granularity framework and the GDAG algorithm. The utility of this approach is introduced by application to a data set including 13 cardiovascular disease risk factors and 590,020 SNP genotypes measured on 1,596 individuals and then the estimated structure is further interpreted. Use of information from the genome level of granularity allowed us to robustly generate the statistical causal network among the phenotypes. A discussion of the GDAG algorithm and the results is provided.
Assume a DAG D = (v,ε) where v is a set of nodes with p elements which corresponds to a set of p random variables and ε is a set of edges which connect the nodes and shows the partial correlation between two corresponding variables. The existence of a directed edge between two nodes shows the causal relationship between the corresponding variables. Assume P is a joint probability distribution over the variables corresponding to the nodes in DAG D = (v,ε). The underlying assumption for a DAG is the Markov condition over D and P [14]. D and P must satisfy the Markov condition: every variable Y_{i}, i v is independent of any subset of its predecessors conditioned on a set of variables, corresponds to parents/immediate causes of node i,
where Y_{k} occurs before Y_{i} and parental set pa(i) = pa_{D}(.) denotes the set of parents of node i relatives to the underlying structure of DAG D. For j pa(i), we denote j → i or .
A topology or skeleton of a DAG is a graph without direction and is obtained by identification of conditional (in)dependencies, see section “Identification the Topology of Nodes” below. Identification of directions is however a challenging problem due to the Markov equivalent property of observational data. Analysis of data in the upper granularity can identify only vstructures, two nonadjacent nodes pointing inward toward a third node. A complete assessment of directionality (i.e. statistical causal relationships) usually cannot be determined from such data alone, resulting in Markov equivalent DAGs [15–16]. Different DAGs on the same set of nodes are Markov equivalent (ME DAGs) if and only if they have the same topology and the same vstructures [17]. When the number of nodes grows, the number of ME DAGs can grow superexponentially [18]. Complete determination of directionality over the corresponding set v is not, however, possible in most of cases.
Identifying robust and complete directionality and showing flow of information is a difficult task, but can be facilitated by integration of different data types (i.e. granularities) where we know the direction of effect is from one granularity to the other. Assume we are seeking a DAG between two phenotypes Y_{1} and Y_{2}. For this example, assume genomewide information, related to the set (Y_{1},Y_{2}) is captured in the variable X_{1}. Based on the results of an analysis assessing conditional independencies, we find that X_{1} is correlated to Y_{1} and is independent of Y_{2} given Y_{1}, by notationY_{2}X_{1}Y_{1}. Since genome sequence variation is a causal factor in phenotypic differences (and not the other way around), the direction of the effect is from X_{1} to Y_{1}, as shown in DAG A in Figure 1. Knowing the relationship between X_{1} and Y_{1} helps generate the directionality between Y_{1} andY_{2} based on the property Y_{2} X_{1}Y_{1}, and the direction shows the flow of information is from Y_{1} toY_{2}, as shown in DAG B in Figure 1. If we obtain X_{1} Y_{2} & X_{1} ⊥̸ Y_{2}Y_{1} by analysis of the data, then the direction of effect would be from Y_{2} to Y_{1}, as shown in DAG C in Figure 1, which represents a vstructure at Y_{1}.
To identify the direction among three variables in ME DAGs , we need to have at least two variables from the genome (i.e. a lower level of granularity, where G→P) influencing Y_{1} and Y_{2} or one variable from the genome influencing Y_{3}. By integrating multiomics data from different granularities, we are able to derive causal inference that is less susceptible to confounding and, as a result, estimate causal networks robustly and uniquely. Partial information from a deeper granularity creates weak instrumental variables and may result in unstable structures in the upper granularity [13], and we may not be able to find a genome variable strongly associated with every phenotype under study [19]. Therefore, we go beyond inclusion of a sample of SNP marker genotypes and extract comprehensive information across the genome by application of principal component analysis (PCA) to reduce the dimensionality of the data while retaining most of the variation in the data set. Since PCA is an unsupervised approach, it avoids increasing false discovery using the same data twice. The steps of the GDAG algorithm are summarized as follows:
The GDAG Algorithm: Steps to identify a Granularity Directed Acyclic Graph (GDAG) over a set of variables of interest, Y, using data from a deeper granularity, X

To make the algorithm feasible, we assume the underlying network is sparse. A sparse network is a network with fewer numbers of links than the maximum possible number of links within the same network [21]. Under the sparsity assumption, the runtime of the algorithm is reduced to a polynomial and as a result the number of nodes can grow with the sample size. This assumption is reasonable and is often considered in most biomedical applications [22–25].
When applying the GDAG algorithm, we are primarily interested in the causal relationships among nodes in the upper granularity, not among nodes in the deeper granularity or the relationship between the deeper and upper granularities. In the current application, genome information summarized by PCAs is applied to identify a robust statistical causal network structure among cardiovascular risk factor phenotypes in upper granularity. In genetics and epidemiology, application of PCA for summarizing genome information is frequent [e.g. 26–29].
In this manuscript, generating the basic topology among nodes and then assessing directionality are carried out by finding conditional independencies in the framework of data integration. One statistical approach to estimate conditional independencies under a Gaussian assumption is assessing partial correlations [30–32]. Conditioning only on one variable, the partial correlation is defined as
where ${r}_{{y}_{i}{y}_{j}}=cov({y}_{i},{y}_{j})/\sqrt{var({y}_{i})\text{var}({y}_{j})}$ is the Pearson productmoment correlation coefficient. Fisher’s Z transform is used to assess the statistical significance of the sample correlation coefficient, r. If the partial correlation between two variables Y_{i} and Y_{j} given variables corresponding to a subset s v\{i,j} is not determined to be significantly different from zero at some significance threshold, then the corresponding nodes i and j are not connected with an edge. On the other hand, there is an edge between nodes i and j if and only if given all subsets s v\{i, j}, Y_{i} and Y_{j} are significantly correlated (see [33] prop. 5.2). Assessing all partial correlations in the case of multivariate normal distribution to estimate conditional independencies is computationally unfeasible. Therefore, a sparsity assumption is employed, meaning that each node is connected to only some but not all of the nodes in the network.
To evaluate the properties of the GDAG algorithm, we used simulated data. We estimated the simulated DAG without genomic information and separately with the GDAG algorithm incorporating the genomic information. We then compared the frequency of false discoveries (FD), which are the number of wrong directions, and nondiscoveries (ND), which are the number of nondirected edges, estimated by each method. Since the performance of the algorithms depends on the number of vstructures in the underlying causal graph, we considered DAGs with different numbers of vstructures. The underlying models of the simulations are depicted in Figure 2.
In order to have genotype data with a realistic linkage disequilibrium structure, we generated 10,000 SNPs for 2000 individuals on the basis of a coalescent model that mimics the linkage disequilibrium (LD) (i.e. nonrandom association of alleles at different loci) pattern, local recombination rate and the population history of African American and European American using a previously validated demographic model [34].
Phenotype values were generated using the structural model
for j < i, where the phenotypes (Z_{j}) in the model are compatible with the graphs in Figure 2. For each scenario, SNPs were selected randomly from the larger set of generated SNPs. The ε_{i} in the model was assumed to be Gaussian with unit variance. The value of nonzero genome effects were randomly sampled from a U (−0.9, − 0.5) and U (0.5,0.9) and the value of the nonzero phenotype effects were sampled from a uniform U (− 1.9, − 1.0) and U (1.0,1.9). The values of these extreme points were based on preliminary analyses. While other studies such as [32] considered only positive effects, we considered both negative and positive effect sizes.
The simulated data were analyzed considering only the phenotypic data using the PCalgorithm [35] which is implemented with polynomial complexity in highdimensional sparse setting [32]. The simulated data were also analyzed using the GDAG algorithm based on both phenotypic and genomic data. To apply the GDAG algorithm, we extended the PCalgorithm to analyze data from different granularities. We extracted information from the generated SNPs using principal component analysis and selected the first 110 principal components responsible for almost 90% of variation to form the set X in the GDAG algorithm. Result of the comparison of the performance of the PC and the GDAG algorithms based on fifty repetitions is summarized in figure 3, which shows the number of false discoveries (FD) and nondiscoveries (ND) under different scenarios.
The GDAG algorithm has fewer FDs and NDs compared to a simple DAG application without the genome information. As can be seen in Figure 3, the performance of the PCalgorithm is improved dramatically by adding information from a deeper granularity. There have been other attempts to improve the PCalgorithm’s characteristics. For example, de Campos et al. [36] improved the PCalgorithm by employing three types of structural restrictions, and Shojaei et al. [25] achieved better performance using a penalization approach.
The GDAG algorithm can generate directionality robustly because it leverages information from a deeper granularity. Here, we examine the performance of the GDAG algorithm for different numbers of observations and two significance levels. We simulated different number of individuals and 40 replicates for each sample size for an underlying network with directionality. Using data from a deeper granularity, the GDAG algorithm was used to generate the topology and directionality for each replication. Therefore, when assessing the GDAG performance across different scenarios, showing either the false discovery rate (FDR) or true discovery rate are sufficient. The mean FDRs across the 40 replicates for each sample size were calculated. A smooth line over mean of FDRs is depicted in Figure 4. The red line shows the mean FDRs at the significance α =0.01 and the black at α =0.001.
Examination of the FDR rate in Figure 4 indicates unsatisfactory rates of false discovery at small sample sizes (e.g. <600 for alpha =0.001 and <1200 for alpha =0.01) and satisfactory rates at larger sample size at both significance levels. For sample sizes between 600 and 1200, alpha=.001 provides more reliable result than alpha=.01.
As an application of the GDAG algorithm, we identified causal relationships among 13 chronic disease risk factor phenotypes: BMI (body mass index), SB (systolic blood pressure), DB (diastolic blood pressure), FG (fasting glucose), FS (fasting insulin), HDL (high density lipoprotein), LD (low density lipoprotein), TRIG (triglyceride), TC (total cholesterol), FIBR (Fibrinogen), PLT (Platelet count), as well as electrocardiogram measurements QT and QRS. The data set includes 590,020 measured genotypes in sample of 1,596 nonHispanic white individuals from the National Heart Lung and Blood Institute (NHLBI) GOESP, which is an ancillary study of the Atherosclerosis Risk in Communities (ARIC) study [37], the Cardiovascular Health Study (CHS) [38], and the Framingham Heart Study [39]. The data were obtained from dbGAP [40], and this analysis is part of ongoing studies having local Institutional Review Board approval. The following steps were undertaken in order:
The resulting GDAG among the risk factor phenotypes is shown in Figure 5.
As was shown in the simulated data, use of information from the genome embedded in the principal components allowed us to estimate the causal network among the phenotypes. Some of the relationships in Figure 5 are expected, such as those among the lipid phenotypes. The analysis identified a causal link between BMI and HDLcholesterol and an indirect effect of BMI on triglycerides. The relationship between fibrinogen and platelets underscores the important role of fibrinogen in platelet aggregation and function [42]. It is important to note the effect of BMI throughout the network.
DAGs are illustrations of causal relationships among a set of related variables. To definitively identify causal relationships, interventions are required. However, interventions, even in some parts of the graph, are not possible in most human observational studies. Data analysis alone does not robustly identify causal relationships, except in very special cases for nonGaussian distributions [43]. As shown here, application of domain expert knowledge and data from another granularity is helpful for identifying causal networks, including the direction of arrows and estimating the magnitude of effect sizes. In a granularity framework, we take advantages of genotype information to identify a robust statistical causal network structure among phenotypes (i.e. GDAG), which provides a high degree of certainty about finding causal relationships. Any algorithm for DAGs can be extended in the granularity framework to be able to achieve causal inference that is less susceptible to confounding by hidden variables and, as a result, estimate robust statistical causal networks which are well anchored to domain knowledge. In previous applications, a priori biologic candidate gene variation has been used to analyze phenotypes [11], but a comprehensive approach to the concept of granularity has not been used. Leveraging eQTLs identified from previous association studies to reduce the number of Markov equivalence classes among phenotypes is wellestablished [2–3&12] but distinct from the concept of granularity. To the best of our knowledge, there is no report using PCAs derived from genomewide SNP information to identify the causal network structure among phenotypes. In the proposed granularity framework, the domain knowledge “genome variation causes phenotypic differences” is used along with objective dependencies in the data to estimate causal relationships.
The concept of granularity can be applied to reduce the running time of the GDAG algorithm. Since the primary interest is generating a causal network structure over the variables Y in the upper level of granularity, the topology of the causal network can be estimated only over the variables Y. After the topology is established, the GDAG algorithm seeks partial correlations between any two variables, one from X and the other from Y. Since the variables in set X, the genomewide principal components, are independent of one another, the GDAG algorithm does not require estimating the partial correlations between the Xs. This results in further reduction of the running time.
To implement the granularity framework, we extended the Peter and Clark (PC) algorithm because it is computationally feasible and often fast for sparse problems having many nodes (variables) [32]. This method can be applied to generate network structures among many variables and reveal patterns in complex systems. A robust statistical causal network reveals patterns in the underlying structure, thereby identifying good targets for intervention and prediction. The total effects among phenotypes can be estimated by structural equations while a sufficient set of confounders identified graphically are in the model [44]. We applied the GDAG algorithm to 13 risk factor phenotypes and genomewide principal components as the deeper granularity. Visualization of the phenotype GDAG shown in Figure 5 provides opportunities for improved disease prediction and identifying targets for risk factor intervention. Nodes with a high indegree (i.e. number of arrows pointing into a node) correspond to variables influenced by multiple other risk factors. These nodes may be good predictors of disease since they capture information from multiple risk factors. On the other hand, nodes with a high outdegree (i.e. number of arrows pointing out of a node) correspond to variables having influences on multiple other risk factors. These nodes may be good intervention targets to lower risk and influence clinical outcomes. For example, according to the GDAG in Figure 5, a good disease predictor may be fibrinogen (FIBR) since it is influenced by multiple other risk factors. BMI may be a good intervention target since it has a high impact on the other risk factor levels, such as fibrinogen, HDL, glucose, and insulin. Changes in BMI are predicted to yield changes in the majority of the network of risk factors. In conclusion, generating a robust statistical causal network among risk factor phenotypes and using this directionality, we are able to identify good candidates for future manipulation.
Understanding causal relationships among large numbers of variables is a fundamental goal of biomedical sciences and can be facilitated by Directed Acyclic Graphs (DAGs) where directed edges between nodes represent the influence of components of the system on each other. In an observational setting, some of the directions are often unidentifiable because of Markov equivalency. Additional exogenous information, such as expert knowledge or genome data can help establish directionality among the endogenous variables. In this study, we use the method of principle component analysis to extract information across the genome in order to generate a robust statistical causal structure among phenotypes, our variables of interest. The method is applied to 590,020 SNP genotypes measured on 1596 individuals to generate the structure on a set of 13 cardiovascular disease risk factor phenotypes. First, principal component analysis was used to capture information across the genome. The principal components were then used to identify a robust causal structure, GDAG, among the phenotypes. Analyzing a robust causal structure over risk factors reveals the flow of information in direct and alternative paths, as well as determining predictors and good targets for intervention. For example, the analysis identified BMI as influencing multiple other risk factor phenotypes and potentially a good target for intervention to lower disease risk.
Azam Yazdani is supported in part by a training fellowship from the Keck Center for Interdisciplinary Bioscience Training of the Gulf Coast Consortia (CPRIT Grant No. RP140113). Funding for GO ESP was provided by the National Heart, Lung, and Blood Institute (NHLBI) grants RC2 HL103010 (HeartGO) and RC2 HL102924 (WHISP). The exome sequencing was performed through NHLBI grants RC2 HL102925 (BroadGO) and RC2 HL102926 (SeattleGO).
Disclaimer: The content is solely the responsibility of the authors and does not necessarily represent the official views of the Cancer Prevention and Research Institute of Texas.
Author Disclosure Statement: The authors declare that no conflict of interests exists.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 