Variation in genomic DNA can affect function in multiple ways, most typically by alteration of the expressed quantity or sequence content of local transcripts. This premise motivated extensive studies over the last decade, cataloging the influence of human genetic variants on gene expression, most often in cis
). Local gene expression level is formally considered as a quantitative trait that is directly modified by allelic variation in regulatory elements (3
). Such modifications of transcriptional regulation have been documented to affect health-related traits as diverse as asthma (5
) and low density lipoprotein (LDL) cholesterol concentration (6
Yet, for large fraction of single-nucleotide polymorphisms (SNPs) with well supported associations to disease phenotypes (7
) which are neither coding, nor linked to coding SNPs in cis
, no cis
-regulatory effect have been reported in studies conducted thus far. A compelling biological hypothesis is that such a SNP does change the transcriptome state or program in order to exert its phenotypic impact, and this regulation is mediated by a transcript in cis
, but in the particular tissue examined, the changes to transcription level of the mediator gene are too minute to guarantee detection in small association cohorts. This hypothesis leads to an approach for mapping expression quantitative trait loci (eQTLs) that is focused on downstream effects of a regulatory SNP across multiple genes in trans
, rather than the cis
-transcript that may mechanistically mediate the effect. A related approach had been successful in simpler organisms (8
), motivating this work.
Data on both gene expression and SNP variation across multiple individuals, often termed genetic genomics have facilitated identification of thousands of expression single-nucleotide polymorphisms (eSNPs) (9
). Approaches that combine these two types of data along with additional factors including the previously inferred biological network structure (11
), modularity of gene expression (12
), pathway analysis (13
) and enzymatic activity (14
) had been proposed. However, tying genetic variation in specific loci to phenotypes is still an active field of research.
In this study, we focus on the modularity of gene regulatory networks, a major organizing principle of biological systems (15
). A module is the fundamental unit of a biological network that consists of a set of elements (e.g. genes) working jointly to fulfill a distinct function. Several studies have used this property to gain better understanding of the regulatory mechanisms (16
) that are affected by genetic variation. Litvin et al
) characterize how genetic variants in multiple loci combine to influence the expression of clusters of co-expressed genes in yeast. Ghazalpour et al
) used co-expression networks to study the genetics of complex physiological traits that are relevant to the metabolic syndrome. Schadt et al.
) used previously reconstructed regulatory networks of genes in mouse and human (17
) to support the existing Genome Wide Association Studies (GWAS) results. Known pathways from Kyoto Encyclopedia of Genes and Genomes (KEGG) were used by Zhong et al
) for the same purpose. Common to all these studies are three steps. The first two are independent: (i) construction of a network from gene expression data; and (ii) detection of association between genetic variants and expression traits; the final step is (iii) integration of genetic association into the network.
However, it is artificial to separate the stages of network construction based on expression data only from a single SNP–transcript association mapping. Ideally, one would combine information from multiple transcripts with genetics in a unified analysis. This motivates complementary approaches to analysis of eSNPs. Specifically, our premise is that the modular organization of gene regulation can be used to pinpoint eSNPs that affect multiple, rather than single genes. Therefore, we developed a method that focuses on groups of transcripts (modules) that are each associated with a single genetic variant.
We present a novel approach that entails analyzing modules of transcripts, each associated to a single genetic variant. These modules are constructed based on both available types of data: transcript expression and genotypes. We combine these transcripts into modules that each share an associated SNP, which we denote as the ‘main’ SNP of that module. This step utilizes the modular organization of gene regulation. We filter the modules according to a confidence score. This score allows us to identify groups of transcripts that are associated to a SNP even if their individual association is not genome-wide significant. We examine the topology of modules, accounting for independent co-association, which is not merely the result of co-expression. This step allows us to infer the flow of causality between the main SNP and the transcripts in the module. We distinguish direct versus indirect SNP–transcript associations through another intermediate transcript whose expression level is co-associated to the same SNP. The main SNP can possibly have cis- or trans-effects on the transcripts in the module. A local cis-effect on a transcript that is either included or excluded from a module can in turn have a modular trans-regulatory effect on the other transcripts in the module by virtue of its changed expression levels or altered produced protein (e.g. a mutation in transcription factor).
Regulatory effects can be categorized by cis- and trans-effects. The cis-effects of eSNPs are often due to changes within the promoter, enhancer or other regulatory regions of a gene that may change the expression of that gene. Trans-effects of the main SNP on module transcripts can be the outcome of two potentially overlapping scenarios: First, a cis main SNP that is located within or close by the coding region of one of the genes in the module can alter the produced protein. The altered protein may then have a trans-regulatory effect on the other transcripts in the module by virtue of its differential expression level despite the protein itself being potentially unmodified. Second, a trans main SNP that is located within or close by the coding region of a gene that is not a part of the module can alter the produced protein. This distant altered protein may then have a trans-effect on the other transcripts in the module by virtue of its modified sequence, despite potentially maintaining its expression level.
All methods previously introduced group transcripts by a shared associated marker and determine intra-cluster interactions by using the correlation of gene expression levels. To our knowledge, this is the first work where a confidence score is assigned to each module and direct/indirect interactions are determined between pairs of transcripts within a module illustrating the dependence/independence of their expression levels conditioned on the main SNP. We are thus able to go beyond traditional clustering-related methods that are based on expression only, and in fact, examine the joint association and the topology of the modules and not merely their content. For completion, we further search for regulatory hierarchical structure within each module: we examine SNPs whose association to transcript levels in a module is conditioned on the main SNP, and denote those as ‘secondary’ SNPs. This step is illustrated as a decision tree where samples in each module are split, first by the genotype of the main SNP and then by the genotype of the secondary SNP. We applied our method to data regarding genotype and gene expression in the liver across 371 samples. This data had been previously analyzed in other means (11
). We observe known relationships from the literature between a module and its associated genetic variants, thereby providing support to our methodology.