While understanding the mechanisms of RNA expression is in itself important for understanding biological processes, the ultimate use of this information is identifying the relationship between variation in expression levels and disease phenotypes in an organism of interest. Microarray experiments are commonly used to explore differential expression between disease and normal tissue samples or between samples from different disease subtypes. These studies are designed to detect association between gene expression and disease-associated traits, which in turn can lead to the identification of biomarkers of disease or disease subtypes. However, in the absence of supporting experimental data, these data alone are not able to distinguish genes that drive disease from those that respond. As discussed above, eQTL mapping can aid traditional clinical trait QTL (cQTL) mapping by narrowing the set of candidate genes underlying a given cQTL peak and by identifying expression traits that are causally associated with the clinical traits.
Expression traits detected as significantly correlated with a clinical phenotype may reflect a causal relationship between the traits, either because the expression trait contributes to, or is causal for, the clinical phenotype, or because the expression trait is reactive to, or a marker of, the clinical phenotype. However, correlation may also exist in cases when the two traits are not causally associated. Two traits may appear correlated due to confounding factors such as tight linkage of causal mutations (Schadt et al. 2005
) or may arise independently from a common genetic source. The Ay
mouse provides an example of correlations between eumelanin RNA levels and obesity phenotypes induced by an allele that acts independently on these different traits, causing both decreased levels of eumelanin RNA and an obesity phenotype. More generally, a clinical and expression traits for a particular gene may depend on the activity of a second gene in such a manner that conditional on the second gene, the clinical and expression traits are independent.
Correlation data alone cannot indicate which of the possible relationships between gene expression traits and a clinical trait are true. For example, given two expression traits and a clinical trait detected as correlated in a population of interest, there are 112 ways to order the traits with respect to one another. If we consider the traits as nodes in a network, then there are five possible ways the traits (or nodes) can be connected: (1) connected by an undirected edge, (2) connected by a directed edge moving left to right, (3) connected by a directed edge moving right to left, (4) connected by a directed edge moving right to left and a directed edge moving left to right, and (5) not connected by an edge. Since there are three pairs of nodes, there are 5 × 5 × 5 = 125 possible graphs. However, because we start with the assumption that the traits are all correlated with one another, we exclude 12 of the 125 possible graphs in which one node is not connected to either of the other two nodes, in addition to excluding the graph in which none of the nodes are connected, leaving us with 112 possible graphs (Fig. A). The joint trait distribution induced by these different graphs are often statistically indistinguishable from one another (i.e., they are Markov equivalent, so that their distributions are identical), making it nearly impossible in most cases to infer the true relationship. On the other hand, when the two traits are at least partially controlled by the same genetic locus and when more complicated methods of control (e.g., feedback loops) are ignored, the number of relationships between the QTLs and the two traits of interest can be reduced to three basic models illustrated graphically in Fig. B. The dramatic reduction in the number of possible graphs to consider is mainly driven by the fact that changes in DNA drive changes in phenotypes and not vice versa. That is, while it may be possible that changes in RNA or protein lead to changes in DNA at a high enough frequency to detect associations between germ-line transmitted DNA changes and phenotype in segregating populations, it seems extremely unlikely.
It is important to note here that when we use the term causality, it is perhaps meant in a more nonstandard sense than most researchers in the life sciences may be accustomed to. In the molecular biology or biochemistry setting, claiming a causal relationship between, say, two proteins usually means that one protein has been determined experimentally to physically interact with or to induce processes that directly affect another protein and that in turn leads to a phenotypic change of interest. In such instances, an understanding of the causal factors relevant to this activity are known, and careful experimental manipulation of these factors subsequently allows for the identification of genuine causal relationships. However, in the present setting, the term “causal” is used from the standpoint of statistical inference, where statistical associations between changes in DNA, changes in expression (or other molecular phenotypes), and changes in complex phenotypes like disease are examined for patterns of statistical dependency among these variables that allows directionality to be inferred among them, where the directionality then provides the source of causal information (highlighting putative regulatory control as opposed to physical interaction). The graphical models (networks) described here, therefore, are necessarily probabilistic structures that use the available data to infer the correct structure of relationships among genes and between genes and clinical phenotypes (Schadt and Lum 2006
). In a single experiment with one time point measurement, these methods cannot easily model more complex regulatory structures that are known to exist, like negative feedback control. However, the methods can be useful in providing a broad picture of correlation and causative relationships, and while the more complex structures may not be explicitly represented in this setting, they are captured nevertheless given that they represent observed states that are reached as a result of more complicated processes like feedback control.
Distinguishing proximal (“cis”) eQTL effects from distal (“trans”)
All genes expressed in living systems are cis
-regulated at some level and so are under the control of various cis
-acting elements such as promoters and TATA boxes (Fig. ). In this context, expression as a quantitative trait for eQTL mapping presents a unique situation in quantitative trait genetics because the expression trait corresponds to a physical location in the genome (the structural gene that is transcribed, giving rise to the expression trait). The transcription process operates on the structural gene, and so DNA variations in the structural gene that affect transcription will be identified as eQTLs in the mapping process. In such cases eQTLs would be identified as cis
-acting, given that the most reasonable explanation for seeing an eQTL coincident with the physical location of the gene will be that variations within the gene region itself give rise to variations in its expression (Doss et al. 2005
). However, because we cannot guarantee that the eQTL is truly cis
-acting (i.e., it could arise from variation in a gene that is closely linked to the gene expression trait in question), it is more accurate to refer to such eQTLs as proximal, given that they are close to the gene corresponding to the expression trait. Because the cis
-regulated components of expression traits are among the most proximal traits in a biological system with respect to the DNA (given that RNA is transcribed from DNA), we might expect that true cis
-acting genetic variance components of expression traits are among the easiest components to detect via QTL analysis, if they exist. This indeed has been observed in a number of studies in which proximal (presumably cis
-acting) eQTLs have been identified that explain unprecedented proportions of a trait’s overall variance (several published studies highlight examples where greater than 90% of the overall variation was explained by a single cis
-acting eQTL) (Brem et al. 2002
; Cervino et al. 2005
; Cheung et al. 2005
; Lum et al. 2006
; Monks et al. 2004
; Schadt et al. 2003
Fig. 3 Mapping proximal and distal eQTLs for gene expression traits. The white rectangles represent genes that are controlled by transcriptional units. The ellipses represent the transcriptional control units, which could be transcription regulatory sites, other (more ...)
Variations in expression levels induced by DNA variations in or near the gene itself may in turn induce changes in the expression levels of other genes (Fig. ). Each of these genes in a population of interest may not harbor any DNA variation in their structural gene so that they do not give rise to true cis-acting eQTL, but they nevertheless would give rise to eQTLs that link to the gene region inducing changes in their expression. Therefore, we see that the individual variation in gene expression can be of two fundamental types. The first, termed proximal, often results from DNA variations of a gene that directly influence transcript levels of that gene. The second, termed trans-acting or distal, does not involve DNA variations of the gene in question but rather is secondary to alterations of other true cis-acting genetic variations (Fig. ). In reality, variation in expression traits may be due to variation in cis-acting elements and/or one or multiple trans-acting elements. In addition, master regulators of transcription, which affect the expression of many traits in trans (Fig. ), may exist, though the evidence on this is not conclusive at this point in all species, given the limited number of studies and small sample sizes for all studies published to date.
In most cases it is not possible to infer the true regulatory effects (i.e., cis
) of an eQTL without complex bioinformatics study (GuhaThakurta et al. 2006
) and experimental validation. As a result, eQTLs have been categorized into proximal and distal types based on the distance between the eQTL and the location of the structural genes. Obviously, if these are on different chromosomes the eQTLs are distal, but if they fall on the same chromosome then they are considered proximal only if the distance between the structural gene and the eQTLs do not exceed some threshold. The exact threshold is a function of the number of meioses and extent of recombination in a given population data set. In a completely outbred population where LD mapping has been used to fine-map the eQTLs, it has been reasonable to require the distance between the proximal eQTL and structural gene to be less than 1 Mb (Cheung et al. 2005
). However, in an F2
intercross population constructed from two inbred lines of mice, the extent of LD will be extreme given that all animals are descended from a single F1
founder, with only two meiotic events separating any two mice in the population. In such cases the resolution of linkage peaks is quite low, requiring the threshold of peak-to-physical gene distance to be more relaxed, so that eQTLs that are within 20 or 30 Mb could be potentially cis
-acting (Doss et al. 2005
; Schadt et al. 2003
). While the proximal eQTLs provide an easy path to making causal inference, given that the larger effect sizes commonly associated with proximal eQTLs make them easier to detect (Brem et al. 2002
; Cervino et al. 2005
; Cheung et al. 2005
; Lum et al. 2006
; Monks et al. 2004
; Schadt et al. 2003
), the methods discussed above work for distal as well as proximal eQTLs. In fact, if a given gene sits more centrally in a given gene network that drives disease, it may capture a larger percentage of the genetic variation associated with the disease (Fig. B), making the gene easier to identify and associate with disease. This was the case in one of the first studies to explicitly leverage DNA and RNA changes to map genes for obesity (Schadt et al. 2005
). In that study three genes (C3ar1
, and Zfp90
) were identified and validated as causal for obesity, and in all three cases the QTLs that facilitated identification of the causal association were all distally acting with respect to the expression traits.