For an individual organism, DNA has the useful feature that it is usually a static variable, meaning that it is fixed and will not change with changing RNA levels, protein levels, phenotypes, or environmental conditions. By performing designed crosses of genetically distinct inbred or isogenic lines, one can randomize the genotypes of an organism from two or more genetic backgrounds, thereby producing independent realizations of DNA content from offspring to offspring [

6]. At the same time, one may measure gene expression, or any other molecular or clinical phenotype of interest, on each resulting recombinant line.

We have developed Trigger as an approach for inferring regulatory relationships among all pairs of genes at the genome-wide level, based on these genetic cross experiments in which high-throughput expression profiling is also performed (Figure ). However, one may also incorporate any other molecular or clinical phenotype of interest into the algorithm.

Probabilities of transcriptional regulation

Suppose that there are

*m *genes with transcription levels measured on recombinant offspring from an experimental genetic cross. (In the yeast experiment we consider,

*m *= 6,216.) The goal is to use the data from such an experiment to estimate the probability that the transcription of gene

*i *has a causal regulatory effect on the transcription of any other gene

*j*, which we denote by

*P*_{ij}, where 'causal regulatory effect' means that a change in the transcription level of gene

*i *results in a predictable change in the level of gene

*j*. This is not necessarily through a direct molecular interaction; however, if we directly modulate the transcriptional level of gene

*i*, then this should result in a corresponding change in the transcriptional level of gene

*j*. Trigger provides a conservative estimate of these probabilities, denoted by

for

*i *= 1, ...,

*m *and

*j *= 1, ...,

*m*.

These estimated regulatory probabilities can be used to build a regulatory network based on a directed graph. The probability that a directed edge exists from gene

*i *to gene

*j *in the network is estimated by

. One can directly threshold the entries, essentially setting those not meeting the threshold equal to zero. For example, one could remove all potential edges with

< 90% while including those with

≥ 90%. Therefore, a directed edge would be drawn from gene

*i *to gene

*j *if and only if

≥ 90% (Figure ). The resulting network has an easily quantified and interpretable FDR, and each directed edge has an estimated probability that it is true (see Materials and methods [below] and Additional data file 1).

In addition to constructing a regulatory network from these estimated probabilities, each gene

*i *can be examined as a putative regulator, and hence a quantitative trait gene or 'quantitative trait transcript' [

34]. Specifically, the probability that a specific gene

*i *is a regulator for each other gene

*j *is estimated as

. A threshold can be applied to these estimated probabilities to obtain the FDR of the significant genes (see Materials and methods [below] and Additional data file 1). This particular application of Trigger allows one to move beyond identifying QTL of expression traits to identifying a specific underlying causal quantitative trait transcript.

Causal models of transcriptional regulation

Trigger is based on a rigorous mathematical framework that we developed for utilizing randomized genetic backgrounds and genome-wide expression in order to test rigorously for causality among transcription levels. The approach starts with a pair of transcripts and a locus to which both are linked. Let *L *be the locus, *T*_{i }transcript *i*, and *T*_{j }transcript *j*.

The goal is to identify triplets (

*L*,

*T*_{i},

*T*_{j}) such that

*L *→

*T*_{i }→

*T*_{j}, where the arrow '→' means causation. The definition of 'causal' has been a topic of much interest [

18,

19]. Although definitions of causality differ slightly among the many articles published on this topic, in essence

*T*_{i }→

*T*_{j }means that the ideal manipulation of

*T*_{i }will change the distribution of

*T*_{j}, whereas the ideal manipulation of

*T*_{j }will not disturb the distribution of

*T*_{i}. 'Ideal manipulation' of a variable means to change the variable in a manner that leaves every other variable unchanged, at the moment when the manipulation occurs [

35]. This framework also applies to causality among random variables.

With the genetic cross experimental design, the genotype at a fixed locus *L *is a random variable, whose random outcome occurs before and independently from the subsequently measured expression values. For example, in the yeast experiment analyzed below, two haploid parental strains (BY and RM) were crossed to produce 112 recombinant haploid segregant strains. Because of the random segregation of chromosomes during meiosis, the inheritance of *L *= *BY *or *L *= *RM *is random. Therefore, when measuring the alleles at a single locus *L *across 112 segregants, we observe 112 genotypes being generated from some probability distribution. (See Materials and methods [below] for explicit details on the assumptions we make about the randomized genotypes among the loci.)

Because the randomization of

*L *takes place before the expression levels of

*T*_{i }are measured, this implies that if

*T*_{i }is linked to locus

*L *then

*L *→

*T*_{i}. This property is due to the well established principles in statistics showing that an association between two variables when one of them is properly randomized implies causation [

19,

20]. Additionally, the randomization of

*L *is carried through to the variation in

*T*_{i }whenever

*L *→

*T*_{i}. If

*L *→

*T*_{i}, then segregants with

*L *= BY have a different mean expression for

*T*_{i }than segregants with

*L *= RM. Therefore, the randomization of

*L *provides a randomization of the mean level of expression for

*T*_{i}. Figure shows the transcriptional levels for a given gene, and Figure shows a case in which it is linked to some locus

*L*. Because the inherited allele

*L *= BY or

*L *= RM is random for each segregant, the mean level of expression for

*T*_{i }is random when

*L *→

*T*_{i}.

Importantly, some of the variation in

*T*_{i }will not be explained by

*L*, specifically the random fluctuations of the transcription levels within each genotype (Figure ). Therefore, it is not possible to conclude that

*T*_{i }→

*T*_{j }whenever

*T*_{i }and

*T*_{j }are significantly associated to

*L*. This follows because there could be a common hidden variable affecting both

*T*_{i }and

*T*_{j}. (Note that if

*T*_{i }were perfectly randomized, then there would be no causal hidden variable for

*T*_{i}, which demonstrates the power of randomization.) Suppose that a hidden variable

*H *is such that

*H *→

*T*_{i }and

*H *→

*T*_{j}. Because of this common hidden causal variable, any association between

*T*_{i }and

*T*_{j }would not allow us to conclude that

*T*_{i }→

*T*_{j }even though

*T*_{i }has been partially randomized. In other words, the partial randomization of

*T*_{i }caused by

*L *is now confounded by the effect of

*H*. The common causal hidden variable

*H *does not prevent

*T*_{i }→

*T*_{j }from occurring; rather, we just are unable to draw any conclusion when this is the case, unless we are willing to model common hidden causal variables. Modeling common hidden causal variables has been shown to be particularly challenging in this high-dimensional setting [

36], and doing so would require much additional work.

If there is a common causal hidden variable *H *that affects both *T*_{i }and *T*_{j}, then the Trigger method is designed to not make any conclusions about causality. However, if there is not a common hidden causal variable, then it is now possible, in a straightforward manner, to determine whether *T*_{i }→ *T*_{j}. The following new theorem identifies three conditions that are equivalent to the case in which both *L *→ *T*_{i }→ *T*_{j }and no common causal hidden variable affects both *T*_{i }and *T*_{j}. (See Materials and methods [below] for a mathematical proof.)

Causality equivalence theorem

The causal relationship

*L *→

*T*_{i }→

*T*_{j }exists and there are no hidden variables causal for both

*T*_{i }and

*T*_{j }if and only if the following three conditions hold:

*L *→

*T*_{i},

*L *→

*T*_{j}, and

*L * *T*_{j }|

*T*_{i}.

This theorem is used in the following manner. If

*L *→

*T*_{i},

*L *→

*T*_{j}, and

*L * *T*_{j }|

*T*_{i}, then we may conclude that

*L *→

*T*_{i }→

*T*_{j }exists and there are no hidden variables causal for both

*T*_{i }and

*T*_{j}. The fact that 'there are no hidden variables causal for both

*T*_{i }and

*T*_{j}' is not an assumption. Rather, it is a verified fact that follows when the three properties are true, as we show in the proof given in Materials and methods (below). We would prefer to detect all cases where

*L *→

*T*_{i }→

*T*_{j}; however, as explained above, it is not yet possible to do so in the presence of common causal hidden variables.

Figure provides a graphical representation of the three properties that must be satisfied. The last condition,

*L * *T*_{j }|

*T*_{i}, denotes that

*T*_{j }conditioned on the information in

*T*_{i }is independent from

*L*. The first two conditions basically ensure that both transcripts are subjected to a common randomization. The third condition is the key one for inferring causality based on these randomizations. Basically, what the third condition determines is whether the causal effect from

*L *on

*T*_{j }can entirely be captured by

*T*_{i}. If so, then

*T*_{i }is indeed a causal factor for variation in

*T*_{j}, with no hidden variables.

For computational and statistical efficiency, we limit *L *to be the locus of gene *i *(see Additional data file 1), which we denote as *L*_{i}. We call *L*_{i }→ *T*_{i }the primary *cis *linkage and *L*_{i }→ *T*_{j }for any other gene *j *the 'secondary linkage' here. Because Pr(*T*_{i }→ *T*_{j}) ≥ Pr(*L *→ *T*_{i }→ *T*_{j}), we can obtain a conservative estimate of *P*_{ij }by estimating Pr(*L *→ *T*_{i }→ *T*_{j}). From the causality equivalence theorem it follows that:

The Trigger algorithm conservatively estimates *P*_{ij }by estimating each probability in the above product from left to right and taking their product. (See Materials and methods [below] and Additional data file 1.)

Application to yeast

We applied the Trigger algorithm to the yeast experiment (Materials and methods [below]) and found several interesting characteristics of the resulting regulatory probability matrix. Table lists the overall significance results with different probability thresholds and Additional data file 2 contains the entire regulatory probability matrix. For example, at a probability threshold of 90%, we found 4,394 significant regulatory relationships among 2,145 genes where 127 are causal. Figure shows a regulatory network drawn from the Trigger results at this threshold, where a directed edge is drawn from gene *i *to gene *j *if and only if *P*_{ij }≥ 90%. It can be seen from Figure that we have constructed a highly interconnected network where there is clearly a 'hub structure'.

| **Table 1**Overall significance of the regulatory probability matrix at different probability thresholds |

We examined in detail four genes as putative regulators: *CNS1 *on chromosome 2, *ILV6 *on chromosome 3, *SAL1 *on chromosome 14, and *NAM9 *on chromosome 14. Each was highly significant for *cis *linkage, and the locus of each putative regulator had many significant secondary linking genes. At a 90% posterior probability cut-off (FDR = 6%), 144, 51 and 36 genes were significant for being regulated by *CNS1*, *ILV6*, and *SAL1*, respectively. At an 80% posterior probability cut-off (FDR = 11%), 14 genes were significant for being regulated by *NAM9*. The significant genes, posterior probabilities, and other relevant information for each putative regulator can be found in Additional data file 3. Note that each of these putative regulators is also a significant quantitative trait gene (or quantitative trait transcript) for each expression trait that it significantly regulates. Figure shows heat maps of the four putative regulators and their corresponding significantly regulated genes. It can be seen that each significant gene is both linked to the locus of the putative regulator and has correlated expression with the regulator within each genotype, both of which are necessary but not sufficient for causality.

In order to determine whether the genes that are significant for each putative regulator show a coherent functional relationship, we employed the Gene Ontology (GO) database [

37]. For each putative regulator, we queried the database among all significant genes and the regulator itself. This approach takes independently performed experiments and synthesizes the information obtained from those. The GO searches allowed us to test specifically whether common processes, functions, and components are present among each set of genes. Indeed, we found an abundance of significance for enriched GO terms for each set of genes corresponding to a putative regulator.

Figure shows the results of GO analysis for the putative regulator

*NAM9*, which is a mitochondrial ribosomal component of the small subunit and inviable under deletion [

38]. It is a structural constituent of ribosome, involved in translation and mitochondrial small ribosome subunit [

39-

41]. For the 14 genes significant at an 80% posterior probability threshold (FDR = 11%), 13 are known to be in the same or similar pathway as

*NAM9*. The other significant gene is heretofore uncharacterized. Translation, structural constituent of ribosome, and mitochondrial small ribosome subunit are all highly significant terms in the GO tree.

Additional data file 1 (Figure S1) shows the results for the putative regulator

*CNS1*, which is an essential tetratricopeptide repeat (TPR)-containing co-chaperone, deletion of which is inviable [

42]. It binds both heat shock protein 82p (Hsp82p) and Ssa1p (Hsp70), and stimulates the ATPase activity of

*SSA1*.

*CNS1 *is involved in the protein binding process, and its cellular component is associated with cytoplasm [

42-

45]. Of the 144 genes significant at the 90% joint posterior probability cut-off (FDR = 6%), a substantial subset is involved in transferase activity and ribosome biogenesis and assembly, which coincides with the key role played by

*CNS1 *in yeast. Many of the 144 genes were also found to be in the same pathway as

*CNS1*; for example,

*TRM8 *and

*CNS1 *are both involved in a pathway for protein binding [

46,

47].

Additional data file 1 (Figure S2) shows the significant GO results for

*ILV6 *and its 51 genes under statistically significant regulation.

*ILV6 *is a regulatory subunit of acetolactate synthase, which catalyzes the first step of branched-chain amino acid biosynthesis [

48,

49]. Amino acid biosynthesis and its associated pathways are significantly enriched GO terms with

*P *values below 10

^{-10}. Cyclohydrolase activity and lyase activity are some other significant pathways identified by GO analysis.

The putative regulator

*SAL1 *is a probable transporter and a member of the calcium-binding subfamily of the mitochondrial carrier family, with two EF-hand motifs. It works in transporter activity and calcium ion binding [

50], with its corresponding cellular component involved in the mitochondrial inner membrane [

51]. From the GO analysis (Additional data file 1 [Figure S3]), we can see that a number of the 36 genes significantly regulated by

*SAL1 *are associated with the mitochondrian and membrane GO terms. Six of the 36 significantly regulated genes are involved in mitochondrial inner membrane with high statistical significance (

*P *< 10

^{-8}), a trend that is consistent with previous findings [

50,

51].

It should be noted that in the case of *SAL1 *no polymorphism exists in the immediate 500 base regions upstream or downstream of the *SAL1 *open reading frame. The linkage peaks occur approximately 13 kilobases and 21 kilobases on either side. This illustrates that linkage does not have to be due to an unequivocally *cis*-acting regulatory polymorphism in order for Trigger to work. On the contrary, there must simply be some locus to which both expression traits *T*_{i }and *T*_{j }are linked. We justified limiting the locus *L *to be in the 50 kilobases region of *T*_{i }based on computational and statistical increases in efficiency (Additional data file 1).

In addition to these four well characterized putative regulators, we noticed that expression levels of a number of genes with relatively unknown function (for instance,

*YSW1*,

*PHM7*, and so on), were predicted to regulate a number of genes, with significant GO terms appearing for each set. Therefore, our results can potentially be used to predict properties of relatively unknown genes as well. Furthermore, several transcription factors significantly regulated a number of genes, including

*HAP1 *[

52,

53] and

*RAD16 *[

54,

55]. In previous work it was found that mutations in

*GPA1 *and

*AMN1 *lead to expression changes in genes whose expression exhibits linkage to each respective locus [

14]. Missense mutations (leading to amino acid changes in the protein product) were identified in both

*GPA1 *and

*AMN1 *that appear to be the cause of the expression changes in the linking genes. In work to be reported in the future we examine the

*GPA1 *and

*AMN1 *cases in detail, showing that there appears to be common causal hidden variables involved. The Trigger approach is extended to take into account these common causal hidden variables, allowing us to recapitulate the previous findings regarding

*GPA1 *and

*AMN1*.

Comparison with other approaches

Mendelian randomization

Recently, 'Mendelian randomization' was proposed as a technique in genetic epidemiology to study the environmental determinants of disease [

27,

28]. Trigger builds upon this concept in the sense that it also employs the randomization of genotypes as a starting point to infer causality. Essentially, we have extended this idea by deriving precise conditions under which the causality of one trait on another can be confirmed and by providing a statistical technique for estimating the probability that one trait is causal for another, among potentially thousands of traits.

Model selection approaches

The concepts of 'causality' and 'regulation' have been utilized in different ways in previous reports concerning the construction of biologic networks [

29,

30,

32,

56-

60]. Among those using the more rigorous definition of causality [

35,

61], most published approaches have been to choose among the best fitting causal models by partial correlation or by model selection. The difference between our work and most previous work is that we explicitly test for and quantify each causal relationship of interest by using the randomization of genetic backgrounds built into the genetic cross experimental system. Furthermore, we assess the significance of each causal relationship by estimating the probability that the causal relationship is true, so that it can be considered in a straightforward manner with millions of other potential causal relationships.

We have made some simple comparisons between Trigger and the model selection and correlation based approaches (Figure ). In addition to Trigger showing different significance rankings relative to these approaches, it offers an increase in specificity. Most of the papers employing model selection have used the 'Akaike information criterion' (AIC) or derivatives thereof [

29,

31,

32]. Among the about 38 million triplets (

*L*_{i},

*T*_{i},

*T*_{j}), the AIC model selection method [

62] classifies about 15.4 million as causal, whereas Trigger identifies about 4,400 causal relationships with probability exceeding 90%. For the putative regulator

*CNS1*, about 2,800 genes are classified as having a causal relationship with

*CNS1 *by model selection, as opposed to the 144 Trigger found to be significant with probability exceeding 90%. The advantages that Trigger has over AIC and other model selection criteria are as follows: there is no generally applicable method to obtain an interpretable measure of significance based on these criteria (which is especially problematic when considering thousands of traits); and these approaches force one to model directly all possible hidden variables, making typically unverifiable assumptions about their underlying model [

11].

Extensions to other data types

We have presented Trigger within the context of inferring regulatory relationships based on gene expression data from organisms with randomized genetic backgrounds. However, this method may actually be applied to a much broader class of data types. Because the estimation is done in a nonparametric and scale-free manner (Materials and methods [below] and Additional data file 1), it is possible to combine any combination of expression, proteomic, metabolomic, and phenotypic data as the variables among which causal relationships are inferred. These may be considered separately or simultaneously, allowing one to discover regulatory relationships, say, among protein levels and transcriptions levels. The general requirement is that one must acquire organisms with random genetic backgrounds that are essentially stable as the expression levels and other potential traits are measured. The computational approach and statistical principles underlying the method remain the same for all of these data types.