2D Linkage Scan
Initially, we applied an exhaustive 2D linkage scan in order to identify expression traits that are significantly linked to pairs of loci or that are significant for epistasis. In performing these significance tests, we considered a linear model that fully parameterizes the quantitative trait in terms of all four possible genotypes. This model can be written as
Traditionally, genetic linkage is said to exist between the trait and a pair of loci if any of the locus effects are significantly different than zero, but not necessarily all of them [24
]. Epistasis exists only if the locus1 × locus2 interaction is significantly different from zero. This linear model approach to identifying QTLs is well justified [24
] and has been shown to be especially useful when the markers are densely sampled [18
The test for pair-wise linkage was performed as follows. For each expression trait, a linear model was fit by least squares to each pair of loci. The locus pair with the largest F-statistic comparing the full model to the baseline model was selected for that trait. For the test of epistasis, a similar procedure was performed, except an F-statistic was computed that compared the full model to the purely additive model, which directly assesses the contribution of the interaction term. The significance of each locus pair selected for linkage was computed using a standard permutation technique against the null hypothesis of no linkage to either locus [25
]. For the test of epistasis, we used a similar permutation technique to assess the significance of the locus1 × locus2 interaction [26
]. The end-product of these tests is a p-
value and a pair of loci for each expression trait. Using the false discovery rate (FDR) quantity to correct for multiple comparisons [27
], there were 3,540 traits significantly linked to a locus pair at the 5% FDR threshold. Note that in this case, a “false discovery” must be defined as a trait where both selected loci are false positives, and thus many of the “true discovery” traits could have one false-positive locus. In the test for epistasis, no significant results were obtained; the number of significant tests at each cut-off mirrored the number found under the null permutations.
The exhaustive 2D search proved to be unsatisfactory for a number of reasons. Most obviously, the number of multiple-locus models that have to be considered is computationally and statistically challenging for pairs of loci, and prohibitive for three or more loci. With 3,312 markers and 6,216 expression traits, one has to consider more than 18 million single-locus models to simply test for linkage between every expression trait and locus. More than 27 billion tests have to be performed to consider all two-locus models for every expression trait, and more than 27 trillion tests to consider all three-locus models for every expression trait. In addition, it is likely that by searching over so many models, the statistical power to detect linkage is severely attenuated because of the multiple comparison problem. Secondly, when employing an exhaustive 2D scan, there is no statistically rigorous method to test for joint linkage, which exists only if both loci have nonzero terms in the full model. In other words, the significance of an individual locus selected for an expression trait is confounded with the overall significance of the pair of loci. Since p-values are calculated against the null hypothesis of no linkage, a highly significant result may be due to only one locus being truly linked while the other locus is included by chance. This confounding is especially problematic when considering thousands of traits simultaneously. Since pairs of loci that show large marginal effects are preferentially selected when testing thousands of traits for linkage, one cannot examine marginal effects of individual loci among the most significant linkages in an unbiased fashion. Therefore, by chance it may appear that both loci explain a large proportion of variance of the trait. Ideally, a measure of significance for each locus would be available, and then a joint measure of significance for all loci would be calculated. In our case, we did not want to call a trait significantly linked if either of the loci were a false positive. Finally, a decision must be made as to which traits to call significantly linked. If the goal is to avoid any false-positive loci when calling a trait significant, then there is no simple p-value that can be formed for this purpose. This follows because the null hypothesis consists of multiple scenarios, and there is no readily available null distribution to describe this. Therefore, a more sophisticated method must be used to assess the significance of thousands of multi-locus models.
Stepwise Search More Powerful than 2D Search
One potential way to improve the exhaustive 2D scan is to use another method for selecting pairs of loci. In particular, one can select loci in a sequential manner, cutting down the number of models considered to 2 × 3,312, instead of more than 5 million. One readily available method for selecting loci in a sequential manner is forward stepwise regression. Here, one selects a primary locus that shows the most significant one-dimensional (1D) linkage, i.e., the one that has the largest LOD score. This is equivalent to identifying the locus that yields the smallest residual sum of squares when regressing the expression trait on the inheritance pattern at that locus [24
]. Next, a secondary locus is chosen that yields the largest LOD score conditional on the primary locus being linked. Again, this is equivalent to choosing a locus that minimizes the residual sum of squares when regressing the expression trait on both the secondary locus and the primary locus contained in the same model. A tertiary locus may be selected by including the previous two loci in the regression, and so on.
One can use this forward stepwise regression technique simply as a way to select pairs of loci for each trait, and then the significance analysis can be repeated as before. It has been hypothesized that failing to consider all possible two-locus models through an exhaustive 2D search (e.g., selecting loci in a sequential fashion) may lead to a loss in power or to missing important interactions between loci [29
]. The yeast expression dataset presents a rare opportunity to test this idea; in a typical study, only one quantitative trait is measured and only one scan is performed, which makes it difficult to compare different multi-locus selection methods. Simulation-based comparisons make a number of assumptions that may not always be true. Here, we are able to make a direct comparison based on thousands of related scans, all conditional on the same set of meioses. Thus, we compared the exhaustive 2D search to a simple sequential search method by repeating every step exactly as above, except that the sequential search method was used to select pairs of loci (see Materials and Methods
). At FDR cut-offs of 1% and 5%, there are 2,780 and 4,271 significant pair-wise linkages, respectively. At these same cut-offs, the 2D search yielded substantially fewer: 1,715 and 3,540 significant linkages. A shows the number of significant pair-wise linkages over a range of p-
value thresholds, where it can be seen that the sequential scan consistently finds a greater number. Since any given p-
value threshold results in the same number of expected false positives for each type of search, this is empirical evidence that the sequential search is more powerful. A 2D search could still produce more biologically meaningful results. In order to assess this, we measured the overlap in selected locus pairs among the traits corresponding to the 3,000 most significant linkages identified by the 2D method. The locus pairs selected by the two methods for each given trait were considered to be equivalent if the same two chromosomes were identified and the respective chromosomal locations of the loci were within 50 kilobases (kb) of each other. Under this definition, 90% of the locus pairs were found to be equivalent between the two methods.
A Power Comparison of the 2D Locus Pair Search and the Sequential Search
The sequential search was also more powerful for identifying epistasis relative to the exhaustive 2D scan. B shows the number of traits called significant for epistasis over a range of p-
value thresholds, where it can be seen that the sequential search is again more powerful. Neither search method yielded a trait with high significance for epistasis. However, from the sequential method we are able to estimate that at least 14% of the traits are operating under epistasis, whereas due to a lack of power the 2D search estimate is 0%. This estimate is obtained by the following reasoning, which has been rigorously developed elsewhere [27
]. If the locus pair identified for each transcript were a false positive, the distribution of p-
values across all transcripts would be flat and uniformly distributed between zero and one; thus, the shape of the observed distribution of p-
values can be used to estimate the total proportion of false positives. The more powerful a set of statistical tests are, the more this flatness can be distinguished from the signal. We did not see much overlap in locus pairs selected among the two search methods, but this is not surprising given that the 2D search apparently produces only noise.
Therefore, for this particular experiment, the sequential search is more powerful than the exhaustive 2D search in identifying pair-wise linkage and detecting epistasis. The sequential search also appears to extract a biological signal that is similar to that from the 2D search. However, it is not possible to conclude whether these properties would hold in other experiments or for different sample sizes. Also, the comparison was made based on significance assessed against the null hypothesis of no linkage, which is not a solution to the problem of detecting joint linkage. The sequential approach as implemented above still suffers from the problem that significance can be driven by a single locus while the other locus is a false positive. However, sequentially selecting loci allows their individual significance to be assessed, which we show is crucial in detecting true joint linkage. We discuss how to assess individual and joint significance for the sequential approach below; we note that the same methods would not work without a number of potentially unjustifiable assumptions for the exhaustive 2D search.
We developed a method to overcome the following problems associated with existing approaches: a prohibitively large number of multi-locus models are considered, a clear measure of significance among individual loci is not available, and the desired alternative hypothesis that all selected loci are linked for each trait is not tested. The method can be summarized in four steps.
Step 1. For each expression trait,L loci are identified through a sequential locus selection procedure, as above.
Step 2. At each stage of the sequential search, a Bayesian technique is employed to calculate the probability that the locus is linked to the expression trait, conditional on the assumption that the previously chosen loci are also linked.
Step 3. The locus-specific probabilities are combined to form the probability that all loci are simultaneously linked to the expression trait.
The overall probabilities of linkage from Step 3 provide a ranking of the traits from most significant to least significant. It is then necessary to select a set of traits, each of which has a high probability of being linked to all loci simultaneously. In order to guide this choice, we propose a method to assess the statistical significance of a given set of traits.
Step 4. A significant expression trait is called a false discovery if any of the loci selected for that trait is a false positive. That is, a true discovery is an expression trait where all selected loci are truly linked. A new approach for estimating the FDR among a set of significant traits is employed that directly utilizes the probabilities calculated in Step 3.
The starting point for the method is to define a multi-locus model that may include varying numbers of loci, where it is clear how one modifies the model to include an additional locus. Here, we continue to use the fully parameterized model. For zero, one, and two loci, the model may be written, respectively, as
where, for example, “locus1” is the main effect for the primary locus, and “locus1 ×
locus2” is the epistatic interaction between the primary and secondary loci (Materials and Methods
). A sequential search can then be performed to identify the top linked locus for each expression trait, which involves finding the locus that offers the greatest improvement in goodness of fit when comparing model M1 to model M0. This primary locus is then included in models M1 and M2, and a secondary locus is identified that provides the greatest improvement in goodness of fit when comparing model M2 to model M1. Continuing this process, an ordered set of L
loci for each expression trait can be identified. Here we consider only L
= 2 loci, but the method can be applied to larger numbers of loci.
The Bayesian posterior probability that the primary locus for each trait shows linkage can be written as Pr(locus 1 linked |Data). Since the secondary locus is identified conditional on the presence of the primary locus, the probability that it is linked is calculated conditionally on the primary locus being linked: Pr(locus 2 linked | locus 1 linked, Data). Note that probabilities may be formed analogously for L loci, with the final probability beingPr(locus L linked |loci 1,...,L-1 linked, Data). These conditional probabilities are conceptually consistent with the procedure used to select loci. For example, the secondary locus is not called significant unless the primary locus is also called significant, since it was used in identifying the secondary locus.
The above probabilities give a measure of significance to each locus. However, one would also like to know the joint significance of the loci. The probability that all loci are linked to the expression trait is simply the product of the locus-specific probabilities. For example,
These joint-linkage probabilities can be used to select traits that are significant for having all loci jointly linked by calling all traits significant that have a joint-linkage probability exceeding some threshold. For example, all traits with Pr
(loci 1 and 2 are linked | Data) ≥ 0.90 may be called significant. This threshold is equivalent to ranking the traits for significance by the size of the joint-linkage probability. (Variations on this ranking procedure are possible, depending on the goals of the study; e.g., one may want to consider only traits that have all locus-specific linkages probabilities at 0.95 or greater—see Materials and Methods
.) In order to decide on a reasonable threshold, an error rate associated with the thresholding rule is assessed. For example, how reliable is the list of traits that have a joint-linkage probability of 0.90 or greater? The FDR concept is attractive in this case since thousands of traits are simultaneously being assessed, and we would like to select several without incurring too many mistakes. The FDR is typically defined and estimated in terms of multiple hypothesis tests [27
]. However, the “null hypothesis” here is complicated because it includes any scenario where one or both of the loci are not truly linked. However, when assuming a Bayesian model, the FDR has been shown to be equal to a Bayesian posterior probability [30
], and the FDR concept and estimation methodology can be extended to accommodate our situation.
A trait is defined to be a “false discovery” for joint linkage if any of its selected loci is a false positive. In standard multiple hypothesis testing situations, the false discovery has been estimated as the ratio of the estimated number of false positive divided by the observed number of tests called significant. For a given threshold we place on the traits, it is straightforward to count how many are called significant, but it is not as easy to estimate the expected number of false positives because the null distribution of the joint-linkage probabilities is not available. However, when identifying pairs of loci for each trait, the probability a trait is a false discovery is 1 −Pr(loci 1 and 2 are linked| Data). Therefore, the overall expected number of false discoveries is the sum of the 1 −Pr(loci 1 and 2 are linked| Data) over all traits called significant for two-locus linkage. An estimate of the proportion of false discoveries among significant linkages is then
where again the summation is taken over all traits called significant for a two-locus linkage. This estimate can be justified in the context of Bayesian representations of the FDR, but it also has connections to p-
value based estimates ([30
]; Materials and Methods
). For example, in our study there are 72 traits that have two-locus joint-linkage probabilities of 0.90 or greater. Summing all 72 corresponding quantities 1−Pr
(loci 1 and 2 are linked | Data), there are 4.8 expected false discovery two-locus linkages among these. Therefore, the FDR estimate of this particular threshold is 4.8/72 = 6.7%.
In practice, the locus-specific and joint-linkage probabilities must be estimated. Due to the massive amount of available data, we form nonparametric estimates of the probabilities rather than making assumptions about their distributions. At each stage of the locus selection, the strength of linkage is quantified by a standard F-statistic used to compare two models (M1 versus M0 or M2 versus M1). The statistics associated with the primary and secondary loci for each trait are the maximal F-statistics among all loci. Since these maximal statistics do not have a known null distribution, the null distributions are simulated. The quantitative trait values are permuted and the maximal statistics are recomputed [25
] to give permutation null statistics. Note that when the null statistics are simulated for the secondary loci, the fact that the primary loci are assumed to be truly linked is taken into account [26
]. That is, the null statistics corresponding to the secondary loci are calculated conditionally on the genotypes of the primary loci. We performed five permutations to yield sets of 6,216 ×
5 simulated null statistics corresponding to the primary and secondary locus selections. The observed statistics and null statistics corresponding to the primary loci are used to estimate the linkage probabilities Pr
(locus 1 linked | Data) for each trait. Similarly, the observed and null statistics corresponding to the secondary loci are used to estimate Pr
(locus 2 linked | locus 1 linked, Data) for each trait.
A key aspect of our proposed approach is that loci are selected one at a time for each given expression trait. In the traditional approach, pairs of loci are selected together so that among these locus pairs, zero, one, or two loci may be truly linked. Therefore, it is not possible to model all three cases without making a number of assumptions. However, since we select only one locus at a time, there are only two possible outcomes at each selection step: the locus is either linked or not. The statistics calculated at each locus selection stage are a mixture of the two distributions corresponding to these linked and unlinked loci. The permutation null statistics represent one component of this mixture and can be used in conjunction with the observed statistics to conservatively estimate the locus-specific linkage probabilities. shows a plot of the strategy used to form these estimates. The solid black line is an empirical probability density (i.e., smoothed histogram) of the 6,216 observed statistics calculated from the secondary loci. This density is a mixture of a null density corresponding to statistics of unlinked loci, and an alternative density corresponding to statistics of linked loci. The solid grey line is an empirical probability density of the permutation null statistics, which has been drawn to reflect its relative contribution to the black mixture density of observed statistics. The observed statistics and permutation null statistics can be used to conservatively estimate the proportion of true linkages among all secondary loci ([27
]; Materials and Methods
), which is the mixing proportion of the null density and the prior probability of linkage. In order to calculate the locus-specific linkage probability, the ratio of the null density to the mixture density must also be estimated. This ratio is estimated by adaptively considering the relative frequency of null statistics to observed statistics in small intervals around each possible value of the observed statistics (Materials and Methods
). Once the locus-specific linkage probabilities are estimated, these quantities are plugged into the above proposed procedure to obtain a set of significant two-locus linkages.
An Example of the Locus-Specific Linkage Probability Estimation Applied to the Secondary Loci
Two-Locus Joint Linkage Applied to Gene expression Traits in Yeast
We applied the proposed method for two-locus linkage analysis to the S. cerevisiae experiment. Based on the joint-linkage probabilities, we estimate that 2,300 traits (approximately 37%) are jointly linked to two loci, although we cannot identify all of these with high confidence. Of these 2,300 traits, 170 can be identified at a FDR of 10%. Among these 170 traits, the primary locus FDR is less than 0.2%. Therefore, we expect that at most 17 of these joint linkages include a single false-positive locus, and about zero include two false-positive loci.
Recall that when a more liberal definition of two-locus linkage was used, where only one locus was required to be linked, about 4,000 linkages were called significant at a FDR of 5%. However, in that situation it was not clear whether both loci or just a single locus were truly linked. Because we identify only 170 significant joint linkages at a FDR of 10%, it appears that many of the 4,000 significant linkages from the other approach were due to only a single locus being truly linked. When comparing our method to a traditional 1D linkage scan where the top two linkage peaks are taken as significant, we find 3.3 to 8.7 times more linkages at FDR cut-offs ranging from 1% to 10% (Materials and Methods
). These observations indicate that our proposed approach provides a new and statistically rigorous framework for distinguishing between genetic models.
To better understand the molecular mechanisms underlying the observed linkages for these traits, we searched for cis
-acting effects. Here a cis
linkage is said to occur if one of the two linkage peaks coincides with the position of the encoding gene corresponding to the expression trait. In total, 58 traits demonstrate a cis
linkage (Table S1
), which has two important implications. First, the observation of a cis
linkage immediately suggests a candidate QTL that can be experimentally tested. Second, these results demonstrate that variation in the expression level for a given trait cannot simply be dichotomized into either cis
effects, as both can simultaneously contribute to variation in gene expression levels.
Several previous linkage analyses of gene expression levels in yeast and other organisms have shown that linkages are nonrandomly distributed throughout the genome and tend to cluster into specific locations [5
,31]. In order to get a broad view of the distribution of joint linkages throughout the genome, we first divided the genome into 550-kb bins and counted the number of jointly significant traits at a FDR ≤
10% in each pair-wise bin (A), where simulations demonstrate that the number of bins expected to have three or more two-locus linkages by chance is less than one. A indicates that the genomic distribution of joint linkages does not solely follow simple patterns, which is evidence that the “joint” linkage here is meaningful. To test whether similar observations would extend to pairs of linkages on a finer scale, we further divided the genome into 50-kb bins and counted the number of significant joint linkages occurring in each bin. We observed 10 pair-wise bins with three or more traits (B), where the number expected by chance is much less than one. This suggests that the same pair of QTL or closely linked QTL contribute to variation in the gene expression levels among all traits falling into any given 50-kb pair-wise bin. Not surprisingly, groups of traits defined by linkage bins possess similar biological functions (Table S2
). For example, the 12 traits that jointly link to nearly identical positions on Chromosomes 3 and 8 are predominantly involved in the mating response. The linkage peak on Chromosome 3 maps to the precise location of the yeast mating type locus MAT
. The parental strains are of opposite mating type, and mating type segregates in the cross. We show elsewhere that variation in the expression of genes in this group is indeed explained by inheritance at MAT
on Chromosome 3 and at the pheromone response gene GPA1
on Chromosome 8 [32
]. Common biological themes can be assigned to the majority of the remaining clusters including amino acid and mitochondria metabolism (13 traits defined by linkage to regions on Chromosomes 3 and 13), mitochondrial tricarboxylic acid cycle (ten traits defined by linkage to regions on opposite ends of Chromosome 15), and response to stress (five traits defined by linkage to Chromosomes 6 and 10). Table S2
provides a complete list of genes and putative biological functions for the ten pair-wise bins with three or more traits.
A Plot of the Locus Pair Positions Corresponding to the 170 Traits Significant for Joint Linkage
Another interesting observation that emerges from the spatial distribution of joint linkages is that distinct groups are connected by a common linkage peak. For example, of the 10 pair-wise bins with three or more linked traits, there are three that share a common linkage to the exact same position on Chromosome 15 (Table S2
). Many of the genes in these three groups are localized to the mitochondria, suggesting an important QTL on Chromosome 15 that mediates expression levels for numerous mitochondria related genes. An attractive candidate QTL for this region is IRA2
, which is a regulator of the RAS-cAMP pathway [33
] that is located in both the cytoplasm and mitochondria. More generally, these results intimate that multiple locus mapping of gene expression levels may be useful in reconstructing regulatory networks by identifying shared linkages across traits.