|Home | About | Journals | Submit | Contact Us | Français|
It is being increasingly recognized that many important phenotypic traits, including various diseases, are governed by a combination of weak genetic effects and their interactions. While the detection of epistatic interactions that involve a non-additive effect of two loci on a quantitative trait is particularly challenging this interaction type is fundamental for the understanding of genome organization and gene regulation. However, current methods that detect epistatic interactions typically rely on the existence of a strong primary effect, considerably limiting the sensitivity of the search. To fill this gap, we developed a new method, SEE (Symmetric Epistasis Estimation), allowing the genome-wide detection of epistatic interactions without the need for a strong primary effect. We applied our approach to progeny crosses of the human malaria parasite P. falciparum and S. cerevisiae. We found an abundance of epistatic interactions in the parasite and a much smaller number of such interactions in yeast. The genome of P. falciparum also harboured several epistatic interaction hotspots that putatively play a role in drug resistance mechanisms. The abundance of observed epistatic interactions might suggest a mechanism of compensation for the extremely limited repertoire of transcription factors. Interestingly, epistatic interactions hotspots were associated with elevated levels of linkage disequilibrium, an observation that suggests selection pressure acting on P. falciparum, potentially reflecting host-pathogen interactions or drug-induced selection.
Complex traits, including human diseases, often involve epistatic interactions [1,2], a property defined as a non-additive/non-independent effect of two loci on a quantitative trait [3,4,5]. The detection of interactions mulit-locus linkage in general and epistatic interaction in particular is a very challenging task since the large number of traits causes a massive multiple testing issue and demands considerable computational resources. Therefore, exhaustive search methods for detecting loci interactions are usually cumbersome . To deal with such constrains, current methods often pre-filter locus pairs and zoom in on a candidate set of interactions that are expected to be enriched with such interacting locus pairs. For example, Storey et al. proposed a step-wise strategy for multiple loci linkage analysis , requiring that one locus had a significant, marginal effect on the underlying trait. In a recent study, Hannum et al. designed a bi-clustering procedure for the identification of interacting locus pairs and applied their approach to uncover functional links between protein complexes in yeast . Litvin et al. developed a three-stage method, which additionally utilized gene expression clustering to detect interactions in yeast . While these methods differ in the type of detected interactions they commonly relied on the step-wise identification of primary and secondary loci. However, interactions of gene products do not necessarily depend on an outstanding influence of a single factor. In fact, transcriptional regulation of most genes often involves multiple regulators that have small individual effects . Furthermore, such interactions are difficult to detect by the stepwise method unless the set of progenies is extremely large, prompting the need for approaches that are independent of strong primary locus effects.
Generally, methods to detect epistatic interactions feature two main characteristics: (i) To mitigate a potential multiple testing issue methods limit the number of upcoming statistical tests, for example by identifying strong primary loci  or leveraging modularity [8,11]. (ii) As a second key characteristic approaches employ different statistical models to detect certain types of interactions. For example, Hannum et al.  detected pairs of interacting loci without focusing on epistasis. Storey et al.  and Zhang et al.  first uncovered loci that jointly controlled a trait without demanding that their impact was non-additive/non-independent . Subsequently, epistatic locus pairs were identified by testing the significance of the interaction term.
Complementing existing methods, we designed a novel method, SEE (Symmetric Epistasis Estimation), allowing us to detect epistatic interactions in relatively small sets of progenies without the identification of primary loci. Our approach relied on a graph-based filtering step specifically designed to retain a large fraction of symmetric, candidate epistatic interactions. Compared to previous methods we also adopted the classical, more rigid, definition of epistasis, allowing us to rigorously check the significance of candidate interactions.
We applied our approach to uncover epistatic interaction in the malaria parasite P. falciparum. In this organism, genetic crosses have yielded few recombinant clones, contributing to statistical challenges. While the first eQTL study of P. falciparum by Gonzales et al.  suggested a potential role of epistatic regulation of gene expression , a broader understanding of transcriptional regulatory mechanisms in P. falciparum remained unclear.. Applying our approach to a HB3 × Dd2 parasite cross  we found more than 1,500 putative epistatic interactions between locus pairs on different chromosomes and identified several epistatic interaction hotspots of biological significance. Interestingly, we found that the level of linkage disequilibrium (LD) between locus pairs was correlated with the number of genes whose expression was influenced by the corresponding epistatic interaction. Such disequilibrium provides an additional level of interactions between the loci.
We also applied our method to an eQTL dataset of S. cerevisiae . Surprisingly, we found much fewer epistatic interactions and no epistatic interaction hotspots. After ruling out that our results were statistical artefacts caused by the small number of P. falciparum progenies, we hypothesized that selection pressure acting on P. falciparum contributed to observed epistatic interactions and elevated LD, potentially reflecting host-pathogen interactions or drug induced selection.
Adopting the classical perspective on epistasis  we defined that two loci l’ and l” have an epistatic interaction effect on gene g, if the expression y of gene g is significantly better explained by a widely used synergistic/epistatic interaction model:
than the additive/independent model:
where x’ and x” are the genotypes of loci l’ and l”, and ε is a noise term. We designed an efficient algorithm, SEE (Symmetric Epistasis Estimation), which allows us to uncover epistatic interactions without enumerating all possible combinations of locus pairs and genes or relying on a strong primary locus (Fig. 1a). In a filtering step, we assumed that the genotype of each locus was either 0 or 1, representing the corresponding parent. Since each differentially expressed gene was either represented as up or down regulated we obtained 16 possible expression phenotype-genotype configurations (Figure 1b). Specifically, we identified eight patterns that suggested a synergistic, (i.e. epistatic) effect, referred to as interacting configurations. Subsequently, we identified triplets, containing one gene and two loci, which were compatible with interacting configurations using a graph theoretical algorithm (see Methods for details). Using the formal definition of epistasis we separately fitted the interaction (eq.1) and additive (eq.2) models to the data. Subsequently, we applied a statistical test to check if the interaction model provided a significantly better fit than the additive model (see Methods for details). Finally, we corrected obtained p-values with an empirical null distribution, using the q-value approach  to select triplets with a given FDR threshold. Additionally, we applied our statistical framework to randomly sampled triplets. Comparing to such a random set, we confirmed that SEE detected candidate triplets that were significantly enriched with epistatic interactions (Fig. S1 in Additional File 1). In the algorithm, we introduced a graph based measure of support for each candidate triplet. Increasing the support threshold resulted in a smaller number of predicted interactions at lower FDR.
To avoid biases related to linkage disequilibrium (LD) in P. falciparum, we only considered epistatic interactions between two loci from different chromosomes. In the eQTL data set, encompassing 5,150 genes and 329 marker loci in 34 progeny strains of P. falciparum, we identified 3,796 triplets with a corrected p-value threshold ≤ 0.001 (FDR < 0.003). Since linkage disequilibrium in the P. falciparum cross was high, gene expression traits were often mapped to multiple locus pairs such as () and (), where and were close on different chromosomes. Therefore, we corrected p-values in an additional way that accounted for the effects of LD (see Methods for details). In particular, we obtained 1,713 epistatic interactions between 1,111 locus pairs and 323 probes (315 genes) (Table S1), suggesting that up to 6% of all parasite genes might be regulated by epistatic interactions.
Interestingly, we found that epistatic interactions were enriched in triplets that were best described by configurations 7 and 10. Commonly, these two configurations involved the expression of the target gene on one level when the interacting loci are inherited from the same parental strain (genotype 00 or 11) while, at the same time there is the opposite expression level when loci were inherited from different parental strains (genotype 01 or 10). In Boolean terms such configurations are best described by the ‘XOR’ function.
To assess the advantage of our approach, we compared the performance of SEE to the results obtained by a step-wise approach. Since the original step-wise method  used a less rigorous epistasis model, we implemented a modified version. While we applied the same filtering step to find strong primary loci, we checked for the presence of epistatic interactions with secondary loci using our statistical framework. We found a relatively small overlap with our original results. Given the small number of progenies such an observation is rather expected since our method has difficulties to uncover interacting loci with a strong primary effect. In turn, the step-wise method is known to miss interactions in the absence of such effects.
Since our method emphasized synergistic effects of locus pairs on gene expression, we observed large differences in the genome-wide distributions of one-locus eQTLs  and epistatic interacting locus pairs (Fig. 2a). In Fig. 2b, we showed the genome-wide profile of the number of partner loci for any given locus. In addition, we observed that the number of a locus’ target genes is significantly correlated to the number of its interacting partner loci (Pearson correlation rp = 0.82, p < 10−10, Fig. 2c).
We defined a pair of loci as an epistatic interaction hotspot if the two loci synergistically co-regulated at least 10 target genes. Applying this criterion, we found 14 such epistatic interaction hotspots with p-value < 10−10 (see Methods for details). Since some hotspots were close to each other on the genome, and their sets of target genes often significantly overlapped, we grouped hotspots into four hotspot regions (see Methods). We found that the epistatic interaction hotspots contained loci from subtelomeric regions on chromosome 3, 7, 9, 12 and 14 (Table S2 in Additional File 1, Fig. 3a), which also harboured many previously detected one-locus hotspots . In addition, the majority of epistatically interacting locus pairs had a relatively small number of target genes while a minority accumulated a large number of targets (Fig. 3b).
We wondered if the pairs of genetic regions that were involved in epistatic interaction hotspots had any linkage disequilibrium pattern. While we found a total of 49,638 locus pairs that were on different chromosomes, only 388 locus pairs had a LD measure r2 > 0.2, where r2 was calculated using genotype data of progeny strains. Two out of 14 epistatic interaction hotspots had r2 > 0.2 (p < 0.006). Computing Pearson correlation rP between the numbers of target genes of two epistatically interacting loci and r2 between the two loci, we found a weak but significant correlation of rP ≈ 0.15 (p < 10−6, Fig. 3c). This observation indicated that high LD is associated with a large number of target genes, suggesting that high LD between epistatically interacting loci might have been maintained for regulatory functions.
First, we focused our functional analysis on hotspot (3_8.6, 7_2.9) that had highest LD among all epistatic interaction hotspots (r2 = 0.24). Utilizing GO-specific functional gene annotations from GeneDB  we observed that the set of target genes was enriched with genes carrying a methyl transferase domain (The list of target genes of this hotspot is provided in Table S2). Two out of 56 genes that appeared in the “methyl-transfer pathway” also were among the target genes of this hotspot (p < 0.01, hypergeometric test). Both of these genes use the same methyl donor S-adenosylmethionine (SAM) and are homologous to rRNA-methylating enzymes, suggesting complex and tight regulation of this pathway. SAM is a ubiquitous methyl donor in many biochemical pathways, ranging from methylation of proteins, lipids and nucleic acids to serving as a precursor in polyamine biosynthesis. Therefore, perturbation of SAM levels would indeed be expected to have a wide range of consequences that could be consistent with the dramatically different drug selection histories of the parents of the HB3 × Dd2 cross. For example, a recent study  contrasted expression divergence following drug pressure in a laboratory line, demonstrating the differential expression of the SAM dependent methyltransferase gene PFE1115c, located in a chromosome 5 locus that was previously linked to multidrug resistance. Furthermore, an earlier study  reported a higher sensitivity of the chloroquine/pyrimethamine resistant strain K1 to polyamine analogs (i.e. pathway inhibitors) as compared to the drug sensitive strain NF54.
As defined in GeneDB, our set of target genes also was enriched with components of the proteasome degradation machinery and their timed transcription, with 3 out of 85 such genes appearing in the set of 14 genes (p < 0.002). Other interesting target genes in this epistatic interaction hotspot included PF07_0038, a gene located close to the chloroquine resistance transporter (Pfcrt) , and a general transcription factor PFB0290c that could be widely involved in trans-regulation. Expression levels of this transcription factor in the eQTL experiment showed a significant correlation to Pfcrt in both chloroquine-resistant and sensitive progenies (average correlation ), suggesting a possible collaborative role in the endogenous function of the transporter Pfcrt. Considering results of previously mentioned studies such observations indicate a potential importance of this hotspot in the context of drug resistance.
Focusing on another hotspot (7_20.2, 8_48.9) that harbours 10 target genes, we found 3 erythrocyte-targeting genes. While PF11_0486 is a merozoite adhesive erythrocyte binding protein and PFE1625c is a pseudo gene for erythrocyte membrane protein 1 (PfEMP1), PFL2575c is a Plasmodium exported protein that carries an erythrocyte trafficking motif on the second exon. Considering that more than 400 erythrocyte-targeting genes  exist, the probability of randomly detecting three such genes in this hotspot combination was p < 0.03, indicating a potential regulatory signal emerging from this hotspot locus pair.
Interestingly, 3 of 4 members of the pyruvate dehydrogenase complex that plays an important role in redox reactions appeared to be epistatically regulated. Redox reactions are essential when P. falciparum degrades hemoglobin in human red blood cells. Specifically, we found PF11_0256, pyruvate dehydrogenase E1 alpha subunit, regulated by hotspot (7_20.2, 8_48.9). While PF08_0066 is a lipoamide dehydrogenase, regulated by hotspot (3_2.9, 7_20.2) we noticed PF10_0407, dihydrolipoamide acyltransferase component E2, among target genes of hotspot (12_28.6, 14_0).
We considered possible mechanisms for the regulatory basis of the transcription of the target genes for the epistatic hotspot between loci 3_8.6 and 7_2.9. We confined the search for candidate regulators in each of these loci by considering only genes located within the regions physically bounded by each marker (3_8.6 or 7_2.9) and their flanking markers. In the simplest case, epistatic transcriptional regulatory loci would contain transcription factors or their regulators. Although we did not identify direct transcription regulators in either locus, it may be relevant that locus 3_8.6 contains 3 ribosomal protein genes (PFC0290w, PFC0295c, PFC0300c) while 7_2.9 contains a SAM-dependent tRNA m5C-methyltransferase (PF07_0015); consequently, both loci carry critical components of translation. Furthermore, the target genes for the (3_8.6, 7_2.9) epistatic interactions are enriched with SAM-dependent methyltransferases, consistent with this regulatory theme. Recent studies specify that tRNAs can be cleaved into small RNAs  and that tRNAs are protected from this cleavage by cytosine methylation . Moreover, methylation of tRNA serves a regulatory role in stem cell differentiation. Yet, there is no evidence as to whether this acts directly through control of mRNA translation or indirectly through small RNAs derived from tRNAs . Accordingly, our study suggests that epistasis between these two loci indirectly mediates methylation of tRNA.
We identified several cases where two direct transcription regulators were within epistatically interacting locus pairs. In one case, the genomic region of epistatically interacting loci 3_0 and 7_11.5 contained RNA polymerase subunits ABC23 (PFC0115c) and ABC10 (PF07_0027). In the other case, the epistatically interacting loci 3_0 and 5_88.9 coded for transcription factor II H p52 subunit (PFL2125c) and a homolog of the transcription elongation factor SPT6 (PF14_0059), respectively. The p52 subunit is essential for the phosphorylation of the carboxy terminal domain (CTD) of RNA polymerase II while SPT6 plays an important role in this reaction for the transition from transcription initiation to elongation.
We also observed an epistatically interacting locus pair (14_77.5, 4_66) that contained members of the recently identified ApiAP2 family of putative transcriptional regulators (PF14_0271 and PFD0985w), where the target genes were two heat shock proteins PF11_0216 and PF13_0021. Since position specific matrices were available for PFD0985w only , we scanned the upstream regions of the two target genes with Matrix-scan . Indeed, we found potential binding sites within 1,000 bp upstream of genes PF11_0216 and PF13_0021 for domain D2 of ApiAP2 protein PFD0985w (p < 10−5), providing additional evidence for the potential role of the newly discovered ApiAP2 family in regulating gene expression.
To capture properties that might be specific to parasite crosses, we applied SEE to yeast progenies. Using yeast eQTL data  from 112 progenies, we observed 110 epistatic interactions (Table S3). Even with FDR < 0.7 this set of epistatic interactions in yeast is considerably smaller than the corresponding set in P. falciparum, ruling out that the observed difference between the two species was an artefact of the small number of strains in the parasite crosses (see Methods). In addition, such a small number was comparable to the set of 170 two-locus interactions (including non-epistatic interactions) obtained by the original step-wise method in .
To gain further insight into the differences between Plasmodium and yeast crosses, we computed LD patterns in the yeast cross and in a recently generated P. falciparum cross 7G8 × GB4 that has 281 genetic markers and 32 progeny clones . High LD was also a feature of this P. falciparum cross although we observed a different LD pattern than in the HB3 × Dd2 cross (Table 1). In comparison, LD in yeast was much lower than in P. falciparum, suggesting that both epistatic interaction hotspots and elevated linkage disequilibrium could reflect selection in the generation and expansion of the P. falciparum progeny clones after isolation from the chimp blood.
Understanding the scale and the role of multiple loci linkage in general and epistasis in particular is an important challenge of today’s association studies and eQTL analysis. The large number of possible expression traits and pairs of loci causes a considerable multiple testing problems. Therefore, a key component of computational approaches for the detection of interactions is a filtering strategy to reduce the number of performed test. Previously reported step-wise methods limit the number of candidate interactions by requiring the presence of a primary locus that has a significant marginal effect on the underlying trait. However, such approaches fail to detect interactions with small individual effects. In contrast, our new method SEE determines candidate interactions in a way that increases the likelihood to detect epistatic interactions between loci without requiring a significant contribution of any individual locus. Especially in small sets of progenies, however, the design of the initial filtering step may limit SEE’s ability to detect epistatic interactions that involve a locus with a relatively large primary effect. In a recently proposed, alternative approach Zhang et al. built on expression modularity to detect locus interactions . Defining modules as a set of genes and markers, this approach searches for modules where genes are linked to loci. Specifically, modules that are linked to two loci were considered candidates for possible epistatic effects, allowing the detection of three epistatic interaction modules that contained a total of only 39 genes in yeast. Such a module centric approach can uncover interactions that might be too weak to be detected on gene-by-gene basis but might miss interactions that do not involve a coherent module.
In comparison, SEE differs in the way the search space is reduced as well as in the statistical way the significance of an epistatic interaction is assessed. In particular, Zhang et al. used t-statistics to test if the interaction term in eq. 1 is non-zero , where the trait is modelled as the first principle component of a module’s gene expression matrix. Recently, however, Wang et al. suggested that the non-epistatic interaction model (a statistical framework without the interaction term in eq. 1) might be sufficient to explain most biologically relevant interactions in human . We therefore adopted a more stringent statistical test of epistasis than previous methods in which we assessed the impact of the epistatic interaction term we fitted the expression of individual genes by both the epistatic (eq. 1) and additive (eq. 2) model, allowing us to check if the epistatic model provided a significantly better fit than the additive model. This statistical test provides a higher confidence that uncovered interactions were indeed non-additive than a test for the significance of the interaction term. Only five out of 29 interactions defined by two epistatic interaction modules discovered by Zhang et al.  and only 19 out of 170 interactions reported in Storey et al.  passed this test. Notably, we generally found that all methods provided small sets of epistatic interactions in yeast, while overlaps between sets of predicted interactions were small as well. Such a result is hardly surprising given the difference in the statistical models, and the small number of epistatic interactions that were detected by each method.
Providing the first view on the structure of epistatic interactions in P. falciparum we revealed extensive epistasis and epistatic interaction hotspots. Since one of the Dd2 parental strain in the HB3 × Dd2 cross is multi-drug resistant while HB3 is generally drug sensitive, we conclude our findings may be valuable leads for studies of the parasite’s drug resistance. Interestingly, epistatic interactions hotspots were associated with elevated levels of linkage disequilibrium, suggesting selection pressure acting on P. falciparum. Currently, we cannot fully ascertain if this observed selection pressure is independent of the host or occurs in the context of a specific chimpanzee host. A new P. falciparum cross 7G8 × GB4 also showed high LD with a pattern that differed from the HB3 × Dd2 cross, indicating that the observed effect might be strain dependent. Indeed, Dd2 has been a subject of intensive Chloroquine and Sulfacoxine-Pyrimethamine related pressure in nature and in the lab. Since GB4 and 7G8 also carry different origins of Chloroquine resistance difference in drug selection might impact the pattern of linkage disequilibrium in this cross.
Specific mechanisms of interactions between multiple loci in the parasite are yet to be discovered, but various mechanisms could be envisioned. For example, interactions may exist among loci, harbouring interacting members of the transcriptional machinery, and we indeed found two cases that supported such a scenario. Epistatic regulatory effects can also arise when two loci encode transcription factors that regulate expression of one or more genes in an indirect way. Consistent with such a mechanism, we found an interacting locus pair encoding potential ApiAP2 proteins. Members of this family of putative transcription factors have been demonstrated to bind to very closely related DNA motifs in the genome of P. falciparum . Such proteins may bind similar regulatory motifs, potentially contributing to competitive binding, leading to the observed epistasis. Indeed, we found potential binding sites in the promoter region of both target genes of the corresponding interacting locus pair, suggesting that such transcriptional regulation at interacting loci could allow the pathogen to fine tuning gene regulation.
We have developed a new computational approach, SEE (Symmetric Epistasis Estimation), to identify epistatic interactions. Mitigating the multiple testing problem our method significantly reduces the number of tests using a novel filtering approach. Unlike most of previous methods, SEE treats both interacting loci in a symmetric way, complementing existing approaches. Thanks to this important property of our approach, we performed the first study of epistatic interactions in P. falciparum eQTL, allowing us to obtain more than 1,700 epistatic interactions. Illustrating a genome wide epistatic interaction landscape in the malaria pathogen we pinpoint epistatic interaction hotspots whose properties suggest importance for drug resistance adaptations. Of particular interest is a weak but significant correlation between epistatic interaction hotspots and patterns of linkage disequilibrium, emerging from our analysis. Such linkage disequilibrium suggests a selection pressure acting on P. falciparum that potentially reflects host-pathogen interactions or drug-induced selection and thus provides additional line of evidence of supporting relation of hotspots to drug resistance.
We used the data of the single locus eQTL study by Gonzales et al. which utilized 34 progeny strains with 329 markers obtained from a HB3 × Dd2 cross in P. falciparum . Genome-wide expression levels were measured 18 hours after the parasite invaded human erythrocytes (RBCs), using 7,665 probes that represented 5,150 ORFs. For the yeast eQTL analysis, we used data of a BY × RM cross , covering 112 progenies, 5,747 genes and 2,956 genetic markers. We binned adjacent markers if the Hamming distance between their genotype data (without considering missing data) was less than 8 , reducing the set of markers to 558, close to the number of genetic markers in P. falciparum.
Building on our previous work  we first constructed an eQTL association graph with three partitions of nodes, corresponding to strains (S), genes (G), and loci (L) and then search for tripartite cliques in this graph. This step is similar as in  and here we summarize it very briefly. The way we utilize these cliques, described in the subsequent sections, is novel. Each gene was represented by two nodes gu and gd, corresponding to g either being up- or down-regulated. Specifically, we considered the expression value of g in a strain up(down)-regulated if it was equal or larger(smaller) than the mean expression value of g across all strains plus (minus) one standard deviation. An edge between gu/gd and a progeny strain s indicated that g’s expression is up-regulated or down-regulated in strain s. An edge between l0/l1 and a progeny strain s indicated that the genotype of l was 0/1 in strain s (Fig. 1a). In the eQTL association graph we detected eQTL maximal association cliques used a previously described approach . Association clique is defined as a maximal clique that contains a subset of loci L’, a subset of progeny strains P’, and a subset of genes G’. In an association clique, each locus in L’ has the same genotype (0 or 1) among all progeny strains in P’, and each gene is up-/down-regulated among all progeny strains in P. Vertices in L’ are fully connected to vertices in P’, and vertices in G’ are fully connected to vertices in P’ as well. In other words, each eQTL association clique is a maximal subgraph that cannot be extended further, maintaining full connectivity (Fig. 1a).
We scanned such maximum eQTL association cliques in order to find candidate triplets. We defined 16 possible combinations of joint genotypes and expression states that can be assigned to a triplet that consisted of two loci l’, l” and gene g (Fig 1b). Possible joint genotype combinations for l’ and l” are 00, 01, 10, and 11. For the purpose of defining configurations, we considered two possible states (up or down) for the expression of a gene g. Accordingly we defined 24 = 16 possible configurations. Out of these 16 configurations, we identified eight interacting configurations, while the remaining configurations were considered non-interacting. Intuitively, a interacting configuration can only be explained by the synergistic interaction model, while a configuration is defined non-interacting otherwise. For example, configuration 7 is an interacting configuration (Fig. 1b), where the joint genotypes 00 and 11 are accompanied by a down-regulated gene g, while 01 and 10 coincides with an up-regulated g. If we simply encode down-/up-regulation of gene expression as 0/1, configuration 7 can be considered a XOR function that can be captured by our synergistic model using y = x’ +x” −2x’x” + ε. In contrast, configuration 4 is non-interacting where the underlying gene g is up-regulated when locus l’ = 1 and vice versa. The absence of a clear impact of l” on the expression of g suggests that the additive model is sufficient to explain this configuration.
Configurations in a given triplet might not be always consistent for all strains. Therefore, while selecting candidate triplets, we demanded that each triplet had an interacting configuration that was consistent across a large enough number of strains. Specifically, we computed the support for the consistency of the triplet l’, l” with a configuration c, follows. We defined the supporting subgraph as an association clique where the underlying triplet appeared. Next, we denoted by Sh (h [1,…,4]) the progeny set in such supporting subgraph. In Fig. 1a, we show a triplet represented by four subgraphs that fit configuration 7 where S1 = (s2,s6), S2 = (s3,s4 ), S3 = (s1) and S4 = (s5 ). We defined . Since a triplet may also fit non-interacting configurations, we defined , measuring the support for interaction (IC) and non-interaction (NIC) configurations of a given triplet. To qualify as a candidate triplet, we demanded that the corresponding support for interaction configurations (IC) reached a certain level and was larger than for non-interaction configurations (NIC). In particular, we defined for each triplet. For P. falciparum, we required that each association clique contained at least five strains, and . For yeast, we required that each association clique had at least 12 strains, and . Intuitively, the requirement that association cliques contained more strains improved sensitivity defined as the percentage of the number of significant interactions over the number of candidate triplets but decreased the number of candidate triplets (See Additional File 1).
After obtaining candidate triplets, we estimated their significance using residual empirical thresholds (RET) [30,31]. For each candidate triplet l’, l”, g, we first obtained a residual value by removing the main effects of two loci from the expression value. Instead of discretized values we used the original expression values to compute the residual and evaluated the probability that the synergistic interaction model fitted the data better than the additive model. Computing the residual for the hth expression value of g in the strain where the genotype of li and lj was 00, we applied the following equation
where was the hth expression value of gene g in the strain where the genotypes of l’ and l” were 0. was the mean expression of g in strains where the genotype of l’(l”) was 0, and was the mean expression of gene g across all strains. Similarly, we computed the residual for g’s expression values in strains where the genotype of l’ and l” were 01, 10 and 11. We fitted the residual with the synergistic model, where the p-value for the interaction term in the regression was taken as the nominal p-value for comparing the synergistic to the additive model.
To correct nominal p-values obtained in the previous step we generated a null p-value distribution from randomized eQTL data. To control for linkage disequilibrium and any skewed inheritance patterns we kept the genotype data intact in the randomization procedure. Specifically, we randomly permuted strain labels and kept gene specific expression profiles unchanged. We applied SEE to such randomized data utilizing exactly the same thresholds and parameters as used for the real data. We repeated this process 10 times and calculated a nominal p-value for each triplet as described previously. Forming a ‘global’ null distribution by pooling all nominal p-values from random data we defined a corrected p-value as the fraction of random p-values at or below the p-value of the given candidate triplet. After we confirmed that the distribution of corrected p-values of random triplets was flat (data not shown), we calculated a FDR for candidate triplets . Applying a corrected p-value threshold ≤ 0.001 (FDR < 0.003) we obtained a set of 3,796 triplets.
To estimate the number of false positives in an alternative way, we computed how many “epistatic interactions” were produced by the previously described random generation of eQTL data. Specifically, we applied SEE to 10 random eQTL datasets and computed p-values in exactly the same way as for the real-data. Averaging over all 10 random runs, we obtained 12 “epistatic interactions”, suggesting that our FDR estimate was indeed accurate (3,796 × 0.003 ≈ 11).
Our procedure of correcting nominal p-values using the global null distribution dis-accounted for the possibility that a gene expression trait might be mapped to multiple, genomically close locus pairs. Since linkage disequilibrium in P. falciparum is high, two neighboring loci on the same chromosome tend to have similar genotypes. To further select the most significant epistatic interactions, we corrected nominal p-values in an alternative way. We generated a null distribution for a group of candidate triplets sharing the same gene, where locus pairs of triplets in a group were close to each other. In particular, locus pairs , and , were considered close to each other if there was at most one locus between and . For each candidate triplet in a group of size n, we randomly permuted the residual and fitted the permuted residual with the synergistic (i.e. epistatic interaction) model to obtain a null p-value. Repeating this process m/n times for each candidate triplet we obtained a null distribution of m p-values for each group and defined a corrected p-value as the fraction of values in the distribution that was less than the given p. Setting m = 1,000 and requiring that selected epistatic interactions had a corrected p-value ≤ 0.001 we found 1,713 out of 3,796 initially obtained epistatic interactions, involving 1,111 locus pairs and 323 probes (315 genes).
We defined a pair of loci as an epistatic interaction hotspot if the two loci synergistically co-regulated at least 10 target genes. We grouped overlapping hotspots into hotspot regions by binning two epistatic interactions with locus pairs , and , if there was at most one locus between and , and one locus between and . Recently Breitling et al.  proposed a permutation method to accurately estimate the clustered distribution of false eQTLs (i.e. the size of false eQTL hotspots) on the genome. Their method randomly permuted columns in the genotype matrix, where rows corresponded to loci, and columns represented progeny strains. In addition, the gene expression matrix was kept unchanged and the integrity of the correlation structure of gene expression matrix was maintained, a major source of false eQTL hotspots. To detect eQTL hotspots, eQTL mapping was subsequently applied to randomized data. Following this procedure and subsequently applying SEE with the same parameters used for real data, we found that the average maximum number of target genes was < 2 (p = 2 × 10−11, one-sided t-test), indicating that 10 was a reasonable threshold for defining real hotspots.
Since the number of progenies of P. falciparum was relatively small, we used yeast eQTL data  to investigate the impact of different numbers of progenies on the performance of SEE. We randomly selected five sets of 34 progenies together with their associated genotype and gene expression profiles. Using the parasite specific parameter setting we applied SEE to the five yeast progeny sets. On average we found less than one candidate triplet with corrected p-values < 0.05, suggesting that no spurious epistatic interactions were detected in yeast when only 34 progenies were used. Considering limited numbers of detected epistatic interactions and their high FDR obtained from the full yeast eQTL dataset, we ruled out the possibility that a large number of epistatic interactions in P. falciparum was an artefact of a small set of progenies, producing a large number of false positives.
Additional Files Additional File 1: Supplementary online materials (including Fig. S1 and Table S2, S3).
Table S1: List of 1,713 epistatic interactions in P. falciparum detected by SEE.