Search tips
Search criteria 


Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
BMC Bioinformatics. 2011; 12: 458.
Published online 2011 November 25. doi:  10.1186/1471-2105-12-458
PMCID: PMC3323860

Multiple-input multiple-output causal strategies for gene selection



Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting.


We show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection.


Integrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.


Supervised analysis of genomic datasets (gene expression microarray or comparative genomic hybridization array for instance) with a large number of features and a respectively small number of samples requires the adoption of either regularization or feature selection strategies [1]. The most common feature selection strategies select or rank the variables according to a relevance score. In ranking, for instance, the score of each variable is the univariate association with the target returned by a measure of relevance, like mutual information, correlation, or p-value. If on one hand the ranking is widely used for its simple implementation and its low complexity, on the other hand it suffers from well-known limitations. A drawback is that ranking relies on univariate terms and as such it cannot take into consideration higher-order interaction terms or redundancy between features [2]. Another limitation is that ranking techniques are not able to distinguish between causes and effects. This is due to the fact that univariate correlation (or relevance) does not imply causation [3]. This problem is not solved in multivariate feature selection approaches since their cost function typically takes into consideration accuracy but disregards causal aspects. Nowadays the importance of bringing causality into play when designing feature selection methods is more widely acknowledged in the bioinformatics and the machine learning communities [4,5]. This is typically the case in microarray classification, where the goal is, for example, to distinguish between tumor classes or predict the effects of therapies on the basis of gene expression profiles [6]. In these settings the number of input variables, represented by the number of gene probes, is huge (typically several thousands) while the number of samples, represented by the patients' tumors, is very limited (a few hundreds) making the selection of relevant genes a challenging task. Moreover the inference of causal relationships between variables plays a major role in the context of genomic studies since more and more biologists and medical doctors expect data analysis to provide not only accurate prediction models (for prognostic purposes) but also insights into the mechanisms associated with disease and appropriate therapeutic targets.

It is well established that the detection of causal patterns cannot be carried out in a bivariate (single-input single-output) context and that at least a trivariate setting has to be considered [7]. This is put into evidence by the literature on graphical models where arc orientation relies on notions of conditional independence (requiring at least three terms) [8] and by the work on information theoretic methods for network inference [9]. In particular this paper will focus on the notion of feature interaction, a three-way mutual information that differs from zero when group of attributes are complementary [10]. The role of interaction in feature selection has already been discussed in the machine learning literature. Jakulin proposed an heuristic based on interaction for selecting attributes within the naive Bayesian classifier [11]. Meyer et al. proposed a filter algorithm which relies on the maximization of an information theoretic criterion, denoted Double Input Symmetrical Relevance (DISR), which implicitly takes into account the interaction, or complementarity between variables, in the choice of the features [12]. Watkinson et al. used a notion of synergy related to feature interaction to assign a score to a pair of genes and then measured the degree of confidence that one of the genes regulates the other [9]. A causal filter algorithm which computes interaction between inputs has been recently proposed in [5]. However it is unclear whether these techniques are capable of recovering the set of features that are both relevant and causal, in high-dimensional problems, such as in microarray analysis.

The contributions of this paper can be summarized as follows. First we introduce a new causal filter based on the interaction information and we show how to estimate this quantity in a multiple-input multiple-output setting. Second we assess the capacity of such filter to prioritize causal variables by using a synthetic case study. Third we measure from an accuracy and a biological point of view the performance of such causal filter in a number of prognostic studies in breast cancer. We advocate that a multiple-input multiple-output approach is particularly relevant in clinical studies where it is common that more than a single target variable is collected. This is the case of prognostic studies of breast cancer patients where several clinical indices, including patients' tumor size and histological grade, are collected together with the survival of the patients and the gene expressions of their tumor. It is worth to note that, in spite of their availability, these additional phenotypes are usually not taken into consideration since statistical studies focus on survival prediction and adopt single-output methods.

This paper describes an original multiple-input multiple-output score which combines a conventional relevance term with a causal term. This additional term quantifies the causal role of the features and allows the prioritization of causal variables in the resulting ranking. We carried out a synthetic study, where the set of causal dependencies is known, which shows that causal variables are highly ranked once this score is adopted. We performed a meta-analysis of six publicly available breast cancer microarray datasets to assess the improvement of using our causal relevance score in terms of accuracy over the conventional ranking. The related discussion shows also that it is possible to carry out a biological interpretation of the role of selected variables which allows to discriminate between potentially causal and relevant, yet non causal, features. The source code, documentation and data are open-source and publicly available from and


Mutual information and interaction

Let us consider a multiple-input multiple-output (MIMO) classification problem characterized by n input variables X = {xi, i = 1,..., n} and m targets Y = {yj, j = 1,..., m} where xiX is continuous and yjYj={cj1,,cjC}. Let us denote y1 as the primary target and the remaining m - 1 outputs as secondary targets. We make this distinction since, though we assume that the goal of classification is to predict y1, we want to take advantage of the causal information which can be extracted by multiple targets. We begin by reviewing some notions of information theory by considering three random (boldface) variables, notably two inputs x1, x2 and the primary target y1. The mutual information [13] between the continuous variables x1 and x2 is defined in terms of their probabilistic density functions p(x1), p(x2) and p(x1, x2) as


where H is the entropy and the convention 0log00=0 is adopted. This quantity measures the amount of stochastic dependence between x1 and x2 and is also called two-way interaction [11]. Note that, if x1 and x2 are Gaussian distributed the following relation holds


where ρ is the Pearson correlation coefficient.

Let us now consider the target y1, too. The conditional mutual information I(x1; x2|y1) [13] between x1 and x2, once y1 is given, is defined by


The conditional mutual information is null iff x1 and x2 are conditionally independent given y1. The change of dependence between x1 and x2 due to the knowledge of y1 is measured by the three-way interaction information defined in [14] as


This measure quantifies the amount of mutual dependence that cannot be explained by bivariate interactions. When it is different from zero, we say that x1, x2 and y1 three-interact. A non-zero interaction can be either negative, which denotes a synergy or complementarity between the variables, or positive, which indicates redundancy. Because of the symmetry of the H operator in (3), we have


By (4) we derive


Since the joint information of x1 and x2 to y1 can be written as


it follows that by adding I(x2; y) to both sides of (5) we obtain


Note that the above relationships hold also when either x1 or x2 are vectorial random variables.

Feature selection, causality and interaction

Consider a multiple-class classification problem where x [set membership] X [subset or is implied by] Rn is the n-variate input and y1Y is the primary target variable. Let A = {1,..., n} be the set of indices of the n inputs. Let us formulate the feature selection problem as the problem of finding the subset X* of v > 0 variables such that

X*= argmaxSA:|S|=vI(XS;y1)= argmaxSA:|S|=v(XS)

where the score (XS) of a subset XS of variables is given by the mutual information it brings to the target. In other words, for a given number v of variables the optimal feature set is the one that maximizes the information about the target. Note that this formulation of the feature selection problem, also known as Max-Dependency [12,15], is classifier-independent.

If we want to carry out the maximization (7), both an estimation of I and a search strategy in the space of subsets of X are required. As far as the search is concerned, according to the Cover and Van Campenhout theorem [16], to be assured of finding the optimal feature set of size v, all feature subsets should be assessed. Given the infeasibility of exhaustive approaches for large n, we will consider here only forward selection search approaches. Forward selection starts with an empty set of variables and incrementally updates the solution by adding the variable that is expected to bring the best improvement (according to a given criterion). The hill-climbing search selects a subset of v <n variables in v steps by exploring only i=0v(n-1) configurations. For this reason the forward approach is commonly adopted in filter approaches for classification problems with high dimensionality [17,18].

If v = 1 the optimal set returned by (7) is composed of the most relevant variable, that is the one carrying the highest mutual information to y. For v > 1, we need to provide an incremental solution to (7) in order to obtain, given a set of d variables, the (d + 1)th feature which maximizes the increase of the dependency

xd+1*= argmaxxkX-XS((XS,xk))

where (XS, xk) stands for the set of variables resulting from the union of XS and xk. Since for large d the term ((XS,xk)) requires the computation of multivariate mutual information, its estimation is often prone to ill-conditioning and large variance. This led to the adoption of low variate approximations in literature, like the univariate approximation

xd+1*= argmaxxkX-XS(xk)= argmaxxkX-XSI(xk,y1)

which leads to a ranking of the variables according to their mutual information with the target. More advanced approaches rely on bivariate decompositions [12] like

xd+1*= argmaxxkX-XS1dxiXS((xi,xk))

where ((xi,xk)) quantifies the amount of information that xi and xk contain jointly about y1.

However a feature selection procedure targeting the Max-Dependency is not able in general to discriminate between causal and non causal dependencies. For instance in a selection procedure applied to a dataset derived from a causal process like the one in Figure Figure1,1, the effect x4 could be more highly ranked than the direct causes x1 and x2.

Figure 1
Single-output case with different causal patterns: (i) common effect or explaining away effect configuration involving x1, x2 and y1; (ii) spouse configuration involving x5 and y1; (iii) common cause configuration involving y1, x3, x4; and (iv) causal ...

Here we propose to modify the conventional score (X) into a causal score c(X) able to keep into consideration the causal information returned by the adoption of a multiple output configuration. This is made possible by integrating in the score an interaction term which is strictly related to the notion of causal dependency.

Interaction and causal dependency

This section aims to establish the link between information theory and causality. Causality is at the same time an essential and imprecise notion in scientific discovery. In order to avoid any ambiguity, here we adopt the formalism of causal Bayesian network which is a sound and convenient framework for reasoning about causality between random variables [8]. This means that all causal dependencies between variables are expressed by a directed acyclic graph where the existence of an oriented edge from a node xi to a node xj means that xi directly causes xj. In formal terms we assume that the Causal Markov condition, the Causal Faithfulness and the Causal Sufficiency conditions hold [4]. Several works in literature showed that the structure of a causal graph can, to some extent, be inferred from observational data. The vast majority of these works rely on statistical tests of conditional independence [19]. Here we present a way to reason about causality which do not use independence tests but estimate an information theory score to prioritize potential causes.

Let us consider a triplet made of two inputs xi, xj and one target y1. As discussed in [4] six possible configurations of directed acyclic graphs involving three variables can occur. One configuration is trivial and corresponds to a completely unconnected graph. One configuration corresponds to a single arrow chain (for example only xi and xj are linked) and it is well known in literature that for a system of two variables the causal structure is not distinguishable. Another configuration corresponds to a fully connected graph and in this case the lack of independencies implies that the direction of the arrows cannot be determined. The remaining configurations can be illustrated and detected by studying the relationship [5] between the sign of I(xi; xj; y1) and causal patterns of the triplet, like the ones sketched in Figure Figure11.

A negative interaction I(xi; xj; y1) means that the knowledge of the value y1 increases the amount of dependence between xi and xj; this situation occurs in the presence of a collider. According to the label of the collider we can have two cases: i) the common effect configuration (for example the pattern involving x1, x2 and y1, also known as the explaining-away effect) and ii) the spouse configuration (the pattern involving x3, x5 and y1 in Figure Figure11 where x3 is the common descendant of y1 and x5). This is a consequence of the fact that, if we assume Causal Faithfulness, the graph structure entails that the two parents are independent (null mutual information) but conditionally dependent (conditional mutual information bigger than zero). Note also that both configurations are characterized by the presence of a collider.

On the contrary a positive interaction I(xi; xj; y1) between xi and xj means that the knowledge of y1 decreases the amount of dependence. This situation occur in two cases: i) the common cause configuration (for example, two dependent effects x3 and x4 become independent once the value of the common cause y1 is known as illustrated in Figure Figure1)1) and ii) the causal chain configuration where one of the variables (let say, x1) is the cause and the other (let say, x4) is the effect of y1. This is due to the fact that the graph entails the dependence between xi and xj as well as their conditional independence (null conditional mutual information).

So far we have considered a single-output configuration. However causal patterns can be better identified if we consider a multiple-output configuration, for instance the two output configuration sketched in Figure Figure2.2. If y1 and y2 are two outputs representing different observations of the same phenomenon (for example a disease) we expect that the causal configurations concerning the first output appear also for the second one. This is a reasonable assumption in breast cancer clinical studies where the measured phenotypes (size and histological grade of the tumor for instance) can be considered as different manifestations of the state of the tumor.

Figure 2
Two-output case with different causal patterns: (i) common effect configuration involving x3, y1 and y2; (ii) spouse configuration involving y2 and x6; (iii) common cause configuration involving x1, y1 and y2; and (iv) causal chain configuration involving ...

Let us consider for instance the inputs x1 and x2 and the two targets y1 and y2: the common effect configurations between x1 and x2 and y1 holds also for the triplet x1 and x2 and y2. The same happens for the common cause pattern involving both the triplet x3, x4, y1 and x3, x4, y2. The presence of multiple outputs can therefore make more robust the identification of a causal pattern, especially in data configurations characterized by a very large number of variables.

In the following we will take advantage of these considerations to design a causal filter able to extract from observed data causal dependencies between variables.

The MIMO causal filter

The link between causality and interaction discussed in the previous section suggests that, if we want to detect causality without estimating large variate dependencies, we may search for patterns like the one sketched in Figure Figure3.3. This dependency pattern is characterized by two causal inputs and two outputs and can be detected when the following two conditions are satisfied:

Figure 3
Two-inputs two-outputs causal pattern.

1 the interaction I(x1; x2; y1) is negative

2 the interaction I(x1; x2; y2) is negative

In what follows we implement this idea into a MIMO causal filter where input variables belonging to causal patterns like the one in Figure Figure33 are prioritized.

For the pair of inputs x1 and x2 and the pair of outputs y1 and y2, we define a structural score


which is composed of two multiple-input interaction terms. The magnitude of this score depends on whether x1 and x2 jointly play a joint causal role on y1 and y2, or in other words, the pattern in Figure Figure33 is encountered. This means that the higher the term C(x1, x2), the higher is the evidence that the pair x1, x2 be a cause of y1 ad y2. This score plays a similar role to the score that is maximized in structural identification of Bayesian networks [20]. If in that case the score measures the likelihood of the data for a given graph structure, here the quantity C(x1, x2) measures the likelihood of the data for a structural pattern where the pair x1, x2 has a causal role.

In the case of bivariate output (m = 2) we propose then a causal version c of the univariate score An external file that holds a picture, illustration, etc.
Object name is 1471-2105-12-458-i25.gif which accounts both for the relevance and the causal role of a pair of input variables x1 and x2


where λ > 0 stands for the degree of causality imposed to the selection. If we adopt the filter approximation (10) the incremental formula takes the form

xd+1*= argmaxxkX-XS1dxiXSc((xi;xk))== argmaxxkX-XSI(xk;y1)+λdxiXSC(xi;xk)== argmaxxkX-XSI(xk;y1)-λ2dxiXS(I(xi;xk;y1)+I(xi;xk;y2))

In other terms this formulation suggests to add at the (d + 1)th step, among all the remaining variables, the one which has the better combination of relevance and causality, where the causal term is obtained by averaging over the selected variables and the considered outputs. Note that in the case of m > 2 targets the structural score (11) is obtained by averaging the interaction terms over the m variables.

Similarly to what is done in regularization approaches [21] where specific configurations (typically those with higher complexity) are penalized by adding a complexity term to the one measuring the error, the causality parameter λ in (13) is expected to penalize input variables with no causal role (positive interaction). Note that for λ = 0 the selection rule (13) boils down to the rule (9). The following section will study the impact of the causality term on the accuracy and the stability of a filter algorithm implementing the rule (13).


In this section we perform two experiments to assess the role of the causation term in the feature selection process. The first one is based on a number of synthetic datasets generated by simulating a causal Bayesian network while the second relies on public microarray breast cancer datasets to assess the approach in a real data setting.

Synthetic data

This experiment focuses on the prioritization of causes in a set of classification tasks defined on the basis of simulated data generated by the causal structure depicted in Figure Figure4.4. Note that this causal structure aims to represent in a very simplified manner a stochastic dependency characterized by a number of indirect (nodes 1-3) and direct causes (nodes 4-8), a latent non measurable variable (node 9), one observable primary target (node 10), two secondary targets (nodes 11-12), a set of additional effects (nodes 13-29) and a number of independent and irrelevant variables (nodes 30-40). In order to set up an analogy with the real data experiments of the following subsection, we could make the assumption that the latent variable represents the cancer progression, the three targets denote a set of observable measures depending on the cancer state (patients' prognosis, size and histological grade of the tumor for instance), and that all other variables represent the expression of genes whose activity could play a causal role, be determined as an effect of the disease or be completely irrelevant. It is worth to note that also in the presence of a hidden variable the interaction between marginally independent causes given an effect is negative. This is due to the fact that conditioning on the hidden variable or on one of his children is equivalent in terms of d-separation between the variables [8] and consequently is equivalent in terms of the sign of the interaction. We simulate a number of multivariate datasets from the causal structure in Figure Figure44 and for each of them we rank the inputs of the MIMO classification problem by using the conventional ranking approach based on mutual information (Equation (9)) and our novel approach based on causality (Equation (13)). The stochastic dependency between parents and descendants of the network is modeled by a linear regression where the parameters are uniformly sampled in [-2, 2] and the noise distribution is Gaussian with zero mean and standard deviation σ. We carry out a series of experiments, each characterized by 150 datasets and an increasing noise standard deviation ranging between 0.01 and 0.4. All the variables are continuous apart from the variables 10, 11, and 12, which correspond to the targets y1, y2, and y3 of the classification task and are discretized to two binary values. Note that all measures are centered and scaled in order to have a zero mean and unit standard deviation; this allows for a better understanding of the impact of the noise amplitude on the ranking.

Figure 4
Bayesian causal network used for synthetic experiment. The green node 9 denotes the non observable variable. The three red nodes denote the targets of the multiple-output classification problem. The isolated node (30-40) represents a set of 11 independent ...

The quality of our causal prioritization strategy is assessed by measuring the average ranking of the direct causes (nodes 4-8) and the percentage of time that the direct causes are ranked among the first 5 variables. These two measures (together with a 90% confidence interval) for different values of λ are shown in Figure Figure55 and and66 respectively. These plots show that by increasing the value of λ, the average ranking position of direct causes decreases (direct causes are better prioritized) and that the percentage of correct selection increases (among the first ranked variables we find the direct causes with higher probability). The improvement occurs in a consistent manner for different values of the noise standard deviation though the detection of causal terms become less accurate as the noise increases. Note also that the very bad performance of the ranking (λ = 0) strategy (0% rate of correct selection) derives from the very large number of effects which tend to be ranked before the real causes.

Figure 5
Synthetic data experiment: average ranking of direct causes for different values of λ as a function of the noise standard deviation. Dotted lines are used to denote the 90% confidence interval estimated on the basis of 150 trials.
Figure 6
Synthetic data experiment: probability of selecting a direct cause among the first 5 ranked variables for different values of λ as a function of the noise standard deviation. Dotted lines are used to denote the 90% confidence interval estimated ...

Real expression data

The real data experiment consists of 6 public microarray datasets derived from breast cancer clinical studies (Table (Table1)1) in order to compare the generalization accuracy of the selection returned by the conventional ranking approach based on mutual information (Equation (9)) with the accuracy of the selection returned by our novel approach based on causality (Equation (13)).

Table 1
Affymetrix microarray datasets and related clinical study where the gene expression have been originally published

All the microarray studies analyzed hereafter are characterized by the collection of gene expression data (the inputs X representing n = 13,091 unique genes), the survival data (the primary target y1) and 2 additional clinical (secondary) variables about the state of the tumor, namely the histological grade and the tumor size. These clinical variables are well known by clinicians to be highly relevant for prognosis since large tumors of high grade are usually aggressive and lead to poor prognosis. Each experiment was conducted in a meta-analytical [22] and cross-validation [23] framework, that is the set of variables are selected by relying on the samples of several datasets and the validation is performed on a set of samples not used for the selection. In order to adopt a classification framework, the survival of the patients was transformed in a binary class such as low or high risk of the patients given their clinical outcome at five years as in [24]. We conducted two sets of meta-analysis validation experiments to compare the conventional ranking approach (λ = 0 case) and our causal version for different values of λ:

• Holdout: we carried out 100 training-and-test repetitions where for each repetition the training set is composed of half of the samples of each dataset and the test is composed of the remaining ones.

• Leave-one-dataset-out where for each dataset the features used for classification are selected without considering the patients of the dataset itself. Once the selection is over, 100 holdout repetitions are used to assess the generalization power of the selected set of features.

All the mutual information terms are computed by using the Gaussian approximation (2). This allows the meta-analysis integration at the correlation level by means of the weighted estimation approach proposed by [22]. All the experiments were repeated for three sizes of the gene signature (number of selected features): v = 20, 50, 100.

The quality of the selection is represented by the accuracy of a Naive Bayes classifier measured by four different criteria: the Area Under the ROC curve (AUC), the Root Mean Squared Error (RMSE), the SAR (Squared error, Accuracy, and ROC score introduced by [25]) and the precision-recall F score measure [26]. Table Table22 reports for the holdout experiment the value of the four performance criteria for different values of v and λ. Table Table33 refers to the leave-one-dataset-out experiments for v = 20, v = 50, and v = 100, respectively. Note that the W-L (Win-Loss) line reports the number of datasets for which the causal filter is significantly more (W) or less (L) accurate than the ranking filter according both to the McNemar test [27] (p-value < 0.05 adjusted for multiple testing by Holm's method [28]) and the Wilcoxon paired test on squared errors (p-value < 0.05 adjusted for multiple testing by Holm's method).

Table 2
Holdout: accuracy criteria (to be maximized) for different numbers v of variables and different values of λ
Table 3
Leave-one-dataset-out: accuracy criteria (to be maximized) for different numbers v of variables and different values of λ


In the previous section we reported the accuracy results of the traditional ranking approach and our novel method based on a causal relevance score. Here we discuss the added value of our causal approach both from a quantitative and qualitative perspective.

The performance measured in cross-validation suggests that the incorporation of a causal term leads to a significant improvement of classification accuracy. This improvement is observed for different validation configurations and different sizes of the prognostic gene signature. From these results we can conclude that (i) causal feature selection is interesting also for a prediction perspective and (ii) relevant (prognostic) information is contained into secondary output variables (in our case tumor size and histological grade). Although the absolute improvement is only moderate (3% to 6% depending on the validation configurations and performance estimates), the use of our causal ranking strategy in more sophisticated modeling approach for prognosis, such as in [29], may help develop more clinically relevant prognostic classifiers in breast cancer.

The other advantage of our approach is that the introduction of a causality term leads to an interpretation of the causal role of the selected genes. We illustrate this characteristic in Figure Figure77 by comparing, through Gene Ontological (GO) terms, gene rankings with increasing degree of causality using a pre-ranked gene set enrichment analysis (GSEA) [30]. By quantifying how the causal rank of genes diverges from the conventional one (λ = 0) with respect to λ we can identify the gene sets that are potential causes or effects of breast cancer.

Figure 7
Most enriched GO terms with respect to λ according to a pre-ranked gene set enrichment analysis (GSEA): (A) GO terms enriched in the conventional ranking and having a high degree of causality for tumorigenesis; (B) GO terms increasingly enriched ...

Genes that remains among the top ranked ones for increasing λ can be considered as relevant (they contain predictive information about survival) and causal. Genes whose rank increases for increasing λ are putative causes: they have less relevance than other genes (for example, those being direct effects) but they are potentially causal. These genes would have been missed by conventional ranking, where they would appear as false negatives if we interpret the outcome of conventional ranking in causal terms. Genes whose rank decreases for increasing λ are putative effects in the sense that they are relevant but probably not causal. This set of genes could be erroneously considered as causal, and represent false positives if we interpret the outcome of conventional ranking in causal terms.

Since genes are not acting in isolation but rather in pathways, we analyzed the gene rankings in terms of gene set enrichment. As described in [30], the normalized enrichment score (NES) computed in GSEA enables quantification of the strength of association of a gene set (GO term) with a phenotye of interest, here poor or good prognosis (survival). In more details, given a list of genes L ranked by their prognostic relevance and an a priori defined set of genes S (for example genes sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom; gene sets associated with the prognosis phenotype tend to show the latter distribution. NES reflects the degree to which a gene set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when a gene in S is encountered and decreasing it when genes not in S are encountered. The magnitude of the increment depends on the statistic used to rank the genes in L. In our study the statistic of a gene is simply its rank (the most relevant genes have the largest ranks) and its sign depends on the association of its expression with survival: positive sign if over-expression is associated with poor survival and inversely. The score is the maximum deviation from zero encountered in the "walk"; it corresponds to a weighted Kolmogorov-Smirnov-like statistic [30,31]. Finally the score is normalized for each gene set to account for the size of the gene set, yielding a NES.

We computed NES for multiple genome-wide rankings generated with increasing values of λ, and displayed in Figure Figure77 the score of the 3 most enriched GO terms which are identified as being potentially (A) both causes and effects, (B) causes, and (C) effects of breast tumorigenesis (GSEA results for all the GO terms are provided in Additional File 1, 2 and 3). The first group of GO terms that show similar enrichment scores independently of their level of causality are implicated in cell movement and division, cellular respiration and regulation of cell cycle (Figure (Figure7A).7A). The first GO term involves genes encoding for the Rho family of GTPases proteins that are among key regulators of actin and microtubule cytoskeleton [32] and are often over-expressed in human breast cancers [33]. Bromberg et al. showed that, when affected by RNF5, this family of proteins may cause dysregulation of cell proliferation to promote tumor progression [34]. The second GO term represents the co-enzyme metabolic process which includes proteins showed to be early indicators of breast cancer [35]; perturbation of these co-enzymes might cause cancers by compromising the structure of important enzyme complexes implicated in mitochondrial functions [35]. Genes involved in the third GO term "regulation cyclin-dependent protein kinase activity" are key players in cell cycle regulation and inhibition of such kinases proved to block proliferation of human breast cancer cells [36]. Moreover, Moore et al. recently highlighted the role of cyclin-dependent kinases as progesterone activators that could give raise to tumors and sustain their progression in breast cancer [37].

Figure Figure7B7B displays the GO terms that are increasingly enriched with their degree of causality, involving genes that are putative causes of the tumorigenesis affecting patients' survival; these genes might have been missed by the conventional ranking approach (λ = 0). Counterintuitively, the three GO terms in this category are related to the immune system what is sought to be more an effect of the tumor growth as lymphocytes strike cancer cells as they proliferate. However, several independent research groups showed that frequent usage of aspirin significantly decrease the long-term risk of cancer death by correcting immune system dysfunction [38,39], findings that have been confirmed in breast cancer [40], what supports that the immune system might have a causal role in tumorigenesis. There is strong evidence of interplay between immune system and tumors since solid tumors are commonly infiltrated by immune cells; in contrast to infiltration of cells responsible for chronic inflammation, the presence of high numbers of lymphocytes, especially T cells, has been reported to be an indicator of good prognosis in many cancers [41], what concours with the sign of the enrichment (negative enrichment; Figure Figure7B).7B). We and others have reported that gene expression signatures representing the immune response process were associated with a better prognosis in particular subtypes of breast cancer [29,42,43].

The last group of GO terms are less enriched when the degree of causality increases and the vast majority of the corresponding genes are related to cell-cycle and proliferation (Figure (Figure7C).7C). Cell-cycle and proliferation-related genes, such as for example Ki67, have been used for many decades to describe breast tumors: High levels of Ki67 have been correlated with worse prognosis and are also known to be associated with high tumor grade and negativity of estrogen receptor status [44,45]. We and others have shown that a quantitative measurement of proliferation genes using mRNA gene expression could provide an accurate assessment of prognosis of breast cancer patients [43,46,47]. We also have shown that only one of those genes, AURKA, which is significantly enriched in this case in the M phase GO term, was sufficient to recapitulate the prognostic performance of different prognostic signatures [48]. However the enrichment of these proliferation-related genes seems to be a downstream effect of the breast tumorigenesis instead of its cause.

Our approach allows to identify biological processes that may be direct causes of cancer. These processes are likely to be missed by conventional methods. Given the promising performance of our approach, we plan to integrate our method in analytical frameworks combining efficiently the available clinical data and a priori biological knowledge, potentially retrieved from biomedical literature [49] or pathway database [50], in order to unravel gene sets or network of genes causal of cancer patients' survival.


It is well known in statistics that correlation does not imply causation or, in more general terms, that features that are relevant or strongly relevant for predicting a target are not necessarily direct causes. Direct effects are typical examples of variables that provide information about a target without having any causal role. In a data-driven approach to gene selection it is therefore more and more important to discriminate not only between relevant and non-relevant variables but also, within the subset of relevant variables, to discriminate between direct or indirect causes and effects. This paper proposes a computationally affordable strategy to infer causal patterns that take advantage of multiple outputs. Experimental results in terms of accuracy and clinical interpretation show the added value deriving from the inclusion of a causal term into conventional ranking.


AUC: Area Under the ROC Curve; DISR: Double Input Symmetrical Relevance; GO: Gene Ontology; GSEA: Gene Set Enrichment Analysis; MIMO: multiple-input multiple-output; NES: Normalized Enrichment Score; RMSE: Root Mean Squared Error; ROC: Receiver Operating Characteristics; SAR: Squared error, Accuracy, and ROC score; W-L: Win-Loss.

Authors' contributions

GB and BHK were responsible for the design and execution of the study, data analysis and interpretation. CD and CS participated to the data analysis and interpretation. GB and BHK were responsible for writing the manuscript; JQ supervised the study. All authors read and approved the final manuscript.

Supplementary Material

Additional file 1:

Spreadsheet containing the normalized enrichment scores with respect to increasing λ as computed by preranked GSEA (gsea_res_all.csv).

Additional file 2:

Archive containing the output files computed by the preranked GSEA for λ [set membership] {0.1,0.2,0.3,0.4,0.5} (

Additional file 3:

Archive containing the output files computed by the preranked GSEA for λ [set membership] {0.6,0.7,0.8,0.9,1.0,2.0} (


This work was supported by the ARC project "Discovery of the molecular pathways regulating pancreatic beta cell dysfunction and apoptosis in diabetes using functional genomics and bioinformatics" funded by the Communauté Française de Belgique (GB), the US National Institutes of Health (NCI/NIH/DHHS: 5U19CA148065-02, BHK and JQ), by the Belgian National Foundation for Research FNRS (CD, CS), the MEDIC Foundation (CS).


  • Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23:2507–2517. doi: 10.1093/bioinformatics/btm344. [PubMed] [Cross Ref]
  • Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research. 2003;3:1157–1182.
  • Shipley B. Cause and Correlation in Biology. Cambridge University Press; 2000.
  • Guyon I, Aliferis C, Elisseeff A. Computational Methods of Feature Selection. Chapman and Hall; 2007. pp. 63–86. chap. Causal Feature Selection.
  • Bontempi G, Meyer P. Proceedings of the 27th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA; 2010. Causal filter selection in microarray data.
  • Xing EP, Jordan MI, Karp RM. Proceedings of the 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA; 2001. Feature Selection for High-Dimensional Genomic Microarray Data; pp. 601–608.
  • Papineau D. Causal asymmetry. British Journal of Philosophy of Science. 1985;36:273–289. doi: 10.1093/bjps/36.3.273. [Cross Ref]
  • Koller D, Friedman N. Probabilistic graphical models. The MIT Press; 2009.
  • Watkinson J, Liang K, Wang X, Zheng T, Anastassiou D. Inference of regulatory gene interactions from expression data using three-way mutual information. Annals of NY Academy of Sciences. 2009;1158:302–313. doi: 10.1111/j.1749-6632.2008.03757.x. [PubMed] [Cross Ref]
  • Freitas AA. Understanding the Crucial Role of Attribute Interaction in Data Mining. Artficial Intelligence Review. 2001;6:177–199.
  • Jakulin A. PhD thesis. University of Ljubliana, Faculty of Computer and Information Science; 2005. Machine Learning Based on Attribute Interactions.
  • Meyer P, Schretter C, Bontempi G. Information-Theoretic Feature Selection in Microarray Data using Variable Complementarity. IEEE Journal of Selected Topics in Signal Processing. 2008;2:261–274.
  • Cover TM, Thomas JA. Elements of Information Theory. New York: John Wiley; 1990.
  • McGill WJ. Multivariate information transmission. Psychometrika. 1954;19
  • Peng H, Long F, Ding C. Feature Selection Based On Mutual Information: Criteria of Max-Dependency,Max-Relevance, and Min-Redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;27(8):1226–1238. [PubMed]
  • Devroye L, Györfi L, Lugosi G. A Probabilistic Theory of Pattern Recognition. Springer Verlag; 1996.
  • Fleuret F. Fast Binary Feature Selection with Conditional Mutual Information. Journal of Machine Learning Research. 2004;5:1531–1555.
  • Peng H, Long F. An efficient max-dependency algorithm for gene selection. Proceedings of the 36th Symposium on the Interface: Computational Biology and Bioinformatics. 2004.
  • Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos X. Local causal and Markov Blanket induction for causal discovery and feature selection for classification. Part I. JMLR. 2010;11:171–234.
  • Pearl J. Causality: Models, Reasoning, and Inference. Cambridge University Pres; 2000.
  • Engl HW. Regularization of inverse problems. Kluwer Academic Publishers Group; 1996.
  • Hedges L, Olkin I. Statistical Methods for Meta-Analysis. Journal of the American Statistical Association. 1987;82(397):350–351. doi: 10.2307/2289186. [Cross Ref]
  • Stone M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B. 1974;36:111–147.
  • van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhiven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. [PubMed] [Cross Ref]
  • Caruana R, Niculescu-Mizil A. Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria. ROCAI'04. 2004. pp. 9–18.
  • van Rijsbergen CJ. Information Retrieval. Butterworth; 1979.
  • Dietterich GT. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998;10:1895–1923. doi: 10.1162/089976698300017197. [PubMed] [Cross Ref]
  • Holm S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. 1979;6:65–70.
  • Haibe-Kains B, Desmedt C, Rothe F, Piccart M, Sotiriou C, Bontempi G. A fuzzy gene expression-based computational approach improves breast cancer prognostication. Genome Biology. 2010;11(2):R18. doi: 10.1186/gb-2010-11-2-r18. [PMC free article] [PubMed] [Cross Ref]
  • Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [PubMed] [Cross Ref]
  • Hollander M, Wolfe DA. Nonparametric statistical inference. New York: John Wiley and Sons; 1973.
  • Jaffe AB, Hall A. Rho GTPases: biochemistry and biology. Annu Rev Cell Dev Biol. 2005;21:247–69. doi: 10.1146/annurev.cellbio.21.020604.150721. [PubMed] [Cross Ref]
  • Burbelo P, Wellstein A, Pestell RG. Altered Rho GTPase signaling pathways in breast cancer cells. Breast Cancer Res Treat. 2004;84:43–8. doi: 10.1023/B:BREA.0000018422.02237.f9. [PubMed] [Cross Ref]
  • Bromberg KD, Kluger HM, Delaunay A, Abbas S, DiVito KA, Krajewski S, Ronai Z. Increased expression of the E3 ubiquitin ligase RNF5 is associated with decreased survival in breast cancer. Cancer Res. 2007;67(17):8172–9. doi: 10.1158/0008-5472.CAN-07-0045. [PMC free article] [PubMed] [Cross Ref]
  • Heikal AA. Intracellular coenzymes as natural biomarkers for metabolic activities and mitochondrial anomalies. Biomark Med. 2010;4(2):241–63. doi: 10.2217/bmm.10.1. [PMC free article] [PubMed] [Cross Ref]
  • Yenugonda VM, Deb TB, Grindrod SC, Dakshanamurthy S, Yang Y, Paige M, Brown ML. Fluorescent cyclin-dependent kinase inhibitors block the proliferation of human breast cancer cells. Bioorg Med Chem. 2011;19(8):2714–25. doi: 10.1016/j.bmc.2011.02.052. [PubMed] [Cross Ref]
  • Moore NL, Weigel NL. Regulation of progesterone receptor activity by cyclin dependent kinases 1 and 2 occurs in part by phosphorylation of the SRC-1 carboxyl-terminus. Int J Biochem Cell Biol. 2011. [PMC free article] [PubMed]
  • Rothwell PM, Fowkes FGR, Belch JFF, Ogawa H, Warlow CP, Meade TW. Effect of daily aspirin on long-term risk of death due to cancer: analysis of individual patient data from randomised trials. Lancet. 2011;377(9759):31–41. doi: 10.1016/S0140-6736(10)62110-1. [PubMed] [Cross Ref]
  • De Santo C, Serafini P, Marigo I, Dolcetti L, Bolla M, Del Soldato P, Melani C, Guiducci C, Colombo MP, Iezzi M, Musiani P, Zanovello P, Bronte V. Nitroaspirin corrects immune dysfunction in tumor-bearing hosts and promotes tumor eradication by cancer vaccination. Proc Natl Acad Sci USA. 2005;102(11):4185–90. doi: 10.1073/pnas.0409783102. [PubMed] [Cross Ref]
  • Brasky TM, Bonner MR, Moysich KB, Ambrosone CB, Nie J, Tao MH, Edge SB, Kallakury BVS, Marian C, Goerlitz DS, Trevisan M, Shields PG, Freudenheim JL. Non-steroidal anti-inflammatory drugs (NSAIDs) and breast cancer risk: differences by molecular subtype. Cancer Causes Control. 2011. [PMC free article] [PubMed]
  • Hsu DS, Kim MK, Balakumaran BS, Acharya CR, Anders CK, Clay T, Lyerly HK, Drake CG, Morse MA, Febbo PG. Immune Signatures Predict Prognosis in Localized Cancer. Cancer Invest. 2010;28(7):765–773. doi: 10.3109/07357900903095755. [PubMed] [Cross Ref]
  • Teschendorff A, Miremadi A, Pinder S, Ellis I, Caldas C. An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biology. 2007;8(8):R157. doi: 10.1186/gb-2007-8-8-r157. [PMC free article] [PubMed] [Cross Ref]
  • Desmedt C, Haibe-Kains B, Wirapati P, Buyse M, Larsimont D, Bontempi G, Delorenzi M, Piccart M, Sotiriou C. Biological Processes Associated with Breast Cancer Clinical Outcome Depend on the Molecular Subtypes. Clin Cancer Res. 2008;14(16):5158–5165. doi: 10.1158/1078-0432.CCR-07-4756. [PubMed] [Cross Ref]
  • Barnard NJ, Hall PA, Lemoine NR, Kadar N. Proliferative index in breast carcinoma determined in situ by Ki67 immunostaining and its relationship to clinical and pathological variables. J Pathol. 1987;152(4):287–95. doi: 10.1002/path.1711520407. [PubMed] [Cross Ref]
  • Locker AP, Birrell K, Bell JA, Nicholson RI, Elston CW, Blamey RW, Ellis IO. Ki67 immunoreactivity in breast carcinoma: relationships to prognostic variables and short term survival. Eur J Surg Oncol. 1992;18(3):224–9. [PubMed]
  • Haibe-Kains B, Desmedt C, Piette F, Buyse M, Cardoso F, van't Veer L, Piccart M, Bontempi G, Sotiriou C. Comparison of prognostic gene expression signatures for breast cancer. BMC Genomics. 2008;9:394. doi: 10.1186/1471-2164-9-394. [PMC free article] [PubMed] [Cross Ref]
  • Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B, Desmedt C, Ignatiadis M, Sengstag T, Schutz F, Goldstein D, Piccart M, Delorenzi M. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Research. 2008;10(4):R65. doi: 10.1186/bcr2124. [PMC free article] [PubMed] [Cross Ref]
  • Haibe-Kains B, Desmedt C, Sotiriou C, Bontempi G. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics. 2008;24(19):2200–2208. doi: 10.1093/bioinformatics/btn374. [PubMed] [Cross Ref]
  • Heidorn PB, Palmer CL, Wright D. Biological information specialists for biological informatics. J Biomed Discov Collab. 2007;2:1. doi: 10.1186/1747-5333-2-1. [PMC free article] [PubMed] [Cross Ref]
  • Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [PMC free article] [PubMed] [Cross Ref]
  • McCall MN, Bolstad BM, Irizarry RA. Frozen robust multiarray analysis (fRMA) Biostatistics. 2010;11(2):242–53. doi: 10.1093/biostatistics/kxp059. [PMC free article] [PubMed] [Cross Ref]
  • Miller LD, Smeds J, George J, Vega VB, Vergara L, Pioner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. PNAS. 2005;102(38):13550–13555. doi: 10.1073/pnas.0506230102. [PubMed] [Cross Ref]
  • Pawitan Y, Bjohle J, Amler L, Borg A, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw PM, Smeds J, Skoog L, Wedren S, Bergh J. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Research. 2005;7(6):953–964. doi: 10.1186/bcr1325. [PMC free article] [PubMed] [Cross Ref]
  • Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA. Gene-Expression Profiles to Predict Distant Metastasis of Lymph-Node-Negative Primary Breast Cancer. Lancet. 2005;365(9460):671–679. [PubMed]
  • Minn AJ, Gupta GP, Padua D, Bos P, Nguyen DX, Nuyten D, Kreike B, Zhang Y, Wang Y, Ishwaran H, Foekens JA, van de Vijver M, Massague J. Lung metastasis genes couple breast tumor size and metastatic spread. Proceedings of the National Academy of Sciences. 2007;104(16):6740–6745. doi: 10.1073/pnas.0701138104. [PubMed] [Cross Ref]
  • Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M. Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. J Natl Cancer Inst. 2006;98(4):262–272. doi: 10.1093/jnci/djj052.;98/4/262 [PubMed] [Cross Ref]
  • Schmidt M, Bohm D, von Torne C, Steiner E, Puhl A, Pilch H, Lehr HA, Hengstler JG, Kolbl H, Gehrmann M. The Humoral Immune System Has a Key Prognostic Impact in Node-Negative Breast Cancer. Cancer Res. 2008;68(13):5405–5413. doi: 10.1158/0008-5472.CAN-07-5206. [PubMed] [Cross Ref]
  • Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JG, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C. Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series. Clin Cancer Res. 2007;13(11):3207–3214. doi: 10.1158/1078-0432.CCR-06-2765. [PubMed] [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central