In this work we propose a novel method to compare microarray experiments at the pathway level, our work addressed three major tasks: (i) defined a metric measuring the probability of a set of pathways to be related to some clinical/biological variable of interest and the relative importance of that set in the context of the biological problem, (Pathway Signature) (ii) proposed a method using these pathway signatures to assess relative similarity of the experiments ; (iii) proved the validity of this method applying it to two different well defined biological problems on two independent collections of microarray experiments.
As a proof of concept we selected datasets representing well known stimuli in well known biological systems. One of the datasets was from S. cerevisiae
, the most extensively studied model organism, with a well characterized genome where all the genes are represented on the array, widely used to test bioinformatics methods. We selected the Rosetta compendium of deletion mutants 
as it measures the steady state response to a very precise alteration, the deletion of a given gene, and consequently is relatively simple to associate a phenotype to a particular pathway profile. As second dataset we decided to use data sets from Homo sapiens
because the majority of the experiments stored in GEO and ArrayExpress are from human samples. We chose transcriptomic datasets from DCs as they have some clear advantages respect to other fields. Firstly, the cell type is well-defined, which enables the study of the alterations in gene expression following stimulation. Secondly, there is the possibility to perform prediction and hypothesis-driven functional genomics studies aiming at the reconstruction of the networks of molecular interactions characterizing specific DC differentiation programs.
The aim of our work was to propose a procedure to assess similarity between microarray experiments at the pathway level generating “Pathway signatures” (PS) for a set of experiments and use these signatures to interrogate microarray databases.
The need for better methods to identify similarities in microarray data sets arises from the fact that although the analysis techniques have constantly improved over the past years, one of the biggest hurdles remains the comparability among distinct data sets produced by different researchers and laboratories, resulting in lists of genes which do not overlap, or overlap in a very limited fashion. A possible reason relies on the different assumptions on the data used by different statistical methods 
. This is a strong limitation, because identifying biologically similar samples would increase the power and the reproducibility of a study, especially since in most studies the number of samples can be a limiting factor, which could be compensated using already published experiments.
Also, as the number of publicly available data sets increases (at the time of writing, GEO and Array Express host a total of 221,815 and 110,356 hybridizations, respectively), it is important to have reliable method to compare microarray data from different sources.
The improvements observed when comparing different experiments at the pathway level 
is coherent with the assumption that genes never act alone in a biological system, but participate in a cascade of networks, an approach overlooked by gene-based analyses.
The selection of the statistical method used to measure enrichment is central to our approach.
Different metrics have been proposed to integrate the probability of alteration of a sector of the cellular network (pathway) and the relative importance of that pathway in the context of the biological problem, such as the probability vector 
, the impact factor (IF) 
. Other methods have been devised for the identification of regulatory modules and their regulation program by integrating genome-wide location and expression data 
. However, to our knowledge, these methods have not been employed to compare a large number of experiments assembled from different microarray data sets.
FET p-values had to be transformed to improve interpretation: p-values express a probability, and the smaller they are, the more significant the result is, while from a conceptual point of view it is better to express pathway enrichment as a number that it is either categorical or the more significant the greater it is. That is the reason we first used the logarithm of reciprocal of the p-value (the Pathway Enrichment Factor), to express the measure in a scale that would avoid interpretation problems, the PEF was finally obtained multiplying the value for a sign representing the “direction” of a pathway, with the same approach described in 
. Our results showed that clustering PEFs grouped samples according to their biological similarity. In yeast we observed as part of the same cluster transcriptional profiles of the deletion mutants of, ssn6
, genes that form a co repressor complex which is responsible for repressing a large number of S. cerevisiae
genes, including glucose-responsive genes, DNA damage genes and oxygen utilization genes 
. As ssn6
act together, deletion of one or the other gene is expected to yield a similar behavior. On the yeast dataset the exact topology of the cluster was dependent on the metric used. This is quite understandable: sets of PEFs can be noisy, as there are all the possible ranges of p-values, and identifying which pathways were “significant” and which “not significant” was not always straightforward. This can influence the clustering, and so certain metrics prove to be more useful than others, for example Pearson's correlation was the metric which most correctly grouped ssn6
in our yeast data. This result could be biologically relevant, as although ssn6 and tup1 are part of a complex, tup1 has also a function alone, and as a matter of fact, the pattern of significantly altered pathway (shown by their sBEFs) in the two deletion mutants exhibits differences. This biological difference could result in changes in the direction in which some pathways are affected by one deletion or by the other. Alternatively this result could reflect greater sensitivity to technical “noise” in the data of some of the metrics used.
Biclustering of PEFs on the dendritic cell data sets gave concordant and biologically relevant results, the responses elicited by stimulation with S. cerevisiae
follow the same downstream signalling as the ones in response to the fungus A. fumigatus 
their PEFs clustered one next to the other with all the metrics used.
On the other hand, sBEFs produced consistent results from a clustering point of view, both for the yeast and the DCs datasets, independently from the clustering method, thus making very easy the identification of similar datasets.
We selected sBEFs to produce the PS as they can be easily used as a “barcode” that can uniquely attached that sample, facilitating “querying” a database of PSs.
The observation that sBEFs, PEFs and PSs, identify similarity between samples that are biologically meaningful, indicates that the categorical transformation of the p-values does not affect the observed result, concluding that the observed similarity has a biological meaning rather than resulting from a manipulation of the data.
When testing different methods to compare experiments at the pathway level, we initially chose clustering because of its flexibility and robustness in representing biological data. The type of clustering used was biclustering with support trees over 100 or 1000 iterations 
as implemented in TMeV 
, because the clustering results obtained with this method are more statistically sound as they do not depend on the original order of the genes and the pathways. Also, clustering has been applied successfully to gene expression studies 
. From an ontological point of view the sBEFs can be considered as categorical phenotypes, thus clustering is a legitimate approach to classify “omics” data, as it has been used on datasets of both continuous and categorical data, such as human haplotypes in population studies 
and phenotypic observations, such as lord Fisher's Iris
data set 
, probably one of the most widely used data sets used in clustering and pattern recognition studies.
Yet since sBEFs are categorical rather than continuous, we investigated the use of an alternative metric (the Jaccard index), since biclustering may not be completely suited for this type of data. Our results showed an excellent agreement with the clustering for DC data, but some inconsistencies with the yeast data, in agreement with the clustering of the PEFs for this dataset. This could reflect some biological instances, or alternatively could be related to noise deriving from technological issues, as the yeast dataset is a none year old two color array. Another possibility is that measures like the Jaccard index may not be the most robust to capture the inherent complexity of “omics” data. In any case, our future work will be aimed at improving the grouping metrics for sBEFs.
Results with GSEA on the dendritic cell data did not satisfy our prediction, due to the different numbers of experiments present in the different classes and the difficulty to classify one data set in a class or another. In fact, the grouping observed with FET scores on DC data was absent when using GSEA data. There are two possible reasons to this inconsistency. First of all, GSEA compares two groups (treated versus untreated) as opposed to single-sample analysis performed with FET. Thus, the effect of inter-donor variability (all data sets have at least two biological replicates for each sample) is noticeable and is not corrected by the pathway approach. The second reason is the heterogeneity of the samples themselves, that are subject to different treatments. As a result, GSEA rounds everything to the lowest common denominator, presenting an “average” profile where individual differences are smoothed out. Also, the number of control and treated samples in each data set is different (this is usually the norm when comparing different data sets), so there is an imbalance among the various GSEA analyses.
We can conclude that the ability to investigate at the single sample makes the Fisher's Exact Test conceptually more appropriate to search for similarities among experiments in a microarray databases, that contain a number of hybridizations that should be interrogated without necessarily specifying the membership to a data set or another. Lastly, GSEA requires dividing the samples in two distinct phenotypic classes in order to operate, and division in classes is not necessarily straightforward. As a matter of fact, such a type of analysis is not suitable for two-color microarrays (which present data as a ratio between treatment and control), and therefore we were not able to use it with our yeast data set.
Overall our results show that using pathway signatures in conjunction with hierarchical clustering with support trees is a powerful and useful technique to compare experiments produced by different people and laboratories with greater power than with the traditional analysis techniques. The results are also easier to interpret and discover biologically meaningful implications, making this approach an ideal candidate to analyze data from different sources. PS generated using sBEFs can be useful as “barcodes” to classify experiments in microarray databases and clustering of sBEFs can be a useful way to query experiments in databases according to their similarity at a pathway level. Thus, we propose to store PSs as an additional experimental annotation in the microarray databases and implement methods using pathway signatures to query experiments in public databases and concurrent analyses of subsets of experiments.
Despite the effectiveness of the method, there are still some drawbacks that will need to be addressed in the future. First of all, Fisher's Exact Test result depend on the lower and upper cut-offs for expression, and at the current time they are defined by the user. We were well aware of the limitations of this method but also aware of its robustness if the appropriate threshold is used, in agreement with the findings from Bussemaker and Boorsma, that proved that the Fisher's Exact Test outperforms other metrics when appropriately selecting the threshold 
. Thus, we calculated the appropriate cut-offs automatically, basing on the 2σ interval of the binomial distribution of the expression values 
, and we will implement this method directly in a future version of Eu.Gene. Secondly, both FET and GSEA make the assumption that genes in a pathway are independent from each other, which is clearly not the case in real biological system. An alternative system with the potential to dramatically improve the results of pathway analysis has been recently proposed 
, which keeps track of the interdependence of genes. However, its use is currently limited to signal transduction pathways from KEGG and requires an accurate description of causality among the events, which is not the case for the vast majority of the pathways present in public resources. We have partially incorporated this idea of direction by calculating signs attached to the p-values. We plan to increase the sophistication of the method, implementing proper directionality of the events using curated pathway sets with a consensus on the number, type and order of the connections between the members of the pathway. With the ability to include directionality, the application of methods for module network analysis using improved PSs instead of genes 
holds the promise to unravel the hierarchical structure in the control of the cellular pathways and reconstruct the modular structure determining changes in a transcriptional profile.