Given a functional category of interest such as GO Biological Process term or a biochemical pathway and a set of microarray experiments our task is to select a subset of experiments that is optimal at differentiating the genes in that functional category from the remaining ones – which we shall call background genes. Since the chosen subset of experiments should be constituted by experiments that most perturb the genes in the functional category of interest, we shall refer to these experiments as the relevant experiments.
Our idea is to choose a feature that, if an experiment is relevant, would be able to discriminate between the genes in the category of interest and the background genes. The set of relevant experiments can then be found by maximizing the discriminatory ability of such a feature. The feature we chose is Pearson's correlation coefficient and we used a t-test to measure its discriminatory ability – whether the mean of the correlation coefficients for the genes of interest is significantly higher than that of the background.
Clearly, an exhaustive search of the space of possible subsets of experiments is computationally intractable for large microarray datasets (the number of possible subsets of a set of n experiments is 2n). Therefore, a ‘brute force’ approach that analyzes every possible combination of experiments would not be feasible for typical microarray collections containing a large number of experiments. For example, the set of 44 Arabidopsis microarray experiments that we have used in this work would require analyzing over 17,000 billion combinations. Therefore, this paper presents an efficient greedy heuristic which was able to select a set of experiments with high discriminatory ability while retaining a quadratic complexity. Here we give an informal description of the procedure while the pseudo-code of the algorithm is presented in .
Our analysis assumes that we are given a certain functional category and a set of
n microarray experiments, each comprising several time-points or conditions. The procedure begins by performing a
t-test for every experiment in the microarray collection assessing whether the correlation between gene pairs where both genes in the pair belong to the functional category of interest (denoted by A in ) is greater than correlation between gene pairs where only one gene belongs to the functional category of interest (denoted by B in ). Note that we do not consider correlations where neither genes in the pair belong to the functional category (the rationale for this is discussed in
Supplementary Information S3.
We then select a fixed number of
seed experiments with the best
p-values from the
t-tests. The algorithm builds experiment lists iteratively starting from these seed experiments. For a given list, at each iteration, an experiment is selected at random among those not already contained in the list and this experiment is tentatively added to the existing list. As before, a
t-test is then performed to check whether this expanded list of experiments exhibits a distribution of correlations between gene pairs where both genes belong to the functional category of interest (A in ) which is greater than correlation between gene pairs where only one gene belongs to the functional category of interest (B in ). If the
p-value is smaller than a pre-defined threshold, the experiment is permanently added to the list; otherwise it is removed. This iterative procedure terminates when all experiments have been considered for every seed experiment for every list. Once the lists have all been created, the list with the overall final best
p-value is kept as the optimal list of experiments that the algorithm returns. Finally, it should be pointed out that a
t-test requires that the values being tested be independent samples from a Gaussian distribution. In our case, the values being tested are the pair-wise correlations in a set of genes. Unfortunately, such correlations are neither independent nor Gaussian. Thus, the
p-values computed by our algorithm are not guaranteed to be accurate. Nevertheless, they are still very useful for choosing experiments. This issue is more fully discussed in the
Discussion section.
Although this algorithm cannot guarantee that the selected set of experiments is optimal, in practice we found that this heuristic selected sets of experiments with high discriminatory ability while providing computational tractability. Indicating with n the number of experiments in the dataset and with K the number of seed experiments, the number of t-tests our algorithm needs to consider at most is given by:
This quadratic complexity allowed us to complete one run of the algorithm for any of the experiments presented here in a few minutes on a regular desktop machine.
The algorithm has only two parameters: the significance level of the
t-tests (denoted by L in the pseudocode) and the number of seed experiments (K). When testing our algorithm we set the significance level to the standard value of 0.05. Importantly, we found that our algorithm is quite insensitive to the number of seed experiments – in the experiment presented in the sequel, in which we tested the procedure on different species and different sets of microarray experiments, a value of K

=

25±15 gave similar results.
Compared to large collections of microarrays, smaller subsets of experiments may lead to higher correlation values purely because of the shorter length of the vectors. In all our analyses we account for this bias by filtering the correlation by a p-value threshold. This ensured that only statistically significant correlations are considered.
We tested our algorithm on publicly available microarray data collections. Here we present results obtained using 44 individual experiments in
Arabidopsis thaliana from the NASCAarray collection
[9] and 31 individual experiments in
Saccharomyces cerevisiae from the M3D collection
[10]. A full list and details of the microarray experiments can be found in
Supplementary Information S4. Our experiments on both yeast and Arabidopsis prove that our procedure is also species-independent. To prove that our selection procedure is independent of the functional classification system adopted, we applied our algorithm for selecting experiments relevant for GO Biological Process terms and MIPS FunCat terms.
In the following sections, we will prove the effectiveness of our procedure by showing that the selected sets of experiments: result in higher correlations between genes in the same functional category; improve the performance of GBA-based gene function prediction; provide a discriminatory ability for a given functional term which increases with the term specificity; lead to a better reconstruction of gene regulatory networks. Furthermore, in the discussion we shall highlight how the selected experiments can also be explained in terms of the literature.
(a) Selected experiments improve overall correlation between genes in the same functional category
As discussed earlier, for effective gene expression-based functional analyses, it is essential that genes belonging to the same functional category exhibit high correlation. However, we observed that this is not necessarily true when large microarray collections are used for calculating the correlation (). The experiments selected by our algorithm uncover significantly higher correlation. For genes which belong to the same functional category, we compared the distribution of correlation coefficients obtained using the experiments selected by the algorithm with the distribution obtained using all the experiments in the collection. shows representative histograms of the distributions for both Arabidopsis and yeast GO terms. As expected, the correlation distribution obtained from using all experiments is populated with low correlations with only a greatly reduced population with higher correlation values. However, when experiments selected by the algorithm are used, the distribution is enriched with higher positive and negative correlations. We performed a t-test between the distributions of the absolute values of the correlations in order to check whether the distribution obtained using selected experiments is greater than when all experiments are used. The low t-test p-values () confirm that the distribution obtained with selected experiments is significantly higher than when all experiments are used.
(b) Quantifying the effectiveness of the selected experiments at improving the correlation in a functional category
In order to evaluate the effectiveness of the selected experiments, we formulated a classification problem where pairs of genes are classified into two classes: the first class contains the pairs were both genes belong to the category of interest; the second class contains those pairs in which one gene belongs to the category of interest while the other one belongs to the background set. This classification is performed using the Pearson correlation between the genes in the pair as the only feature. This allowed us to compare the effect of using the selected experiments vs. using all experiments by comparing the performance of the classifiers – since the classifiers use only correlation to distinguish between the two types of gene pairs, comparing them allows us to assess the quality of the correlations.
Receiver Operating Characteristic (ROC) curves are widely used in the machine learning literature for comparing classifiers. They are the plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) for different values of the classifier decision threshold. The greater the area under the ROC curve (AUC), the better is the performance of the classifier.
For calculating the ROC curves, gene pairs in which both genes belong to the functional category of interest were considered the Positive Set; and gene pairs, in which only one gene belongs to the functional category of interest, were considered the Negative Set. To evaluate the performance of the selected experiments, we performed a 10-fold cross-validation: genes were first randomly divided into 10 parts and at each round, 9 parts were used for training and one for testing. At each round, our algorithm was used to select the experiments using only the training set and these were then used for calculating the ROC curves for the testing set. shows the average ROC curves for four different GO functional categories, two from Arabidopsis and two from yeast (the procedure for averaging ROC curves can be found in
[11]). We can see that the ROC curves for selected experiments (shown in green) have a greater AUC compared to all experiments (shown in red). Following common practice, we also present the average (1-AUC) for both selected and non-selected datasets over the 10-folds (). The average ROC curves and their corresponding (1-AUC) for further twelve examples of different GO functional categories can be found in
Supplementary Information S5.
We can see that the average (1-AUC) is remarkably lower for the selected set of experiments. Moreover, we performed a t-test between the ten (1-AUC) values from the 10-fold cross-validation obtained using the selected experiments and those obtained using all experiments – p values are also reported in . This proves the clear difference in performance between classifiers that use all experiments and classifiers that use experiments selected by the algorithm and, consequently, the improved quality of the correlations obtained when selecting experiments.
The superior performance of the selected set of experiments was observed for both Arabidopsis and yeast GO Biological Process terms (, ). Importantly, we were able to show that this effect is true also for MIPS FunCat terms, thus indicating that our procedure is effective independently of the functional categorization adopted. shows the average ROC curves and their corresponding (1-AUC) together with their
p values for 2 MIPS functional categories – further twelve examples of MIPS categories can be found in
Supplementary Information S5. In general we found that for MIPS FunCat terms the difference in performance between the selected experiments and all experiments was smaller compared to GO Biological Process terms. This could be due to the broad functional classification found in MIPS FunCat when compared to GO (we discuss the relation between specificity of annotation and performance of the selected set of experiments in the following section (d)).
(c) Selected experiments improve GBA-based gene function prediction
A central goal of GBA-based analysis of transcriptomics data is to predict gene function. Therefore an important test of the efficacy of our method is to check whether the correlations obtained by the selected experiments are a better feature for predicting gene function than the correlations obtained using the entire set of experiments.
We framed this problem as a classification problem between two classes of genes: those in the category of interest and those in the background. This classification is performed using very simple GBA-inspired classifiers that use only the Pearson correlation between the genes. The simplest possible classifier of this kind is one that classifies a gene using the sum of the correlations between that gene and the genes in the training set that belong to the category of interest: if this sum is above a certain threshold, it classifies the gene as belonging to the category of interest; otherwise it assigns it to the background.
As before, to evaluate the performance of the classifiers we performed a 10-fold cross-validation and calculated the average ROC curves. Our aim is to compare the performance of classifiers that employ correlations from the selected experiments with the performance of classifiers that employ correlations from the entire set of experiments. shows the average ROC curves for four different GO functional categories, two from Arabidopsis and two from yeast – the categories are the same ones that we used in . We can see that the ROC curves for selected experiments (shown in green) have a greater AUC compared to all experiments (shown in red). The (1-AUC) for both selected and non-selected datasets over the 10-folds is shown in , together with the
p values obtained by the
t-test of the (1-AUC) values obtained in the 10-fold cross-validation. As before, this proves the clear difference in performance between classifiers that use all experiments and classifiers that use experiments selected by the algorithm and, consequently, the improved quality of the correlations obtained when selecting experiments. The average ROC curves and their corresponding (1-AUC) for further twelve examples of different GO functional categories can be found in
Supplementary Information S6. Results for yeast also show the same effect (, ).
We also repeated our experiments for MIPS FunCat terms obtaining consistent results (see – further twelve examples of MIPS categories can be found in
Supplementary Information S7). Again, this indicates that our experiment selection procedure is effective at improving GBA-based gene function prediction independently of the functional categorization adopted.
(d) The effectiveness of the selected experiments increases with annotation specificity
Another way to prove the effectiveness of our experiment selection procedure is to show that the performance of the selected experiments for the classification task outlined in section (b), increases as the functional category becomes more specific. This is based on the fact that the overall correlation among genes in a functional category is expected to increase as the category becomes more specific. Consequently, a set of relevant experiments should be more effective in differentiating genes belonging to the functional category of interest from all other functional categories. To evaluate this, we measured the effectiveness of the selected experiments at every level of specificity of annotation starting from a leaf node up to the root node.
For the GO BP term GO:0009861 “Jasmonic acid and ethylene dependent systemic resistance”, we ran our algorithm for terms found at each level of the tree leading up to the root term. The effectiveness of the selected experiments was evaluated using the classification problem framework outlined in section (b). The performance of the classifier was evaluated by plotting ROC curves and the average (1-AUC) score in 10-fold cross-validation was recorded. Here, we expect that if the performance of the selected experiments were the same as using all experiments then the difference in their average (1-AUC) scores would be zero. This difference for every term in the hierarchy from GO:0009861 up to the root node is shown in . From the figure, it is clear that the effectiveness of the selected experiments is dependent on the specificity of the functional annotation.
(e) Selecting relevant experiments: Implications on Pathway reconstruction
An important application of the GBA principle is the elucidation of putative members of biological pathways. Identifying experiments relevant to the pathway of interest can be crucial for pathway reconstruction methods where the objective is to identify potential members of a pathway. The same reasoning we applied earlier for selecting experiments relevant to specific functional categories can also be applied for selecting experiments relevant to given biological pathways. In this case, the background set is constituted by all the genes belonging to pathways different from the pathway of interest, and the set of relevant experiments are the ones which best discriminate genes in the pathway of interest from the background. The results we present here show that the selected experiments can uncover greater correlation among genes belonging to the biological pathway and that this correlation is a better predictor of the membership of a gene in the pathway of interest.
To begin with, we obtained the “Alpha linolenic acid metabolic pathway” (KEGG ID: ath00592) from the KEGG Pathway Database
[12]. Alpha linolenic acid is a precursor of a class of fatty acid derived regulators called the Jasmonates. The biosynthetic derivative of alpha linolenic acid Jasmonic acid (JA) is known to be an important mediator of defence response and other stress related signalling in plants
[13],
[14]. The KEGG annotation of the pathway in
A.thaliana consists of 30 genes of which 26 were found in our microarray collection. To demonstrate that the correlation obtained from the selected set of experiments is a better predictor of pathway membership, we framed this as a classification problem similar to the one presented in Section (c). We applied the same GBA-based classifier presented in section (c) to classify genes as either members of the pathway of interest or of the background. As before, the performance of the classifier was evaluated by 10-fold cross-validation and average ROC curves were calculated over the ten folds. We compared the performance of the classifier when using correlations from the selected experiments and all experiments in the collection. The average ROC curves () and the average (1-AUC) bar plots () clearly show that the classifier using correlations from the selected set outperforms the classifier using correlations from all experiments in the collection. This result clearly highlights the potential of the experiment selection algorithm in pathway modelling and reconstruction approaches.