Microarray data and the Gene Ontology
In this study, we used two microarray data sets: one from the Botany Array Resource at the University of Toronto [22
], and the other from the AtGenExpress Consortium [23
], archived at NASCArrays [24
]. These data sets include over 1000 expression-level experiments for Arabidopsis
, and using all of them would give a data set with dimensionality over 1000. Since the performance of statistical and machine-learning methods tends to decrease with dimensionality, we chose only those experiments that are specifically stress-related. Even so, the covariance matrix of the resulting data set is singular, which is a problem for many of the machine-learning methods. The singularity is probably due to dependencies between the expression levels under control conditions, since removing the controls from the data sets solved the problem. To compensate, we tried applying the learning algorithms to expression-level ratios (i.e
., ratios of experimental to control conditions). However, we found that the results were better when ratios were not used (data not shown). This is probably because the classifiers look for genes that respond similarly to the known stress-associated genes, so it is not so important to include the controls. In addition, since many of the features are time-courses, there is still a "time zero" control included for the values, providing a baseline measurement. The results reported in this article are therefore based on absolute expression levels without controls.
From the Toronto data set, we selected 54 features corresponding to experiments conducted primarily to study plant environmental and stress physiology, plant physiology, plant-microbe and plant-insect interactions. From the AtGenExpress data set, we selected 236 features, including various abiotic stresses (e.g., osmotic stress, heat stress, cold stress, salt stress, drought stress, UV-B stress, wounding stress, water-deprivation stress and oxidative stress). We combined the selected features into a single data set. The resulting data set consists of gene expression levels for 22,746 genes under 54 + 236 = 290 different experimental conditions.
We used terms from the Gene Ontology for Biological Processes (GOBP) to represent gene function. For example, the GOBP term GO:0006950 [response to stress]
refers to genes that respond to stress. In general, the Gene Ontology (GO) provides a dynamic controlled vocabulary for describing genes and gene products in any organism [26
]. "Biological Process" is one of three broad GO categories (the other two being "Molecular Function" and "Cellular Component"). GOBP terms are organized into a directed acyclic graph (DAG) to reflect the hierarchical relationships between the terms. Parent GOBP terms are subdivided into increasingly specific child GOBP terms.
Since our study focussed on stress, we were concerned with gene functions at or below the term GO:0006950 [response to stress] in the GOBP hierarchy. This GOBP term has 19 child terms, such as GO:0009409 [response to cold], GO:0009408 [response to heat], and GO:0009414 [response to water deprivation]. Since gene function becomes more and more specific as we move down the GOBP hierarchy, fewer and fewer genes have any given annotation. The result is that for specific types of stress, our data set contains many negatives and few positives. In the best case, for the term GO:0009613 [response to pest, pathogen or parasite], over 97% of the training data consists of negatives. The typical case is even worse. In fact, looking at all 19 types of stress, 5 types have no positives at all, and of the remaining 14 types, the median number of negatives is 99.2% of the training data. This highly unbalanced data made accurate prediction of gene function difficult. For this reason, we narrowed our study to the top stress term, GO:0006950 [response to stress]. To get positive training samples for this term, we propagated all genes in its offspring upward to it in the hierarchy. After up-propagation, the top stress term has 1,031 genes, or almost 9% of the total genes in the training data. The training data therefore contains 9% positives and 91% negatives.
Using GOBP terms to annotate all genes in A. thaliana
is an ongoing project started in 2002 by TAIR [27
]. The gene annotations (updated weekly) can be downloaded from TAIR [27
]. The predictions reported in this paper are based on the version for March 10, 2007. Using these annotations, we categorized the genes into annotated
genes and unannotated
genes. The annotated genes are those which have at least one GOBP annotation; the unannotated genes are those which have no GOBP annotations. In addition, a gene was treated as unannotated if its only annotation is the top GOBP category, GO:0008150 [biological process]
, since the function of such a gene is unknown. The result was 11,553 annotated genes and 11,193 unannotated genes in our data set.
The annotated genes formed the training data, in which a gene was called positive if it is annotated as a stress gene, and negative otherwise. The unannotated genes formed the prediction data. It should be noted that this approach probably introduces some false negatives into the training data, because genes not known to have a particular function are considered to be negative, even though future experiments could reveal them to have that function. That is to say, "unknown" is treated as "negative". However, the number of such false negatives should be small, since only a small number of genes participate in any given biological process. That is, most negatives are true negatives.
Predicting gene function using basic learning methods
Using a variety of basic learning methods, we trained a number of classifiers to distinguish between genes that do and do not respond to stress, based on their patterns of gene expression in the training data. We then applied each classifier to the prediction data to estimate the function of the unannotated genes. In addition, we used cross validation to evaluate the performance of each classifier and to estimate the precision of each prediction.
We used five supervised learning methods: Logistic Regression (LR), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Naive Bayes (NB) and K-Nearest Neighbors (KNN) [21
] (see Methods). These methods were chosen because they are representative of the most basic supervised learning methods, the goal being to explore simple methods first. These methods are widely understood, take little computation time, and the results provide a benchmark against which more sophisticated methods can be compared. Moreover, as we show below, the results provided by these methods are good enough to enable biologists to conduct targeted laboratory experiments.
Each of the five methods is discriminative. That is, the classifiers learned by the methods assign a real number (called a discriminant value) to each gene, reflecting the classifier's certainty that the gene responds to stress. Formally, a discriminative classifier is a function,
, from genes to discriminant values. In our case, each gene is represented as a 290-dimensional vector, x
, whose components are the expression levels of the gene under the 290 experimental conditions. Thus, if x
is a vector representing a gene, then dv
) is the discriminant value assigned to the gene by the classifier. Finally, a decision threshold, τ
, is chosen, and the gene is predicted to respond to stress if and only if dv
Unsupervised, semi-supervised and transductive learning
In addition to these supervised learning methods, we preprocessed the gene expression data using Principal Components Analysis (PCA), a form of unsupervised learning, to reduce the dimensionality of the data (see Methods). For this purpose, we combined the expression-level measurements for all genes (both annotated and unannotated) into one large data set, and applied PCA to the entire set. We are therefore doing a form of semi-supervised learning [29
], in which unsupervised learning uses the entire data set (ignoring annotations), and then supervised learning uses the annotated data. This increases the effectiveness of learning by increasing the amount of training data used in the unsupervised phase [29
]. In our case, the unannotated data is also the prediction data, which means that information about the prediction data is used during (unsupervised) training. This is possible because we know all the prediction data in advance. That is, we know the expression levels for all the genes in Arabidopsis
whether they are annotated or not. We are therefore doing a form of transductive learning [29
], in which the entire prediction set is known during training and is exploited to predict its annotations. This has the added computational advantage of simplifying the way PCA is done during cross validation (see Methods).
Estimating classifier performance
To evaluate the performance of discriminative classifiers, it is common to use receiver operating characteristic (ROC) curves [32
]. A ROC curve plots the true positive rate (TP) of a classifier against the false positive rate (FP) for various decision thresholds. It therefore shows the quality of a classifier not at one threshold, but at many, and provides more information than a simple miss-classification rate (as in [33
] for example). In practice, however, biologists are not usually interested in having more than a few dozen false positives, especially in unbalanced data such as ours, in which the number of false positives can rapidly overwhelm the number of true positives. We therefore use so-called ROC50
], a variant of ROC curves in which the horizonal axis only goes up to 50 false positives. The area under a ROC50
curve is the ROC50
], and is a measure of the overall usefulness of a classifier.
To estimate ROC50 curves for our classifiers, we used 20-fold cross-validation (see Methods). Because cross-validation relies on a random split of the training data into folds (20 folds in our case), there is a certain randomness to the estimated ROC50 curve. To provide more accurate results, we performed cross-validation ten times, each time with a different (randomly selected) 20-fold split of the data (see Methods). Each 20-fold split results in a slightly different ROC50 curve. In some cases, we plot all ten of these curves, to give an idea of the uncertainty in classifier performance (Figure ). In cases where this would result in overly cluttered graphs, we simply present the average of the ten ROC50 curves (Figures to , each of which show several average ROC50 curves).
ROC50 curves. Estimated ROC50 curves of the combined classifier (WV), showing ten different estimates (dashed curves) and their average (solid curve).
Logistic Regression (LR). Seven ROC50 curves for Logistic Regression with varying amounts of dimensionality reduction using PCA. In the legend, p is the PCA-reduced dimension, and s is the ROC50 score.
Comparison of methods. The ROC50 curve (purple) for the combined classifier using weighted voting (WV), and the best ROC50 curves from each of Figures 2 to 6. In the legend, p is the PCA-reduced dimension of the data, and s is the ROC50 score.
Linear Discriminant Analysis (LDA). Seven ROC50 curves for Linear Discriminant Analysis with varying amounts of dimensionality reduction using PCA. In the legend, p is the PCA-reduced dimension, and s is the ROC50 score.
Quadratic Discriminant Analysis (QDA). Seven ROC50 curves for Quadratic Discriminant Analysis with varying amounts of dimensionality reduction using PCA. In the legend, p is the PCA-reduced dimension, and s is the ROC50 score.
Naive Bayes (NB). Seven ROC50 curves for Naive Bayes with varying amounts of dimensionality reduction using PCA. In the legend, p is the PCA-reduced dimension, and s is the ROC50 score.
We generated ROC50 curves for each supervised learning method combined with various amounts of dimensionality reduction. Using PCA, we reduced the original 290 dimensions to 5, 10, 15, 20, 40 and 100 dimensions, respectively. In this way, the original data set was transformed into six separate data sets of various dimensions. Each basic learning method (except KNN) was applied to the original data set and to each of the six reduced data sets. Thus, for each basic learning method (except KNN), we trained and tested seven different classifiers. In the case of KNN, we used only the original, unreduced data, but with five different values of K. Altogether, we trained and tested a total of 4 × 7 + 5 = 33 different classifiers. Figures to show the estimated performance of these basic classifiers. Each figure shows a number of ROC50 curves, each derived using cross-validation averaged over a number of random splits of the data, as described above. Unlike traditional ROC curves, the axes of these curves give the number of true and false positives, instead of the proportion. The red dash-dot line near the bottom of each figure shows the expected performance of a random classifier (i.e., a classifier that ignores the expression data and guesses whether or not a gene responds to stress by essentially flipping a coin). The ROC50 scores for the curves are shown in the legend of each figure.
K-Nearest Neighbours (KNN). Five ROC50 curves for K-Nearest Neighbours for various values of K. The legend gives the ROC50 score, s, for each value of K.
As the figures show, in some cases the classifiers perform not much better than random, but in most cases they perform significantly better. The figures also show that the performance of each classification method depends heavily of the amount of dimensionality reduction used. Notice in particular that in some cases, the classifier trained on the reduced data has a much higher ROC50 score than the classifier trained on the original, unreduced data. This is especially true for NB and QDA. For instance, the classifiers trained on the original data have low ROC50 scores of 182.3 for NB and 115.2 for QDA. This is comparable to the random classifier, whose ROC50 score is 122.5. However, reducing the dimensionality of the data to 15 increases their ROC50 scores to 1373.1 and 1651.0, respectively. This shows the importance of dimensionality reduction. In contrast, KNN performs well for all the values of K that we used.
Figure compares the basic classification methods by plotting the best performance of each. That is, for each of the basic classification methods, the ROC50 curve with the highest ROC50 score is reproduced in Figure . In addition, the figure shows the performance of a classification method that uses a weighted voting scheme (WV) to combine the 33 basic classifiers into a single, composite classifier. Notice that this composite classifier performs best of all. The next section describes how this composite classifier is constructed.
Improving prediction accuracy by combining classifiers
Combining different classifiers in prediction can be thought of as combining different opinions in decision making. The advantage is that a group opinion is better than a single opinion if the single opinions are correctly weighted and combined. In machine-learning systems, classifiers are often combined by weighted voting, in which the discriminant value of the combined classifier is a linear combination of the discriminant values of the individual classifiers. Formally, given a set of basic classifiers,
, and a set of weights, w1
, …, wM
, the combined classifier,
, is defined by the equation
. In our case, M
= 33, as described above.
By judiciously choosing the weights, w1
, …, wM
, the performance of the combined classifier can be maximized. Various methods are available for doing this, such as model averaging and stacking [21
]. Using these methods on our data sets, we found that the ROC curve of the combined classifier was usually better than the ROC curves of the basic classifiers, as expected. Unfortunately, we also found that the ROC50
curve of the combined classifier was usually worse (data not shown). We hypothesized that this is because our data sets are highly unbalanced. Intuitively, model averaging and stacking try to choose weights so as to correctly classify as much data as possible. In our case, this means trying to correctly classify the vast number of negative samples in our data sets, even if this means misclassifying the small number of positives. In other words, these methods try to minimize the total number of false positives, even though we only care about the first fifty.
To choose appropriate weights for our combined classifier, we used the heuristic that classifiers that perform well should be given more weight than classifiers that perform poorly. In our case, since we want to maximize the ROC50
score of the combined classifier, we want to give high weight to classifiers with high ROC50
scores. There are many ways to do this, but we found that it was sufficient to estimate and normalize the ROC50
score of each basic classifier, and use this as its weight. That is, we used
is an estimate of the ROC50
score of classifier fm
. Note that with these weights, if each
) is a number between 0 and 1 (as with our classifiers), then so is
). Also, this method automatically gives low weight to classifiers that use an inappropriate amount of dimensionality reduction, since such classifiers have low ROC50
scores. In this way, the combined classifier incorporates not only the best combination of supervised learning methods, but also the best amounts of dimensionality reduction for each method.
To train and evaluate the combined classifier, we used two sets of validation data. After the basic classifiers were trained, one validation set was used to estimate their ROC50 scores. The combined classifier was then constructed using these scores, and the second validation set was used to estimate its ROC50 curve. Thus, the validation data for the basic classifiers is part of the training data for the combined classifier. To do this in a cross-validation setting, we used what amounts to nested cross-validation (see Methods). As shown in Figure , the resulting combined classifier has a higher ROC50 score than any of the basic classifiers from which it is made.
Figure gives another view of the performance of the combined classifier. Here, the thin dashed lines are a superposition of ten different curves, where each one is a different estimate of the combined classifier's true ROC50 curve. As described earlier, each estimate of a classifier's ROC50 curve includes some randomness, due to the random choice of folds during cross-validation. The ten dashed curves in Figure are derived from ten different cross-validations, each one using a different set of folds. The thick solid line in the figure is the average of the other ten curves. Because averaging reduces variance, the average curve is a more accurate estimate of the true ROC50 curve (i.e., has lower variance) than any of the other ten curves. The diagonal dash-dot line near the bottom of the plot shows the expected performance of a random classifier.
ROC and ROC50
curves plot the number of true positives against the number of false positives. However, in applications such as ours, the precision
is also of interest. Precision is the proportion of true positives (TP) among the predicted positives (PP). (It is also the complementary false discovery rate, 1-FDR [35
].) Precision is important since each prediction is a potential experiment, and as a matter of economics, a biologist needs an estimate of how many of the experiments will succeed. This is especially important in situations, such as ours, where the number of real negatives is much greater than the number of real positives, and so there is a real possibility of having a huge number of failed experiments.
Figure plots estimated precision against the number of predictions for the first hundred predictions. Notice that as the number of predictions increases (i.e., as the classifier's decision threshold is lowered), the precision decreases, meaning that fewer of the predictions are expected to be true. As in Figure , the thin dashed lines are a superposition of ten different curves, each one an estimate of the true precision curve, and the thick solid line is their average. Also, the horizontal dash-dot line near the bottom of the plot is the expected precision of a random classifier, and its height is equal to the ratio of the number of positives (i.e., stress genes) to the total number of samples (i.e., genes) in the training data. Since all the estimated precision curves are well above the horizontal dash-dot line, the performance of the combined classifier for the first hundred predictions is significantly better than random. Also, since Figures and show small variance, and since the variance of the average curves will be even less, the combined classifier should have stable prediction performance.
Precision curves. Estimated precision curves of the combined classifier (WV), showing ten different estimates (dashed curves) and their average (solid curve).
We trained the combined classifier on our Arabidopsis data set, using all 22,746 genes for Principal Components Analysis, and the 11,553 annotated genes for supervised learning, as described above. We then applied the classifier to the 11,193 unannotated genes, to get a set of 11,193 predictions (see Methods). Table shows the top fifty predictions. Each row in the table is a prediction: the first (leftmost) entry is the rank of the prediction (1 being the top prediction); the second entry identifies a gene; the third entry is a discriminant value (measuring the likelihood that the gene responds to stress); and the fourth entry is the estimated precision of the prediction and all predictions above it (i.e., the fraction of these predictions expected to be true). As an example, consider the 23rd row of the table, the row for gene At1g09950. Since the estimated precision in this row is given as 0.7044, we expect that about 70% of the top 23 genes respond to stress, i.e., 16 genes.
The top 50 predictions of the combined classifier ordered by discriminant value
Figures and provide visual evidence supporting these predictions. Each figure shows a heat map. These maps, known as "electronic Northerns" (or e-Northerns), were generated using the Expression Browser tool of the Botany Array Resource (BAR) and the AtGenExpress Stress Series (shoot) data set[23
]. The program contains expression data for more than 22,000 genes across more than 1000 samples collected from NASCArrays, AtGenExpress Consortium, and the Department of Botany at the University of Toronto [22
]. Each row in an e-Northern is a gene, and each column is an experiment. The colour at a point represents the relative expression level of the gene during the experiment. More specifically, the colour represents the log2
of the ratio of the average of replicate treatments relative to the average of corresponding controls. Yellow means that under the experimental conditions, the gene had the same expression level as the control. (The wide, yellow vertical stripes are the controls.) Red means that the gene had a higher expression level than the control (up-regulation), and blue means it had a lower expression level (down-regulation). A gene that shows significant up-regulation (or down-regulation) under stress conditions is likely to be involved in response to stress. Thus, unlike cross validation, electronic Northerns provide a means of evaluating the quality of predictions based on the prediction data, not just the training data. The e-Northerns of Figures and , for instance, are based entirely on prediction data. In these e-Northerns, the experiments exposed the plant to various stress conditions, such as heat, cold, drought, UV-B radiation, etc. Figure is the e-Northern for the top-50 predictions of our combined classifier, i.e
., for the 50 genes predicted to most likely to respond to stress. For comparison, Figure is the e-Northern for 50 genes chosen at random from the prediction set. Note that there is much more colour in Figure than in Figure , especially red. This suggests that our combined classifier has indeed extracted meaningful gene expression patterns for genes that respond to stress.
Electronic Northern analysis. E-Northern of the top 50 predictions.
Electronic Northern analysis. E-Northern of 50 randomly selected genes.
Gene knockout experiments
From the predictions of the combined classifier, three genes were chosen for biological analysis using gene knockout experiments. Here, we present the results for one of these genes, At1g16850, which show it to be necessary for the normal response to temperature and NaCl. Our results also confirm that the other two genes, At1g11210 and At4g39675, are involved in a variety of stress responses (data not shown).
The criteria used to choose candidate genes for subsequent biological analysis were: 1) the gene must be expressed in either root or shoot, 2) gene expression should be strongly increased in response to abiotic stress, such as cold, drought, osmotic and salt stresses, 3) T-DNA knockout lines – in which a given gene's expression has been eliminated – should available from the Salk Institute [37
], and 4) the gene should not have an annotated function nor be present in any patent database. Further bioinformatics analysis was performed using Athena for promoter motif prediction [38
], Expression Angler for co-expressed gene analysis [22
] and eFP browser for electronic representation of gene expression patterns [39
The increased presence of anthocyanin levels in plants lacking a functional copy of the At1g16850 gene during cold stress of 4C indicates that this gene is involved in cold stress response (Figure ). The same effect is seen at 30C, indicating that this gene is also associated with response to heat stress (Figure ). Interestingly, At1g16850 is normally expressed during the later stages of seed maturation, towards seed dessication, and hence may play a role in seed dormancy. This sort of bifunctionality is seen with other stress response genes, which have documented roles in the cold, heat and salt stress pathways, e.g. RD29A (Response to Desiccation) and LEA (Late Embryogensis Abundant) protein [40
]. These proteins have also been found to accumulate during seed maturation [40
] and are in fact co-expressed with At1g16850 under stress conditions and during seed maturation, as determined using the Expression Angler algorithm [22
Figure 11 Gene knockout experiments. 10 day old wild-type and mutant plants after exposure for 7 days at 14. (a) The mutant cotyledons appear darker than wild-type due to increased anthocyanin levels. (b) mutant and wild-type seeds 24 h after sowing on agar plates. (more ...)
In addition to modulating a response to temperature, plants lacking a functional At1g16850 exhibit a defective root growth phenotype under increasing salt concentrations (Figure ). This phenotype, combined with previous microarray studies [42
], which found At1g16850 induction at 250 mM NaCl, gives clear indication that At1g16850 is also part of the salt stress response pathway.
Figure 12 Gene knockout experiments. Root growth on 50 mM NaCl, relative to growth on 0 mM NaCl, on 10 day old wild-type and mutant plants transferred to 50 mM NaCl medium. Error bars indicate the standard error of 5 replicates. n = 25 measurements per treatment (more ...)