Data from microarrays and other high-throughput molecular profiling platforms are clearly revolutionizing biological and biomedical research. However, interpretation of the data remains a challenge to the field and a bottleneck that limits formulation and exploration of new hypotheses. In particular, it has been a challenge to link gene expression profiles to functional phenotypic signatures such as those of disease or response to therapy. A number of partial bioinformatic solutions have been proposed. The most mature and promising such algorithms have analyzed the data from the perspective of categories of related genes, such as those defined by the Gene Ontology (GO) or by the Kyoto Encyclopedia of Genes and Genomes [1
]. Gene categories group genes into nonexclusive sets of biologically related genes by linking genes of common function, pathway, or physical location within the cell.
Gene categories introduce an independent representation of the underlying biology into the analysis of complex datasets and therefore serve to guide the algorithms toward conclusions congruent with conventional knowledge of biological systems. Algorithms that take such an approach have often demonstrated a higher level of functional interpretation than did earlier, single-gene statistical analyses. However, most gene category based methods still perform the analysis on a gene-by-gene, univariate basis, failing to capture complex nonlinear relationships that may exist among the category's genes. If, for example, upregulation of gene A influenced a drug sensitivity signature only if gene B in the category were downregulated and gene C upregulated, then that relationship would be missed. Here, we introduce a novel gene category based approach, the Learner of Functional Enrichment (LeFE) algorithm, to the interpretation of microarray (and similar) data. LeFE captures that type of complex, systems-oriented information for prediction of functional signatures.
The input to LeFE consists of the following components: signature vector, microarray (or analogous) data, and a predefined set of categories and the genes within them. The 'signature vector' describes the biological behavior, process, or state to be predicted for each experimental sample. The signature vector either classifies samples (for example, as normal or diseased) or assigns each sample a continuous value (for example, relative drug sensitivity). That is, the signature can be nominal or continuous. A discrete signature vector is handled as though it were continuous.
The goal of LeFE or any other gene category based algorithm is to determine which categories (for instance, molecular subsystems) are most strongly associated with the biological states described by the signature vector. Toward that end, most previously published methods, for example Gene Set Enrichment Analysis (GSEA) [2
], assign each gene category a score based on nonparametric statistics, t
-statistics, or correlations that reflect the relationships between individual genes and the signature vector. The gene categories most enriched with those strong single-gene associations are said to be related to the signature. The degree of enrichment is usually represented by a P
value or false discovery rate using, for example, a Fisher's exact test [3
], a weighted Kolmogorov Smirnov test [2
], or comparison with a χ2
], binomial [6
], or hypergeometric [7
] distribution. Although those approaches have proved useful, they neglect the fact that gene products generally function in complicated pathways or complexes whose expression patterns may not be reflected in the summation of univariate associations between single genes and the biological activity [8
To address that shortcoming, LeFE uses a machine learning algorithm to model the genome's complex regulatory mechanisms, determining for each category whether its genes are more important as predictors (variables) than are a set of randomly sampled negative control genes. Although any of several different machine learning algorithms could be used in LeFE, we chose the Random Forest algorithm [12
] because it has features (discussed below) that make it particularly apt for this application. The power of Random Forest has been successfully demonstrated in numerous bioinformatic and chemoinformatic applications [13
]. As per the 'no free lunch' dictum [17
], no single machine learning algorithm can be optimal for all datasets and applications, but Random Forest appears to be an appropriate choice as an engine for LeFE.
The Random Forest algorithm builds an ensemble of decision trees using the Classification and Regression Tree (CART) method [18
]. Random Forest is therefore included among the general class of 'ensemble learning' algorithms. The algorithm injects diversity into the tree creation process by building each tree on an independently bootstrapped (resampled with replacement) subset of the samples. Further diversity among the trees is generated by basing each tree-split decision in each tree on a different randomly chosen subset of the variables. After the entire forest of slightly different decision trees has been built, it can be applied to new, unseen data by running each new sample down each tree. Just as in CART, each tree's ultimate classification or regression decision is determined by class voting on sample class or the median regression value of the training samples in the case of continuous variables. The aggregate forest's output is then determined by averaging the regression values of the trees or using a weighted voting process to determine the most common class decision reached by the trees. The power of random forests is derived from both the low-bias and the low-variability they achieve on the basis of the 'ensemble' of low-bias, high-variance decision trees.
At the simplest level, the Random Forest algorithm has only two tunable parameters: mTry, the fraction of all variables tried in each tree-split decision, and nTree, the number of trees grown. Typically in Random Forests, nTree is set to 500, but we used nTree = 400 since that choice showed no appreciable decline in the algorithm's accuracy and achieved a modest increase in efficiency. The best values of mTry, suggested by the literature [14
], are ns
/3 for regression on a signature vector with continuous values and √ns
if the signature data contains class information, where ns
is the number of experimental samples. We used those values, so there were no parameters that we tuned. The algorithm is therefore simple to deploy, and over-parameterization is relatively rare. The Random Forest algorithm also has two other properties that make it especially apt for use within LeFE. The first is that it includes an internal cross-validation procedure that estimates the forest's predictive performance without the need for explicit a priori
separation of the testing and training samples. That feature is particularly important in this application because microarray experiments are often run on limited numbers of samples. Because each tree is constructed on a bootstrapped sample representing 1 - e-1
, or approximately two-thirds of the samples, about one-third of the samples are not used to build any given tree. Those unused 'out-of-bag' (OOB) samples are unseen in training and therefore can be used to determine the predictive performance of the tree. After the forest is built, each sample serves as a test case for the approximately one-third of the trees for which it was OOB. That procedure provides an estimate of the forest's error in the prediction for each individual sample. The OOB error of each sample is averaged over all samples to estimate the total error of the model. Fivefold cross-validation and the internal performance assessment using OOB samples have been shown to yield quite similar results [14
The second useful property of random forests is that they can determine the importance placed on each variable in the model. Each variable's importance is assessed by randomizing the variable's association (permuting the variable's row elements) with the samples and then reassessing the model's error by OOB cross-validation. The Random Forest software package, which we used for the computations, has one iteration as the default, and the documentation states that more than one randomization does not appreciably improve the stability of the calculated importance scores. The loss of model accuracy is normalized by the accuracy of the unpermuted, intact model's performance to give an 'importance score' for each gene in a category. When Random Forest is applied to a classification problem, the model's error is a weighted classification accuracy, and in the regression context model error is the mean squared error. The greater the decrease in normalized performance, the more instrumental was the variable (gene) in achieving the forest's predictive performance. See Materials and methods (below) for a detailed description of the importance score.
The steps in the LeFE algorithm (shown schematically in Figure ) are described more formally in the Materials and methods section (below). Here, we summarize the basic elements conceptually. For each category, LeFE builds a random forest to model the signature vector on the basis of a composite matrix consisting of genes in the category and a proportionately sized set of randomly selected negative control genes that are not in the category. On that basis, the random forest determines the importance score of each gene (variable) in the multivariate model. The distribution of importance scores of the genes in the category is then compared with the distribution of importance scores of the negative control genes. The expectation is that the two distributions will be similar when that comparison is made for a category that is biologically unrelated to the signature vector. However, if the category includes biologically relevant genes or gene combinations, then Random Forest is expected to assign higher importance scores to at least some of the genes. A one-sided permutation t
] is used heuristically to compare the distribution of importance scores of the genes in the category with those of the negative control genes. Because the test compares the calculated t
-scores with the distribution of such t
-scores obtained after permuting the sample labels (instead of comparing them with a parametric t
-distribution), it is nonparametric. To ensure diversity in the sampling of negative control genes, that process is repeated nr
times, each with the same gene category and a different set of randomly selected negative control genes. As nr
becomes large, the random gene sets asymptotically reflect the overall covariance of the dataset. The median of the permutation t
values from the nr
iterations is taken as an index of the degree of association between the gene category and the signature vector. After LeFE has been applied to each gene category, the categories are ranked according to those median P
The LeFE algorithm illustrated schematically for a category of two genes. See Materials and methods for further details and Table 4 for a description of the steps (keyed to the circled letters). LeFE, Learner of Functional Enrichment.
LeFE is different from the other category-based algorithms listed previously [2
] in that it assesses gene importance within the context of a multivariate model. That enables LeFE to access the gene information contained in complex biological interrelationships, rather than relying on the summation of univariate relationships within a category. For example, if two genes in a category were related to the samples' biological process or state by an 'exclusive OR' association, then LeFE could capture that relationship, whereas category-based summations of univariate associations would be likely to overlook it.