Defining multifunctionality and predicting without data
Using GO as our source of functional annotations, we define the multifunctionality of a single gene as
is the number of genes within GO group i, and Num_outi
is the number of genes outside GO group i. If we ignore the weighting by the size of the groups, this score is simply the number of GO terms a gene has. The weighting has the effect of counting membership in a GO group by how much the gene contributes to that GO group. Weighting by Num_in has the effect of giving a gene which, e.g., is one out of five in a GO group a contribution of 1/5. Weighting by Num_out provides a corresponding weighting to genes not within the GO group; that is, being the only gene outside a large GO group subtracts as much from that one gene's score as being the only gene within a GO group would add to another gene's score. The particular form of this definition was not chosen arbitrarily, as we now explain.
We arrived at our definition of multifunctionality by considering that the greater the multifunctionality of a gene, the greater the degree to which it should be a good candidate for having any function (averaged over all functions). Thus, a single ranked list of genes which best captures candidacy across all functions is equivalent to a list of genes ranked by multifunctionality. Intuitively, if one is forced to choose a single ranking, the gene with the most GO annotations could be predicted as being in all GO categories. This is because if one gene is in 100 GO categories (high multifunctionality), and another is in only one (low multifunctionality), by placing the former gene ahead of the latter gene in a fixed ranking, we make a correct prediction more often across all GO categories. Because the ranking of genes is optimized in this way, we expect that when we use it as a “predictor” of GO category membership, we should get values of the AUC of over 0.5 for many GO terms. We were nonetheless surprised that this single ranking of genes gives a mean AUC of 0.90 across all GO terms tested (). By itself, this is clearly not a meaningful result for “gene function prediction” since the ranking is obtained in a “circular” way by optimizing for GO. A main point of this paper is that there are properties of real data that are correlated with this optimal ranking, and this fact can explain much of the apparent learnability of gene function by “guilt-by-association”.
A single ranking of genes can predict GO group membership.
The definition above is the optimal list (see Data S1
, Section 1) when the optimization is for the area under the curve (AUC) of the receiver-operator curve (ROC).
Small differences in AUC can hide larger differences in other measures such as positive predictive value (PPV). Maximizing AUC provides a particularly intuitive form for the multifunctionality scores (number of functions weighted by involvement) as well as simple analytic calculation of the optimal list, but the same principle can be applied to other metrics. The AUC has the added benefit of being widely used and not dependent on choosing a specific threshold. Using a different metric would suggest using a different optimal list. In later sections we show that our conclusions are not sensitive to the choice of performance measure.
It is naturally of interest to know which genes are most multifunctional. As ranked by our definition, the most multifunctional human genes include a mixture of genes that most biologists would recognize immediately as "centrally important" (e.g. tumor necrosis factor (TNF), transforming growth factor beta 1 (TGFB1)) and others that might seem more unexpected (e.g. forkhead box J1 (FOXJ1), ATPase, Cu++
transporting, alpha polypeptide (ATP7A); the full lists are available as Data S1
). We leave as a topic for future work the question of whether these top-ranked genes really are especially multifunctional, or whether they merely appear to be, due to biases in research into gene function or of patterns of annotation. In addition to providing a single ranking of genes by multifunctionality, we also obtain a ranking of GO groups by how well they are predicted by this ranking (Data S1
). Highly-ranked GO groups are the “most multifunctional” – the genes they contain tend to be highly multifunctional.
Thus far we have merely provided a novel definition of multifunctionality and shown how it can yield surprisingly strong performance when used for functional prediction. As mentioned above, while this exposes some important features about the distribution of GO annotations amongst genes and gives insight into which genes are most “multifunctional”, by itself it has no implication for gene function prediction because it uses GO in its construction (it is obviously “overfit”; we are not proposing this ranking is of any utility for gene function prediction). We now show that the ranking of genes by multifunctionality is consistently reflected in real data. The basic implication is that gene function predictions based on such data can be attributed to multifunctionality.
The relationship between multifunctionality and network node degree
We hypothesized that multifunctionality plays a role in the prediction of gene function from genomics data. In particular, we wished to examine whether multifunctionality is reflected in other properties of genes which are used in data interpretation. This is potentially important because it is usually assumed that when genes are assigned a function, it is due to either a valid prediction or a false positive due to “promiscuity” or other issues with the data. We suspected that in fact multifunctionality can explain a substantial amount of the way functions are assigned. How could this happen? As we showed in , ranking genes by multifunctionality would be a good way to get good performance from a gene function prediction algorithm. Thus, if the data used for prediction is in some way a proxy for (or correlated with) multifunctionality, and the algorithm used for classification can exploit this, very good prediction performance can result. Put another way, algorithms which assign new functions to genes which are already highly multifunctional will, on average, be rewarded by appearing to yield good performance. This is true no matter what measure of performance is used, though as mentioned above the optimal ranking will be somewhat different for different performance measures.
Because gene function prediction uses the structure of gene networks (or association matrices), it is natural to consider how those networks are generated and how this relates to multifunctionality. A key statistic about networks is the node degree of each gene – how many connections each gene has. Because we define each function at the molecular level by the set of genes interacting in a particular context, it might be expected that a greater number of interaction partners overall (higher node degree) would reflect membership in a greater number of sets of genes (higher multifunctionality); this is the hub view of high node degree. Alternatively, functional studies might not translate well to interaction studies, or a higher node degree could represent only a less specifically defined functional role for the gene for no greater number of functions; this is the promiscuous protein view of high node degree.
In the following we consider a model to assess the specificity of functional information in protein interaction networks (and find that node degree ranks genes by their probability of random interaction). Previous workers have noted that in protein interaction networks, the probability of two genes being associated in a network is strongly correlated with the product of their individual node degrees (the number of connections they have) 
. It has further been shown that protein interaction networks can be surprisingly accurately reconstructed by a model in which each gene i
has an inherent and fixed probability pi
of being an interactor, without any specificity in the interactions; in our model the probability of an interaction is determined only by the product of these probabilities.
An “interaction” matrix generated from this model takes the form pipj
for each pair of genes i
– the independent product of probabilities representing the probability of both proteins appearing in the test and thereby their being labeled as associated. If we treat a real interaction matrix as if it were generated from the model, we can reconstruct the ranking of the original values for the p's
by noting that the sums across the columns of the matrix are
Because the sum forming the multiplicand is a constant, this product is proportional to the value of pi, so the result is the desired ranking. With a real interaction matrix, we compute A simply as the (weighted) sum of each row of the matrix. If the original network is unweighted, A is essentially equivalent to the node degree. Under this null model, node degree is more than just an interesting statistic about a gene; it explains the structure of networks.
Our interest in node degree is spurred by the idea that node degree might be strongly related to multifunctionality. As mentioned in the introduction, this seems intuitive and has some support in the literature. Here we show that there is an unequivocal relationship between node degree and multifunctionality, in a wide range of networks.
We first computed node degree for a human gene coexpression network and compared it to the ranking provided by the degree of annotations (number of GO terms) (). The rank correlation is modest (0.28) but statistically significant (p<0.001). Importantly, the effect is distributed across the entire range of node degrees – it is not just a property of genes with higher node degrees. This general relationship held for a variety of other networks we examined (), to varying degrees. Thus there is a relationship between multifunctionality and node degree, but by this measure the effect appears modest.
Gene function prediction performance.
As hinted above, the results shown in suggest that simply ranking genes by how much coexpression they have (regardless of with which other gene) would yield good results in classification tasks because it would approximately rank genes by their multifunctionality. In this approach, the gene with the highest node degree (most widely coexpressed) is predicted as the best candidate for membership in all GO categories, and the gene with the second-highest node degree is the next best candidate, and so on. The results of doing this for coexpression data are shown in . The mean ROC is 0.58, which is significantly different from 0.5 (p<0.01, permutation test), and there are many GO terms for which performance is quite good (over 0.70). We performed a similar analysis on several other networks (), and the performance is consistently significant (p<0.01). The performance is particularly strong in the case of the human protein-protein interaction network (). Clearly these prediction performance results are an artifact, and are due to interactions between the input data and multifunctionality.
The results in and show that while there is a detectable relationship between node degree and multifunctionality, the actual correlation between the rankings is modest and varies from data set to data set. However, when we look at the impact on the ability to predict gene function from node degree alone, the variance is almost entirely explainable by multifunctionality. To help show this, we assessed the performance of node degree at predicting gene function in forty-seven human coexpression networks. As expected, performance varies from data set to data set; these values essentially expand the data in . Overall, the mean AUC we obtain with a given network using node degree alone to rank genes is highly predicted by the degree to which the network's node degree ranking correlates with multifunctionality (Spearman correlation
0.96). Thus node degree performance is a proxy for multifunctionality in determining gene interactions. In subsequent sections, we will see more clearly that node degree performance is central to determining overall performance in gene function prediction from networks, so the high correlation between node degree performance and the degree to which node degree reflects multifunctionality is critical.
Thus far we have measured multifunctionality using the Gene Ontology, which is arguably the best single source of gene annotations. One could ask whether the same ranking generalizes for other groupings of genes one might want to predict. As shown in , the GO-based multifunctionality ranking provides very good predictions for sets of candidate disease genes identified in genetics studies, including Alzheimer's disease 
, schizophrenia 
, Parkinson's disease 
and autism 
. Indeed, across 4069 sets of disease genes from OMIM 
, the average ROC is 0.76 (Figure S1
). To give a point of reference, the multifunctionality ranking performs better than a sophisticated algorithm, GeneMANIA 
, using protein interaction data to make predictions on these same groups (). This finding suggests biases in how these candidate gene groups are established, how association with disease indirectly influences the GO annotations a gene receives, and/or the degree to which multifunctional genes contribute to disease.
Disease candidate gene prediction.
In this section we have shown that using node degree alone to predict gene function works strikingly well and that numerous real-life rankings of genes seem to reflect multifunctionality more than any more specific biological principle. But because node degree is not actually what is used by biologists to predict function, one might still ask whether “real” analyses are independent of these effects. We address this issue in the next section.
Do gene networks provide new information?
While the above results show that simply ranking genes by node degree (and thus by multifunctionality) provides some predictability of gene function, it is still possible that a “real” analysis of the original network with a sophisticated machine learning algorithm does much better and that multifunctionality plays only a small role in real studies. Unfortunately we find this is not the case. Earlier we mentioned the finding that node degree has been shown to explain much of the structure of protein interaction networks. We now show that node degree (and thus multifunctionality) underlies a large fraction of the performance of gene function prediction methods. In fact is it impossible to say for certain that any gene function prediction is not simply due to the influence of multifunctionality.
As a first indication of this problem, we found that the ability to predict a gene function group based on node degree performance is strongly predictive of how well that group will be predicted in the “real” analysis (, ). That is, node degree generates its best prediction performance in the same gene groups as a real analysis. In fact, the correlation understates this trend, since scatter around an AUC of 0.5 is expected. Second, predictions from node degree are surprisingly good compared to the “real” analysis (). For example, while average “real” prediction performance of GO groups is high when using protein interactions (mean AUC 0.7), the node degree ranking mean AUC is 0.63 – far above the 0.5 expected by chance. Coexpression yielded very similar results (). Thus, association matrices reflecting strong multifunctionality produce better performance, and particularly the performance that could be predicted by multifunctionality alone (without knowledge of specific gene-gene associations).
Node degree performance is highly correlated with prediction from the original data.
The histograms in show another important feature of node degree performance. Not only is node degree only slightly worse than real performance, the distributions have similar variances. This suggests that if the statistical significance of the predictions made from the real data were calibrated by the performance from node degree, the results would be drastically different from the more usual approach of using a permuted matrix. In other words, not only is “real” performance largely similar to multifunctionality performance, multifunctionality explains much of the variation for one group compared to others.
While node degree ranking is surprisingly powerful for predicting gene function, it is still not clear from the above analysis the extent to which node degree underlies the performance obtained when network structure is used. To quantify how much is actually gained by using the network and show the manner in which multifunctionality can affect network structure through node degree, we constructed networks entirely according to the model proposed earlier (in which a gene's appearance in a given test is taken to be due to a higher prevalence, pi
). As earlier, our interaction matrix takes the form of a self-outer product, and as shown in the supplement (Data S1, Section 2), the vector which provides the best approximation (in the least-squares sense) of the original association matrix under this constraint is the node degree vector. This means that the best approximation (under these conditions) is a network constructed from our proxy for multifunctionality. We call the network formed from the self outer product of the node degree vector an “individual property network” or IPN, since it uses only information about individual genes (i.e., node degree), not relationships between genes. This network lacks any meaningful association information but can be analyzed using the same algorithms as the original network (see Figure S2
for a schematic of IPN construction). Thus, IPNs reflect associations predicted from multifunctionality alone, that genes with many functions are more likely to interact, and if these interactions show gene function prediction performance, it can be ascribed to multifunctionality alone.
As with the simple node degree ranking, we found that IPNs perform very well in gene function prediction tasks compared to results from the original association matrix. Groups that perform well in the “real” analysis can be largely explained by variation in the performance of the IPN. Two examples are given in . Prediction of KEGG pathways from protein interactions using GeneMANIA yields the highest average performances of all tests (), but only slightly better than the IPN (). In contrast, predicting GO from coexpression using the SVM (evaluated using corrected classification rate, CCR; see Methods
) does substantially better than the IPN, but “real” performance is not as good and there are many groups which are not learnable from either network (, cluster of points around 0.5, ). To confirm that these results are not influenced by the use of any specific algorithm, we performed a similar analysis using only the semantic similarity of gene pairs in the original data and the corresponding IPN to show that functional overlap in prediction is primarily due to multifunctionality bias (Figure S3
Individual property network (IPN) performance.
One prediction of the model we propose is that if performance is due to the node degree, varying the data will have a much more significant effect than varying the method, performance metric, or even the exact learning task. In , the same task is performed with two different high- performing association matrices, and there is little similarity in performance across all groups. However, the top performing groups are the ones in which the average node degree is high in both data sets (measured as the product of the average node degrees; black points in ). In contrast, two different methods yield very similar results even on rather different tasks, so long as the same underlying data is used (), and again the best performance is for the groups with the highest node degree.
To further address variability across data sets, we analyzed coexpression in the 47 microarray studies underlying our human gene coexpression network. We also examined 77 mouse microarray studies on the Affymetrix MOE430A platform, analyzing genes intersecting with the mouse Golden Path list (mm9) and the genes analyzed by Su et al. 
. The Su dataset is of interest as it was used in a large evaluation of gene function prediction methods 
. For computational efficiency, we only used the GO-slim categories in this analysis, which may also forestall concerns that our estimates of performance of node degree are enhanced by redundancy in GO (Figure S4
). We found that overall, human dataset performance is strongly predicted by node degree (correlation of 0.94) with a variable fraction (approximately half to two-thirds) of the performance accountable by node degree performance (including in the meta-analysis dataset and protein-interaction dataset) (). A similar effect is present for mouse, with the Su dataset being an apparent outlier ().
Since node degree predicts gene function using multifunctionality without association information, its prediction performance could be used as the null distribution in computing p-values for the true association matrix, instead of using a permuted matrix as the null distribution. When we do this, it drastically reduces the number of significantly predictable GO groups in even the best of cases (). However, we have seen that different gene groups perform differently depending on the node degree of the genes within them (and that node degree of genes predicts performance). Thus, the null distribution must more properly be specific to the gene group, depending on the node degrees of the genes within the group. We developed a simple algorithm to generate random association matrices where the node degree for each gene is maintained (in contrast to usual permutation methods where the node degree distribution and network structure are maintained, but node degree for any given gene is allowed to change). Generating 1000 such matrices and testing prediction performance for the gene group on each gives a distribution of AUCs that is an estimate of performance under the null hypothesis of performance solely from multifunctionality effects. As can be seen in for two GO groups, some groups can have nearly all of their performance due to true guilt-by-association, though performance is not very good (GO:0019825, AUC
0.67, p<0.001), whereas high-performing groups may simply have high average node degree (GO:0044419, AUC
0.78, p~0.035). Finally, in , we plotted average performance across all human datasets for each GO-slim category to show that each gene function prediction category has a unique degree of learnability from multifunctionality. Most groups lie slightly above the identity line, indicating some information gained from using the original network, but deviation from the line is low.
In this section we showed that much of the performance of guilt-by-association methods can be explained by multifunctionality. Data sets which perform well in gene function prediction are the ones most strongly influenced by multifunctionality. Isolating those predictions which are not explained by multifunctionality leaves a gloomier picture of the feasibility of gene function prediction from networks. In the next section we consider the generality of our findings.
Assessing the generality of multifunctionality effects
A potential criticism of our results thus far is that we rely on the AUC or the CCR for evaluation. Focusing on the AUC, a common complaint is that the AUC takes into account true positives that are not near the top of the ranking, but low-ranking predictions are of no interest from a biological validation standpoint. Thus is it reasonable to ask whether the genes with the highest rankings in our predictions are multifunctional. Because the trend in which multifunctionality is predicted by node degree appears to be true not just for "hubs" but across all node degrees (), we would expect that predictions based on node degree would still account for a significant fraction of performance when focusing on the top of the list.
We re-assessed node degree effects in the human protein interaction by three alternative methods. First, we employed the ROC50, a variant of the ROC that examines only the rankings up to the 50th
false positive example 
. ROC50 performance averaged across GO groups was 0.64, while ROC50 using only node degree was 0.58 (significantly different from 0.5), confirming the prediction. ROC50 is not very widely used and emphasizes details of the order of the genes in the top of the rankings, which might not be of much interest. An alternative measure is positive predictive value (PPV), the fraction of true positives above a chosen threshold. To give a point of reference, using a threshold of 50 genes, GeneMANIA gave a mean PPV across GO groups of 3.8%, a 14-fold improvement over that expected by chance. We then generated a ranking of genes based on multifunctionality, using PPV instead of AUC as the optimization criterion (see Methods
). This single ranking gives an average PPV of 4.7% averaged across GO groups. The final measure we examined is the area under the precision-recall curve (AUP). The multifunctionality-ranked gene list gives a mean AUP of 0.037, considerably lower than GeneMANIA's mean performance of 0.10. The difference between this result and the PPV result suggests that GeneMANIA tends to rank true positives higher than the multifunctionality ranking, though clearly both tend to rank positives more highly than expected by chance. However, this does not mean that using AUP gets around the problem of multifunctionality. Because AUP strongly rewards a single true positive at the top of the ranking, overall performance across GO categories is highly sensitive to the impact of a correct prediction for a highly multifunctional gene. To demonstrate this we constructed a gene network that contains only 100 edges (among 188 genes), with edges chosen based on the number of shared functions between the nodes. This yields a mean AUP of 0.07 (at 10 fold cross-validation). It turns out that 21 of these edges are in the real network. A closer examination of these results suggests that most of the GO categories that have high AUPs with GeneMANIA can be accounted for by the effects of just a handful of highly multifunctional genes. A full exposition of this effect is beyond the scope of this paper, but we also note that AUP suffers from having different expected values for each size of GO group and, unlike AUC, is also sensitive to the evaluation setup in sparse networks, improving at higher numbers of cross-validation splits (i.e., 10-fold vs. 3-fold), complicating its interpretation when attempting to detect trends. Altogether,these results strongly argue that the effects we report reflect a real tendency of multifunctional genes to account for predictions made, regardless of evaluation metric.
As a further test of generality, we examined another “gold standard” set of seven yeast gene interaction networks, some of which are aggregated from other (potentially overlapping) network studies data 
. Importantly, these are networks built using a variety of methods including genetic interactions, coexpression, protein interactions and other approaches. Cumulatively, these interaction networks and the data underlying them have been cited over 6000 times (based on Google Scholar). We used yet another measure of “performance”, the Dice-Jaccard semantic similarity (e.g., 
) to directly assess the quality of the network connections without performing machine learning. The links within the real networks are highly significant individually or in aggregate, possessing much higher semantic similarity than random links (). However, as predicted by our previous results, a large fraction of this performance may be explained purely by multifunctionality biases. This is true even in the case of YeastNet 
which, unlike the other networks we tested, was tuned using information from GO to improve functional relevance. The aggregate network node degree predicts all Gene Ontology categories in yeast with an average ROC of 0.63, comparable to what we observe with the human data. Moreover, the aggregate network IPN accounts for roughly three-quarters of the performance of the true yeast network connections (61% higher in the IPN than random). In the individual networks (e.g.,Figure S5
), the distribution of scores for individual links can be seen to be strongly shifted to higher values both for the real and IPN networks; using the IPN to estimate false discovery rates would substantially increase false discovery estimates. There is no gold standard method to assess network performance, but the Dice-Jaccard index attempts to correct for prevalence/multifunctionality (it is normalized to number of terms), and thus it tends to reduce IPN contribution over other methods, such as GO term overlap. These results show that the structure of these networks strongly reflects multifunctionality, and that gene function predictions stemming from them are heavily influenced by this fact. They furthermore show that our findings extend to all networks we have examined, and are again not merely a function of assessing the networks using machine learning and predictive performance as outcome measures.
Can the impact of multifunctionality be reduced?
Node degree has been previously recognized as potentially influencing functional predictions. Indeed, it is important to note that the first step in GeneMANIA's operation is to attempt to correct for node degree (downweighting each edge by node degree). As discussed in the introduction, a variety of methods have been suggested to minimize the importance of node degree. Above (i.e., ), we've considered one of the most general approaches, which is constraining node degree to take a much more restricted range of values by top overlap. However, this has little effect because it is the AUC produced by the node degree rank that matters (which reflects the influence of multifunctionality), and top overlap tends to preserve node degree ranks (r>0.76 in coexpression). Here we consider other possible corrections for node degree biases and show that their interpretation is still enhanced by consideration of the effects of multifunctionality.
We first considered the role of sparsification. Sparsification is a necessary step for methods like GeneMANIA, and protein interaction networks are inherently sparse, but coexpression networks begin as a complete distance matrix. We hypothesized that that sparsification induces network structures that affects node degree/multifunctionality issues. In particular, sparsification is likely to result in high node degree for genes with higher variability (even with similar means). In un-sparsified data, variation in mean interactions across many genes may dominate over the heaviness of the tails. Therefore we assessed the impact of multifunctionality on non-sparsified coexpression data. We used a simple nearest-neighbor gene function prediction algorithm which could work rapidly with unsparse data.
The unsparse coexpression data exhibits low node degree bias (the mean AUC based on node degree ranking is 0.52) but very good performance using the prediction algorithm (mean AUC of 0.71, ). Thus this method might appear to remove the influence of node degree, and thus multifunctionality. However, there is still a very strong dependence on node degree (). The more extreme the performance of a GO group is based on node degree, the better the performance in the nearest-neighbor analysis. Thus we can see that GO groups are predicted if their multifunctionality is either very high or very low (that is, the genes in the group are largely mono-functional).
Potential node degree corrections do not remove node degree influence.
In we show results for another correction for node degree, using sparse data made up by aggregating multiple coexpression networks. Each coexpression matrix in the aggregated coexpression matrix was first normalized so that edges are retained only if the nodes exhibit a correlation above that predicted by their node degree alone (e.g., as in 
). After aggregation, this matrix exhibits high performance (mean AUC
0.70) and modest node degree bias (mean AUC
0.55). However, it again exhibits the triangular dependence on multifunctionality – highly predictable groups tend to either contain unusually multifunctional genes or unusually mono-functional genes.
As a final test for the influence of node degree, we examined in more detail the highest performing individual data set (excluding protein interaction aggregate data) across all methods and data-types. This was an unsparsified coexpression network from a large multi-tissue coexpression experiment (GSE7307). It yields a mean AUC of 0.68 across all GO groups and no node degree bias in favor of multifunctionality (mean AUC
0.48). However, even in this case, the same triangular distribution of node degree AUCs is apparent (). Thus, the set of high performing functional groups generally consist of two quite different classes, both understandable with the consideration of multifunctionality.
Taken together, the results in this section demonstrate that attempt to reduce the effect of node degree bias do not have the desired effect, nor are data sets which appear to have no node degree bias actually free of its effects.