A variety of recently available high throughput data sets, such as protein-protein interaction networks, microarray data and genome sequences, offer important insights into the mechanisms leading to the accomplishment of a protein's function. However, the complexity of analyzing these data sets manually has motivated the development of numerous computational approaches for predicting protein function [1
]. For a recent comprehensive survey on this topic, see Pandey et al
One of the most popular methods used for predicting protein function from biological data is classification [4
]. Under the traditional classification framework, each protein is represented by a set of features, such as the expression profile of its corresponding gene or the set of proteins it interacts with. Now, for each functional class, a model is constructed using the feature sets of the proteins annotated with this class. This model is then used to decide if an unannotated query protein should be annotated with this class. The key premise underlying this methodology for predicting protein function is that proteins belonging to the same functional class have "similar" biological attributes.
Standard classification or predictive modeling techniques for function prediction rely on positive and negative examples from functional classification schemes, such as the Gene Ontology [7
] or FunCat [8
], and typically treat each functional class separately. However, this standard approach fails to capture one of the key properties of such classification schemes: most schemes not only provide annotations to functional classes, but also capture inter-relations between the functional classes. For example, the Gene Ontology (GO) is arranged as a directed acyclic graph in which the GO terms form a hierarchy capturing everything from relatively general functions (e.g. metabolism) to specific biological roles (e.g. nucleotide excision repair). Such an organization of classes, particularly in the case of GO, poses two important challenges for predictive modeling techniques for function prediction. First, several studies [9
] have concluded that proteins in inter-related, or similar
GO functional classes tend to have similar biological attributes. This limits the applicability of the key premise of classification-based function prediction discussed above, since distinguishing between such similar GO classes becomes hard.
The second important issue that arises is that it is often hard to construct reliable classification models for several functional classes from a given data set due to complex issues including noise in the data, low relevance of the data set for some functional classes, and an insufficient number of training examples for building accurate classification models [12
]. These issues, particularly the last one, are expected to most severely affect functional classes having few members, which also include classes located deep in a functional hierarchy. This difficulty of constructing reliable classification models is illustrated in Figure for two classes from the GO biological process ontology, which have 383 and 54 member proteins in Mnaimneh et al
's gene expression data set [13
], representing a large class (GO:0051252) and a class of median size (GO:0006352) respectively. The histograms in this figure show, for each protein in these classes, how many proteins in its neighborhood belong to the same class. Neighborhood is restricted to the twenty proteins with the most similar expression profiles to the query protein, using correlation as the similarity measure. These plots show that for most of the proteins in both the large, as well as the much smaller class, only a limited number of similar proteins in the same class are available. For instance, in the large class, 243 of the 383 member proteins have less than three similar proteins in the same class, while two is the maximum number of neighboring proteins of the same class for proteins in the smaller class. In fact, 40 of the 54 proteins in the smaller class have no proteins of the same class in their neighborhood. This lack of enough training examples having characteristics similar to the query protein, which occurs due to the issues discussed above, illustrates the difficulty of building classification models for functional classes, particularly for the small ones.
Distribution of the number of nearest neighbors of the same class in a neighborhood of size 20 for the member proteins of two functional classes of varying sizes in Mnaimneh et al's data set.
However, the availability of the same well-defined structure of relationships between functional classes in the form of Gene Ontology raises the following key question: "Can the performance of standard classification algorithms for function prediction be improved by incorporating these inter-relationships into them?". In this paper, we address this question using an approach shown visually in Figure . As illustrated by this figure, our approach uses evidence in neighboring proteins belonging to similar classes to bolster the evidence for annotation of the query protein with the target class. Evidence for the abundance of proteins belonging to classes similar to the target class in the neighborhood of a query protein, and hence the applicability of such an approach is presented in Figure for the target class of median size (GO:0006352) discussed in Figure . Here, the semantic similarity of all the classes with the target class, calculated using Lin's measure [14
], is plotted against the average number of nearest neighbors of the corresponding class in the nearest neighborhood of a protein belonging to the target class. As can be seen from this scatter plot, even though the average frequency of the target class (similarity = 1) is very small (less than 0.5), there are several classes, such as GO:0006366, GO:0051252 and GO:0016072, that are more abundant, and have a substantial semantic similarity with the target class (over 0.4). This similarity can be used to enrich the information available in the neighborhood of proteins being tested for a target class. Our approach is based on this principle.
Figure 2 Conceptual visualization of our approach for incorporating functional inter-relationships into function predictions algorithms, and empirical evidence to show the abundance of proteins carrying labels similar to the target class in the neighborhood of (more ...)
More specifically, using Lin's measure [14
] for evaluating the semantic similarity between classes in an ontology, we incorporate such functional interrelationships into the k
-nearest neighbor classifier [4
]. We evaluate our algorithm on two large microarray data sets [13
], a recent protein interaction data set [16
] and a combination of interaction and microarray data sets, each of which is used for the modeling and prediction of over hundred classes from the GO Biological Process ontology. The results show that, compared to the base k-NN classifier, this incorporation produces more accurate predictions for many of the functional classes considered, and also that the classes benefitted most by this approach are those containing the fewest members. We also illustrate how the proposed framework can be used for integrating information in the entire GO Biological Process ontology to improve the accuracy of prediction over a set of target classes. Finally, we provide qualitative and quantitative evidence that this incorporation of functional inter-relationships enables the discovery of interesting biology in the form of novel functional annotations for several yeast proteins, such as Sna4, Rtn1 and Lin1.
Note that since the rest of the discussion in this paper will be concerned with classification within the context of GO, the terms (functional) class, (GO) term, node (in an ontology) and label will be used interchangeably in the rest of the text.