Machine learning approaches for classification learn the pattern of the feature space of different classes, or learn a boundary that separates the feature space into different classes. The features of the data instances are usually available, and it is only the class-labels of the instances that are unavailable. For example, to classify text documents into different topic categories, the words in the documents are features and they are readily available, whereas the topic is what is predicted. However, in some domains obtaining features may be resource-intensive because of which not all features may be available. An example is that of protein-protein interaction prediction, where not only are the labels ('interacting' or 'non-interacting') unavailable, but so are some of the features. It may be possible to obtain at least some of the missing features by carrying out a few experiments as permitted by the available resources. If only a few experiments can be carried out to acquire missing features, which proteins should be studied and which features of those proteins should be determined? From the perspective of machine learning for PPI prediction, it would be desirable that those features be acquired which when used in training the classifier, the accuracy of the classifier is improved the most. That is, the utility of the feature-acquisition is measured in terms of how much acquired features contribute to improving the accuracy of the classifier. Active feature acquisition (AFA) is a strategy to preselect such instance-feature combinations (i.e. protein and experiment combinations) for maximum utility. The goal of AFA is the creation of optimal training set that would result in the best classifier, and not in determining the best classification model itself.
We present a heuristic method for active feature acquisition to calculate the utility of acquiring a missing feature. This heuristic takes into account the change in belief of the classification model induced by the acquisition of the feature under consideration. As compared to random selection of proteins on which the experiments are performed and the type of experiment that is performed, the heuristic method reduces the number of experiments to as few as 40%. Most notable characteristic of this method is that it does not require re-training of the classification model on every possible combination of instance, feature and feature-value tuples. For this reason, our method is far less computationally expensive as compared with previous AFA strategies.
The results show that our heuristic method for AFA creates an optimal training set with far less features acquired as compared to random acquisition. This shows the value of active feature acquisition to aid in protein-protein interaction prediction where feature acquisition is costly. Compared to previous methods, the proposed method reduces computational cost while also achieving a better F-score. The proposed method is valuable as it presents a direction to AFA with a far lesser computational expense by removing the need for the first time, of training a classifier for every combination of instance, feature and feature-value tuples which would be impractical for several domains.