Our study represents a first attempt to examine the predictive power of PPI network properties, in combination with an extensive set of structural and functional features, for identification of cancer genes. Compared to OMIM disease genes and non-cancer genes, cancer genes have more interaction partners, higher network density in their neighborhood, and are more closely related to other cancer genes in the PPI network. These observations agree with the notions that cancer genes play a central role in the cellular network and exert functions in an inter-dependant modular fashion. One common concern regarding analysis of PPI network is that the observed higher connectivity of certain group of genes could be a result of a bias in the PPI network, as it could be argued that these genes received more detailed investigations by the research community. To address this concern, it was previously argued that higher number of known interaction partners for cancer genes is likely to be a consequence of higher frequency of promiscuous domains (which interact with a variety of different domains) in caner genes rather than obvious bias in the PPI network [
8]. Based on a probability density function from the Pfam domain population [
8], many of the top Pfam domains enriched in cancer genes vs. non-cancers in our study showed significantly higher-than-expected interaction promiscuity in term of the number of different domains they interact with, such as protein kinase domain, Ets domain and Homeobox domain (Table ). In addition, there is significant difference in connectivity and clustering coefficient between cancer and OMIM genes (Figure ; see additional file
1: Supplementary Table S1) even though cancer genes and OMIM genes both represent heavily studied gene sets. ~90% of both cancer genes from Cancer Gene Census and disease genes from OMIM database were included in the PPI network. Furthermore, the analyses were conducted using the subset of well-annotated genes from human genome that were assigned with GO terms and Pfam domains. As a result, the less well-studied genes were filtered out from the non-cancer gene group.
Our study showed that cancer genes have distinctive functional, sequence and evolutionary characteristics from COSMIC, OMIM and non-cancer genes. COSMIC genes and OMIM genes in turn have distinctive features between each other and from non-cancer genes. It should be noted that the OMIM gene set in our study is specific to the context of comparison with cancer genes as we excluded from the OMIM gene set those common between the OMIM database and Cancer Census Genes or COSMIC database. COSMIC genes showed relatively more similarities with cancer genes in many properties, and in fact many COSMIC genes were found to be involved in cancer although they are not included in the Cancer Gene Census database [
15]. Therefore, it is beneficial to separate COSMIC and OMIM gene groups from non-cancer genes in training a classifier to predict cancer genes.
SVM classifiers on average perform slightly better than Naïve Bayes and logistic regression. Naïve Bayes performs the worst in our study probably due to the fact that our feature vectors are not orthogonal to each other, which violated the basic assumption of Naïve Bayes models. The theoretical advantage of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; the idea of maximizing the margin mitigates the problem of over-fitting the training data, which is of particular importance when dealing with large number of features.
PPI topological features alone have relatively strong predictive power for identification of cancer genes. Similar to PPI features, GO and Pfam annotations are strong predictors compared to sequence and conservation features. Combining all these features maximize the predictive power (Table ). With the accumulation of more and more protein-protein interaction datasets, our approach of integrating PPI topological features will potentially become more powerful in the future.
The SVM classifier provides a probability score to prioritize candidate cancer genes, which can be followed up by experimental studies, such as siRNA knock down and cell viability assays. Preliminary siRNA studies on predicted cancer genes showed promising leads for further investigations. Interestingly, COSMIC genes with somatic mutations in cancer samples have higher scores than other genes in the COSMIC database (Figure ). As COSMIC genes were held out from the training set and no mutation information was included in the training features, this observation indicates our approach aligns with the large-scale systematic re-sequencing efforts and can serve as a useful complementary approach for identifying cancer genes.