Functional annotation of genes is a fundamental problem in computational and experimental biology. The problem can be solved at various levels of resolution ranging from identifying high level processes where a given protein might be associated with, to discovery of the cell specific protein-ligand interaction targets of a protein in different biological conditions. The most established and reliable methods for protein function prediction are based on sequence similarity using BLAST
[1] and profile methods such as PFAM
[2], and PSI-BLAST
[1]. Other still evolving methods that are too numerous to list include gene fusion information
[3], and phylogenetic profiling
[4],
[5]. Emergent methods that elucidate function from a variety of high-throughput experimental screens have become particularly attractive recently due to the reduced cost of conducting genome-wide functional screens. Genomic and proteomic data sets, including gene expression and protein-protein interaction (PPI) data, are becoming increasingly available for a growing array of organisms. Driven by the hypothesis that co-expressed genes might participate in related biological processes, clustering gene expression profiles across diverse conditions can be used to assign protein function
[6]–
[8]. Using PPI data to assign protein function has been extensively studied. These algorithms are often based on the “guilt by association” principle that suggests that interacting neighbors in protein-protein interaction (PPI) networks might also share a function
[9]–
[11]. Since such genome-wide data sets are inherently noisy, and each type of data captures only one aspect of cellular activity (e.g. gene expression data measure mRNA levels of transcriptionally induced genes, and PPI data suggest a feasible physical interaction between proteins), it is appealing to combine such heterogeneous data in an effort to improve the coverage and accuracy of protein function prediction.
Bayesian network methodologies for data integration have been explored
[12]–
[14] in a number of systems for predicting protein-protein interactions and protein function similarity. These approaches calculate the posterior probability that each pair of genes
i and
j, has a functional relationship, given the various types of genome-wide data. These algorithms output a functional linkage graph
[3],
[15] in which an edge between two nodes (genes) represents functional similarity with a reliability score (probability) assigned to each edge. However, using these probabilistic networks to produce a functional assignment remains a hard computational problem. For instance, one approach for protein function annotation based on Markov random fields (MRFs) has been previously investigated
[10]. An integrated MRF approach that includes network structures (PPI network and co-expression network) and protein domain information to predict protein function has also been proposed
[16]. There, the authors used Gibbs sampling to estimate the probability that a protein has a particular function. Machine learning methods based on support vector machines have been investigated in several projects
[17],
[18]. In fact, it is rather obvious that if we treat the prediction of function based on each modality as an expert, then any of the popular classification methods (decision trees, boosting, and weighted majority) can in principle be used for “integration” of these predictions. However, given the currently sparse data using complex representations for prediction might lead to overfitting.
In this paper, our contribution is twofold. First, we propose a simple and relatively transparent probabilistic model for protein function prediction that allows us to efficiently calculate the posterior probability that each gene has a particular function, given various types of genome-wide data. Second, we analyze the effect of combining the heterogeneous data sources in a substantially more comprehensive manner than has been done to date, with the goal of better understanding just which types of genes benefit most from the integration of which types of data sources. In particular, we develop a relatively simple yet useful method to integrate functional linkage graphs with categorical information. The functional linkage graphs are constructed from PPI data and gene expression data. As usual the assumption here is that physically interacting proteins or co-expressed genes are more likely to share protein functions than a randomly selected pair of proteins
[10]. Categorical features for each protein, including protein motifs, knockout phenotype, and localization information are captured based on predictive sources of evidence available from the MIPS database
[19]. Using Bayesian networks framework, this categorical information is then combined with functional linkage graphs constructed from PPI data and gene expression data to generate functional predictions. Our method is applied to the functional prediction of proteins in yeast (
Saccharomyces cerevisiae). Our methodology combines PPI data, gene expression data, protein motif information, mutant phenotype data, and protein localization data, while using Gene Ontology (GO) “biological processes” terms
[20] as the basis for functional annotation. The long term goal of this research is to develop a probabilistic language to specify which proteins might be active in a given biological process based on the type of interacting partners they have, protein motifs, or transcriptional profiles.
By combining five types of data, the number of correctly recovered known gene-term associations is increased by 18% at the same precision (50%), compared to using PPI data alone. We specifically focused on certain points on the ROC curve in our analysis that we believe are potentially feasible for follow-ups on the prediction in experimental labs. We show that by adding different types of genome-wide data, different types of the GO terms that are specific for the type of information are newly recovered. Also, by conducting robustness analysis of the integration model to PPI edge removal, we provide a novel perspective on the amount of PPI data necessary to obtain high prediction accuracy by the integration model. In that analysis, we find some conditions where integration actually hurts performance rather than improving accuracy. Plausible functions are assigned to 463 currently unannotated proteins by our method, and we discuss some of these novel assignments.