Protein complexes are important molecular entities in cellular organizations. With large amounts of protein interactions produced by high-throughput experimental techniques [1
], protein complexes are able to be automatically identified from genome-scale interaction networks by computational approaches. Generally, proteins in a complex share more interactions among themselves than with other proteins [3
]. Many algorithms, based on graph theory, have been proposed to identify protein complexes by detecting dense regions in PPI networks, such as MCODE [4
], MCL [5
], and CFinder [6
]. However, their performance is affected by the false-positive interactions in the network. In some experiments, the proportion of false-positive interactions generated by high-throughput techniques is estimated to be up to 50% [7
]. It is reasonable to make use of biological information to measure the reliability of interaction pairs or predicted complexes. For example, protein function annotation datasets are used in RNSC [8
] and DECAFF [9
] to filter complexes with low functional homogeneity or reliability.
GO annotation is a useful information resource to measure the reliability of protein interaction pairs. The GO project maintains three structured controlled vocabularies, which describe gene products in terms of their associated biological processes, cellular components, and molecular functions [10
]. The ontology of each domain is structured as a directed acyclic graph (DAG), which organizes terms by their relationships. The similarity of two gene products based on GO annotations can be considered as the similarity of two sets of GO terms. The semantic similarity of GO terms can be measured by the topological information in the ontology structure.
In this paper, we attempt to make use of GO annotations and the ontology structure of GO terms to measure semantic similarity of GO terms and proteins. The similarity of two GO terms is measured based on their average distance to their lowest common ancestors in the ontology structure. Semantic similarity between proteins is computed as the similarity of two sets of GO terms, which annotate the two proteins respectively. PPIs in the network are then weighted by the similarity of interacting proteins for the filtering and clustering steps. As far as we know, most approaches filter the predicted complexes with low density or statistical significance in post processes [4
], which still introduce some unreliable interactions in the results. In our method, however, the low-weight interactions are filtered first, followed by a cluster-expanding algorithm to identify high quality complexes consisting of only reliable interactions. Considering the core-attachment structure revealed by Gavin et al. [13
], which reflects the inherent organization of protein complexes, we propose a network clustering algorithm to identify the core and attachment proteins of complexes successively. Firstly, cliques in the filtered network are detected. Highly overlapping cliques are merged to form cores of complexes. Secondly, we add attachment proteins to the cores, making use of the cluster-expanding strategy in RRW algorithm [11
], which is appropriate for expanding clusters consisting of multiple nodes in weighted networks. By applying the clustering algorithm on the purified PPI network, our method identifies complexes with high biological significance and functional homogeneity.