Proteomic data and more specifically PPIs data are of great scientific interest through their connection with important cellular functions such as extra and intra cellular signaling, cell communication etc [
1]. Moreover, multi protein complexes reveal insights of the functional and topological organization of the protein networks. In the past years, new high throughput methods for identifying pairwise PPIs have been developed that generate enormous datasets. Depending on the method used, different kinds of protein interactions are recorded. This is the reason why there exist differences on the generated datasets from different methods. The most popular ones are yeast two hybrid systems [
2], mass spectrometry [
1], tandem affinity purification [
3], microarrays [
4] and phage display [
5].
Each method has its strengths and weaknesses; however every method has a certain error rate for the detection of a protein-protein interaction. The main basic errors are under-prediction and over-prediction (false positive) of protein interactions [
6]. Besides that, we currently don't know the real "truth" in these datasets, due to the fact that most of the protein complexes are experimentally not yet determined [
7].
Usually, the aggregation of the PPIs of an organism is modeled as an undirected graph, symbolized as G = -(V, E), where nodes (V) represent the proteins and edges (E) the pairwise PPIs. The graph model makes it easy for many computational methods derived from the graph theory to be applied on these noisy datasets to extract functional modules such as protein complexes. The goal of those approaches is to detect highly connected subgraphs which are protein complex candidates.
Each algorithmic strategy relies on a very different approach. The best known one is the Molecular complex detection algorithm (Mcode) [
8]. Another algorithm, that has been characterized for its efficiency [
9], is the MCL (Markov Clustering) algorithm [
10]. Besides that, King
et al suggested the RNSC algorithm [
11] which uses a cost local search algorithm based loosely on a tabu search meta – heuristic. Another algorithm of the local search approach is the Local Clique Merging Algorithm (LCMA) [
12] which first locates cliques in a graph and then tries to expand them. Two algorithms that use the hierarchical approach are the Highly Connected Subgraph method (HCS) [
13] and the SideS algorithm [
14]. The main concept of these methods is the use of numerous graph min cuts until the stopping criterion of each algorithm is satisfied.
In this paper, we have developed a new clustering tool called GIBA that offers the ability to detect important protein modules such as protein complexes. GIBA implements a two step strategy, where in the first one the whole protein – protein interaction graph is divided into clusters and in the second step these clusters are filtered and only the ones considered important are kept. Extensive experiments were performed on 6 different datasets of yeast organism which are either derived from individual experiments (Tong [
15], Krogan [
16] and Gavin [
1,
17]) or from online databases (DIP [
18] and MIPS [
19]). These datasets vary on the number of proteins as well as the number of interactions composing either sparse (Tong dataset) or relatively dense (MIPS and DIP datasets) graphs. Moreover, by using the recorded yeast protein complexes of the MIPS database, we compared the results obtained from GIBA with 4 other algorithms: Mcode, HCS, SideS and RNSC and examined the derived results based on 5 different metrics. Selecting appropriate combinations between clustering algorithms and filtering methods, GIBA proved its superiority compared to the remaining methods. The undertaken experiments and their results are presented in detail in the Results and Discussion section. Finally, an evaluation of the filter methods has been performed to test how these methods affect the final results and to decide, as accurately as possible, the most effective set of filter parameters that produce the best results.
The remaining of the paper is organized as follows: in the next section, we present the algorithms and the filter methods that are hosted in GIBA tool. In Methods section, the properties of GIBA are presented and the evaluation procedure is presented. In Results and Discussion section we performed extensive experiments on datasets with different properties. Results and Discussion section also contains a discussion about the parameters and the methods that compose the filter of GIBA tool and how these approaches affect the final results. Finally, the conclusions of our work are quoted and the main directions for future work are suggested.