The regulation of a diversity of biological process is only possible because of the interaction between distinct components [1
], maintaining the homeostasis of the system. A major challenge in biological sciences is to understand how several components interact with each other in order to perform their functions. By considering that proteins play their action not alone but into the context of a network of interactions [2
], the mapping of the interrelationships among proteins is an important step to understand their functions and the global cell behavior [3
]. In recent years, new high-throughput technologies allowed the measuring of expression profiles of thousands genes simultaneously. Because of the large amount of transcriptome data available, the inference of gene networks (GNs) from expression data has emerged as an approach to the study of the systems biology [4
]. The assumption is that if there is an interaction between two elements (e.g. protein-protein, Transcription Factor-DNA, etc.) their expression profiles should also be related. However, two genes may have similar expression profiles just by coincidence. Thus, the challenge is to recover GNs reducing the number of false positives. The expression data can be sampled as time points (time-series / time-course data) or under different biological conditions (steady state data). Also, the data can be produced by distinct technologies as microarrays [5
], SAGE [6
] and RNA-Seq [7
The so-called "curse of dimensionality" [8
] is a phenomenon in which the number of training samples required for a satisfactory classification is given by an exponential function of the size of the feature space. In many applications, and especially in systems biology, the size of the training samples is generally much lower than the dimension of feature space. Thus, despite high-throughput data available, there is still a limitation in the inference of GNs: the number of genes (features) is much larger than the number of time-points (samples). As an example the expression dataset of P. falciparum
has 7,745 oligos and only 48 time points.
Facing this problem, other biological information than expression data has been included in order to reduce the estimation error [9
]. Several types of new biological data have been recently produced: (a) Interaction Data: protein-protein interaction [11
] and protein-DNA [14
], (b) Function and Ontology: KEGG [15
] and Gene Ontology [16
], (c) Other like phylogenetic profile [17
] and Rosetta stone fusion proteins [18
]. In a recent work [20
], pairwise relationships obtained from Gene Expression, Phenotypical Profile, KEGG Pathway, Transitive Homology of protein sequences and protein-protein interaction were used to increase the positive predictive value (PPV) [21
] and to predict the gene function. Each dataset comprises pairwise relationships between genes and each pair has an associated similarity measure (except in protein-protein dataset). A PPV is calculated for each dataset at each similarity value using Yeast Gene Ontology annotation as Gold Standard. After, the PPVs were combined into an equation (Biological Score) and weights were associated to each PPV. Gene pairs were grouped according to the Biological Score using the KNN cluster algorithm [22
]. Gene function was associated to genes according to the group.
Although several data sources are integrated and increase PPV, the result is only related to Gene Ontology and the gain of adding each information remain unclear. The work [20
] is based on the assumption that if two genes are related in the information dataset they should share a common GO term. In other words, the weights show how much each information contributes to recover Gene Ontology relationships but they do not make clear how each information contributes to recover the same type of information itself. Other study [3
] assessed the limit of data integration to predict protein-protein relationships. Using a Bayesian classifier, the relationship between the number of features (prior biological information) and the improvement in the predictive power were evaluated. An improvement in accuracy and coverage was achieved by integrating data of four strongest features: a) functional similarity based on GO, b) functional similarity based on MIPS Functional Catalog database, c) coessentiality and d) correlation between expression data. The MIPS Functional Catalog is a database of protein function provided by the Munich Information Center for Protein Sequences (MIPS) [23
]. The MIPS terms are arranged hierarchically according to classes (e.g. 01:Metabolism, 01:04 phosphate metabolism, 01:04:04 regulation of phosphate metabolism, 02.Energy, etc.). Data from GO and MIPS are important because proteins that belong to same biological process are more likely to interact [3
Also, statistical dependence between features (types of biological information) was analyzed. The absence of statistical dependence between the features available was another important discovery.
Although increasing performance in prediction of protein-protein interactions is very significant, some other important aspects must be highlighted: (a) The prediction was not done in the context of the inference of GNs from expression data, (b) the gain of each information is unclear, (c) the protein-protein itself as prior information was not evaluated.
Another important aspect is that several approaches of data integration are based on the correlation measure between expression profiles of gene pairs in biological data (like protein-protein networks) as in [20
]. However, it has been shown that opposite to prokaryotes organisms, in eukaryotes the correlation of related pairs is similar to those of random networks [24
]. Also, [25
] showed that transient protein complexes have a weak correlation to expression profiles. Other work [26
] observed that although in S. cerevisiae
and bacteriophage T7
co-expression and protein-protein interaction are related, self-interactions cannot be tested by correlation between expression profiles which notably represent a considerable proportion of the samples. In this work we address this problem by using an approach based on mean conditional entropy. Previous works [9
] in data integration have contributed to decrease the estimation error. However, some questions are still unclear:
What is the gain of adding different biological information in the GNs inference?
In a previous work [28
] we describe the gain of adding protein-protein data to recover the GN of Plasmodium falciparum
. Also, added data of distinct types are normally evaluated against a common gold standard related to a single feature (functional, physical contact, etc.). It is not clear what is the gain to infer the same type of data added. For example, what is the gain of adding protein-protein information to recover a protein-protein network?
An important aspect of adding biological information is the heterogeneity of data.
For example, protein-protein networks obtained from Yeast two hybrid experiment (Y2H) are in vitro verification of physical interaction. Rosetta stone fusion pairs are prediction of protein-protein interactions obtained indirectly through sequence comparison. KEGG data can inform if two components interacts in the same pathway. The Gene Ontology (GO) information is useful to obtain the physical cellular localization, biological process or molecular function. In general, it does not make clear if genes sharing the same GO-term interact directly with each other. Thus, we propose a classification into two types of biological information data: (a) direct physical interaction data (e.g. protein-protein, protein-DNA) and (b) feature data (e.g. biological process, cellular localization, signaling pathway in which participates, etc.).
Thus, another important question is: what is the relative gain of distinct types of biological information? It is not clear if they have the same behavior.
In this work we developed an algorithm to integrate biological information data for the inference of GNs, evaluated and compared the relative gain of four biological information dataset for the P. falciparum organism: (a) Protein-protein interaction, (b) Rosetta Stone fusion proteins, (c) KEGG, (d) Combined KEGG and GO dataset.