This method combines information on the interactions and functions of the proteins that belong to a domain of biological interest (e.g., a cellular organelle or a biological process), with the goal of converting a functionally annotated PPI network into a PG, i.e., a compact and coherently structured representation of dependencies among biological functions. The goal is challenging, as available information about the protein-to-function relations does not guarantee that a protein under examination does indeed participate directly in the annotated function. As edges between functions are based on the PPI among the proteins that these functions annotate, it follows that redundant protein-to-function assignment inevitably produces redundant edges among the corresponding FN. Thus, throughout the study, it has been our main concern to ensure that a direct edge between two FN could be established, only if intermediate functions were unlikely to occur. Otherwise, the resulting PG would be a mere assembly of coupled functions and not a coherent and compact representation of the way functions cooperate in supporting complex biological activities. In addition, a redundant PG would be of limited usefulness for planning the smallest set of experimental interventions that can be made on a function, when one desires to impact on target functions. To achieve compact representations, we took the following considerations into account. First, FN are expected to map onto a PPI network the correspondence between proteins and annotations. Second, such mapping is expected to represent the most extensive coverage of the PPI network with the least degree of overlap between FN, provided that one can exclude the annotations of those proteins that support only indirectly the annotated functions. Third, molecules more typically contribute to biological functions as highly inter-connected (or ‘modular’) assemblies, rather than as unconnected elements [
12]. Within PPI networks, for instance, functional and topological modules display significant overlap [
10,
24]. Thus, based on these considerations, the algorithm we have devised introduces a topologically-driven prioritization that selects only plausible inclusions of a protein into a FN, as quantified by its PMS, i.e., a score that reflects the ability of a FN to discriminate among the topological patterns of the PPI network, to which the protein belongs.
The method has been applied to two cellular components (i.e., the peroxisome and the cell bud) and one biological process (i.e., cell budding) in
S. cerevisiae, which are well characterized domains and thus suitable for validation purposes. On one hand, well characterized causal dependencies among functions (e.g., dependence of peroxisome matrix on membrane assembly) have confirmed that the method specifically highlights important relations. On the other hand, less obvious dependencies (e.g., those among different fission-related mechanisms in peroxisomes and mitochondria) have revealed the heuristic power of this method and its usefulness in formulating testable hypotheses. It should be noted that the peroxisome-centered PPI network has been extended to the first neighbors of the peroxisome core proteins, because we wanted to highlight the wider biological landscape that ideally surrounds the organelle. The inclusion of non-peroxisomal proteins is justified by the observation that almost half of the core proteins are annotated (in the cellular compartment vocabulary of GO) with terms related not only to the peroxisome but also to mitochondrion, ER and nucleus (Additional file
1), the organelles that interact functionally with the peroxisome [
25,
26]. Clearly, some multiple annotations of the same protein simply refer to the existence of sub-cellular distinct (and functionally unrelated) pools, such as the peroxisomal and nuclear pools of the dynamin Dyn2p. Nevertheless, other multiple annotations suggest that a protein may change its sub-cellular location, at least under specific conditions. For instance, Pex11p relocates from the ER to the peroxisome, when peroxisomes are induce to proliferate in response to oleate [
27].
Numerous studies have analyzed the relation between molecules and functions. In particular, one of the major aims of many bioinformatics studies has been to infer the function of uncharacterized genes based on comparisons with characterized genes, such as sequence similarity [
28], co-occurrence in genomic clusters [
29], co-evolution in different species [
30] and co-expression patterns [
31]. Also the PPI have been used to infer the function of uncharacterized proteins, based on the most frequent annotations of their protein interactors [
4,
32-
34]. Like these ‘guilt by association’ methods, also our approach builds on the assumption that proteins often interact mutually to contribute to the same function. Furthermore, all these studies (including ours) deploy a non-directional annotated network as input, sometimes designed ‘functional linkage network’ [
35], in which nodes correspond to molecules, while edges correspond to different types of functional connection between molecules [
36].
Few studies, however, have moved beyond the immediate aim of inferring protein-function binary associations to the ultimate aim of inferring structured dependencies among functions, which can be displayed in a markov graph of connected functions. An earlier study established a link between a given pair of functions, anytime a PPI had been detected between the proteins annotated with the two functions [
4]. A more recent study has elaborated on this method, by selecting statistically enriched pairs of functions, as defined by the probability that two sets of proteins (annotated with two distinct functions) establish more PPI between themselves than it can be expected by chance [
5]. Our method differs considerably from these earlier studies, as it retains any annotation that is shared by two interacting proteins within the PPI network, leaving to the topological analysis the task to define the FN, whose relationships may satisfy the markov property. This way, the selected and prioritized protein assignments to the FN are expected to refer truly to the functions that are directly impacted by the protein. As an example of how our approach differs from the previous studies, consider a 3-node sub-graph composed of FN related to the GO functional annotations of protein import (GO:0017038), PMP insertion into the peroxisome membrane (GO:0045046) and peroxisome organization (GO:0007031). While a previous study linked the three nodes with three edges [
5], our algorithm establishes only two edges (one between nodes 17038 and 45046 and another between nodes 45046 and 7031), but not the edge between nodes 17038 and 7031, because it would violate the markov assumption, given that protein import (node 17038) contributes to peroxisome organization (node 7031) only indirectly, i.e., by enabling PMP insertion (node 45046). Furthermore, our study also contrasts with an important implication of the earlier study, which advocated re-engineering the GO database by complementing the GO hierarchy with the links inferred from the functional linkage graphs [
5]. Alternatively, we propose to adapt the semantics of the GO annotations to the level of detail that characterizes the domain of interest, mostly based on the real protein content of the FN. A more detailed comparison of graphs that represent similar domains but are obtained with different methods can be found in Additional file
8.
Clearly, our method can be applied to different domains of biological interest in different model organisms, even though some words of caution should be added. First and foremost, inaccurate and/or defective datasets of protein interactions and functions will certainly affect the quality of the PG representation. In our experience with S. cerevisiae, even after revising carefully the PG, we found just a limited number of false positives and false negatives. Nevertheless, we cannot exclude that results might be less accurate, should the algorithm be applied to other organisms that are not so extensively and accurately characterized as the budding yeast. Second, other features of the starting PPI network should be taken into account, including the choice of a cell type-specific repertoire of proteins (in the case of multi-cellular organisms) and the size of the PPI (to ensure computational tractability). Third, it should be pointed out that labor-intensive analysis is required to verify the consistency of the PG with current biological knowledge and to define the causal directionality of its undirected edges. It should also be taken into account that just a minor fraction of the physical interactions that are reported in the PPI databases have an annotation of biochemical directionality (e.g., kinase-dependent phosphorylation of a substrate). For instance, out of 106,230 PPI reported as ‘physical interactions’ in the 3.1.83 release of BIOGRID, more than 94% of PPI (100,388 PPI) has no annotation. Only less than 6% of the remaining PPI has the annotation ‘phosphorylation’ or other types of modifications. Furthermore, in many of these cases, the annotated modification has been detected in a biochemical assay without functional characterization. Finally, many PPI refer to physical interactions that are non-directional in nature (e.g., interactions among structural proteins).
In conclusion, with all the caveats related to incomplete knowledge, the herein reported data suggest that, even when the PPI structures that underlie a function are only partially known, it is nevertheless possible to regard functions as black boxes with only known inputs and outputs, to obtain non-redundant graphical representations of complex biological systems. In addition, our efforts indicate that the graph we obtain can be helpful in carefully designing experimental studies, provided that specific manipulation and measurement of the portrayed functions are feasible.