Integration of orthology, GO annotation and gene expression
The basis of the prediction of protein-protein interactions in Arabidopsis thaliana performed in this study resides in the detection of interologs. Interologs are defined as protein-protein interactions that are conserved between two species and that can be detected through the identification of the orthologs in the target organism of the proteins known to interact in the source organism (see Fig. ). Applying this approach, we assume that protein-protein interactions occurring in yeast, C. elegans, Drosophila and/or human, are conserved and consequently occur in Arabidopsis as well.
Figure 1 Interolog detection and filtering. Proteins in similar color belong to the same orthologous group as identified by OrthoMCL. Connecting lines indicate predicted interactions between proteins. First, all possible combinations between proteins in the two (more ...)
When identifying interologs, we need to deal with the high number of duplicated genes in the target organism Arabidopsis thaliana
(see Methods; Fig. ). Inferring all combinations of co-orthologous proteins can cause a high number of false positive predictions. To augment the confidence in the predicted interactions, we integrate data from several so-called genomic features, namely the GO biological process similarity, the GO cellular component similarity and the Pearson correlation coefficient (see Methods). Accordingly, three main conditions are set up, namely the co-localization of interacting proteins, a similar biological role for interacting proteins and the co-expression of genes encoding interacting proteins. These conditions were verified by comparing these different properties in positive and negative protein pairs (see Fig. ). As positives, protein-protein interactions experimentally shown to interact are considered, while a negative set of protein pairs was built by making random combinations between all proteins, which is an approximation of true negative protein pairs (see Methods for details). One has to take into account that the size of the positive dataset can be relatively low and as a consequence, caution in the assessment of false positive rates needs to be taken (see Table ; Additional file 1
). For each genomic feature, the frequency distributions of positive (experimentally identified interactions) and negative (random gene pairs) datasets were compared (see Fig. ) to choose a threshold value for the genomic features (see methods).
Assessment of genomic features. Frequency distributions of A) GO Cellular Component similarity scores, B) GO Biological Process similarity scores and C) Pearson Correlation Coefficients for real and random protein-protein interactions.
Experimentally identified proteins and protein-protein interactions
To decide if two proteins co-localize and/or function in the same biological process, both the GO cellular component (CC) and the GO biological process (BP) annotation of the interacting proteins were evaluated. To measure the similarity between the GO annotations of two proteins, we calculated the maximum depth of the common ancestor of all pairs of GO terms assigned to both proteins (see Methods; Additional file 2
; Fig. ). Similarly, the co-expression in positive and negative datasets was investigated by calculating the Pearson correlation coefficient, describing the global similarity between expression profiles for interacting and non-interacting proteins (see Fig. ). When inspecting the distributions, a smaller difference between positive and negative datasets can be observed for expression correlation than for GO annotations. Therefore, it is impossible to define a threshold in Pearson correlation coefficient that identifies a considerable number of true positives while having few false positives (Fig. ). Different genomic features need to be combined to maximize the coverage of the prediction (see Additional file 3
). Through this combination of genomic features, thresholds may be decreased while maintaining an acceptable number of false positives (see Additional file 1
for an estimation of the false positive rates). The use of a threshold of 5 for the GO biological process similarity score in combination with a threshold of 0.3 for the Pearson correlation coefficient is acceptable (see Additional file 3
; Additional file 1
). Through this combination a lower threshold can be chosen for both genomic features. Nevertheless, one should keep in mind that the co-expression measure takes into account expression data from 86 microarray experiments. It can be expected that the number and type of experiments influences the overall Pearson correlation coefficients for all pairs of genes. Networks of experimentally identified protein-protein interactions in yeast and human, although containing false positive interactions as well, show an average Pearson correlation coefficient of 0.3 and 0.241 respectively, supporting our choice of the Pearson correlation coefficient threshold [36
]. In addition, the GO biological process similarity score (threshold score of 8) and GO cellular component similarity score (threshold score of 5) were also used independently. These relatively high thresholds result in inferences based on very detailed GO annotations, mostly not inferred based on computational analyses (ISS)). The depth of GO terms such as ubiquitin-dependent protein catabolic process (BP ≥ 8) or peroxisome (CC ≥ 5) is high, pointing to specific processes or localizations that are mostly assigned based on experimental evidence (e.g. inferred from direct assay
). Systematic assessment of the evidence codes of the GO terms used for the interaction filtering (BP8 and CC5) shows that a minority of terms are inferred from sequence or structural similarity (ISS) (see Additional file 4
). GO terms used in the BP5+PCC0.3 filter are more often inferred from sequence or structural similarity. However, as in this case the BP score is combined with the PCC, we believe that these ISS terms do not affect the quality of our predictions considerably. Therefore, we decided to include ISS-based GO annotations throughout our analysis and only removed IPI, ND, NR, NAS and IEA-based annotations (see methods). Finally, we added up the different sets of predicted protein-protein interactions resulting from the application of different thresholds and different source organisms. We opted to add up the different sets rather than to take the intersection. On the one hand, our randomization studies show that the individual filters result in acceptable numbers of false positives. On the other hand, we encompass the relative high amount of missing functional data in cases where, for instance, BP annotation but no CC annotation is available. However, using this approach it is possible that a protein-protein interaction passed the BP8 filter, while it does not pass the CC5 filter (contradictory CC information – annotation for BP and CC available). Our systematic assessment shows that this possibility occurs for only a minority of interactions for BP8, while more often for BP5 which is used in combination with PCC0.03 (see Additional file 5
). Moreover, only half of these contradictory CC annotations is inferred from experimental evidence (see Additional file 5
In summary, 52.6% of the experimentally identified protein-protein interactions (767 out of the 1457 interactions) meet the conditions of the genomic features (GO biological process, GO cellular component, co-expression) (see Table and Figs. and Additional file 3
), without taking into account orthology (see next paragraph).
Prediction of protein-protein interactions in Arabidopsis thaliana
Starting from the protein-protein interactions in the source organisms yeast, C. elegans, Drosophila and human, protein-protein interactions were predicted in Arabidopsis using the above-mentioned genomic features. We downloaded the OrthoMCL database containing orthologous groups (OG) of 87 species. From this dataset, we extracted the evolutionary relationships between Arabidopsis, yeast, C. elegans, Drosophila and human genes. The approach taken to build this database has the advantage that in-paralogs (duplicates arisen after speciation) and thus orthologous groups rather than one-to-one orthologs can be identified. Employing these orthologous relationships between yeast, C. elegans, Drosophila and human genes on the one hand, and Arabidopsis genes on the other hand, interologs were detected (see Methods and Fig. for details). The original numbers of interologs are drastically reduced when applying the genomic feature filters (see Table ). The set of protein-protein interactions that remains after applying the genomic feature filters will further be referred to as the "filtered interactome" and accounts for 18,674 protein-protein interactions among 2233 proteins (see methods on the reliability of the predictions). However, we do not discard all interologs that do not meet the requirements of the genomic features. As we would like to study the idenfied interologs in an evolutionary context, we choose to extend the filtered interactome. For every combination of orthologous groups present in the filtered interactome, we take all other (non-filtered) interactions to build the "predicted interactome" (see Fig. ). This predicted interactome accounts for 51,885 protein-protein interactions among 3014 proteins (see Table ). The filtered (see Additional file 6
) and predicted interactome (see Additional file 7
) are provided as supplementary material at http://bioinformatics.psb.ugent.be/supplementary_data/stbod/athPPI/
Predicted protein-protein interactions
The protein-protein interactions detected in the filtered and predicted interactome were compared to experimentally shown and previously predicted protein-protein interactions. Fig. and Additional file 8
depict the overlap between the different datasets. Overall, a small overlap is found between our filtered interactome and the experimentally shown interactions reported by the TAIR, MINT and IntAct databases (see Fig. ; Additional file 8
, panel A1). Similarly, a small overlap of the experimentally identified interactions with the previously predicted interactions of Geisler-Lee et al. [45
] and Cui et al. [46
] is observed (see Fig. ; Additional file 8
, panel A2). Similar to our approach, Geisler-Lee et al. [45
] identified interologs using worm, fly, human and yeast as source organisms. In this study, a confidence value is calculated taking into account the number of times a protein-protein interaction is found as interolog and/or supported by genomic features such as co-expression based on the Pearson correlation coefficient and localization based on the Arabidopsis Subcellular Database (SUBA) [47
]. The predicted protein-protein interaction dataset of Cui et al. [46
] was constructed based on a Naive Bayesian Classifier. This method integrates different predictive data sources such as ortholog information, GO biological process, co-expression, gene fusion, gene neighborhood, phylogenetic profiles and domain architecture, to build a model to predict novel protein-protein interactions. A comparison of our filtered and predicted interactome with these two sets of previously predicted protein-protein interactions is shown in Additional file 8
(panel B1 and B2). Although a similar approach was taken by Geisler-Lee et al. [45
], a considerable number of new interactions (17624) as well as experimentally identified interactions (75) not recovered by Geisler-Lee et al. [45
] is found in our study (see Fig. ). Differences between the two approaches that may cause the relatively small overlap are most probably the use of different protein-protein interaction databases for the source organisms (BIND, MIPS, BIOGRID, and DIP were used in Geisler-Lee et al. [45
]) and the use of different confidence measures. These observations corroborate previous reports on the low coverage of current protein-protein interaction datasets and detection strategies.
Figure 3 Overlap between datasets of protein-protein interactions. Overlap between experimentally identified protein-protein interactions available in public databases (see methods) and predicted sets from this study and the Geisler-Lee et al. study . Filtered (more ...)
Delineation of protein clusters in the predicted interactome
In an attempt to reveal the functional repertoire of the predicted interactome, we have delineated highly interconnected regions in the protein-protein interaction network (hereafter called protein clusters). In addition, we tried to assign a function to the identified protein clusters. Finally, we investigated the evolutionary conservation of the protein clusters.
Identification of protein clusters is performed using the CAST clustering algorithm (see Methods). This clustering procedure employs the connectivity of the proteins. Overall, 1802 proteins taking part in 16,498 interactions could be identified in protein clusters, accounting for the majority of originally identified interactions (see Additional file 9
). The biological roles of the identified protein clusters were studied through identification of overrepresented GO categories (biological process, molecular function and cellular component) (see Methods). To judge the validity of the protein clusters, we inspected clusters involved in particular biological processes together. Using this approach of clustering and subsequent GO enrichment analysis, we can elegantly pinpoint protein complexes, the relationships between them and the encompassing biological processes they are involved in (see Supplementary Data site and more details below).
As could be expected, well-conserved proteins and functions, such as those involved in transcription, translation, and proteolysis, are overrepresented. Typically, proteasome and ribosomal proteins are identified as highly connected (CAST clusters 1 and 3; see Supplementary data site). A number of protein-protein interaction networks involved in particular processes such as organelle organization and biogenesis, lipid metabolism, and ATP binding as well as the biological processes mentioned below can be viewed through our Supplementary data site. In addition, a considerable number of protein-protein interactions with a role in transmembrane transport, membrane receptor activity and vesicle trafficking were detected as previously reported by Geisler-Lee et al. [45
] (see Supplementary data-Transmembrane activity). For example, a link was found between protein clusters of interacting VAMPs (Vesicle associated membrane proteins) and SNAREs (CAST clusters 10 and 70) with vacuolar H+
pumping ATPases (CAST clusters 9, 34 and 120) and cation/H+
exchangers (CAST clusters 100, 145 and 185). Although not connected to the above-mentioned clusters, a protein cluster containing components of the translocase inner membrane complex (CAST cluster 13) associated with carrier proteins (CAST cluster 94), was retrieved as well. Several links between cell cycle control, protein degradation and related processes are captured in the protein clusters enriched for GO categories containing 'cell cycle' (CAST clusters 3, 12, 18 and 61; see Supplementary data). Whereas ubiquitin-mediated proteolysis regulates the activity of cyclins during the progression through the different phases of the cell cycle, B-type cyclins interact with several microtubule-related components during cytokinesis. Similarly, A-type cyclins, DNA replication proteins (CDC, MCM and ORC subunits) and RBR/WEE proteins are associated during the G1/S transition [48
] (see Supplementary data - Cell cycle + DNA repair + DNA replication). Overall, we observe a high similarity in expression patterns for the genes encoding proteins in the 'cell cycle' clusters most probably due to the tight regulation of proteins involved in DNA replication (green edges, see Fig. , Supplementary data). Moreover, the interactions in these clusters are often supported by the GO biological process measure (solid edges, see Supplementary data). These two properties occur mainly within protein clusters. In contrast, genes encoding interacting proteins involved in transmembrane activity, described above, show an overall lower similarity in expression patterns, even within protein clusters (yellow edges, see Supplementary data). This difference in the degree of co-expression is probably due to the more specific expression patterns of transmembrane activity genes resulting in transient interactions that cannot be identified using a global co-expression measure. However, most of the interactions in the transmembrane activity network are supported through the GO biological process and/or GO cellular component feature and thereby were identified through our approach (solid and thick edges, see Supplementary data). Reasonably, localization is very important in transmembrane activity-related processes such as ion transmembrane transport and protein excretion through vesicular exocytosis.
Protein-protein interactions involved in 'DNA replication'. Green edges represent PCC values > 0.3 (see Figure 5 for more details). For grey edges, no co-expression information is available.
More detailed analysis of the 'response to' interaction network identified many proteins functioning in, for instance, response to DNA damage or response to oxidative stress caused by the accumulation of reactive oxygen species (see Supplementary data site). Proteins active in these stress responses are DNAJ heat shock proteins, calcium-dependent and MAP kinases (CAST cluster 1), DNA repair proteins (RAD1, RAD5, RAD50, RAD51, RAD54 – CAST 18), superoxide dismutases (CAST 49), glutathione peroxidases (CAST 74) and glutathione S-transferases (CAST 15). These proteins are involved in very diverse biological processes, such as response to heat, response to toxins and response to chemical stimulus, which is reflected in the dashed edges in the network (see Supplementary data). In order to verify that these predicted interactions also reflect actual conserved biological stress responses, we compared the expression patterns of all 500 Arabidopsis genes in these 'response to' protein clusters using a recently compiled comparative stress response matrix [49
]. Whereas 11% of all stress-induced Arabidopsis gene families shows a conserved stress response in human or yeast (44/390 families in the complete matrix), the 'response to' set (representing 137 gene families) shows a more than five-fold enrichment for conserved stress response (16/25 gene families with responsive Arabidopsis genes are also responsive in human or yeast). These findings confirm that several components as well as protein-protein interactions in Arabidopsis indeed function in diverse responses and that this response is evolutionary conserved in eukaryotes.
Besides interactions functioning in well-conserved biological processes, we could also identify the recruitment of well-conserved proteins and protein-protein interactions in plant-specific processes. In the following examples, the regulatory mechanisms such as chromatin remodeling and actin polymerization are common to different lineages. However, these mechanisms are put into play in highly lineage-specific biological processes, such as seed and trichome development.
In a first example, we predict that the proteins FERTILIZATION INDEPENDENT ENDOSPERM (FIE), CURLY LEAF (CLF), SWINGER (SWN), MULTICOPY SUPPRESSOR OF IRA1 (MSI1), and MEDEA (MEA) interact and that this protein complex functions in flowering and embryonic development (see Fig. and Supplementary data). These proteins are homologous to the Polycomb (PcG), ESC, E(Z), p55, and Su(Z)12 genes in animals and take part in the Polycomb Repressive Complex 2 (PRC2) [50
]. This PRC2 complex mediates the epigenetic control of developmental patterning through chromatin modification. Genetic evidence suggests that there are different PcG complexes in plants, regulating different developmental pathways. In Arabidopsis, the FERTILIZATION INDEPENDENT SEED 2 (FIS2) repression complex is described to be involved in seed and flower development and includes FIE, MEA, MSI1, and FIS2. The VERNALIZATION2 (VRN2) complex consisting of VRN2, FIE, CLF, and SWN is involved in vernalization response. Except for VRN2, we have predicted interactions between all these components. In addition, we have uncovered additional, to our knowledge new interactors such as PHOTOPERIOD-INDEPENDENT EARLY FLOWERING 1 (PIE1), BUSHY (BSH) and SPLAYED (SYD), probable subunits of SWI/SNF chromatin remodeling complexes [51
]. Moreover, we detect interactions with associated proteins such as histone deacytelases and putative homeotic regulators. Interesting to note is that we could also identify the interaction of MSI1 with RETINOBLASTOMA RELATED 1 (RBR1), just recently confirmed by biomolecular fluorescence complementation [54
] (see Fig. ). This complex is overall very well supported by the genomic features used in this study (green, solid and thick edges in Fig. ). Moreover, several of the identified interactions were already shown experimentally (black edges in Fig. ).
Figure 5 Protein-protein interactions involved in plant-specific processes. A. Protein-protein interactions involved in imprinting genes. B. ARP2/3 complex. Edges in black depict previous experimentally identified interactions. For grey edges, no co-expression (more ...)
In a second example, we have predicted interactions between different actin-related proteins (ARPs), constituting the ARP2/3 complex involved in leaf development (see Fig. and Supplementary data). ARP2 or WURM, ARP3 or DISTORTED TRICHOMES 1, ARPC1, ARPC1A, ARPC2A or DISTORTED TRICHOMES 2, ARPC3 and ARPC5 or CROOKED, as well as a number of other non-actin-related proteins, were predicted to interact. The Arp2/3 complex has been shown to be composed of two actin-related proteins (Arps), a seven-bladed beta propeller (ARPC1), and four other subunits (ARPC2-5). Only ARPC4 is missing from our predictions, which is due to the fact that ARPC4 is not present in the input dataset of orthologous relationships. In addition to the ARP2/3 complex, we could identify a complex of three proteins (KLUNKER, GNARLED and ARAC4), involved in leaf development as well. At least two of these proteins of the SCAR complex are thought to regulate the ARP2/3 complex. The ARP2/3 complex functions as a central player in the precise regulation of both the initiation of actin polymerization and the organization of the resulting filaments, common to all organisms used in this study. However, disruption of the ARP2/3 complex has distinct effects in different organisms, going from growth defects in yeast, defects in eye and axon development in worm and fly, migration of cancer cells in human, and epidermal cell shape determination in Arabidopsis (Goley and Welch 2006). For the ARP2/3 protein complex, the power of the co-expression measure was limited. Nevertheless, similarity in localization and biological process showed to be very valuable. This observation points to the complementarity of the different genomic features employed in this study.