|Home | About | Journals | Submit | Contact Us | Français|
Protein phosphorylation catalyzed by kinases plays crucial regulatory roles in intracellular signal transduction. With the increasing number of experimental phosphorylation sites that has been identified by mass spectrometry-based proteomics, the desire to explore the networks of protein kinases and substrates is motivated. Manning et al. have identified 518 human kinase genes, which provide a starting point for comprehensive analysis of protein phosphorylation networks. In this study, a knowledgebase is developed to integrate experimentally verified protein phosphorylation data and protein–protein interaction data for constructing the protein kinase–substrate phosphorylation networks in human. A total of 21110 experimental verified phosphorylation sites within 5092 human proteins are collected. However, only 4138 phosphorylation sites (~20%) have the annotation of catalytic kinases from public domain. In order to fully investigate how protein kinases regulate the intracellular processes, a published kinase-specific phosphorylation site prediction tool, named KinasePhos is incorporated for assigning the potential kinase. The web-based system, RegPhos, can let users input a group of human proteins; consequently, the phosphorylation network associated with the protein subcellular localization can be explored. Additionally, time-coursed microarray expression data is subsequently used to represent the degree of similarity in the expression profiles of network members. A case study demonstrates that the proposed scheme not only identify the correct network of insulin signaling but also detect a novel signaling pathway that may cross-talk with insulin signaling network. This effective system is now freely available at http://RegPhos.mbc.nctu.edu.tw.
Protein phosphorylation is the most widespread and well-studied post-translational modification in eukaryotic cells. It has been estimated that one-third to one-half of all proteins in a eukaryotic cell are phosphorylated (1). Phosphorylation can regulate almost every property of a protein and is involved in all fundamental cellular processes. In addition, protein phosphorylation catalyzed by kinase plays crucial regulatory roles in intracellular signal transduction. The networks of proteins and small molecules that transmit information from the cell surface to the nucleus, where they ultimately effect transcriptional changes (2). Thus, a full understanding of the mechanism of intracellular signal transduction remains a major challenge in cellular biology. Mass spectrometry (MS)-based proteomics have enabled the large-scale mapping of in vivo phosphorylation sites (3). There are several databases storing experimentally verified phosphorylation sites with catalytic kinases, such as Phospho.ELM (4), PhosphoSite (5), UniProtKB/Swiss-Prot (6), Phosphorylation Site Database (7) and PHOSIDA (8). PhosPhAt (9) is a database of phosphorylation sites in Arabidopsis thaliana. PhosphoPOINT (10) provides robust annotation for kinases, their down-stream substrates and their interaction (phospho)-proteins and this should accelerate the functional characterization of kinome-mediated signaling.
Manning et al. (11) have identified 518 human kinase genes, the so-called ‘kinome’, that provides a starting point for comprehensive analysis of protein phosphorylation networks. To explore the protein kinase–substrate phosphorylation networks, the experimentally verified kinase-specific phosphorylation sites can be collected from the public resources. However, only 20% of the experimentally verified phosphorylation sites have the annotation of catalytic kinases. Recently, with exponential increase in protein phosphorylation sites identified by MS, many researches are undertaken to identify the kinase-specific phosphorylation sites, including NetPhosK (12), Scansite 2.0 (13), GPS (14,15), PPSP (16) and KinasePhos (17–19). The summary of the previously developed phosphorylation site prediction methods is listed in Supplementary Table S1. Particularly, Linding et al. (20) have proposed an excellent method, namely NetworKIN, that augments motif-based predictions with the network context of kinases and phosphoproteins.
Although the proposed resources can be utilized to construct the phosphorylation network between kinase and substrate proteins, the experimental data need to be combined by systems biology analysis, which translates the separate, large-scale datasets into signaling networks (21). Many studies have been proposed to model signaling networks using various approaches (22–26). Additionally, Steffen et al. (2) have developed a computational approach for generating static models of signal transduction networks. It utilizes protein-interaction maps generated from large-scale two-hybrid screens and DNA microarrays expression profiles. However, it is still insufficient to discover signaling networks in a gene group that have similar microarray expression profiles. To fully investigate how protein kinases regulate the intracellular processes, it is necessary to accurately identify the catalytic kinases for phosphoproteins. In this study, a knowledgebase named RegPhos is developed to integrate experimentally verified protein phosphorylation data and protein–protein interaction data for constructing the protein kinase–substrate phosphorylation networks in human. A graph searching algorithm, Breadth-first search (BFS) (27), is applied to explore the intracellular phosphorylation network starting from receptor kinases to transcription factors, associated with the information of protein subcellular localization. Supplementary Figure S1 demonstrates the concept of RegPhos. This effective system can let users input a group of human proteins; consequently, the phosphorylation network associated with the protein subcellular localization can be explored.
For the phosphoproteins without the annotation of catalytic kinases, KinasePhos (17–19) is incorporated with protein association for assigning the potential kinase. A case study is demonstrated that RegPhos not only identify the correct network of insulin signaling but also detect a novel signaling pathway that may cross-talk with insulin signaling network. Additionally, time-coursed microarray expression data is subsequently used to represent the degree of similarity in the expression profiles of network members.
The system flow of RegPhos is shown in Figure 1, mainly including the collection of experimentally verified phosphorylation sites, identification of experimentally confirmed kinase–substrate interactions and construction of intracellular phosphorylation networks. To fully investigate how protein kinases regulate the intracellular processes, a published method, KinasePhos (17–19), is combined with protein associations for identifying kinase-specific phosphorylation sites. Time-coursed microarray expression data is then used to validate the degree of similarity in the expression profiles of network members.
The experimental verified phosphorylation sites are extracted from dbPTM (28) which has integrated version 8.0 of Phospho.ELM (4), release 55.0 of UniProtKB/Swiss-Prot (29) and version 1.0 of PHOSIDA (8). As shown in Table 1, Phospho.ELM, Swiss-Prot and PHOSIDA contains 21542, 24628 and 6600 experimental verified phosphorylation sites within 6520, 8606 and 2244 phosphoproteins, respectively. Additionally, Human Protein Reference Database (HPRD) (30), which integrates a wealth of information relevant to the function of human proteins in health and disease, is integrated in this work. In release 7.0 of HPRD, there are totally 16972 PTMs within 2830 protein entries, of 7438 PTMs are phosphorylation sites within 1774 proteins. Furthermore, data pertaining to thousands of protein–protein interactions, posttranslational modifications, enzyme/substrate relationships, disease associations, tissue expression and subcellular localization were extracted from the literature for a non-redundant set of 25661 human proteins. We are prompted to construct human phosphorylation network in this study, the collected phosphorylation sites in human proteins are separately represented in Table 1. After removing the redundant data among these databases, the number of human phosphorylation sites and phosphoproteins are 21110 and 5092, respectively.
The human kinase annotations extracted from KinBase (11) are used to unify the kinase names among the external phosphorylation site databases which contain various names for a kinase. To unify the heterogeneous data of kinases and phosphoproteins, the kinase names in KinBase and phosphoproteins in public resources are both mapped to the UniProtKB/Swiss-Prot ID and accession number. Due to the classification of kinase identified by Manning et al. (11), 518 kinases are categorized by their annotated family or subfamily, including totally 221 kinase families. The 518 kinases are major nodes in the construction of human phosphorylation networks. Several representative kinase families are listed in Supplementary Table S2; for instance, the family of protein kinase B (PKB) consists of three kinase members such as AKT1, AKT2 and AKT3. With the integration of experimental phosphorylation sites from public resources, totally 89 phosphorylation sites of 63 human phosphoproteins are catalyzed by PKB kinase family. The knowledgebase contains 21110 experimentally verified phosphorylation sites within 5092 human proteins, of 4138 phosphorylation sites (~20%) have the annotation of catalytic kinases. According to the annotations of 4138 experimentally confirmed kinase-specific phosphorylation sites, a total of 1306 experimentally kinase–substrate interactions are identified.
Manning et al. (11) have identified 518 human kinase genes, that provides a starting point for investigating protein phosphorylation networks. With the identification of experimentally confirmed kinase–substrate interactions, the intracellular phosphorylation networks can be reconstructed. A graph-based method is adopted to formalize the construction of intracellular phosphorylation network to a path search problem in graph theory. The intracellular protein phosphorylation networks are visualized as an directed graph G=(V, E), where x, yV and (x, y)E. Let x and y represent kinase and substrate proteins, respectively and (x, y) E represent a phosphorylation interaction when kinase x phosphorylates substrate y. However, the intracellular phosphorylation networks (signaling networks) contain not only the kinase cascades or kinase–substrate interactions, but also protein–protein interactions or protein complex, such as insulin signaling network (31). To make the construction of signaling networks feasible, the experimental protein–protein interactions or protein complexes in human are integrated from DIP (32,33), MINT (34), IntAct (35) and HPRD (30), as shown in Supplementary Table S3. In this work, V refers to all human proteins in UniProtKB (36) and E refers to all experimental interactions in knowledgebase including experimentally verified kinase–substrate interactions and experimental protein–protein interactions.
Moreover, the cellular localization of proteins is used to constrain the search of phosphorylation network. Supplementary Table S4 lists the public databases of protein subcellular localization, including LOCATE (37), DBSubLoc (38), Organelle DB (39) and PSORTdb (40). Due to the annotation of cellular localization databases, there are 84 cell membrane-associated kinases being the start points of the phosphorylation networks. With the annotation of TRANSFAC version 11.0 (41), there are 1364 transcription factors in human. To identify the phosphorylation networks starting from membrane receptor to transcription factor in nucleus, the graph-based definition can be refined as follows: given a directed weighted graph G=(V, E) with n nodes, m edges, a set S of start nodes (receptor) and a set T of end nodes (TF). As shown in Supplementary Figure S2, for each node s in S, a acyclic path p=(s, c1,…, ck, t) with length k that starts from S and ends at t within T, passed through cytoplasmic proteins c1,…, ck is found. A graph searching algorithm, BFS (27), is applied to explore the intracellular phosphorylation network associated with the information of protein subcellular localization. BFS is one of the basic schemes for searching a subgraph or a path in a graph. Given a graph G=(V, E) where V represents the set of proteins and E is the set of physical interactions between proteins and a distinguished source vertex s, BFS systematically explores the edges of G to discover every vertex that is reachable from s. We restrict attention to simple paths that was constrained the order of occurrence of proteins in a defined path length 8 (2).
Systematically exploring the intracellular phosphorylation networks, starting from membrane receptor to transcription factor in nucleus, may produce a lot of false positive networks. Clustering genes with similar profiles into a group is a proven method for grouping functionally related genes (21). Therefore, the identified signaling networks are further examined the degree of similarity in the expression profiles of network members. The time-coursed gene expression samples from Affymetrix GeneChip Human Genome U133 Array Set HG-U133A platform (GPL96) (42), which consists of 22283 probe set for 12678 genes, is used to explore the co-expression of kinase and substrate genes. Gene expression data, including Esophageal cell response to low pH (GSE2144), Lung cancer cell line response to motexafin gadolinium (GSE2189), Cyanobacterial metabolite apratoxin A cytotoxic effect on colon adenocarcinoma cells (GSE2742), Interleukin 13 effect on bronchial cell line (GSE3183), Endotoxin effect on leukocytes (GSE3284), Blood response to various beverages (GSE3846) Androgen receptor modulator effect (GSE4636), Glucocorticoid receptor activation effect on breast cancer cells (GSE4917) and Epidermal growth factor effect on cervical carcinoma cell line (GSE6783), were quantified by Robust Multichip Average (RMA) algorithm (43). RMA quantification was performed by the justRMA function of Bioconductor Affy package in R program language using raw data (Affymatrix CEL file). Then, Pearson correlation coefficient is used to measure the trends of two expression profiles.
With the integration of public phosphorylation resources, most of the experimentally verified phosphorylation sites (~80%) do not have the annotation of catalytic kinases. To fully investigate how protein kinases regulate the intracellular processes, it is necessary to accurately link the experimental phosphorylation sites to catalytic kinases. With reference to the approach of NetworKIN (20), a published kinase-specific phosphorylation site prediction tool, named KinasePhos (17–19), is incorporated with protein association for assigning the potential kinase. The association context for each kinase–substrate pair is investigated by the information of protein–protein interactions, functional associations (physical protein interactions, curated pathway, co-occurrence in literature abstracts, mRNA co-expression studies and genomic context) and cellular co-localization. A public SVM library, namely LibSVM (44), is adopted to train the kinase-specific predictive models, including more than 100 kinase families, with the encoded amino acid sequences and structural features, such as secondary structure (SS), accessible surface area (ASA) and disorder region (DIS). Radial basis function (RBF) is selected as the kernel function of SVM. Each model is evaluated the discriminatory power between phosphorylated and non-phosphorylated sites, based on five-fold cross-validation.
To investigate the possibility of using association context to enhance the identification of kinase-specific substrates, the constructed SVM models are combined with protein associations including protein–protein interactions, functional associations and subcellular localization. This work extract human protein–protein interactions from DIP (32,33), MINT (34), IntAct (35) and HPRD (30), as shown in Supplementary Table S3. Moreover, to capture the complete biological context of a substrate, the functional associations extracted from the STRING database (45) are integrated. In order to identify the direct and indirect connection between kinase and substrate, a graph searching algorithm, BFS, is also adopted.
The eukaryotic cell is a composite system internally subdivided into membrane-enveloped compartments that perform particular functions (46). The proteins, which are involved in similar biological functions, are closely located in the same subcellular localization. Therefore, knowing the localization of every protein is important for elucidating its interactions with other molecules and for understanding its biological function. In order to accurately identify the interaction of kinase–substrate phosphorylation, the information of subcellular localization is used to evaluate the co-localization between kinases and phosphoproteins. Supplementary Table S4 shows the list of integrated databases of protein subcellular localization, including LOCATE (37), DBSubLoc (38), Organelle DB (39) and PSORTdb (40).
Logistic regression has been adopted to evaluate the confidence value of protein–protein (kinase–substrate) interaction (25). In this study, a modified version of the Sharan et al. (47) method was utilized to evaluate the confidence values of the discovered kinase–substrate interactions (see Supplementary Figure S3). In the logistic regression model, we incorporate four sets of variables for a given interaction set, including (i) the prediction score of the kinase-specific SVM model, (ii) the depth of interaction between kinase and substrate was observed, (iii) the confidence score of the STRING functional association and (iv) the binary (0/1) protein subcellular localization data of interacting pairs. The computationally identified kinase–substrate interactions can be considered into the construction of intracellular phosphorylation networks, which may make the discovered network more feasible. Since exploring the protein phosphorylation networks, each edge has the weighted score from 0 to 1, 1 for the experimentally verified kinase–substrate interaction and logistic regression probability value for the computationally identified kinase–substrate interaction.
The aim of this work is to develop an effective system, namely RegPhos, for exploring the protein kinase–substrate phosphorylation networks in human. The information of subcellular localization is utilized to construct the intracellular phosphorylation network starting from membrane receptor to transcription factor in nucleus. In order to enhance the identification of kinase–substrate interactions, the protein associations (protein–protein interaction, functional association and subcellular localization) between kinases and phosphoproteins are carefully investigated.
With the annotations of 4138 experimentally confirmed kinase-specific phosphorylation sites in human, a total of 1306 experimental kinase–substrate interactions are identified; as presented in Supplementary Figure S4, 1039 kinase–substrate pairs of which have been annotated as protein–protein interactions, based on the collection of protein interactions from DIP (32,33), MINT (34), IntAct (35) and HPRD (30) databases. According to annotations in the four integrated interaction databases, a total of 1801 phosphoproteins have the direct interaction to 430 human kinases. Furthermore, the indirect links between kinases and their substrates are also taken into account. Those unobvious relationships would be very difficult to predict by manually inspecting the available sequence motifs. To investigate the interacting distance of indirect connection between kinases and substrates, the number of substrates interacting to a specific kinase family is observed in different interacting distance. As shown in Table 2, the numbers of interacting substrates in PKA, PKC, CK2, CDK, Src, EGFR and INSR families are listed with various interacting distance. For instance, PKA family, consisting of PKACa, PKACb and PKACg kinases, has 123 (63%) directly interacting substrates. About 37% of PKA-specific substrates are indirect connection to PKA kinases. Base on the statistics of interacting distance between kinases and their substrates, most of the substrates (~95%) are connecting to kinases within the distance of three interacting nodes (proteins). Both direct and indirect protein associations are adopted to help the identification of kinase–substrate interactions.
To easily categorize the subcellular localization for kinases and substrates, the localization of substrates is mainly classified into nuclear and cytoplasmic substrates. We mapped localizations from UniProtKB/Swiss-Prot to the kinase-specific substrates, which resulted in 3863 phosphoproteins that are described as localizing to either the cytoplasm or the nucleus. The statistics of substrate localization preference of kinase families is listed in Table 3. The statistically significant (P<0.05) localization preference of kinase family is marked in bold. Based on the statistics, we found 33 kinase groups that show a statistically significant preference for either cytoplasmic or nuclear substrates. For the kinase groups that are primarily localized in the nucleus (ATM, DNAPK, RSK, CK2, CDK, CDC2 and Aurora), their preference were about two-fold more nuclear than cytoplasmic targets. However, GRK, ROCK, BARK, CaMK2 and CK1 have strong preference for cytoplasmic substrates. PKA, PKC, PKB, Abl, IKK and MAP2K families are both fairly pleiotropic kinases, which in the phosphorylation network show a slight preference for cytoplasmic substrates. In the case of membrane-associated kinase families, EGFR, INSR, JAK, Src, FYN, LCK, LYN and SYK have the high preference of cytoplasmic substrates.
To fully investigate how protein kinases regulate the intracellular processes, this work proposes a computational model for assigning the potential kinase for each experimental phosphorylation sites without the annotation of catalytic kinase. With reference to NetworKIN (20), that has augmented motif-based predictions with the functional association context of kinases and phosphoproteins, we adopt the similar data set to evaluate the performance of the proposed method. Using only SVM-based model (KinasePhos), the predictive accuracies are 84, 89.6, 91.5 and 81.9% in PKC, CDK, PIKK and INSR, respectively (Supplementary Table S5). The cross classifying specificity among PKC, CDK, PIKK and INSR families are listed in Supplementary Table S6. The specificity (Sp) of CDK, PIKK and INSR sets corresponding to the PKC model are 81.9, 89.1 and 83.3%, respectively. Similarly, the cross specificity values among PKC, CDK, PIKK and INSR are generally higher than 80%. However, the specificity of INSR model is slightly weak when differentiating PKC substrates from INSR substrates. The higher specificity in the cross-validation, the less incorrect prediction of the phosphorylation sites in other groups. By incorporating contextual information of protein association, the prediction accuracy improves to 84.1, 91.6, 91.9 and 91.9% in PKC, CDK, PIKK and INSR, respectively, because of the improvement of specificity (Supplementary Figure S5). However, there are slight drops in predictive sensitivity. These results highlight the importance of including contextual information in identifying kinase–substrate relationships for experimentally verified phosphorylation sites without annotated catalytic kinases. The computationally identified kinase–substrate interactions can make the construction of intracellular phosphorylation networks more feasible.
Insulin receptor substrate 1 (IRS1), which mediate the control of various cellular processes by insulin (48), were used to present the effectiveness of computational identification of kinase-specific phosphorylation sites. With the annotation of Phospho.ELM (4) and UniProtKB/Swiss-Prot (29), IRS1 has totally 32 experimentally verified phosphorylation sites. However, some of the experimental phosphorylation sites do not have the annotation of catalytic kinases. Based on the trained threshold of logistic regression probability score in each kinase group, these phosphorylation sites were annotated the potential catalytic kinases. As illustrated in Figure 2, seven kinase-specific phosphorylation sites with their protein associations are identified. For instance, the tyrosine phosphorylation sites ‘Y612’ and ‘Y632’ were potentially catalyzed by Janus kinase 1 (JAK1), with the indirect protein–protein interaction which was linked by v-erb-b2 erythroblastic leukemia viral oncogene homolog 2 (ErbB2). The tyrosine phosphorylation sites ‘Y46’ and ‘Y896’ were catalyzed by Insulin-like Growth Factor I Receptor (IGF1R), with the directly functional association annotated by STRING database. Phosphoserine ‘S636’ was catalyzed by Mitogen-Activated Protein Kinase (MAPK) group and a functional association shows that Mitogen-Activated Protein Kinase 1 (MAPK1 or Erk2) was directly link to IRS1. Phosphotyrosine ‘Y1229’ was catalyzed by insulin receptor (InsR) with the direct protein–protein interaction (DIP:429E) of DIP database. Some phosphorylation sites were identified by more than two kinases, for example phosphoserine ‘S1145’ was potentially catalyzed by v-akt murine thymoma viral oncogene homolog 1 (Akt1) with directly functional association or was potentially catalyzed by protein kinase C epsilon (PKCe) with indirect link in distance of three protein–protein interactions, passing through Stratifin (SFN) and B-Raf proto-oncogene serine/threonine-protein kinase (BRAF).
To facilitate the investigation of protein kinase and their substrate, a web-based system, named RegPhos, was implemented for users to efficiently browse the protein kinases and their substrate proteins in a user-friendly manner. Three major functions, including browsing kinase or substrate (see Supplementary Figure S7), constructing phosphorylation network and microarray expression analysis (see Supplementary Figure S8), are provided in the proposed system. The JMol viewer (49) is adopted for the visualization of PDB (50) structures of kinases and substrates. The proposed system can let users input a group of gene/protein names; the phosphorylation network associated with protein subcellular localization can be automatically constructed. To fully investigate how protein kinase control the intracellular processes, the experimentally verified kinase–substrate phosphorylations and the computationally discovered kinase–substrate interactions are incorporated to explore the phosphorylation networks starting from receptor kinases associated with membrane to transcription factors located in nucleus. However, the phosphorylation-driven signal transduction pathway is not always the phosphorylation cascade. Some protein–protein interactions are involved in the signal transduction pathway, such as IRS1–GRB2 interaction, GRB2–SOS1 interaction, SOS1–HRAS interaction and HRAS–RAF1 interaction in insulin signaling pathway (31). Supplementary Figure S9 shows an example of insulin signaling network in the construction of phosphorylation network. A group of proteins associated with insulin signaling pathway are inputted to construct the network from membrane-associated proteins to nuclear proteins.
To demonstrate the effectiveness of the proposed method, the discovered phosphorylation networks associated with the insulin signaling pathway are represented in Figure 3. Insulin regulates both metabolism and gene expression; the insulin signal transduction passes from the plasma membrane receptor to insulin-sensitive metabolic enzymes and finally to the nucleus, where it stimulates the transcription of specific genes (31). The well-known insulin signaling pathway, INSR → IRS1 — Grb2 — SOS1 — RAS — Raf1 → MEK → ERK1 → Elk1, can be successfully identified by the presented graph-based phosphorylation network searching method (‘→’ stands for phosphorylation and ‘—’ stands for protein–protein interaction). Due to the protein–protein interactions, which are allowed in the network searching, numerous insulin receptor (INSR) related signaling pathways have been discovered, which contain about 2000 pathways with length of eight proteins. After the validation of time-coursed microarray data, the discovered INSR-related phosphorylation networks can be decreased to about 50 pathways. Some of the well-known signaling networks are discovered and marked with red lines in Figure 3. RegPhos not only identify the correct network of insulin signaling but also detect a potentially novel signaling pathway that may cross-talk with insulin signaling network. For instance, Qin-induced kinase (QIK) posphorylates ‘Ser-794’ of IRS1 in insulin-stimulated adipocytes, potentially modulating the efficiency of insulin signal transduction (51); SHC-transforming protein 1 (SHC1) is a signaling adapter that couples activated growth factor receptors to signaling pathway (48); GRB2-associated-binding protein 1 (GAB1) probably involved in EGF and insulin receptor signaling (52). The phosphoregulators, such as QIK, IRS1, SHC1 and GAB1, are considerably involved in cross-talk between signaling cascades (53).
To investigate the statistically significant syn-expressed pair of kinase and substrate genes, all the pairs of genes are calculated for background correlation. However, it is time-expensive for calculating all pairs of genes. Therefore, the random sampling is adopted to extract 100000 gene pairs as the background set for estimating the distribution of Pearson correlation coefficients of background gene pairs (see Supplementary Figure S10). The distribution of Pearson correlation coefficient of pairs of specific kinases and their substrates is also investigated. Supplementary Figure S11 shows the distribution of correlation coefficient of PKA-substrate pairs, CDC2-substrate pairs and EGFR-substrate pairs, based on 98 microarray series. Most of the PKA-substrate pairs (40%) belong to the low positive correlation (0<r<0.4), with the average correlation coefficient 0.08. In particular, about 65% of CDC2-substrate pairs have the positive correlation, with ~20% high positive correlation (r>0.7). The average correlation coefficient of CDC2-substrate pairs is 0.14. In the case of EGFR-substrate pairs, the distribution of correlation coefficient is similar to the distribution of all kinase–substrate pairs. The average correlation coefficient of EGFR-substrate pairs is 0.028.
Moreover, the distribution of Pearson correlation coefficient of pairs of specific kinases and their substrates is investigated based on time-coursed microarray data. Supplementary Figure S11 shows the distribution of correlation coefficient of PKA-substrate pairs, CDC2-substrate pairs and EGFR-substrate pairs based on nine time-coursed microarray series (described in ‘Materials and methods’ section). The average correlation coefficient of PKA-substrate pairs is up to 0.12. The proportion of PKA-substrate pairs belonged to the low positive correlation (0<r<0.4) is increased from 40 to 45%. In the case of EGFR-substrate pairs, the average correlation coefficient of EGFR-substrate pairs is raised from 0.028 to 0.08. The proportion of EGFR-substrate pairs belonged to high positive correlation (r>0.6) is approaching 16%. However, based on time-coursed microarray data, the average correlation coefficient of CDC2-substrate pairs is decreased to 0.10. Generally, the experimentally confirmed kinase–substrate pairs have higher value of Pearson correlation coefficient based on time-coursed microarray expression data. Thus, the time-coursed microarray data of Affymetrix GeneChip Human Genome U133 Array Set HG-U133A platform (GPL96) are used to test the degree of similarity in the expression profiles of network members.
With the increasing number of in vivo phosphorylation sites, which have been identified, the desire of mapping the network of protein kinase and substrate is motivated. The experimental kinase-specific substrates, ultimately, need to be combined by systems biology analysis, which translates the separate, large-scale datasets into signaling networks. Therefore, this study has incorporated the experimentally verified kinase–substrate interactions with experimental protein–protein interactions to construct the intracellular phosphorylation network starting from receptor kinases to transcription factors, associated with the information of subcellular localization. With the integration of public phosphorylation resources, most of the experimentally verified phosphorylation sites (~80%) do not have the annotation of catalytic kinases. A published kinase-specific phosphorylation site prediction tool, KinasePhos (17–19), is incorporated with protein association (protein–protein interaction, functional association and protein subcellular localization) for assigning the potential kinase. After the evaluation, the proposed method improves the predictive power and highlights the importance of kinase–substrate interactions in the specificity of protein phosphorylation within cells. Moreover, the experimental expression evidence, such as gene microarray data, was adopted to validate the syn-expression of the discovered phosphorylation network with statistical significance. To facilitate the investigation of protein kinases and their substrates, a web-based system, named RegPhos, was implemented for users to efficiently browse the protein kinases and their substrate proteins in a user-friendly manner. A case study demonstrates that RegPhos not only identify the correct network of insulin signaling but also detect a novel signaling pathway that may cross-talk with insulin signaling network. In prospective works, protein phosphatase, act as oppositive function to protein kinases, is needed to be considered in construction of protein phosphorylation network. Protein kinases and phosphatases can regulate the phosphorylation status of the protein complement of a cell and in turn, regulate the activity of their target phosphoproteins in cellular processes. Defining the entire complement of these proteins gives us an opportunity to view the system as a whole.
The RegPhos database will be continuously maintained and updated. All the experimentally verified data on protein phosphorylation and protein–protein interaction will be updated quarterly. The time-coursed microarray expression data collected from Gene Expression Omnibus (GEO) will also be updated quarterly. The resource is now freely available at http://RegPhos.mbc.nctu.edu.tw.
Supplementary Data are available at NAR Online.
National Science Council of the Republic of China under (Contract Numbers of NSC 98-2627-B-009-005, NSC 99-2320-B-155-001, NSC 99-2627-B-009-003, NSC 98-2311-B-009-004-MY3, NSC 99-2621-B-006-001-MY2 and NSC 99-2628-B-006-016-MY3); National Research Program for Genomic Medicine (NRPGM), Taiwan.
Conflict of interest statement. None declared.