|Home | About | Journals | Submit | Contact Us | Français|
Human Protein Reference Database (HPRD) is a rich resource of diverse features of human proteins, which are experimentally proven. Protein information in HPRD includes protein–protein interactions, post-translational modifications, enzyme/substrate relationships, disease associations, tissue expression, and subcellular localization of human proteins. Although, protein-protein interaction data from HPRD has been widely used by scientific community, its phosphoproteome data has not been exploited up to the potential. HPRD is one of the largest documentation of human phosphoproteins in the public domain. Currently, phosphorylation data in HPRD comprises of 95,016 phosphosites mapped on to 13,041 proteins. Additionally, enzyme-substrate reactions responsible for 5,930 phosphorylation events were also curated. Significant improvements in technologies and high-throughput platforms in biomedical investigations, led to exponential increase of biological data and phosphoproteomic data in the recent years. Human Proteinpedia, a community annotation portal developed by us, has also led to significant increase in phosphoproteomic data in HPRD. A large number of phosphorylation events have been mapped on to reference sequences available in HPRD and Human Proteinpedia along with associated protein features. This will provide a platform for systems biology approaches to determine role of protein phosphorylation in protein function, cell signaling, biological processes and their implications in human diseases. This review aims to provide a composite view of phosphoproteomic data pertaining to human proteins in HPRD and Human Proteinpedia.
Post-translational modifications (PTMs) are crucial for protein stability, sorting and function. Commonly found PTMs include phosphorylation, acetylation, glycosylation, methylation, proteolytic cleavage, GPI anchor, palmitoylation, sumoylation and ubiquitination. Phosphorylation of proteins is mediated by a class of enzymes called protein kinases, which enable transfer of phosphate groups from ATP to the hydroxyl group of either serine, threonine, or tyrosine residues of their protein substrates 1. Protein phosphorylation is a reversible and dynamic process which plays a critical role in growth, function and development of living cells 2. Protein phosphorylation and dephosphorylation events are known to regulate several receptor mediated signal transduction pathways and assume switch-on and –off role in enzymatic activity, protein stability, protein-protein interactions (PPIs) and subcellular localization 3-5. Protein kinases and phosphatases recognize their substrates by means of a set of conserved residues among substrates sequences called phosphomotifs 6. There are more than 518 protein kinases in humans including approximately 400 serine/threonine kinases and ~90 protein tyrosine kinases 7-9. Similarly, there are an estimated number of 30-40 serine/threonine phosphatases and 109 tyrosine phosphatases 9-11. Dual specificity kinases or phosphatases are capable of acting on both serine/threonine and tyrosine residues within the same substrate 12, 13. As phosphorylation and dephosphorylation events are known to regulate activity of several enzymes and mediate signaling pathways, a manually curated phosphorylation data resource would allow ‘systems level’ discoveries by facilitating analyses of protein networks and signaling pathways in different biological contexts including human diseases 14. This is especially relevant because, dysregulation of protein phosphorylation events and signaling pathways has already been implicated in several diseases such as cancer, neurodegenerative disorders including Alzheimer's disease, dementias, disorders of learning and memory 15-21.
Traditional methods used for identification of phosphorylated proteins involved radioactive labeling of proteins with 32P-labeled ATP followed by SDS-PAGE or high-performance liquid chromatography followed by Edman sequencing to determine the site of phosphorylation 22. However, this technique is cumbersome and requires large amounts of pure protein samples. Western blotting using phosphosite-specific antibodies is a method that is gaining popularity but remains limited because of the restricted availability of such antibodies 23. High-resolution mass spectrometry (MS) has evolved as the method of choice for analysis of phosphoproteome in complex protein samples owing to its sensitivity and high-throughput nature 24. Such experiments often involve protein isolation followed by phosphoproteome enrichment 25-27. Subsequently, the enriched fractions are resolved by liquid chromatography and analyzed by MS 28-30.
Several high-throughput studies have substantially increased the data pertaining to the human phosphoproteome in this decade. Olsen et al. reported 6,600 phosphorylation sites from 10,000 phosphopeptides on 2,244 proteins in HeLa cells stimulated with epidermal growth factor (EGF). Quantitative proteomics approach followed in this study combined SILAC for quantitation and titanium dioxide (TiO2) chromatography for phosphopeptide enrichment 31. Nagaraj et al. identified 5,359 and 6,845 phosphosites using TiO2 affinity-based enrichment of HeLa S3 cell lysates using two different modes of fragmentation - collision induced dissociation and high-energy collision-induced dissociation, respectively 32. In another study, 2,225 non-redundant phosphosites from 1,023 proteins were obtained from human liver by enrichment of phosphopeptides using immobilized metal ion affinity chromatography microspheres – IMAC33. Oyama and coworkers performed tyrosine phosphoproteome analysis of EGF-stimulated A431 cells and identified 3,730 phosphopeptides 34. Li and coworkers used pTyr-100 and 4G10 phosphotyrosine antibodies to enrich phosphotyrosine proteome in human Hep3B and MHCC97H cell lines 35. As the data obtained in these large-scale studies are voluminous, the phosphopeptides information usually ends up in supplementary files of the research articles. It is difficult for an individual investigator to benefit from these data, because of difficulty in accessing them, non-uniformity in data presentation and file formats. For a systems level analysis, it will be helpful if phosphoproteome data derived from diverse biological systems and experimental platforms are curated uniformly in a centralized resource. The essential information for phosphoproteome may include the residues and sites of phosphorylation, name and reference sequence of the protein that these phosphosites were mapped onto, experimental methods for enrichment and analysis with links to the research articles. HPRD and Human Proteinpedia, as community resources, serve this function. In HPRD, a large number of phosphorylation and dephosphorylation reactions were included from targeted low-throughput studies in addition to the phosphoproteome data derived from high-throughput analyses. These resources also allow researchers to perform complex queries in the context of other protein features of proteins. Phosphorylation data information from these databases can be freely downloaded. The information in these resources can be used to undertake systems approaches to study phosphorylation especially in the light of newer initiatives to understand human proteins and biological systems.
HPRD (http://www.hprd.org) is a comprehensive resource of several types of protein information. It allows for the integration of data from diverse experimental platforms pertaining to different protein features in the context of an individual protein. HPRD links each curated information to the original research article in which it is described. It documents various protein features such as PPIs, PTMs, enzyme-substrate relationships, disease associations, subcellular localization, tissue expression, biological motifs/domains derived from a variety of experimental platforms. HPRD contains curated information for a non-redundant set of 30,047 human proteins including 19,653 proteins and 10,394 additional isoforms as reported in the RefSeq database 36. Curated information in HPRD is currently comprised of 39,194 unique PPIs, 109,513 PTMs, 112,158 tissue expression annotations, and 22,490 subcellular localization entries which are linked to 59,194 published research articles. It also contains annotations pertaining to isoform-specific subcellular localization (1,117) and tissue expression (5,499). The graphic user interface of HPRD displays an image for each protein sequence, which depicts biological motifs and protein domains along with the PTMs in the context of each protein sequence. HPRD has implemented Gene Ontology in describing subcellular localization, molecular function and biological process. PPI data in HPRD has been used in the standardization of Proteomics Standard Initiative-Molecular Interaction (PSI-MI) vocabularies 37. Data annotated in HPRD is freely accessible for academic community in PSI-MI, XML and tab-delimited formats.
HPRD follows a manual curation strategy to annotate phosphorylation and dephosphorylation data, which are described in several low- and high-throughput research articles. Each phosphosite is mapped to the reference sequence provided in the HPRD, using phosphopeptide sequence data provided in the article. Only unique phosphosites which map to a single human protein are considered for inclusion. Data is entered using software called BioBuilder, developed previously by our group 38. Biobuilder has built-in consistency checks to prevent redundant entries and annotation of wrong residue and site information. The annotated information will be reflected in HPRD protein page for visualization.
Figure 1 shows the molecule page of insulin receptor in HPRD, which provides links to research articles describing phosphorylation sites and residues. For example, phosphorylated residues marked in peptides DGGSSpLGFK and APESpEELMEFEDMNVPLDR as Ser-1275 and Ser-1309 in the respective literature39, 40. However, when we map these peptides to corresponding RefSeq protein sequence (accession # NP_000199.2), these position were matched to Ser-1314 and Ser-1348. These sequence matched sites have been annotated in HPRD. In addition, the enzymes directly responsible for phosphorylation or dephosphorylation reactions are also documented. Three proteins, protein kinase C alpha (PRKCA), epsilon (PRKCE) and delta (PRKCD) have been shown to phosphorylate the opioid receptor (OPRD1) at Ser-344 41. Elongation factor 2 kinase (EEF2K) and Phosphatase 2A, catalytic subunit, alpha (PPP2CA), respectively, have been reported to mediate phosphorylation and dephosphorylation reactions on Thr-57 and Thr-59 of Elongation factor 2 (EEF2)42.
Improvements in experimental platforms in biology have resulted in a great increase in biological information including phosphorylation. Table 1 provides a list of large-scale studies that cataloged a large number of phosphopeptides in humans along with a summary of information regarding the cell line/tissue used for the experiments, type of mass spectrometer employed, methods used for enrichment and the number of identified phosphosites.
Phosphorylation data were annotated based on the sites of modification and the modified residues. Figure 2A depicts distribution of phosphorylation data in HPRD derived from in vivo and in vitro experiments. HPRD has 95,016 annotated phosphorylation sites, of which 88,250 phosphosites are identified based on in vivo analysis alone, 2,678 from in vitro experiments and 4,088 phosphorylation sites by both methods. Among the annotated phosphorylated sites, 66,050 (69%) are on serine, 20,652 (22%) on threonine, 8,314 (9%) on tyrosine.
While annotating phosphorylation data in HPRD, we have also documented enzymes responsible for 5,930 phosphorylation events. We realized that enzymes responsible for only a minor fraction of all phosphorylation events have been determined so far (Figure 2B). The annotation of enzymes for these PTMs has also resulted in the documentation of 2,118 unique enzyme-substrate relationships. We found 447 substrates, which are known to be phosphorylated by more than one enzyme and 517 substrates modified by only one enzyme. Overall, in HPRD, 13,014 proteins are annotated with at least one phosphorylation site. Figure 2C shows an exponential increase of HPRD phosphorylation data over the last four releases, which is mainly due to an increase in high-throughput phosphoproteomic analysis of proteins by mass spectrometry. Since HPRD release 7 in the year 2007, we have added approximately 84,165 phosphosites which contains 58,738 phosphoserine, 18,888 phosphothreonine and 6,539 phosphotyrosine on human proteins. The distribution of phosphorylation events on phosphoproteins with respect to serine, threonine and tyrosine is illustrated in Figure 2 (D). Of these, only 1,368 phosphorylation reactions have been defined in the context of human signaling pathways, as curated in NetPath (http://www.netpath.org) indicating that the large majority of phosphosites have still not been explored for their role in cell signaling. Thus, phosphorylation data contained in HPRD should provide a platform for biomedical investigators to further evaluate the role of these phosphosites and proteins in cell signaling and in diseases.
Curation of heterogeneous proteomic data from multiple sources into a uniform format is quite a challenging task. Large-scale phosphorylation data is usually provided as supplementary files with no uniformity in data presentation. If most of investigators deposited their data into a centralized resource in a predefined format, it could be used more easily by the broader biology community. Therefore, we developed distributed annotation system protocols to annotate human protein data so as to allow proteomic investigators a platform to directly submit their data in a single resource; we named it Human Proteinpedia (http://www.humanproteinpedia.org/). Users can submit their data by uploading batch files, FTP or email. It utilizes protein information in HPRD as a scaffold and maps all proteomic data entered by the users in Human Proteinpedia to HPRD. The data submitted to Human Proteinpedia can be viewed through HPRD in the context of an individual protein molecule. It also allows laboratories to share human proteomic data obtained from multiple experimental platforms and increases the speed of data dissemination. Users can also share both unpublished and published human proteomic data. Data thus submitted will remain linked to the investigators and the laboratories. This resource has entries pertaining to 34,500 PPIs, 24,152 PTMs, 2,900 subcellular localization and 192,557 cell lines/tissues expression. Figure 3 displays the HPRD molecule page of microtubule associated protein 4 (MAP4) which promotes microtubule assembly.
Users of Human Proteinpedia have submitted 60 novel phosphorylation sites of MAP4 in HPRD and the data on PTMs can be viewed in a separate table in PTMs page in HPRD. It provides more insights into phosphorylation data from mass spectrometry by giving option of adding peptide score, experimental description, peptide identification data, precursor mass, charge state, sequence identifier, algorithm, MS/MS spectrum, ionization methods, fragmentation methods and mass tolerance used for database searching. All peptides are linked to their respective MS/MS spectrum which can be visualized using Spectrum Viewer43. Phosphopeptides derived from multiple experimental platforms and enrichment strategies will help researchers to determine a combination of strategies to be used to maximize the discovery in phosphoproteomics. There are over 1,960,352 phosphopeptides deposited in Human Proteinpedia. It becomes important source of phosphoproteome information and can drive meta-analysis of human phosphoproteome. Several phosphopeptides in Human Proteinpedia have multiple line of evidence as they have been submitted by more than one group. In addition to enriching phosphopeptide data in the public domain, multiple sources of evidence gathered from various labs on different mass spectrometry platforms validates these phosphopeptides.
Biological databases pertaining to various post-translational modifications are important for investigating biological and pathway information in the cells. Several public databases collect published phosphorylation data disseminated in the scientific literature and provide researchers access to their curated datasets. These usually reference the original publication and the experimental method that determined every individual phosphorylation. HPRD is richer in human phosphosites as compared to other publicly available repositories for human proteins. For instance, Phospho.ELM version 9.0 (http://phospho.elm.eu.org/) contains data detailing 8,718 substrate proteins from different species covering phosphorylation of 3,370 tyrosine, 31,754 serine and 7,449 threonine instances44 . Phospho3D version 2 (http://www.phospho3d.org/) provides three-dimensional structures of phosphorylation sites 45, 46 PHOSIDA (http://www.phosida.com) is an archive of 24,262 phosphosites on 8,283 proteins from Homo sapiens47. LymPHOS (http://www.lymphos.org) describes 342 phosphorylation sites in human T-Lymphocytes 48 and RESID (http://www.ebi.ac.uk/RESID/) has collection of structures for modified proteins 50. PhosphoSitePlus (http://www.phosphosite.org) has reported 97,589 non-redundant sites from all organisms with phospho-antibody information. dbPTM 2.0 (http://dbptm.mbc.nctu.edu.tw/) has 36,466 PTM sites from Swiss-Prot, Phospho.ELM, O-GLYCBASE and UbiProt, which were categorized by the PTM types, among which 22,363 are phosphorylation sites 50. SysPTM (http://www.biosino.org.cn/SysPTM/) contains 117,349 experimentally determined PTMs sites on 33,421 proteins involving nearly 50 PTMs types from many species, curated from public resources 51. One of the potential uses (use) of human protein phosphorylation resources will be in analyzing top/middle-down data from mass spectrometry. ProSight PTM 2.0 is a web-based tool that allows characterization of intact proteins using top-down tandem mass spectrometry approach. This tool utilizes proteomic databases annotated with known or predicted PTMs information including phosphorylation to identify and characterize intact proteins. The algorithm makes use of post-translational modifications annotation to calculate intact protein mass by considering each PTMs in isolation as well as in combination with others to characterize different protein forms using top-down approach. Each of these databases has their own unique features with large variations in the type and depth of annotation.
Several software or algorithms are available, which can predict phosphorylation motifs in the protein sequences. These include NetPhos (http://www.cbs.dtu.dk/ws/ws.php?entry=NetPhos) 52, GPS: group-based phosphorylation site 53, Scansite (http://scansite.mit.edu) 54, PredPhospho (http://www.ngri.re.kr/proteo/PredPhospho.htm) 55, KinasePhos (http://kinasephos.mbc.nctu.edu.tw/) 56, NetworKIN (http://www.networkin.info/search.php) 57, 58, DISPHOS 1.3 (http://www.ist.temple.edu/disphos/ ), ELM (http://www.elm.eu.org/) 59, PhoScan (http://bioinfo.au.tsinghua.edu.cn/phoscan/) 60, pkaPS (http://mendel.imp.ac.at/sat/pkaPS/) 61, PPSP (Prediction of PK-specific Phosphorylation site, http://ppsp.biocuckoo.org/) 62, NetPhorest (http://netphorest.info/) 63 and Motif-X (http://motif-x.med.harvard.edu/) algorithms also identify novel and known phosphorylation motifs 64. However, these software do not simply map literature based experimentally proven phosphomotifs onto a specific protein query. We developed a software called PhosphoMotif Finder in order to address this vacuum and integrated it in HPRD. It currently contains 324 phosphorylation based motifs described in the literature65. We categorized phosphomotifs into phosphorylation-based substrate motifs and phosphorylation based binding motifs. The former category of motifs will be recognized kinases and phosphatases and the latter will provide scaffold for other proteins to bind to phosphorylated motifs. PhosphoMotif Finder has 170 serine/threonine kinase substrate motifs, 5 serine/threonine phosphatase substrate motifs, 17 serine/threonine binding motifs, 50 tyrosine kinase substrate motifs, 19 tyrosine phosphatase substrate motifs and 63 binding motifs.
Figure 4 shows 52 known tyrosine phosphorylation motifs in vascular endothelial growth factor receptor 3 (VEGF receptor 3). It displays position of the motif in the query sequence and its features mentioned in the existing literature. This compendium is useful for the biomedical research community to identify potential phosphosites in a hypothetical protein, novel phosphorylation motifs, enzyme association and potential protein interactors of a phosphorylated protein. PhosphoMotif Finder can be useful tool for designing experiments to characterize phosphorylation reactions of a set of proteins. Phosphomotif Finder will be updated regularly as new phosphomotifs are described.
The phosphorylation data from HPRD has been used by scientific community to develop new prediction strategies, to compare experimental datasets, to enrich other databases and to perform targeted or specific bioinformatics analyses. SysPTM (http://www.sysbio.ac.cn/SysPTM) has used PTMs data from HPRD along with other databases for PTMBlast, PTMPhylog, PTMCluster and PTMPathway 51. Yang et al. introduced PhosphoPOINT, which integrates data from HPRD as well as from other resources 67. RegPhos (http://regphos.mbc.nctu.edu.tw/) has used HPRD phosphorylation sites to predict phosphorylation network with respect to subcellular localization 67. HPRD data has been used to create a resource of protein tyrosine phosphatases, which includes tyrosine-specific and dual-specificity phosphatases 68. The Pathway Palette (http://blaispathways.dfci.harvard.edu/Palette.html) generates a PPIs network of mass spectrometry derived data based on information curated in HPRD and other databases 69. Short linear motifs (SLiMs) (http://bioware.ucd.ie/~slimdisc/slimfinder/conmasking/), a software designed to find functional microdomains in proteins which are important in many biological processes, used data from HPRD 70. Yachie et al. obtained data of serine kinase and phosphatase motifs data from HPRD for the evaluation of relationship of amino acid sequences of phosphoproteins and positions of phosphosites 71. Bioinformatics analyses of phosphoproteome data from HPRD have led to many useful biological interpretations in protein signaling. Cao et al. found 80 tyrosine, 83 serine and 19 threonine phosphorylation sites in stimulated Jurkat cells to understand the signaling pathways involved in T cell. They used HPRD data to identify novel phosphorylation sites 72. Amanchy et al. identified 23 novel c-Src kinase substrates involved in PDGF signaling using SILAC 73. Tang et al. used HPRD data while investigating phosphoproteins in Wnt signaling 74. However, HPRD helps researchers to use data to maximize the discovery in proteomics.
HPRD and Human Proteinpedia serve as repositories of diverse features of human proteins. The major fraction of published human phosphoproteome has already been incorporated into these resources because of our ongoing curation efforts. Human Proteinpedia provides a list of phosphopeptides identified by mass spectrometry. PhosphoMotif Finder allows mapping of phosphomotifs in a given protein. These datasets together can help in designing phospho-specific antibodies and peptide arrays. Datasets obtained from different experimental methods and platforms can be used to analyze the efficiency of these technologies and to decide on a methodology that suits most for any investigation in phosphoproteomics. A minority of phosphorylation reactions and phosphosites described in HPRD have been studied with their relevance to cell signaling and protein activity. Targeted studies can be designed to explore the biological role of these phosphorylation events. Phosphoproteomic data available in these resources will provide baseline data for further biomedical investigations.
We thank the Department of Biotechnology (DBT), Government of India for research support to the Institute of Bioinformatics. Harsha Gowda is a Wellcome Trust/DBT India Alliance Early Career Fellow. A.P. was supported by NIH Roadmap grant “Technology Center for Networks and Pathways” (U54 RR 020839) and W81XWH-06-1-0428 from the Department of Defense for Proteomic and Functional Analysis of Fibroblasts in Breast Cancer. Dr. T. S. Keshava Prasad is supported by research grant on “Development of Infrastructure and a Computational Framework for Analysis of Proteomic Data” from DBT, Government of India, New Delhi, India.