|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact email@example.com
Human Protein Reference Database (HPRD) (http://www.hprd.org) was developed to serve as a comprehensive collection of protein features, post-translational modifications (PTMs) and protein–protein interactions. Since the original report, this database has increased to >20000 proteins entries and has become the largest database for literature-derived protein–protein interactions (>30000) and PTMs (>8000) for human proteins. We have also introduced several new features in HPRD including: (i) protein isoforms, (ii) enhanced search options, (iii) linking of pathway annotations and (iv) integration of a novel browser, GenProt Viewer (http://www.genprot.org), developed by us that allows integration of genomic and proteomic information. With the continued support and active participation by the biomedical community, we expect HPRD to become a unique source of curated information for the human proteome and spur biomedical discoveries based on integration of genomic, transcriptomic and proteomic data.
The Human Protein Reference Database (HPRD) is a protein information resource that provides extensive information pertaining to human proteins including domain architecture, protein functions, protein–protein interactions, post-translational modifications (PTMs), enzyme–substrate relationships, subcellular localization, tissue expression and disease association of genes (1–3). In order to make HPRD a more comprehensive resource, we have greatly expanded the number of protein entries, protein–protein interactions and PTMs. We have also incorporated additional query (e.g. BLAST) and browse options and provided explanatory pages for motifs found in the proteins cataloged in HPRD. Some of the new features include protein isoforms, links to signal transduction pathways and integration of GenProt Viewer, a novel browser that we have recently developed. HPRD currently contains over 20000 protein entries including 1587 protein isoforms and has grown significantly in size over the last 3 years (Figure 1a).
A crucial aspect of any proteomic analysis is the elucidation of interacting proteins—the interactome (4). HPRD currently has 33710 unique protein–protein interactions. The experimental evidence for the interactions is derived from in vivo experiments for 19175 interactions, in vitro for 11114 interactions and yeast two-hybrid for 1813 interactions. Figure 1b shows the distribution of the protein–protein interactions annotated in HPRD. Table 1 shows the overall statistics as of September 15, 2005.
PTMs can alter both structure and function of proteins. In recent years, several large-scale studies have been carried out to characterize PTMs using proteomic methods. For instance, 2002 phosphorylation sites were identified using mass spectrometry from HeLa cell nuclear extract in a single experiment (5). A total of 5011 phosphorylation events and 1132 glycosylation events are among the 8409 recorded PTMs in HPRD (Figure 1c). Updated annotations involving subcellular localization include 489 nucleolar proteins (6,7) and 270 secreted proteins (8). Similarly, tissue expression data have been added to a number of entries including those encoded by KIAA cDNAs (9).
One of the important additions to HPRD is the inclusion of protein isoforms. Criteria for inclusion as an isoform include only those RefSeq database (10) entries with different CDS (coding sequence) for the same gene. Thus, only those alternate splice forms are considered in which the splicing involves the coding region and not the 5′ or 3′ untranslated regions. All annotations are displayed for all isoforms by default except when isoform-specific data regarding subcellular localization, PTMs, domain architecture or tissue expression are available. Mainly due to lack of data, isoform-specific annotations for protein–protein interactions, substrates and disease involvement are not provided currently but are common to all isoforms.
HPRD can be queried through gene symbols or a variety of database accession numbers such as RefSeq (10), OMIM (11), Swiss-Prot (12), HPRD and Entrez Gene (13). A multiple search option is included in the updated query system that allows the database to be queried by simultaneously specifying several different parameters. Because accession numbers, gene symbols or protein names might still not yield the protein being searched for, we have now also included a BLAST option as a search tool.
In order to visualize and identify the potential function of a protein in the context of a large signaling network and its interaction partners, we have curated a number of pathways. These pathways have been integrated through the ‘Pathways’ tab. The pathway data are diagrammatically represented using GenMAPP (Gene Microarray Pathway Profiler) (14), a computer application designed for viewing and analyzing the data in the context of biological pathways (Figure 2). These pathways are a collection of literature-derived information usually downstream of ligand-receptor interactions. In addition to information about protein–protein interactions, pathways include reactions involving PTMs, shuttling of proteins between subcellular compartments, activation or inhibition of enzymatic activity and up or down regulation of mRNAs.
HPRD provides annotations by mapping the protein onto the genome alongwith transcript and SNP information for each protein from the molecule page, through GenProt Viewer (Figure 2). The GenProt Viewer, a browser (http://www.genprot.org) developed by our group, provides an integrated genomic, transcriptomic and proteomic view of the human genome. Genomic annotations that have been addressed here include the mapping of single nucleotide polymorphisms and homology blocks of the human genome when compared against that of mouse. Transcriptomic data include RefSeq annotations and the categorization of protein-coding transcripts into untranslated regions and open reading frame. It also integrates ‘Haploview’ for investigating population haplotype patterns (15). Experimentally derived peptide sequences obtained by mass spectrometry that have been deposited in PeptideAtlas (http://www.peptideatlas.org) (16) and PRIDE (17) repositories are also mapped onto the genomic sequence in GenProt Viewer. The peptides are linked to the sequence pages in these two repositories. Finally, the BLAST option allows users to query the genome using protein or nucleotide sequences.
HPRD data are available for download in XML as well as tab delimited file formats. Regular updates of full release of all the data in a compressed format is available using the ‘Download’ tab (http://www.hprd.org/FAQ?selectedtab=DOWNLOAD+REQUESTS). Interaction datasets in PSI-MI (18) format are provided as individual files for each protein as well as a single combined file for the entire dataset. The PSI-MI is an evolving data format which was originally released as level 1.0. We are currently formatting the protein–protein interaction data in HPRD to stay compliant with the latest version of this specification (PSI-MI level 2.5).
We wish to develop a Protein Distributed Annotation System, which will enable laboratories throughout the world to annotate valuable proteomic information including PTMs, tissue expression, protein–protein interactions and enzyme–substrate relationships in the context of HPRD data. We hope to link any data obtained by mass spectrometry directly to such annotations in HPRD. We are also in the process of integrating transcriptomic data into HPRD, which will allow gene expression patterns to be visualized in normal and diseased states. Based on user input over the last 3 years, we also hope to include a list of genes regulated by the major transcription factors.
Our strategy of involving the biomedical community in providing feedback for individual entries using the ‘Comments’ button and designating interested researchers as ‘Molecule Authority’ listed under the ‘Credits’ tab is already successful. To make the best use of HPRD and to understand the annotation procedure and philosophy, we strongly encourage all users to visit the ‘FAQs’ page (http://www.hprd.org/FAQ). We hope that this community involvement will continue to intensify over the coming years in our effort to make HPRD a knowledgebase of human proteins that will assist in biomedical discoveries by serving as a complete resource of genomic, transcriptomic and proteomic information and in providing an integrated view of sequence, function and protein networks in health and disease.
The HPRD was developed with funding from the National Institutes of Health and the Institute of Bioinformatics. Funding to pay the Open Access publication charges for this article was provided by a grant from the NIH (RR020839).
Conflict of interest statement. A.P. serves as Chief Scientific Advisor to the Institute of Bioinformatics. A.P. is entitled to a share of licensing fees paid to the Johns Hopkins University by commercial entities for use of the database. The terms of these arrangements are being managed by the Johns Hopkins University in accordance with its conflict of interest policies.