|Home | About | Journals | Submit | Contact Us | Français|
Sharing proteomic data with the biomedical community through a unified proteomic resource, especially in the context of individual proteins, is a challenging prospect. We have developed a community portal, designated as Human Proteinpedia (http://www.humanproteinpedia.org/), for sharing both unpublished and published human proteomic data through the use of a distributed annotation system designed specifically for this purpose. This system allows laboratories to contribute and maintain protein annotations, which are also mapped to the corresponding proteins through the Human Protein Reference Database (HPRD; http://www.hprd.org/). Thus, it is possible to visualize data pertaining to experimentally validated posttranslational modifications (PTMs), protein isoforms, protein–protein interactions (PPIs), tissue expression, expression in cell lines, subcellular localization and enzyme substrates in the context of individual proteins. With enthusiastic participation of the proteomics community, the past 15 months have witnessed data contributions from more than 75 labs around the world including 2710 distinct experiments, >1.9 million peptides, >4.8 million MS/MS spectra, 150 368 protein expression annotations, 17 410 PTMs, 34 624 PPIs and 2906 subcellular localization annotations. Human Proteinpedia should serve as an integrated platform to store, integrate and disseminate such proteomic data and is inching towards evolving into a unified human proteomics resource.
Proteomics is the large-scale analysis of proteins whose major goals are to characterize features of gene products such as posttranslational modifications (PTMs), protein isoforms, subcellular localization, protein–protein interactions (PPIs) and tissue expression (1, 2). Many proteomic experiments make use of high-throughput technologies such as yeast two-hybrid, mass spectrometry (MS), protein/peptide arrays or fluorescence microscopy techniques to yield multi-dimensional datasets. These datasets, which are often quite large, are not usually published in their entirety or are published as supplementary information that is not easily searchable. Given the heterogeneous nature of proteomic data, integration of such diverse data from multiple sources is quite difficult. Without a system for standardizing and sharing such data in place, it is less fruitful for the biomedical community to contribute these types of data to centralized repositories. Even more difficult is the annotation and display of pertinent information in the context of the corresponding protein. In light of these issues, we have developed Human Proteinpedia (http://www.humanproteinpedia.org/) as a portal that overcomes many of these obstacles to provide a unified view of the human proteome.
Researchers contributing data to Human Proteinpedia belong to various arena of biomedical research including biochemistry, molecular biology, bioinformatics, genetics, cell biology and pathology. Different types of MS datasets are included from mass spectrometers ranging from ion traps to quadrupole time-of-flight to Fourier transform instruments. To aid comparison and interpretation, all of the datasets are annotated with the sample, method of isolation and experimental platform-specific information (labeling method, protease used, fractionation mode, ionization mode as well as database search annotations such as search algorithm, peptide score for MS, primary antibody source organism, primary antibody company or catalog information and primary antibody titration for antibody-based experiments). In addition to the types of data, more than 80 human tissues and cell types, including liver, serum, brain, plasma, B lymphocytes, saliva, platelets and pancreas are represented. With the recent advancements in proteomics technologies and large-scale proteomics initiatives like Human Proteome Organization (HUPO) and Cancer Genome Anatomy Project (CGAP), proteomic data are sure to explode. We aim to make Human Proteinpedia as an exhaustive platform to store and disseminate diverse proteomic data.
Distributed annotation system (DAS) is a set of defined protocols that is used to share biological data. This system was originally developed to share annotations for genomic data (i.e. DNA information) of various organisms and comprised a genome server for hosting genome maps, sequences and sequencing information, an annotation viewer, which is basically a computer used to access a genome browser and one or more annotation servers. Some of the resources that already have DAS implementations include WormBase (3), FlyBase (4) and Ensembl (5) allowing integration and visualization of data in a browser. A major limitation of DAS is that the specifics of the protocol mainly relate to DNA (6) or, more recently, mRNA data (7). Furthermore, significant technical expertise is required for any laboratory to participate in the data sharing process. Thus, given the complexity of proteomic data, it is a prerequisite to alter the specifics of DAS before it can be used to share data pertaining to proteins.
Even though initial attempts to extend the specifics of the DAS protocol to suit protein data have been made by SPICE (7), Dasty (8), UniProt (9) and ProteinDAS (5), all of the inherent properties of proteins such as PTMs, protein isoforms, PPIs, tissue expression, subcellular localization, functional annotations and enzyme substrates have not yet been consolidated into a single annotation system. Our annotation system follows DAS standards but is extended to fit protein data that characterize protein functions. Human Proteinpedia simplifies data sharing by allowing contributors to provide data in four different ways: (i) annotation of data on individual proteins along with experimental evidence through the use of web forms; (ii) upload of data via the web in a batch mode; (iii) sending data through FTP/e-mail to the Human Proteinpedia team that carries out data processing, formatting, mapping and upload of data; and (iv) DAS servers set up by the contributing labs for upload of data. Supplementary Figure 1 shows the web interface for annotation of PTMs and tissue expression using simple pull-down menus and pop-up windows for specifying and mapping sites of modification or tissue expression using standardized vocabulary terms. We have included these simpler options for contributing data because although the laboratories can also set up their own annotation servers, many investigators are precluded from volunteering to provide their data because of the technical expertise required. In addition, we have also developed a semi-automatic author tracking system for capturing proteomic data. Here, we are scanning through the published issues till date in all of the major proteomics journals for possible inclusion in Human Proteinpedia. Validity checks are carried out to minimize logical errors (e.g. typographic errors). For instance, if a user submits a PTM of serine located at position 162 for a given protein accession number, NP_0123456.1, the automated validation system will check whether the accession provided belongs to humans and whether there is indeed a serine residue at the given position. Once verified, the data is uploaded into Human Proteinpedia.
All of the data submitted to Human Proteinpedia can be viewed through Human Protein Reference Database (HPRD) (10, 11) in the context of an individual protein molecule, which provides an integrated view of the existing literature curated data along with the deposited annotations. The peptide data in Human Proteinpedia has been mapped onto the Ensembl human genome browser and can be viewed as separate tracks.
Human Proteinpedia is the portal for querying, browsing and downloading the contributed datasets (Supplementary Figure 2). All meta-annotations pertaining to experiments (e.g. description of experiment, experimental platform, publication details, etc.) along with the protein annotations (e.g. identified peptides, autoradiographs, microscopy images, etc.) are stored in the server hosting Human Proteinpedia. On the other hand, storage and dissemination of the MS data is more challenging because it poses unique issues pertaining to hardware, file formats and file transfer. For this purpose, we have taken advantage of ProteomeCommons.org, which is a public repository for digital content relating to proteomics and incorporates most of the tools for interchanging file formats. The Tranche file-sharing network supporting ProteomeCommons.org (http://www.proteomecommons.org/dev/dfs) is a secure transactional system for proteomics data that incorporates public key/private key encryption techniques and a peer-to-peer distributed file system, which provides considerable flexibility for file transfer issues. The system is designed to support both raw proteomics data and metadata independent of file format. Currently, over 16 servers (~50 TB of aggregate capacity) host the MS datasets in triplicate including two servers (in Japan and USA) set up specifically for this initiative.
All of the raw and processed MS datasets referenced by Human Proteinpedia are deposited in Tranche, which delivers the datasets as required. The data can be accessed either by clicking a link to download a file via web browser or by a link that will launch the Tranche downloader. Tranche can also act as a repository for reference datasets for other metadatabases and includes the contents of Peptide Atlas (12), TheGPMdb (13), Open Proteomics Database (OPD) (14) and PRoteomics IDEntifications database (PRIDE) (15, 16). Tranche is file format independent so there are no barriers to importing data in standardized file formats (e.g. XML-based) or in proprietary file formats. Support for file conversion within Tranche is provided by the IO Framework (17), which supports all major standardized and proprietary MS file formats.
Protein features that can be annotated are briefly described below along with relevant examples that illustrate the functionality and utility of Human Proteinpedia.
Human Proteinpedia facilitates annotation and visualization of PTMs by mapping them onto the modified amino acid residue of the corresponding modified protein. This information is not readily available even in the published literature as either amino acid residue numbers or a short peptide sequence are usually reported as sites of modifications, both of which are disconnected from the protein level information. An example of a novel phosphorylation site on vimentin protein, annotated through the use of Human Proteinpedia, is shown in Figure 1. Vimentin is a cytoskeletal protein that is important for maintaining the integrity of cytoplasm and is involved in processes such as wound healing. It has recently also been shown to be secreted into the extracellular space by activated macrophages, an event that is dependent on phosphorylation of vimentin on serine and threonine residues (18). Thirteen novel phosphorylation sites on vimentin shared using Human Proteinpedia should provide more insight into the role of phosphorylation events that regulate its function and subcellular localization. As shown in the figure, additional details regarding the experiment pertaining to the phosphorylation data are shown under the heading ‘Human Proteinpedia’. The tandem MS (MS/MS) spectrum for one of the phosphopeptides is displayed through a spectrum viewer, obtained from PRIDE.
Cataloging sites of expression for human proteins is an important task and one that cannot easily be tackled by a single laboratory or research group. The data pertaining to organ-based localization has grown considerably given several HUPO (e.g. plasma, liver, brain) and other initiatives [e.g. Human Protein Atlas project (19)]. Figure 2 shows the tissue expression data obtained by mass spectrometric analysis and immunohistochemical labeling from other laboratories for vimentin. This includes documentation of expression in human plasma from PeptideAtlas, in B cell, liver, platelets and serum from four different mass spectrometry laboratories and in lung, ovary, spleen, thyroid and tonsil from Human Protein Atlas.
Given the fact that a large majority of biochemical and cell biology experiments are carried out in cell lines, investigators can also deposit data pertaining to proteins in cell lines such as proteins identified in a mass spectrometric analysis. Given the knowledge of expression of proteins in cell lines, investigators will be able plan their experiments in a wider variety of cell lines than would otherwise be possible. It could also cut costs, as investigators would not need to randomly test cell lines for expression themselves in a large battery of cell lines.
PPIs are crucial to most cellular events and represent an important component of current high-throughput screening modalities for therapeutic development. High-throughput proteomic techniques, such as identification of immunoprecipitated complexes, have recently become popular for systematically cataloging physical interactions between proteins on a large scale. Human Proteinpedia allows annotation of PPIs obtained from a number of platforms including yeast two-hybrid experiments, pulldown assays and protein complex identification by mass spectrometry. In all of these cases, standardized vocabulary is used to describe the experiments and the detection methods and has direct links to the published citation, if any.
Human Proteinpedia permits annotation of information pertaining to subcellular localization of proteins. The vocabulary used for this purpose is in compliance with cellular component of Gene Ontology terms. Also annotated are details about the experiment, fluorescence microscopy images (where available) and links to contributing laboratories. Figure 3 shows how a protein of unknown function that is present in the databases simply as a hypothetical protein (accession numbers MGC33867 or KIAA2013), was localized to the endoplasmic reticulum and the Golgi apparatus by two independent groups. While the report describing localization to the endoplasmic reticulum is only preliminary, the localization to the Golgi complex is supported by three lines of evidence: first, the investigators identified the proteins from a Golgi membrane preparation of rat liver tissue; second, they showed localization to the Golgi using a GFP fusion protein; and last, they raised an antibody against the human protein and demonstrated localization of the endogenous protein as well to the Golgi. Integration of all this information in one place is likely to spur further research on such proteins and exemplifies how new discoveries could be enabled by Human Proteinpedia.
The transient nature of an enzyme/substrate reaction makes it difficult to be captured by many of the standard high-throughput experiments. The instances where such information is actually available are quite valuable. Human Proteinpedia allows enzymatic reactions to be annotated along with details of the residue that is modified. The experimental type (e.g. in vitro experiment) and the detection method (e.g. phosphopeptide mapping or mass spectrometry) are also annotated.
Most of the proteomic repositories in the public domain are restricted to one or two experimental platforms. PeptideAtlas and PRIDE are specialized public repositories of peptides/proteins identified by tandem mass spectrometric experiments. PRIDE, in addition, also accepts two-dimensional gel data. Human Proteinpedia differs significantly from these repositories in its scope, data types accepted, annotation features and various options provided for submission of data.
Standardization of data formats alone is not sufficient for facilitating data exchange. It is equally important to use standardized vocabulary before different types of data can be freely exchanged, queried or retrieved. eVOC (20), a set of controlled terms in a set of hierarchical vocabularies, are used to standardize human tissue nomenclature. Gene Ontology (21) terms for cellular component are used to designate subcellular compartments while RESID (22) and Proteomics Standard Initiative-Molecular Interaction (PSI-MI) (23) vocabularies are used to standardize PTMs and experiment types, respectively. Proteomics Standard Initiative-Mass Spectrometry (PSI-MS) vocabularies are used to standardize MS-based experimental annotations.
A large number of potential uses of the data deposited in Human Proteinpedia can be envisaged. The following is a brief list of some of the important uses that can be envisaged in the near future:
In this effort, the data contributed by the proteomics community include over 34 000 PPIs obtained from yeast two-hybrid and co-immunoprecipitation assays, more than 17 000 PTMs obtained from mass spectrometric analyses, over 150 000 sites of protein tissue expression from immunohistochemistry and mass spectrometric analyses, 2906 protein subcellular localization obtained from fluorescence microscopy and mass spectrometric analyses, over 1.9 million peptides and more than 4.8 million MS/MS-spectra (Table 1). We have also imported mass spectrometry data from several HUPO initiatives including human plasma proteome project (HPPP), brain proteome project (HBPP) and liver proteome project (HLPP) that were deposited in two public repositories of MS data: PRIDE and PeptideAtlas. All of the data present in Human Proteinpedia is freely available to the community for downloading at http://www.humanproteinpedia.org/. We anticipate that ready availability of such diverse datasets will spur research in many new areas of human biology including signaling networks, biomarkers and cellular and developmental studies.
Active participation by members of the proteomics community in sharing data through Human Proteinpedia has resulted in a prolific increase in the amount of protein annotations. Manual curation of scientific literature over the span of four years resulted in greater than 228 800 protein annotations in HPRD. The enthusiastic participation of the proteomics community over the last 15 months has almost doubled the number of protein annotations over short duration. This initial effort with 75 contributing laboratories enabled sharing of this large amount of data. We hope to include more human proteomic data in the near future as more investigators generating large proteomics datasets participate in this initiative. This would not be feasible without the continued participation of the community and we encourage all investigators to share their published and unpublished proteomic datasets with Human Proteinpedia. We anticipate that contribution of experimental data to a public repository will be made an essential criterion for publication as is already the case for nucleotide sequences, gene expression profiles and protein structures. With the participation of scientific community, we foresee the Human Proteinpedia to serve as a unified resource for proteomics research.
Supplementary Data are available at NAR Online.
National Institutes of Health (U54 RR020839); Roadmap Initiative for Technology Centers for Networks and Pathways, partially. Funding for open access charge: National Institutes of Health (U54 RR020839).
Conflict of interest statement. None declared.