|Home | About | Journals | Submit | Contact Us | Français|
The availability of human genome sequence has transformed biomedical research over the past decade. However, an equivalent map for the human proteome with direct measurements of proteins and peptides does not exist yet. Here, we present a draft map of the human proteome using high resolution Fourier transform mass spectrometry. In-depth proteomic profiling of 30 histologically normal human samples including 17 adult tissues, 7 fetal tissues and 6 purified primary hematopoietic cells resulted in identification of proteins encoded by 17,294 genes accounting for ~84% of the total annotated protein-coding genes in humans. A unique and comprehensive strategy for proteogenomic analysis enabled us to discover a number of novel protein-coding regions, which includes translated pseudogenes, non-coding RNAs and upstream ORFs. This large human proteome catalog (available as an interactive web-based resource at http://www.humanproteomemap.org) will complement available human genome and transcriptome data to accelerate biomedical research in health and disease.
Analysis of the complete human genome sequence has thus far led to the identification of ~20,687 protein-coding genes1 although the annotation still continues to be refined. Mass spectrometry has revolutionized proteomics studies in a manner analogous to the impact of next generation sequencing on genomics and transcriptomics2–4. Several groups, including ours, have employed mass spectrometry to catalog complete proteomes of unicellular organisms5–7 and to explore proteomes of higher organisms including mouse8 or human9,10. To develop a draft map of the human proteome by systematically identifying and annotating protein-coding genes in the human genome, we carried out proteomic profiling of 30 histologically normal human tissues and primary cells using high resolution mass spectrometry. We generated tandem mass spectra corresponding to proteins encoded by 17,294 genes, accounting for ~84% of the annotated protein-coding genes in the human genome – the largest coverage of the human proteome reported thus far. This includes mass spectrometric evidence for proteins encoded by 2,535 genes that have not been previously observed as evidenced by their absence in large community-based proteomic datasets - PeptideAtlas11, GPMDB12 and neXtProt13 (which includes annotations from Human Protein Atlas14).
A general limitation of current proteomics methods is their dependence on predefined protein sequence databases for identifying proteins. To overcome this, we also employed a comprehensive proteogenomic analysis strategy to identify novel peptides/proteins that are currently not part of annotated protein databases. This approach revealed novel protein-coding genes in the human genome that are missing from current genome annotations in addition to evidence of translation of several annotated pseudogenes as well as non-coding RNAs. As discussed below, we provide evidence for revising hundreds of entries in protein databases based on our data. This includes novel translation start sites, gene/exon extensions and novel coding exons for annotated genes in the human genome.
To generate a baseline proteomic profile in humans, we studied 30 histologically normal human cell and tissue types, including 17 adult tissues, 7 fetal tissues, and 6 hematopoietic cell types (Fig. 1a). Pooled samples from three individuals per tissue type were processed and fractionated at the protein level by SDS-PAGE and at the peptide level by basic RPLC and analyzed on high resolution Fourier transform mass spectrometers (LTQ-Orbitrap Elite and LTQ-Orbitrap Velos ) (Fig. 1b). To generate a high quality dataset, both precursor ions and HCD-derived fragment ions were measured using the high resolution and high accuracy Orbitrap mass analyzer. Approximately 25 million high resolution tandem mass spectra, acquired from >2,000 LC-MS/MS runs, were searched against NCBI’s RefSeq15 human protein sequence database using MASCOT16 and SEQUEST17 search engines. The search results were rescored using the Percolator18 algorithm and a total of ~293,000 non-redundant peptides were identified at a q value <0.01 with a median mass measurement error of ~260 parts per billion (Extended Data Fig. 1a). The median number of peptides and corresponding tandem mass spectra identified per gene are 10 and 37, respectively, while the median protein sequence coverage was ~28% (Extended Data Fig. 1 b, c). It should be noted, however, that false positive rates for subgroups of peptide-spectrum matches can vary upon nature of peptides such as size, charge state of precursor peptide ions or missed enzymatic cleavage (Extended Data Fig. 1d–f and Supplementary Information).
We compared our dataset with two of the largest human peptide-based resources – PeptideAtlas and GPMDB. These two databases contain curated peptide information that has been collected from the entire proteomics community over the last decade. Strikingly, almost half of the peptides we identified were not deposited in either one of these resources. Also, the novel peptides in our dataset constitute 37% of the peptides in PeptideAtlas and 54% of peptides in the case of GPMDB (Extended Data Fig. 1g, h). This dramatic increase in the coverage of human proteomic data was made possible by the breadth and depth of our analysis as most of the cells and tissues that we have analyzed have not previously been studied using similar methods. The depth of our analysis enabled us to identify protein products derived from two-thirds (2,535 out of 3,844) of proteins designated as ‘missing proteins’19 for lack of protein-based evidence. Several hypothetical proteins that we identified have a broad tissue distribution indicating the inadequate sampling of the human proteome thus far (Extended Data Fig. 2a).
Based on gene expression studies, it is clear that there are several genes that are involved in basic cellular functions that are constitutively expressed in almost all the cells/tissues. Although the concept of ‘housekeeping genes’ as genes that are expressed in all tissues and cell types is widespread among biologists, there is no readily available catalog of such genes. However, the extent to which these transcripts are translated into proteins remains unknown. We detected proteins encoded by 2,350 genes across all human cells/tissues with these highly abundant ‘housekeeping proteins’ constituting ~75% of total protein mass based on spectral counts (Extended Data Fig. 2b). Large majority of these highly expressed housekeeping proteins include histones, ribosomal proteins, metabolic enzymes and cytoskeletal proteins. One of the caveats of tissue proteomics is the contribution of vasculature, blood and hematopoietic cells. Thus, proteins designated as housekeeping proteins based on analysis of tissue proteomes could be broadly grouped into two categories - those that are truly expressed in every single cell type and those that are found in every tissue (e.g. endothelial cells). Another caveat to be noted here is that some proteins that are indeed expressed in all tissues might not be detected in some of the tissues because of inadequate sampling by mass spectrometry. Thus, this list of housekeeping proteins will continue to be refined as additional in-depth analyses are carried out.
We used a label-free method based on spectral counting to quantitate protein expression across cells/tissues. Although more variable as compared to label-based methods, this method is readily applicable to analysis of a large number of samples8 and has been shown to be reproducible20. Supervised hierarchical clustering showed proteins encoded by some genes to be expressed in only a few cells/tissues, while others were more broadly expressed (Fig. 2a). Some proteins detected in only one sample were encoded by well-known genes like CD19 in B cells, SCN1A in frontal cortex and GNAT1 in the retina, while others were encoded by ill-characterized genes. For example, C8orf46 was expressed in adult frontal cortex while C9orf9 was expressed in adult ovary and testis. Overall, we detected proteins encoded by 1,537 genes only in one of 30 human samples examined in this study (Extended Data Fig. 2c). These may or may not be tissue-specific genes because of the limit of detection of mass spectrometry and because this analysis did not sample every human cell or tissue type. Because methods based on antibody-based detection can be more sensitive, we performed Western blotting experiments to confirm the tissue-restricted nature of expression of some proteins against which appropriate antibodies were available. Of 32 proteins tested, eight proteins exhibited a tissue-specific expression in agreement with mass spectrometry-derived data (Extended Data Fig. 3a). Four proteins exhibited a more widespread expression although in each of these cases extra bands were detected (Extended Data Fig. 3b). In eighteen cases, the antibody did not recognize a protein in the expected size range at all while no band was detectable in the remaining two cases.
A number of proteins are expressed during development in fetal tissues but not in normal adult tissues. Although earlier studies have focused on a few fetal tissues like fetal brain21 or liver22, our study provides the first general survey of the fetal proteome. We detected proteins encoded by 735 genes that are expressed >10-fold in fetal samples as compared to adult tissues/cells. A heat map highlights the expression level of fetal tissue-restricted genes across various fetal tissues (Fig. 2b). The list includes the well-known oncofetal antigens, alpha fetoprotein (AFP) and insulin-like growth factor-2 binding protein-3/IGF-2 mRNA binding protein-3 (IGF2BP3). High levels of AFP in serum and cerebrospinal fluid are clinically used as biomarkers for neural tube defects, teratoma and yolk sac tumors. Some of the proteins expressed during development in ovary and testis can serve as potential biological markers for identifying cancers of different lineages in the future.
In the past, gene expression profiles across various experimental conditions or tissues have been utilized to investigate the likelihood of co-expressed genes to physically interact at the protein level23,24. We hypothesized that protein expression patterns should be a better predictor of protein-protein interactions than gene expression measured at the mRNA level. We correlated the expression level for each available protein pair across all 30 cells/tissues in our data using Pearson correlation and compared this to known protein-protein interactions. We then repeated this analysis using correlations obtained from 111 published gene expression data sets. ROC curve analysis clearly shows that data from the human proteome map outperforms that from gene expression profiles for predicting protein-protein interactions, even if all the gene expression datasets are combined and used as a single predictor (Fig. 2c). It should be pointed out, however, that although the use of protein expression data is useful for predicting protein-protein interactions, it is unlikely by itself to be sufficient for such predictions.
Many proteins interact with different partners in different tissues or at different stages of development although they are not traditionally studied in this fashion. To investigate tissue-restricted expression of protein complex subunits, we evaluated 938 complexes with three or more subunits from the CORUM database25 and found 679 protein complex subunits showing differential expression in at least one of the 30 tissues. In contrast, there were 34 complexes where all the subunits were expressed concurrently in all tissues. Interestingly, there were 201 instances where differential expression of subunits of complexes was observed across the adult and fetal tissues suggesting that these complexes are dynamic and likely have distinct composition during ontogeny. This dynamic composition is probably related to developmental stage-specific processes in which these complexes are involved. For example, MCM complex components are highly expressed in fetal liver in contrast to FARP2-NRP1-PlexinA1 complex members, which are highly expressed in the adult liver (Fig. 2d).
Alternative splicing gives rise to a large number of splice variants at the RNA level, some of which can encode distinct protein isoforms. Multiple protein isoforms are contributed by only one-third of annotated genes, while the remaining two-thirds generate only a single protein product according to the RefSeq database15 (Extended Data Fig. 2d). Although our primary goal was not to obtain complete coverage of splice isoforms, we identified isoform-specific peptides for 2,861 protein isoforms derived from 2,450 genes. For example, we detected isoform 1 of Fyn protein tyrosine kinase in brain and isoform 2 in hematopoietic cells (Fig. 3a). This is significant because although we did not detect the third known isoform of FYN, the two isoforms that we identified are known to possess distinct functional properties.
We have developed a portal for the Human Proteome Map (http://www.humanproteomemap.org), which makes it possible to test and generate hypotheses regarding gene families, protein complexes, signaling pathways, biomarkers, therapeutic targets, immune function and human development. As an illustration, one can explore the protein components of the 20S constitutive proteasome and immunoproteasome complexes (Fig. 3b and Extended Data Fig. 2e). Three of the subunits in 20S constitutive proteasome (PSMB5, PSMB6 and PSMB7; colored red in Fig. 3b left panel) are known to be replaced by three other subunits (PSMB8, PSMB9 and PSMB10; colored green) in the 20S immunoproteasome26. As shown in the heat map from Human Proteome Map, PSMB8, PSMB9 and PSMB10 are highly expressed in immune cell types.
The evidence for protein-coding potential is still largely driven by gene prediction algorithms or cDNAs and does not routinely include direct detection/measurement of proteins/peptides. Following protein database search of our large-scale proteome analysis, we extracted ~16 million MS/MS spectra that did not match currently annotated proteins. These unmatched spectra were searched using a unique proteogenomic analysis strategy developed by our group against conceptually translated human reference genome, RefSeq transcript sequences, non-coding RNAs and pseudogenes. In addition, the data were searched against theoretical protein N-termini and predicted signal sequences (Fig. 4a). As a result, we confirmed translation of 17,294 annotated protein-coding genes that include 4,105 protein N-termini, 223,385 exons and 66,947 splice junctions (Fig. 4b). More importantly, we discovered 808 novel annotations of the human genome including translation of 140 pseudogenes, 44 novel ORFs, 106 novel coding regions/exons within annotated gene structures and 110 gene/protein/exon extension events in addition to 198 novel protein N-termini and 201 novel signal cleavage sites (Fig. 4b and Supplementary Table 1). Although the peptides were identified at <1% FDRs with ppb median mass measurement error (Extended Data Fig. 1i), we carried out an additional level of manual inspection of the MS/MS spectra to reduce false positives in this type of analysis27. In addition, because these novel findings are based on matches of tandem mass spectra to translated genomic/transcriptomic sequences, we experimentally confirmed a subset of the identifications from various proteogenomic categories through matching of MS/MS spectra from 98 synthetic peptides with those obtained from analysis of human cells/tissues (Supplementary Data).
We identified 28 genes with upstream ORFs (uORFs) where peptides mapped to 5’ UTRs of known human RefSeq transcripts. For example, we identified peptides that mapped to 5’ UTR of solute carrier family 35, member A4 transcript (SLC35A4). Although we did not identify any unique peptides from the protein encoded by SLC35A4, which is probably related to its low abundance or cell/tissue-specific expression, we identified four peptides derived from the protein encoded by the uORF in 25 of 30 samples (Extended Data Fig. 5a). This uORF encodes a 103 amino acid protein that contains a transmembrane region within a putative BNIP3 domain (BCL2/adenovirus E1B 19 kDa protein-interacting protein 3) that is required for mitochondrial localization, suggesting that this protein may play a role in mitochondrial function.
We identified eight cases where we observed peptides that mapped to an ORF located in an alternate reading frame within coding regions of annotated genes. For example, we identified peptides that mapped to a novel ORF of 159 amino acids within the C11orf48 gene. The protein encoded by the C11orf48 gene was identified only in the adult retina although we identified three peptides encoded by the novel ORF from 17 various cells/tissues. We also identified peptide matches to seven ORFs located within 3’UTRs. As an example, a novel ORF comprising of 524 amino acids in the 3’UTR of the CHTF8 gene was identified on the basis of multiple peptides. The translation initiation site of this novel ORF overlaps the stop codon of the CHTF8 gene (Extended Data Fig. 4a). Remarkably, the protein encoded by this novel ORF was observed in hematopoietic cells where we did not detect the CHTF8 protein. In addition, this novel gene product was expressed at higher levels in fetal ovary and adult testis than the protein encoded by CHTF8. These observations suggest that the translational control for these two proteins encoded by the same gene structure is likely different. We also identified a peptide encoded by an ORF within a human endogenous retrovirus (Extended Data Fig. 5b). Domain analysis revealed the presence of a signal peptide at the N-terminus along with other domains including Furin-like repeats. In fact, during preparation of this manuscript, a report was published in which this protein was designated as suppressyn and shown to inhibit cell-cell fusion in trophoblast cells28.
We detected several peptides that were not mapped to RefSeq proteins but corresponded to nine transcripts annotated as non-coding RNAs. As shown in Extended Data Fig. 4b, five peptides (including one across an exon-exon junction) were mapped to a region on Chr. 1 that was annotated as a non-coding RNA. Although long non-coding RNAs are, by definition, not supposed to encode proteins29, our data suggest that there are additional translated non-coding RNAs, which can be discovered by deep proteomic profiling of cells/tissues.
A few studies have recently examined the transcription of pseudogenes on a global scale and found ~2,000 pseudogenes to be transcribed either ubiquitously or in a tissue/cancer-specific manner30,31. In our dataset, we identified >200 peptides that are encoded by 140 pseudogenes. These 140 pseudogenes originate from 110 parental genes. In order to derive this unambiguous set, we filtered out several peptides for which sequence change could be explained by SNPs reported for corresponding genes in dbSNP. When we looked at the tissue distribution of the translated pseudogenes, we observed that roughly half of the pseudogenes were translated in a cell/tissue-restricted manner while a small minority was expressed globally (Fig. 5a), a pattern similar to that described for pseudogene transcripts30,31. For instance, VDAC1P7 was found to be translated globally (22 of 30 cells/tissues analyzed) while MAGEB6P1 was detected only in adult testes although its parental gene, EIF4B, is widely expressed) (Fig. 5a). Extended Data Fig. 5c shows two peptides that map to the MAGEB6P1 a pseudogene along with an alignment of the identified peptides with the corresponding peptides in the parental gene. It should be noted that there is still a small chance that some SNPs are not represented in our analysis since we did not analyze the genomes of these individuals. However, this is unlikely to impact our results greatly because it has been estimated that ~98% SNPs with 1% frequency have been detected from the 1,000 Genomes Project32.
We confirmed 4,105 annotated N-termini in RefSeq (from 4,132 protein-coding genes) to generate the largest catalog of experimentally validated translational start sites in humans (~20% of annotated genes). Annotation of start sites in proteins can be imprecise because initiation AUG codons are generally predicted by computation and are prone to errors. For instance, the optimal Kozak consensus sequence (CCACC[AUG]G), which is widely believed to surround the true initiation codon, occurs only in half of the human genes33. By searching unmatched MS/MS spectra against customized databases, we identified 3 cases that map upstream and 195 cases that map downstream (within 100 amino acids) of the currently annotated translational initiation sites (Fig. 5b). The nucleotide sequences that surround the AUG codon of these novel translational start sites exhibit the same pattern as bona fide translational start sites (Extended Data Fig. 6) strongly suggesting that we have identified true novel or alternative start sites. We also confirmed 201 annotated signal peptide cleavage sites and revised the predictions in 128 other cases (Supplementary Information).
Here, we present a high coverage map (covering >84% of human protein-coding genes) of the human proteome using high resolution mass spectrometry. This demonstrates the feasibility of comprehensively exploring the human proteome on an even larger scale in the near future such as the initiative recently announced by the Human Proteome Organization including the Chromosome-centric Human Proteome Project34–36. Our findings also highlight the need for using direct protein sequencing technologies like mass spectrometry to complement genome annotation efforts. This process can be facilitated by workflows suitable for proteogenomic analyses. Even greater depth can lead to an increase in sequence coverage by employing methodologies such as multiple proteases, direct capture of N-termini, enrichment of post-translationally modified peptides, exhaustive fractionation and incorporating alternative sequencing technologies such as ETD and top-down mass spectrometry. Finally, these strategies can be further complemented by sampling of individual cell types of human tissues/organs to develop a ‘human cell map.’ With the availability of both genomic and proteomic landscapes, integrating the information from both resources is likely to accelerate basic as well as translational research in the years to come through a better understanding of gene-protein-pathway networks in health and disease.
Human tissues were harvested postmortem as part of a rapid autopsy program from three adult donors for each tissue by Dr. Chris Iacobuzio-Donahue or obtained from National Disease Research Interchange (NDRI, PA, USA). All tissues were histologically confirmed to be normal prior to analysis. Fetal tissues were obtained by Dr. Candace Kerr. Hematopoietic cells were isolated from leukopaks obtained from healthy volunteers participating in routine platelet pheresis. This study was approved by the Johns Hopkins University’s Institutional Review Board for use of human tissues and informed consent was obtained from all subjects from whom blood samples were obtained for isolation of hematopoietic cells. Hematopoietic cell populations were isolated sequentially using Miltenyi Biotec magnetic beads from leukopaks in the following order; CD14+ Monocytes, pan CD56+ NK Cells, CD20+ B cells, CD8+ T cells, pan CD4+ T cells as per manufacturer’s instructions. Briefly, peripheral blood mononuclear cells (PBMCs) were enriched by centrifugation over Ficoll gradient at 700 g for 15 minutes. Cells were incubated with magnetic beads in modified MACS buffer (0.05% BSA and 2mM EDTA) and separated using autoMACS (Miltenyi Biotec). Purity was assessed by flow cytometry. Isolated cells were washed with PBS and stored at −80°C until use. Platelets were obtained from platelet rich plasma from healthy donors. Platelets were pelleted by centrifugation, resuspended and washed with modified Tyrode’s buffer before lysis.
The samples were lysed in lysis buffer containing 4% SDS, 100 mM DTT and 100 mM Tris pH 7.5. Tissues were homogenized and sonicated in lysis buffer. The protein concentration of the cleared lysates was estimated using BCA assay and equal amounts of protein from each donor was pooled for further fractionation. Proteins from SDS lysates were separated on SDS-PAGE and in-gel digestion was carried out as described earlier37. Briefly, the protein bands were destained, reduced and alkylated and subjected to in-gel digestion using trypsin. The peptides were extracted, vacuum dried and stored at −80°C until further analysis. The samples (~450 µg proteins) were reduced, alkylated and digested using trypsin overnight at 37°C. The peptide digests were desalted using Sep-Pak C18 columns (Waters Corporation, Milford, MA) and lyophilized. The lyophilized samples were reconstituted in solvent A (10 mM tetraethylammonium bicarbonate, pH 8.5) and loaded onto XBridge C18, 5 µm 250 × 4.6 mm column (Waters, Milford, MA, USA). The digests were resolved by bRPLC38 method using a gradient of 0 to 100% solvent B (10 mM tetraethylammonium bicarbonate in acetonitrile, pH 8.5) in 50 minutes. The total fractionation time was 60 minutes. A total of 96 fractions were collected which were then concatenated to 24 fractions, vacuum dried and stored at −80 until further LC-MS analysis.
Peptide samples from in-gel digestion and bRPLC fractionation were analyzed on LTQ-Orbitrap Velos or LTQ-Orbitrap Elite mass spectrometers (Thermo Electron, Bremen, Germany) interfaced with Easy-nLC II nanoflow liquid chromatography systems (Thermo Scientific, Odense, Southern Denmark). The peptide digests from each fraction were reconstituted in Solvent A (0.1% formic acid) and loaded onto a trap column (75 µm × 2 cm) packed in-house with Magic C18 AQ (Michrom Bioresources, Inc., Auburn, CA, USA) (5 µm particle size, pore size 100Å) at a flow rate of 5 µl/min with solvent A (0.1% formic acid in water). Peptides were resolved on an analytical column (75 µm × 10 cm- on Velos and 20 cm on Elite) at a flow rate of 350 nl/min using a linear gradient of 7–30% solvent B (0.1% formic acid in 95% acetonitrile) over 60 minutes. Mass spectrometry analysis was carried out in a data dependent manner with full scans (350–1800 m/z) acquired using an Orbitrap mass analyzer at a mass resolution of 60,000 at 400 m/z in Velos and 120,000 in Elite at 400 m/z. Twenty most intense precursor ions from a survey scan were selected for MS/MS from each duty cycle and detected at a mass resolution of 15,000 at m/z of 400 in Velos and 30,000 at m/z of 400 in Orbitrap analyzer. All the tandem mass spectra were produced by higher-energy collision dissociation (HCD) method. Dynamic exclusion was set for 30 seconds with a 10 ppm mass window. The automatic gain control for full FT MS was set to 1 million ions and for FT MS/MS was set to 0.05 million ions with a maximum ion injection times of 100 ms and 200 ms, respectively. Internal calibration was carried out using lock-mass from ambient air (m/z 445.1200025)39.
MS/MS data obtained from all LC-MS analysis were searched against Human RefSeq database (version 50 containing 33,833 entries along with common contaminants) using SEQUEST and MASCOT (version 2.2) search algorithms through Proteome Discoverer 1.3 platform (Thermo Scientific, Bremen, Germany). The parameters used for data analysis included trypsin as the protease with a maximum of two missed cleavages allowed. Carbamidomethylation of cysteine was specified as a fixed modification and oxidation of methionine, acetylation of protein N-termini and cyclization of N-terminal glutamine and alkylated cysteine were included as variable modifications. The minimum peptide length was specified to be 6 amino acids. The mass error was set to 10 ppm for precursor ions and 0.05 Da for fragment ions. The data was also searched against a decoy database and the results used to estimate q values using the Percolator algorithm within the Proteome Discover suite. Peptides were considered identified at a q value <0.01 with a median mass measurement error of ~260 parts per billion (Extended Data Fig. 1a). The mass spectrometry data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE40 partner repository with the dataset identifier PXD000561. For assessing the effect of parameters such as size, missed enzymatic cleavage or charge state of precursor peptide ions on FDRs, all PSMs identified at 1% FDR were collated from all fetal tissues and subjected to additional analyses presented in Extended Data Figure 1f–h (discussed in Supplementary Information).
Non-redundant peptides sequences were mapped onto RefSeq protein sequences and the corresponding RefSeq genes. When a peptide was mapped uniquely to a gene but to multiple protein isoforms, the peptide was considered ‘gene-specific.’ When a peptide was mapped uniquely to a single protein product of a gene, the peptide was considered ‘isoform-specific.’ Otherwise, peptides were considered ‘shared.’ To explore gene expression, gene-specific peptides were used. To map the human proteome, all the identified peptides were used. For the isoform expression analysis, isoform-specific peptides were used when multiple isoforms were annotated in RefSeq protein database (version 50).
Spectral counts per gene per experiment (e.g. SDS-PAGE) were first summed from all peptides mapped to each gene. Total acquired tandem mass spectra were used to normalize between experiments and then spectral counts per gene were averaged across multiple experiments (e.g. SDS-PAGE and bRPLC fractionation) per sample (e.g. pancreas). Final averaged spectral counts per gene per sample were used to plot the white-to-red gradient heat map for genes of interest, while summed spectral counts per peptide were used to depict the heat map for all peptides identified and mapped to a gene of interest.
Our data was compared with the neXtProt genes (http://nextprot.org/; release date: 07–16–2013) with expression evidence at protein level. All the peptides from our study were also compared with the peptide databases, GPMDB (http://gpmdb.thegpm.org/; release date: 02–17–2012) and PeptideAtlas (http://www.proteinatlas.org; release: 201308). Processed Illumina Human Body Map 2.0 data for 10 tissues (adrenal glands, ovary, testis, prostate, lung, colon, liver, heart, brain and kidney) were downloaded as individual GTF files from Broad Institute (ftp://ftp.broadinstitute.org/). All the proteins with isoform-specific peptides corresponding to the 10 tissues were mapped to respective mRNA accessions from RefSeq database. The mRNA accessions were used to get the FPKM values of each isoform across tissues from the GTF files. These expression values were tabulated and used for comparison with normalized spectral counts of proteins. Normalized spectral count zero was taken as the protein being unidentified in the tissue. Similarly, FPKM of zero was considered for the transcript to be not expressed.
Equal amounts of adult tissue protein (25 µg) were separated by sodium dodecyl sulfate-polyacrylamide gel electrophoresis in 4%−12% NuPAGE® Bis-Tris Precast Gels (Novex, Life Technologies Corporation) and transferred onto nitrocellulose membranes (Whatman, GE Healthcare, Life Science, Pittsburgh, PA, USA) using a semi-dry transfer unit Hoefer TE 70 (Amersham Bioscience). The membranes were blocked with 5% low-fat milk dissolved in phosphate buffered saline containing 0.05% Tween (PBST) for 1 hour at room temperature and incubated overnight at 4°C with primary antibodies. After washing with PBST three times each for 10 min, the membranes were further incubated with the corresponding horseradish peroxidase-conjugated secondary antibodies for 1 hour at room temperature. After washing with PBST three times each for 10 min, antibody-bound protein bands were detected with ECL™ Western Blotting Detection Reagents (RPN2106V1 and RPN2106V2, GE Healthcare Life Science, Pittsburgh, PA, USA) and photographed with Amersham Hyperfilm ECL autoradiography film (GE Healthcare Life Science, Pittsburgh, PA, USA). Anti-GAPDH antibody was used for a loading control.
To enable identification of novel peptides and correction of existing gene annotations in the human genome, we searched six different databases generated in-house using python scripts for protegenomics analysis. The databases used were: 1) six frame translated human genome database 2) three frame translated RefSeq mRNA sequences 3) three frame translated pseudogene database with sequences derived from sequences from NCBI and Gerstein’s pseudogene database 4) three frame translated non-coding RNAs from NONCODE 5) N-terminal peptide database derived from RefSeq mRNA sequences from NCBI and 6) Signal peptide database from SignalP and HPRD. The sequences from common contaminants including trypsin were appended to each database. A decoy database was created for each database by reversing the sequences from a target database. Peptide identification was carried out using X!Tandem41. The following parameters were common to all searches - 1) Precursor mass error set at 10 ppm, 2) Fragment mass error set at 0.05 Da, 3) Carbamidomethylation of Cysteine defined as fixed modification, 4) Oxidation of methionine defined as variable modification. 5) Only tryptic peptides with up to 2 missed cleavages were considered. Unmatched MS/MS spectra generated from the protein database search were then searched against these database using X!Tandem search engine installed locally. Specific details about generation of each database, search parameters and post-processing of the search outcome are provided below.
The human reference genome assembly hg19 was downloaded from NCBI and translated in six reading frames using in-house python scripts. The six frame translation included stop codon to stop codon translation of the template sequence. Translated peptide sequences smaller than 7 amino acids were not included in the database. Three frame translated mRNA sequence database was created from mRNA sequences downloaded from NCBI RefSeq (RefSeq version 56 containing 33,580 sequences) and translated in three reading frames. Pseudogene sequences downloaded from NCBI (11,160 sequences from NCBI) and Gerstein’s pseudogene database (16,881 sequences from http://pseudogene.org/, version 68) were translated into three frames. Similarly, non-coding RNA sequences from NONCODE (91,687 sequences from http://www.noncode.org/NONCODERv3/, version 3) and mRNA sequences from RefSeq version 56 (33,580 sequences) were translated in three reading frames. A custom N-terminal tryptic peptides database was created by fetching all the peptide sequences which begin with methionine and end with K/R from RefSeq mRNA sequences (RefSeq 56). 5’-UTRs of known mRNAs were translated in three reading frames and all possible peptides starting with a methionine until the next tryptic cleavage site were generated. Peptides starting with a methionine until the next tryptic cleavage site from translated protein were also included in the database. Only peptide sequences with up to one missed cleavage and 6–25 amino acids in length were considered. ‘Quick acetyl’ search option was defined in X!Tandem searches as it allows up to 2 N-terminal amino acids clipping. Signal peptide database was created by using signal peptide annotations for 15,257 proteins from HPRD (www.hprd.org) and SignalP version 4.0 to fetch sequences from starting amino acid of protein extended to the next tryptic cleavage site.
Peptide sequences identified from each of the alternate database searches were filtered and only peptides passing 1% FDR score threshold were considered. Spectra that matched to multiple sequences with equal score were not considered for further analysis. Unique list of genome search specific peptides (GSSPs), pseudogene specific peptides and transcriptome search specific peptides were generated by comparing the unique peptide data with the protein database. The genomic regions to which these peptides mapped were further analyzed to identify novel protein coding regions or corrections/alterations to existing annotations. The peptides identified were categorized either as 1) mapping intergenic regions 2) overlapping annotated genes 3) mapping to the intronic regions of existing gene models 4) overlapping annotated genes but translated in alternate reading frame. In case of peptides identified uniquely from pseudogene database search, the amino acid change compared to parent sequence was checked against dbSNP to rule out SNP cases. Peptides identified with at least one mismatch and filtered after comparison with dbSNP were considered for further analysis. The filtered novel peptides were further manually verified to check the confidence of spectral assignment by the search engines. A list of novel peptides identified from this study is available (Supplementary Table 1).
UCSC genome browser42 (http://genome.ucsc.edu/cgi-bin/hgBlat?command=start) was used as visualization tool for manual genome annotation using these novel peptides. Novel peptides, ESTs, ncRNA transcripts, Ensembl RNASeq models and gene prediction models from GENSCAN tracks were enabled on the genome browser. For manual genome annotation, genomic regions where novel peptides mapped were examined and novel genes or changes in gene structures were determined.
Peptide sequences were randomly chosen from several categories of proteogenomic analysis and synthesized (JPT Peptide Technologies, Berlin, Germany). Dried peptides were diluted using 0.1% formic acid and ~1 pmol of each peptide was separately subjected to LC-MS/MS analysis on the LTQ-Orbitrap mass spectrometers. Fragmentation patterns of 98 peptides were validated manually by comparing them with that identified from proteogenomic analysis (Supporting Information).
The Human Proteome Map (HPM) portal was developed using the 3-tier web architecture with presentation, application and persistence layers. A MySQL database was used as a persistence layer. PHP and HTML were used for the application and presentation layers, respectively, and AJAX was used for dynamic functionality. Protein and peptide identifications obtained from SEQUEST and MASCOT search engines were converted into MySQL tables. NCBI RefSeq56 annotations were used as standard reference for fetching additional information about the genes from various public resources. Normalized spectral counts were used to represent expression of proteins and peptides. For each peptide, a high resolution MS/MS spectrum from the best scoring identification is shown on the spectrum viewer page using the Lorikeet JQuery plugin (https://code.google.com/p/lorikeet/). The download page allows the user to download the data in standard formats.
PPI networks were generated for each tissue by filtering known interactions from the iRefWeb43 database (http://wodaklab.org/iRefWeb/search/index) based on tissue spectral count data. If two proteins physically interact and both have spectral count data in a tissue, then the protein interaction is included in the corresponding tissue specific network. First, proteins identified were mapped to UniProt protein accessions using data from the UniProt, HGNC, and GeneCards databases. 87,027 pairwise and experimental PPIs with confidence score > 0 (iRefWeb MINT-like score) were downloaded from iRefWeb. These interactions were used to generate 30 tissue specific PPI networks. We compared our tissue specific interaction networks to the CORUM database. There are 938 complexes with three or more subunits in our protein expression data. We evaluated the differential expression of these complexes within corresponding tissue types of adult and fetal samples.
Gene expression datasets were downloaded from GeneMANIA44. Only gene expression data sets without any missing values were used. Average Pearson correlation over multiple gene (111) expression data sets was calculated using Fisher’s Z transformation.
where rn is the Pearson correlation between two genes for the nth experiment. Then, an appropriate estimate of the true mean is calculated as,
where N is the total number of experiments. Then, by inversion, average correlation is calculated as,
Human proteome data was filtered by removing all genes with zero spectral count in more than 26 tissue samples. Pearson correlation values lie within the range [−1, +1]. They were converted to fall between [0, 1] using a logistic function for both human proteome and gene expression datasets. A positive set of 5,523 protein-protein interactions was downloaded from iRefWeb with MINT-like confidence score > 0.9 and where human proteome and gene expression information was available for both interaction participants. A negative set of an equal number of randomly selected protein pairs not part of the complete iRefWeb interaction set was created to compute ROC statistics.
Hierarchical clustering explorer (version 3.5, http://www.cs.umd.edu) was used to create tissue-supervised clustering using genes whose protein products were identified. Normalized/averaged spectral counts across cells/tissues were loaded onto the software and spectral counts per gene were further normalized be rescaling from 0 to 1 (0 being non-detected; 1 being the highest spectral counts). When clustering, average linkage and pearsone’s correlation were set. Color gradient was used; grey being non-detected, yellow being the lowest and red being the highest expression.
GeneSpring 12.6 was used to visualize the GSSPs against the human reference genome hg19. The GSSPs were loaded as a region list into GeneSpring. RefSeq annotation was downloaded from the Agilent server. All GTF files were loaded onto GeneSpring and the tracks were visualized in a genome browser view using a custom coloring scheme. Individual images were exported in JPEG format.
We would like to acknowledge the National Development and Research Institutes for some of the tissues. We acknowledge the assistance of Varot Sandhya and Vinuth Puttamallesh at the Institute of Bioinformatics, Udayan Guha at the National Cancer Institute and Bob Cole at the Johns Hopkins University School of Medicine for help with analysis of some of the samples. We thank Drs. Lydie Lane and Bairoch Amos for their assistance with the list of missing genes. This work was supported by an NIH roadmap grant for Technology Centers of Networks and Pathways (U54GM103520), NCI’s Clinical Proteomic Tumor Analysis Consortium initiative (U24CA160036), a contract (HHSN268201000032C) from the National Heart, Lung and Blood Institute and the Sol Goldman Pancreatic Cancer Research Center. The authors acknowledge the joint participation by the Adrienne Helis Malvin Medical Research Foundation and the Diana Helis Henry Medical Research Foundation through its direct engagement in the continuous active conduct of medical research in conjunction with The Johns Hopkins Hospital and the Johns Hopkins University School of Medicine and the Foundation’s Parkinson’s Disease Programs. The analysis work was partially supported by the National Resource for Network Biology (P41GM103504). A.M, S.K.S., P.S. and T.S.K.P are supported by DBT Program Support on Neuroproteomics (BT/01/COE/08/05) to IOB and NIMHANS. H.G. is a Wellcome Trust-DBT India Alliance Early Career Fellow. We thank Council of Scientific and Industrial Research, University Grants Commission and Department of Science and Technology, Government of India for research fellowships for S.M.P., R.S.N., A.R., M.K., G.J.S., S.C., P.R., J.S, S.S.M, D.S.K., S.R., S.K.S., K.K.D., Y.S., A.S., S.DY., N.S., S.A. and G.D.
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.
Author ContributionsA.P., H.G., R.C., M.-S.K. designed the study; A.P., H.G., M.-S.K. managed the study; D.G., C.L.K., C.A.I.-D., K.R.M. collected human cells/tissues; M.-S.K., R.C., D.G. developed the pipeline of experiment and analysis; D.G., M.-S.K., S.M.P., K.M., R.C., S.R., J.Z., X.W., P.G.S., M.S.Z., T.C.H. prepared peptide samples for LC-MS/MS; M.-S.K., R.S.N., S.M.P., R.C., D.S.K., S.R., G.J.S. performed LC-MS/MS; M.-S.K., S.M.P., S.P., S.S.M ,C.J.M., J.A. and A.K.M. processed MS data and managed data; A.K.M., S.S.M., B.G., A.H.P., Y.S., M.-S.K. performed comparison analysis with PeptideAtlas, neXtProt and GPMDB; R.I., S.J., G.D.B. performed interaction and complex analysis; M.-S.K., S.M.P., S.S.M., P.K., A.K.M., N.A.S., R.S.N., L.B., L.D.N.S., D.S.K., V.N., A.R., T.S., M.K., S.K.S., G.D., A.M., R.R., S.C., K.K.D., A.S., S.D.Y., S.J., P.R., A.H.P., B.G., J.S., N.S., R.G., G.J.S., A.A.K., S.A., D.F., T.S.K.P., H.G., A.P. performed proteogenomic analysis; A.C., H.L., R.S., J.T.S., K.K.M., S.S., A.M., S.K.S., P.S., S.D.L., C.G.D., A.M., M.K.H., R.H.H., C.L.K., C.A.I.-D. assisted with analysis of the data; M.-S.K., S.M.P., T.C.H., P.L.R. performed Western blot experiments; M.-S.K., J.K.T., A.K.M., B.M., S.P., S.M.P. designed the HPM web portal; M.-S.K., A.K.M., J.K.T. generated SRM database; M.-S.K., K.M., G.D., S.M.P., S.S.M. illustrated figures with help of other authors; A.P., M.-S.K., H.G. wrote the manuscript with inputs from other authors.
The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository with the dataset identifier PXD000561.
The authors declare no competing financial interests.