|Home | About | Journals | Submit | Contact Us | Français|
The relatively small numbers of proteins and fewer possible posttranslational modifications in microbes provides a unique opportunity to comprehensively characterize their dynamic proteomes. We have constructed a Peptide Atlas (PA) for 62.7% of the predicted proteome of the extremely halophilic archaeon Halobacterium salinarum NRC-1 by compiling approximately 636,000 tandem mass spectra from 497 mass spectrometry runs in 88 experiments. Analysis of the PA with respect to biophysical properties of constituent peptides, functional properties of parent proteins of detected peptides, and performance of different mass spectrometry approaches has helped highlight plausible strategies for improving proteome coverage and selecting signature peptides for targeted proteomics. Notably, discovery of a significant correlation between absolute abundances of mRNAs and proteins has helped identify low abundance of proteins as the major limitation in peptide detection. Furthermore we have discovered that iTRAQ labeling for quantitative proteomic analysis introduces a significant bias in peptide detection by mass spectrometry. Therefore, despite identifying at least one proteotypic peptide for almost all proteins in the PA, a context-dependent selection of proteotypic peptides appears to be the most effective approach for targeted proteomics.
A complete genome sequence presents a one-dimensional perspective of the physiological potential of an organism. It is the temporally and spatially coordinated expression of genes into functional protein networks that yields emergent behavior that is unique to each species. Therefore, to fully understand how cells function at a systems level it is imperative to measure, assimilate and simultaneously analyze changes that occur at all levels of genetic information processing1. The transcriptome is dynamic and relatively easy to monitor comprehensively using whole genome microarrays, providing insight into which genes respond and assist in adaptation of the organism to a particular environment2. However, much more information on regulatory processes remains locked within the proteome. There exist important differences between the transcriptome and the proteome that stem from a variety of post-transcriptional processes, such as regulated degradation and posttranslational modifications, thus elevating the importance of comprehensive analysis of dynamic changes in the proteome in response to various environmental perturbations3, 4.
However, comprehensive detection of the proteome is fraught with technical challenges, especially with regard to proteins that are present in low abundance, integral to the membrane, or uniquely expressed in an environment-specific manner. Even within the same protein some peptides are more tractable than others using mass spectrometry-based approaches. While there are several existing hypotheses regarding the underlying reasons that make some peptides more tractable than others (i.e. biophysical properties such as isoelectric point (pI), hydrophobicity, and length)5, certain properties such as protease accessibility, protein structure, and protein modifications complicate any attempt to make accurate predictions of peptide tractability by mass spectrometry (MS) using purely theoretical approaches.
The PeptideAtlas (PA) project was initiated to map the proteome of a given organism, cell type or tissue as experimentally detected by the mass spectrometer6. The PA is technology agnostic and can make use of data from a variety of MS proteomic approaches such as qualitative proteomics surveys, tandem MS of immunoprecipitated complexes, and quantitative proteomics (e.g. ICAT and iTRAQ). Once constructed, the PA can be used as a reference for designing targeted proteomic strategies such as multiple reaction monitoring (MRM) as well as absolute protein quantification7. PeptideAtlas databases have been constructed for the human8, human plasma9, fly10, and yeast11 proteomes.
Here we report a PA for Halobacterium salinarum NRC-1, an obligate halophilic archaeon that evolved unique adaptations, such as increased surface negative charges on folded proteins for survival in its extreme environment of 4.5 M salt12. H. salinarum NRC-1 has a completely sequenced and easily manipulable genome and as such has been used as a model system for constructing a predictive model of cellular responses13 to a diverse array of routine and stressful environmental changes2, 4, 14-17. The PA represents the product of integration and re-processing of data from a wide array of proteomics experiments (surveys of fractionated proteomes, enrichment of complexes by immunoprecipitation, and ICAT- and iTRAQ-based quantitative analysis of proteomic changes) in these environmental response studies. This exercise has verified the expression of 63% of the predicted proteome of H. salinarum NRC-1 including previously undetected and potentially new members of a diverse array of physiological processes. Through extensive analysis of peptides in the PA in context of function, biophysical properties, and abundance, we have identified several factors that might have contributed to our inability to detect 37% of the proteome. Notably by demonstrating a significant correlation between absolute abundance of proteins and transcripts we have identified low abundance of proteins as the main limiting factor in peptide detection by mass spectrometry. We have also conducted a comparative analysis of all the various proteomics approaches that contributed data to the construction of the PA to craft strategies for improving coverage and using proteotypic reference peptides for targeted proteomics.
All details regarding cell culturing, protein preparation, and mass spectrometry conditions are discussed in the corresponding publications on H. salinarum NRC-1 for each of the mass spectrometry proteomics methods included in the PA4, 17-20. These methods include iTRAQ, ICAT, cell fractionation, enrichment by immunoprecipitation, and gel band extracted proteins. However, to aid clarity in the present study, we have delineated pertinent details regarding these procedures in Table 1.
A PA is created by identifying the peptides in MS/MS spectra, calculating the genomic coordinates of the peptides, and storing the datasets and derived information in a database for subsequent data mining10. The H. salinarum NRC-1 PA was constructed from 88 experiments [immunoprecipitation (IP), quantitative proteomic analysis using isotopic reagents (ICAT and iTRAQ), and proteome surveys via fractionation into soluble and membrane fractions] comprised of a total of 636,000 MS/MS from multiple spectrometer vendors [Sciex QStar (Applied Biosystems, Foster City, CA), Micromass QTOF (Waters, Milford, MA) and LCQ (ThermoFinnigan, Waltham, MA)] (Table 2). For each experiment, the vendor format MS/MS spectra were converted to mzXML format21 and assigned to peptides using SEQUEST22 and the complete set of H. salinarum NRC-1 protein sequences derived from the original genome annotation12 and the National Center for Biotechnology Information (NCBI) and SwissProt sequence databases. The peptide identifications were scored using PeptideProphet23 and filtered to retain only those with P ≥ 0.9, which corresponds to a spectrum identification false discovery rate of 1.1%. After all experiments were processed, the peptides were aligned to the reference proteome. The chromosomal coordinates of peptides from this analysis were verified against NCBI's Generic Features File (GFF) files and manually-curated data maintained at ISB (http://baliga.systemsbiology.net/halobacterium) in Systems Biology Experimental Analysis Management System (SBEAMS), a Relational Database Management System (RDBMS) (http://www.sbeams.org), which was also used to archive all of the PA results. We generated a complete library of tryptic peptides by performing an in silico digest of the entire H. salinarum NRC-1 predicted proteome, allowing for one missed cleavage. A measure of peptide observability, the Empirical Observability Score (EOS) (E. Deutsch, personal communication) was calculated for each peptide using the following equation: Nsamples (peptide) / Nsamples (protein). For example, if a protein was seen in 10 different samples, and one of its constituent peptides was seen in 5 of those samples, EOS of that peptide would be 0.5.
To calculate transcript abundance for each of the 1,646 genes whose cognate proteins were detected in the PA (Table 2), we computed the arithmetic mean intensity for that gene across 215 microarray conditions (Supplementary Table ST-1). These intensities were then log10 transformed. Cultures prepared for these microarray experiments were treated identically to those used for the proteomics experiments included in the PA (Table 1, Supplementary Table ST-1; conditions included gamma radiation stress4, UV radiation24, oxygen transitions20, and genetic knockouts17, 18, 24). To calculate sequence coverage per protein (Fig. 3A), the number of amino acids in each peptide corresponding to a given protein were summed, then divided by the total number of amino acids in that protein. If peptides were detected with partially overlapping amino acid sequences, each of the bases in the overlapping region was only counted once. To calculate the spectral counts (Fig. 3B), we computed the arithmetic mean of the number of spectra counted per protein which corresponded to peptides with a confidence value of P ≥ 0.9. To calculate the concordance between transcript abundance and cumulative proteome coverage, the average mRNA signal intensities for each gene were organized into 100 bins with 100 intensities per bin (i.e. bin 1 = intensity 0−99; bin 2 = 100−199; etc.) (Fig. 3C). The total range of intensities for this analysis was 0 to 50,000. Cumulative proteome coverage was calculated by adding the total number of proteins detected per transcript intensity bin as each successively higher bin was added to the analysis. The p-value of the correlation between mRNA and protein abundance was computed by counting the number of times that a set of randomly-permuted mRNA and protein levels had a correlation coefficient that was greater than or equal to the reported (unpermuted) correlation.
A total of 636,000 tandem mass (MS/MS) spectra from 88 proteomic experiments in 497 individual runs representing at least three types of approaches and three types of mass spectrometers (Materials and Methods) were converted to a common file format (mzXML) (Table 2). Using SEQUEST and PeptideProphet, 76,212 MS/MS spectra or ~12% of all MS/MS spectra had significant matches (P ≥ 0.9) to peptides from 1,646 predicted proteins in H. salinarum NRC-1 (Table 2), resulting in a false discovery rate of 1.1%. This represents 1,461 non-redundant proteins or 63% of the predicted proteome, thus improving coverage by 1.7-fold over a previous report that made use of a two-dimensional separation approach for protein cataloguing25. To facilitate further analysis, the PA module has been integrated with the H. salinarum NRC-1 protein annotation module in SBEAMS –a relational database system for managing systems biology data (http://baliga.systemsbiology.net/halobacterium/).
Although gene finding algorithms such as GLIMMER can identify protein-coding genes with relatively low error rates26, until verified experimentally these genes are considered putative. This is an especially important concept considering that over a third of all genes predicted from almost all completely sequenced genomes do not match experimentally characterized orthologs. The identification of a peptide verifies the expression of the parent protein predicted from the genome sequence. As such we have verified the expression of 1,461 non-redundant proteins predicted in the H. salinarum genome. Of these 1,029 proteins (9,330 peptides) had significant matches to PFAM signatures (e–value < 0.001)27; 1,157 proteins (10,490 peptides) had significant matches to Clusters of Orthologous Groups (COGs) (e-value < 0.001)28; 902 proteins (9,012 peptides) matched manually-curated functional annotations12, 29, and 838 proteins (12,410 peptides) mapped to distinct enzymatic steps within 77 metabolic pathways in Kyoto Encyclopedia of Genes and Genomes (KEGG)30. In summary, we have experimentally verified the expression of at least 989 proteins (37.6% of the H. salinarum NRC-1 proteome) with some putative functional annotation, which represents a 2.3-fold improvement over the 16.2% verification in a previous proteomic survey25. More importantly, we have verified the expression of at least 300 proteins with no significant matches to experimentally characterized proteins. Below we provide some highlights from this analysis.
Consistent with previous H. salinarum NRC-1 high-throughput proteomics studies, we observed a high degree of coverage for proteins involved in essential cellular functions (Table 3, Supplemental Table ST-2; Supplemental Figure SF-1). For example, with regard to genetic information processing, unique peptides were detected from five of the six predicted DNA polymerase proteins or subunits (PolA1 was not detected). We have detected unique peptides from 10 of the 12 predicted putative RNA polymerase (RNAP) subunits. In addition, we detected a putative 7 KDa RNAP subunit, Rpc10 (COG1996), which is not co-transcribed with any of the other known RNA polymerase subunits and was not detected in any previous H. salinarum NRC-1 proteomic surveys. With regard to protein synthesis, secretion and degradation, unique peptides from all 55 ribosomal proteins, all 20 amino acid-tRNA synthetases, elongation factors EF-1α and β, and EF-2, 11 translation initiation factors, 6 putative sec-dependent secretion proteins, 5 putative twin-arginine translocation proteins, and five proteases were detected. Also as expected, proteins involved with cellular motility and relocation including 9 chemotaxis proteins, 8 flagellar proteins, and 11 gas vesicle biogenesis proteins were detected. At least 53 out of the 75 predicted membrane ABC transport system subunits have been detected.
We have detected 20 of the 26 components of the four unique modes of energy production in H. salinarum NRC-1, including oxidative phosphorylation, arginine fermentation, phototrophy and dimethyl sulfoxide (DMSO) respiration (Table 3, Supplemental Table ST-2). Although all proteins from the arginine deiminase pathway, including ArcRABC, were previously detected25, here we have newly detected previously elusive components of the DMSO and phototrophic respiratory pathways DmsC and the regulator Bat (Table 3). Specifically, we were able to detect 183 unique peptides from 13 proteins involved in energy production via bacteriorhodopsin-mediated phototrophy by enriching for the purple membrane (Table 1: fractionation, ICAT, and bacteriorhodopsin IP gel band extraction experiments)31. Interestingly, in cells which overexpress the purple membrane (Table 1: ICAT experiments), we also detected VNG1459H, a protein of unknown function. VNG1459H co-localizes in the genome and is significantly co-expressed under relevant environmental conditions with other known phototrophy genes13, 32, 33. While the exact function remains to be tested, these data support the prediction that this protein may be involved in phototrophy, an extension of a process that was considered well understood.
Fractionation and subsequent solubilization with detergents also improved the detection of other membrane-associated proteins, allowing detection of 188 out of 550 proteins with predicted transmembrane domains19, 34 (Fig. 2D). We have also verified the expression of a large number of transcription factors despite their supposed low abundance in the cell: 68% of predicted transcription factors (88 out of 130) are included in the PA, which represents a significant improvement over previous proteomic surveys of H. salinarum NRC-1 and other organisms, which detected at most 44% of all predicted TFs35. For example, we detected several general transcription factors (e.g. TFBa, TFBd, TFBe, TBPc, and TBPd) only upon enrichment by immunoprecipitation (Fig. 1, Table 1).
Despite the unprecedented proteome coverage of the H. salinarum NRC-1 PA, it is significant that nearly 37% of all predicted proteins were not detected. In fact we observe from a cumulative plot of numbers of distinct peptides detected as a function of individual experiment type that we have reached an apparent threshold that was previously predicted6, 8, 11 but not observed until now (Fig. 1). To explore the possibility of improving coverage, we examined the influences of several parameters on proteome detection. Using these metrics as a guide, we discuss possible solutions below for improving detections in high-throughput analysis of the proteome.
As expected, we observed better detection of peptides with an increase in molecular weight up to 1,500 Da (Fig. 2A). This is explained by a combination of peptide sequence uniqueness with an increase in length and the detection limits of the mass spectrometer. Protein size is also an important factor at play with regard to proteome coverage considering that lower molecular weight proteins tend to be underrepresented in total protein surveys36. However, it is noteworthy that we have detected at least one peptide from each of 406 (~42%) out of the total 963 predicted proteins with calculated molecular weights less than 20 KDa, which is a slight improvement over the 380 proteins detected in a recent study specifically designed to enrich these proteins36.
The isoelectric point of a peptide influences its enrichment depending on the type of fractionation columns used for enriching peptides (or proteins) during sample preparation. Most proteins in H. salinarum NRC-1 have a relatively higher number of acidic residues and the resulting surface negative charge is believed to help circumvent protein aggregation and precipitation in a hypersaline cytoplasm37. Consequently most peptides in H. salinarum NRC-1 PA are also acidic with a median isoelectric point of 4.4 (Fig. 2B). It is interesting that despite the predominant use of cation exchange chromatography for sample processing in most of the experiments within the H. salinarum NRC-1 PA, a significant fraction of basic peptides were not detected.
Peptides of very low hydrophobicity were poorly detected. This is expected because of the property of the LC column used in most of our experiments34. Low hydrophobicity peptides are washed off from these columns before the mass spectrometer has a chance to analyze them. Also, as expected, peptides with hydrophobicity greater than ~30 were detected at a relatively lower frequency, perhaps due to their low solubility (Fig. 2C).
Despite enrichment of membrane proteins in some experiments, this fraction of the proteome is poorly represented in the PA (Fig. 1). This was evident in the observation that over 90% of all detected peptides originated from proteins predicted to be soluble (Fig. 2D). This bias in detection has been discussed previously4.
Abundance of proteins in a population can significantly influence the time a mass spectrometer spends analyzing each unique protein species38. Since there is no independent approach to measure absolute abundance of proteins on a systems scale we evaluated the use of mRNA signal intensity from microarray-based transcription profiling experiments as a proxy for the same. First we investigated whether mRNA and protein abundance were indeed proportional. The dynamic quantitative relationships between transcription and translation can be assessed at the level of absolute abundance or relative changes in the outputs of mRNA and protein. Although comparisons of relative changes across protein and mRNA concentrations have yielded variable relationships in some studies3, 39-43, we have previously demonstrated that given sufficient numbers of temporal measurements for both RNA and protein level changes over time scales of minutes, for most genes there exists significant time-lagged correlation between relative changes in transcript and protein abundance2, 16.
However, a significant correlation value between absolute mRNA and protein across the entire genome has not yet been reported40, 44. To assess this relationship we compared mRNA signal intensities from 215 microarray experiments4, 20, 24 (Supplementary Table ST-1) to average spectral counts (over all peptides) per protein from 497 mass spectrometric runs. This comparison yielded significant correlation across the two datasets (Spearman correlation ~0.5; P < 10−6) indicating that the abundance of most proteins is proportional to the abundance of their corresponding transcripts (Fig. 3C). In addition, we found that this relationship is not biased by protein length (Supplemental Fig. SF-2). Further, from our analysis we found that for lower abundance transcripts there is a dramatic increase in proteome coverage with small increases in mRNA signal intensity (Fig. 3C). This may be attributable to the observation that a significant fraction of the transcriptome (>60%) in H. salinarum NRC-1 appears to be present in low abundance (300−1500 intensity units) (Fig. 3C). Regardless, we find that more peptides tend to be detected from proteins whose transcripts are present in higher abundance, as reflected in better sequence coverage and spectral counts per individual protein with an increase in mRNA abundance (Fig. 3A, B). We conclude from this analysis that although targeted enrichment will help detect peptides with certain biophysical properties, approaches to enrich low abundance proteins and higher sensitivity mass spectrometers will yield higher proteome coverage.
As the PA becomes increasingly comprehensive we can use it for designing strategies for high throughput approaches to rapidly characterize the proteome –both qualitatively and quantitatively. A tangible approach to accomplish this is via the use of “proteotypic peptides”5, 45, which are peptides that map uniquely to one protein and are likely to be observed with LCMS/MS if the protein is present. We selected proteotypic peptides as those that (i) receive a PeptideProphet score of P > 0.9, (ii) were detected in more than one experiment, and (iii) have an Empirical Observability Score (EOS) > 0.3 (Materials and Methods). These peptides can be used as beacons for tracking specific proteins in high-throughput experiments for the targeted analysis of the proteome, greatly reducing mass-spectrometer time and improving proteome coverage at the same time. Proteotypic peptides can also aid in the QCAT approach7, in which known quantities of labeled synthetic peptides are spiked in and used as reference for absolute quantification of proteins.
Using the criteria listed above, we have identified proteotypic peptides for 1,505 proteins or 57.3% of the proteome (Table 4). In other words, we can now, in principle, tune the mass spectrometer to specifically search for 1,505 mass spectra instead of a possible ~76,212 (Table 4), which represents a 50-fold reduction in mass spectrometer time to get information on the same number of proteins. However, the selection of proteotypic peptides for practical applications is more complicated since the PA represents a diverse array of experimental techniques (ICAT, iTRAQ, immunoprecipitation, etc.) designed to address different scientific problems. Each of these approaches could have an inherent bias in peptide detection as suggested by the observation that a majority of proteins were observed in a small number of runs; for example, 633 (43%) of the 1,461 observed non-redundant proteins were observed in 5 or fewer of the 497 runs (Fig. 4). We conducted a comparative analysis to determine possible biases and efficiencies of peptide detection by each individual approach. We caution that such a comparison can be confounded by the significant biases in protein and peptide populations that were intentionally enriched in several of these approaches. For example, there was very little overlap between cysteine-specific ICAT labeling approach and any other proteomics method (Fig. 5). While the genetic and environmental perturbations could partly explain the reason for the bias, this is more likely an outcome of the poor per-protein cysteine content in H. salinarum NRC-1 (<65%)18.
Despite the potential for these inherent biases, we were able to make a fair and statistically significant comparison of performance across two of the most information rich datasets that shared significant numbers of detected proteins: iTRAQ vs. all other shotgun proteomics methods (Fig. 5, Table 5). Specifically, we considered proteotypic peptides for 25 proteins that were observed in the largest number of LCMS runs in iTRAQ experiments vs. all other approaches (data for two proteins are shown in Table 5, the remaining 25 proteins is given in Supplemental Table ST-3). Notably, of the 180 proteotypic peptides for these 25 proteins, only 40 were detected reliably by both approaches. A likely explanation is that iTRAQ introduces a significant bias in the types of peptides that are detected. Therefore, the choice of an appropriate proteotypic peptide will clearly vary depending on the application. Until we have a clearer understanding for the reasons for these observed biases, an empirical approach remains the best option for proteotypic peptide selection. For example, searching for the glutamate dehydrogenase protein GdhB in the H. salinarum NRC-1 PA yields 26 distinct peptides, which were detected a total of 208 times. Of these, the peptide CAVMDLPFGGAK (PA accession: PAp00363211) has an Empirical Observability Score (EOS) score of 0.59, indicating it was detected reliably. Further investigation into this peptide reveals that it was detected in both iTRAQ and ICAT experiments a total of 57 times and is therefore a reliable candidate for future use as a proteotypic peptide for both of these experimental approaches. However, this peptide was not detected in any of the experiments for targeted enrichment of transcription factors. A second peptide, VVQVSVPVER (PAp00368363) with a lower EOS score of 0.41, on the other hand, was detected only 12 times but observed in at least three of the targeted enrichment experiments. Clearly, this second peptide would make for a better proteotypic peptide for these targeted enrichment applications. Using this approach, one can compile a custom list of proteotypic peptides for a specific application of interest. Further explorations of this type are possible at the H. salinarum NRC-1 proteome annotation webpage (http://baliga.systemsbiology.net/halobacterium).
We thank Christopher Bare and David Campbell for help with programming, database construction and false discovery rate calculation, and Ning Zhang for help with generating proteotypic peptide scores. This work was supported by grants from NIH (P50GM076547 and 1R01GM077398-01A2), DoE (MAGGIE: DE-FG02-07ER64327), NSF (EF-0313754, EIA-0220153, MCB-0425825, DBI-0640950) and NASA (NNG05GN58G) and Institute for Systems Biology institutional support to NSB, postdoctoral fellowships are acknowledged from NSF to MTF (DBI 0400598) and KW (0443746), and from NIH (5F32GM078980-02) to AKS.
SUPPORTING INFORMATION AVAILABLE
A comparative analysis revealed possible similarities between spectral counts and genome organization, a finding discussed in the Supporting Information. Supporting figures (Figure SF-1, Figure SF-2) and Supplementary Tables (ST1 and ST2) are freely available online at http://pubs.acs.org.