|Home | About | Journals | Submit | Contact Us | Français|
Mass spectrometry has become the method of choice for proteome characterization, including multi-component protein complexes (typically tens to hundreds of proteins) and total protein expression (up to tens of thousands of proteins), in biological samples. Qualitative sequence assignment based on MS/MS spectra is relatively well-defined, while statistical metrics for relative quantification have not completely stabilized. Nonetheless, proteomics studies have progressed to the point whereby various gene-, pathway-, or network-oriented computational frameworks may be used to place mass spectrometry data into biological context. Despite this progress, the dynamic range of protein expression remains a significant hurdle, and impedes comprehensive proteome analysis. Methods designed to enrich specific protein classes have emerged as an effective means to characterize enzymes or other catalytically active proteins that are otherwise difficult to detect in typical discovery mode proteomics experiments. Collectively, these approaches will facilitate identification of biomarkers and pathways relevant to diagnosis and treatment of human disease.
Mass spectrometry has transitioned from an esoteric, low-throughput analytical method typically utilized only by highly specialized labs to the technique of choice for systematic identification and quantification of proteins studied in the context of numerous biological systems. Progress in large-scale protein sequencing has been driven largely by technological developments in mass spectrometry, protein/peptide separations, biochemical enrichment methods and data analysis algorithms. One long-term goal of efforts directed at comprehensive proteome sequencing is the development of models based on primary proteomics measurements that support prediction of cellular activity, clinical outcomes, or other biological responses “personalized” to individuals or clinical cohorts .
We will focus here on bottom-up analysis of proteins, which deals with proteolytic fragments of proteins (or ‘peptides’) of approximately 3,000 Da or less, and is the most widely practiced technique in proteomics today. Several mass spectrometers are well-suited to bottom-up analysis of proteins. The quadrupole ion trap, which is capable of mass-selected ion storage, can function as a stand-alone mass spectrometer for MS and MS/MS measurements [6, 7] or be coupled to a high resolution mass analyzer in a hybrid geometry [8–10]. Time-of-flight mass analyzers have been coupled to matrix assisted laser desorption [11–16] or electrospray [17–20] ionization sources for MS and MS/MS analyses of proteins and peptides.
The fundamental observable in a mass spectrometer is analyte mass-to-charge ratio (denoted m/z). In protein digests of whole cell lysates, hundreds of thousands of distinct peptides may be generated, many of which will have relative m/z differences of 0.1 Da or less, owing to the fact that highly variable peptide sequences can be constructed using the 20 common amino acids which are either isomeric (i.e., same molecular formulae) or nominally isobaric (i.e., same integral mass, but different molecular formulae) to other peptides. Drawing from combinatorial theory, He et al.  showed that of all possible di- and tri- peptides, assuming only the 20 naturally occurring amino acids, only 52% and 19%, respectively, were compositionally distinct; moreover, 29% and 53% of each sub-class were isomers. The consequence of this pseudo-degeneracy is that mass alone may not be sufficient for peptide identification, but can be used to limit the number of candidate peptides during sequence assignment of MS/MS spectra, illustrating the value of high resolution mass spectrometry platforms.
A number of concepts related to resolution have evolved that are useful to introduce here. Monoisotopic mass is calculated using the most abundant isotope for each atom in an ion. Average mass is calculated using the abundance-weighted average mass of each atom. Both definitions are based on the unified atomic mass unit, or ‘u’, which defines 1 u as 1/12 the mass of carbon-12, and is equivalent to 1 Da . Resolution (RZ) is defined by IUPAC  as:
where m is the m/z ratio at a peak’s maximum and ΔmZ is the peak width at Z percent of the maximum intensity. Z is commonly assumed to be 50%, and this value will be used herein. In most time-of-flight instruments, R can approach 5,000 without a reflectron [24, 25] and 15,000 with a reflectron [24, 26], although the latest generation time-of-flight mass spectrometers can achieve R = 50,000  for small molecules and R = 40,000 for peptides . Radio frequency ion trap mass spectrometers have about 1.0 Da resolution for ions up to 2000 m/z in a typical implementation  (or R ≤ 2,000), though R = 20,000 is attainable with slower scan rates . Fourier transform instruments can achieve significantly higher values, such as the Orbitrap (R = 150,000 ) and FTICR (R = 3,300,000 ) mass spectrometers.
Mass accuracy is important for mass spectrometry-based proteomics because accurate measurement of a peptide’s mass greatly reduces the putative number of sequence matches in the associated database of proteins. Zubarev et al.  showed that a mass accuracy of 1 ppm can eliminate 99% of nominally isobaric peptides. Mass accuracy is commonly reported in two ways: parts per million or Da. Parts per million (ppm) specifies error of the measured mass relative to a peptide’s theoretical mass. For example, consider a peptide with a theoretical m/z of 1570.58 Da and a measured m/z of 1570.56 Da. The mass accuracy is calculated as:
Alternatively, the mass accuracy can be specified relative to the theoretical mass in absolute units, commonly Daltons. For an ion trap mass spectrometer, the accuracy is typically 0.6 Da for MS precursor peaks or MS/MS fragment peaks. For a time-of-flight instrument, equipped with a reflectron ion mirror, the mass accuracy is 0.1–0.25 Da for MS or MS/MS. Fourier transform-based instruments typically have mass accuracies at or below 10 ppm for both MS and MS/MS.
Peptide sequence information can be derived from a variety of MS/MS data, such as that from low energy collisionally activated dissociation (CAD) in an ion trap  or collision cell [18, 33, 34] configuration, electron transfer dissociation (ETD) [35, 36] and electron capture dissociation (ECD) [37, 38]. In CAD, peptide ions are fragmented by low-energy (≤200 eV) collisions with a neutral seed gas to form b- and y-type ions (Fig. 2a). In ETD and ECD, peptide ions undergo non-ergodic fragmentation upon incorporation of a thermalized electron, either directly (ECD) or via transfer from an electron-donor anion (ETD), to form c/z ions [38, 39] (Fig. 2a). CAD will be the focus of the following sections.
Typical rates of data generation of state-of-the-art mass spectrometers require computational database search tools such as Mascot , SEQUEST , Phenyx [42, 43], and X!Tandem  that provide for high-throughput, automated spectral sequence assignment. The details of the underlying search algorithms are described in several recent reviews [45–47]. In most cases a theoretical ‘peptidome’ resulting from enzymatic digestion (e.g., using trypsin) is readily generated in silico based on the known genome sequence of the organism from which the sample was derived. Computational algorithms match peaks in each MS/MS spectrum with fragment ions derived from the mass-filtered set of theoretical peptides in accordance with the accuracy of the mass spectrometer. Sequences are scored based on the numbers of matched peaks, often with a penalty for non-matching signals. In this way, thousands of spectra can be rapidly sequenced and mapped back to their source proteins.
Although database-driven sequencing has become an industry standard in proteomics [40, 48–50], and has identified tens of thousands of peptides for a broad range of biomedical applications , large fractions of MS/MS spectra remain unidentified . This is partly driven by experimental factors , but computational limitations also play a role. Non-quantitative amino acid modifications (i.e., chemical or post-translational modifications which occur on specific side-chains, but often with low stoichiometry), for example, require searching both modified and unmodified forms of all amino acids targeted by the specific modification in a protein database. This effectively increases the search space and the chance of random matches; hence in practice one must compromise between the number of simultaneous post-translational modifications that adequately capture the nature of the sample being analyzed and an acceptable rate of false positive identifications .
In light of such considerations, we provide a brief tutorial on manual sequencing of MS/MS spectra. This approach is labor intensive and time-consuming, but is not restricted by modifications, genome annotation, or enzymatic cleavage specificity, and therefore is very useful for targeted validation and confirmation of subsets of peptide hits.
Sample MS and MS/MS spectra are shown in Fig. 1A and 1B, respectively. These data were generated on an LTQ-Orbitrap hybrid mass spectrometer  with online ESI, where [M+2H]2+ and [M+3H]3+ ions are expected. The 785.84 Da, +2 charge state precursor will be used for manual sequencing below. The charge state is determined from the 0.5 Da peak separation in the isotope series of the precursor in the MS spectrum (Fig. 1A inset). The peak separation corresponds to a single neutron addition to a peptide atom (e.g., 13C, 15N, 32S). The ratio of the neutron rest mass (approx. 1 Da) to the peak separation is equal to the charge state of the precursor:
When such a precursor is subjected to CAD, singly charged fragment ions are typically formed. As a result, calculation of the singly-charged precursor m/z is a useful first step in manual interpretation. In this example, we remove a proton (H+): [M+2H]2+ = 785.84 Da, [M+H]+ = [2 × 785.84] – 1.007825 = 1570.67 Da.
The MS/MS spectrum of the 785.84 Da precursor shown in Fig. 1B was generated in the LTQ ion trap mass analyzer via low-energy CAD. A peptide’s structure is heteropolymeric, consisting of n amino acids of the following form :
Fragmentation in CAD typically cleaves the amide bond between adjacent amino acids in a peptide (Fig. 2A). A unified nomenclature for the resulting fragment ions was originally proposed by Roepstorff and Fohlman , and later modified by Johnson and colleagues . In this scheme, the b series builds up ordinally from b1 at the N-terminus and the y series builds up ordinally from y1 at the C-terminus, with addition of single amino acids creating a “ladder” for each ion series. The cleavage reaction is initiated by protonation of an amide nitrogen on the peptide backbone to form fragments containing the original N-terminus (b- and potentially a-type ions) and C-terminus (y-type ions) [56, 57] (Fig. 2B). The gas phase basicity of amide nitrogens is roughly equivalent, meaning that fragment ions corresponding to multiple amide bond cleavages in a peptide will be observed in the associated MS/MS spectrum. Alternative mechanisms have been proposed for generation of b- and y-type ions [58, 59], that yield different fragment ion structures. These products are isobaric to the ones shown in Fig. 2, and hence are indistinguishable by mass measurement.
The vast majority of peptides produced by digestion with trypsin, the most commonly used enzyme in proteomics, will have Lys or Arg at their carboxy-terminus . As drawn above, a single amino acid has one open valence at the N-terminus and one at the C-terminus. To calculate a y-type ion mass, the ion must first be neutralized by addition of an H atom at the N-terminus and an OH group at the C-terminus, and subsequently protonated to receive a formal charge. Addition of an H atom, an H+ atom and one OH group yields a net change in fragment ion mass of: 3 × 1.007825 + 15.9949146 = 19.01839 Da. The resultant y1 masses for Lys and Arg are 147 Da and 175 Da, respectively. The presence of either ion is a good indicator of the identity of the C-terminal amino acid. Neither of these ions is found in Fig. 1B. The alternative method of identifying the C-terminal amino acid is to search for the complementary b ion to y1 using the equation bn-m = [M+H]+ - ym + H+, where n is the length of the peptide and m refers to the ordinal number of an ion in its series. This equation can be used to calculate any b/y complementary pair, as illustrated in Figure 2B for fragmentation of a generic, doubly-charged tryptic peptide. For [M+H]+ = 1570.67 Da:
There is an ion of mass 1396.4 Da, indicating that Arg is the C-terminal amino acid. Further, there is an ion of similar intensity at 1378.4 Da, corresponding to H2O loss from bn−1, indicating the presence of Ser, Thr, Glu and/or Asp in the peptide sequence. In this exercise, the b – H2O peaks are more intense overall than their b series counterparts.
To concatenate the y series, a fragment must be located that corresponds to addition of an amino acid residue mass to the y1 ion. There is a peak at 246.3 Da, which is 71 Da above the y1 ion for Arg, corresponding to Ala. The bn−2 complementary ion is calculated as:
There is an ion at 1325.2 Da, corroborated by a 1307.3 Da H2O loss ion, which confirms the assignment of the 246.3 Da peak as y2. C-terminal sequencing proceeds in this fashion through the entire amino acid backbone.
The final ion series assignments and partial peptide sequence are shown below. It was not possible to get some of the low mass b ions and the high mass y ions, though often spectra do not offer complete ion series simply because fragmentation favors some ions over others, and as is plain to see in Fig. 1B, the intensity distribution is not even across the m/z range, particularly at the high and low ends of the spectrum. Still, an incomplete de novo sequence tag can be used to find reasonable-probability matches for completion of the sequence and spectral comparison [61, 62].
The sequence above, in one-letter code, is NDNEEGFFSAR. This sequence tag was submitted to a species-independent online protein blast search of the NCBI non-redundant protein database at http://blast.ncbi.nlm.nih.gov/Blast.cgi. This resulted in a list of 21 hits with 100% sequence tag coverage, producing two candidate peptides: QGVNDNEEGFFSAR and EGVNDNEEGFFSAR. The [M+H]+ for the precursor to Fig. 1B, determined to within 10 ppm accuracy in the Orbitrap (about 0.015 Da for [M+H]+ = 1570.7 Da), was calculated above as 1570.67 Da. Using the monoisotopic masses in Table 1, the [M+H]+ masses for QGVNDNEEGFFSAR and EGVNDNEEGFFSAR are 1569.68 Da and 1570.66 Da, respectively, uniquely identifying EGVNDNEEGFFSAR as the correct sequence for Fig. 1B, with n = 14. Figure 3 shows the assigned MS/MS spectrum.
There are two strategies for mass-spectrometry based quantification of protein expression, generally referred to as “label” [63–71] and “label-free” [72, 73]. The former utilizes a single analysis of mixed (and isotopically-labeled) samples and the latter relies on comparison of extracted ion chromatograms (XIC) or spectral counting for samples analyzed in a series of LC-MS/MS acquisitions. Ultimately the data are represented as fold changes (i.e., relative ratios) of peptide or reporter ion signal intensities across several samples/conditions. Increasingly these measurements are accompanied by confidence intervals and other relevant statistics [74–77].
In label-based methods, the method of label incorporation and the type of mass spectrometer used for detection may vary significantly. In SILAC [63, 64], for example, cells are cultured in a growth medium containing either a ‘light’ amino acid, with the natural isotopic abundance of each atom, or a heavy analog, enriched with multiple heavy atoms (e.g., 13C, 15N). The growing cells utilize these amino acids in protein synthesis. When differentially labeled cells are combined, MS analysis of protein digests from these cells reveals a user-defined peptide mass shift, allowing the relative quantitation of different biological samples. In AQUA , heavy amino acid analogs of peptides from biological samples are synthesized, combined with cell lysate, and digested. In addition to amino acid-based labeling approaches, chemical labels which target amino acid functional groups are also common. ICAT [66, 67] is used for MS-level peptide quantitation. It can be used for cysteine peptide enrichment, which confers the added benefit of reduced mixture complexity, as cysteine is a relatively rare amino acid . iTRAQ [68, 69] and TMT [70, 71] provide quantification at the MS/MS level. The iTRAQ label contains a reporter group that fragments easily during MS/MS. Once combined, the differentially labeled peptides co-elute during LC/MS, and are quantified by well-defined reporter ion masses (e.g., spanning 114–117 Da for the four-plex reagent) in a single MS/MS spectrum. TMT reagents have a similar design, but the reporter ions fall in a higher-mass region of an MS/MS spectrum (126–131 Da for the six-plex reagent).
Challenges in quantification are based on the nature of specific labeling chemistries , in addition to fundamental differences in the operation of mass spectrometry detectors and how these translate into measurement variance and significance thresholds for observed peptide and protein ratios. These issues are exacerbated for analysis of post-translational modifications, in which it is necessary to quantify sites on individual peptides, rather than aggregating evidence across multiple peptides from the same protein. Although models have been proposed for specific platforms [76, 77], a universal solution has not yet gained widespread acceptance. In all cases, computation of changes at the protein level involve combining the mean or median of the associated peptide ratios .
Reversible protein phosphorylation is an important and tightly-regulated catalyst of many cellular functions, including growth, proliferation and transcription. Phosphorylation often occurs with low stoichiometry or in low-abundance proteins, making it difficult to identify and quantify phosphorylation sites en masse from complex biological systems. Strategies to circumvent this problem generally fall into three categories: chemical affinity tag derivatization, selective chromatographic enrichment of phosphorylated side chains and linked-scan mass spectrometer acquisition methods that rely on diagnostic ions specific to phosphorylated peptides.
Chemical derivatization techniques exploit the reactivity of the phosphate functional group. For example, high pH conditions cause β-elimination of phosphoserine and phosphothreonine, yielding dehydroalanine or β-methyldehydroalanine, respectively. The products of β-elimination can be modified through Michael addition to introduce affinity handles (i.e. biotin) for isolation  or stable isotopes to afford quantitation across biological states . Although this approach demonstrated enrichment of phosphopeptides from whole yeast tryptic digest , deleterious side reactions can be problematic . An alternative strategy uses carbodiimide-catalyzed thiolation of phosphorylated amino acid side chains , followed by solid phase capture using an iodoacetyl functionalized resin. Following the initial report, a simplified method coupling modified phosphopeptides to a dendrimer support was described , and successfully applied to profile phosphorylation in T-cells.
Chromatographic methods for phosphopeptide enrichment involve selective and reversible binding of the phosphate group directly. Immobilized metal affinity chromatography (IMAC) relies on the high affinity of phosphate toward various metal ions. The most widely used IMAC resins typically employ Fe3+ or Ga3+ chelated to iminodiacetic acid (IDA) or nitrilotriacetic acid (NTA) to selectively retain phosphorylated molecules [86–88]. Chemical derivatization of carboxyl groups is often required to obtain high specificity for phosphopeptides with Fe3+-IDA [89–91], whereas Fe3+-NTA can achieve >95% specificity from complex mixtures without any derivatization . A seminal report in 2004 demonstrated enrichment of phosphopeptides using titanium dioxide , spurring rapid development of metal oxide affinity chromatography (MOAC) in the field of phosphoproteomics. Subsequent reports revealed that small molecule chemical competitors like 2,5-dihydroxybenzoic acid  and other acids [95, 96] increased the specificity significantly. Since these initial reports, several metal oxides including zirconium  and niobium , amongst others , have also been shown to enrich phosphopeptides. Ultimately, the different oxides appear to have unique selectivities that may provide complementary phosphoproteome coverage.
Additional modes of chromatography for phosphopeptide enrichment are selective because the acidic phosphate group will bear a formal negative charge at all but very low pH values , enabling charge-based separation of non-phosphorylated peptides from their more anionic phosphopeptide counterparts. In strong cation exchange chromatography, phosphopeptides tend to appear in the flowthrough or elute early in a salt concentration gradient at low pH [101, 102]. In strong anion exchange chromatography, phosphopeptides bind more tightly and elute only at lower pH values . In hydrophilic interaction chromatography (HILIC), the polarity of phosphopeptides makes them less hydrophobic, resulting in later elution in a normal phase gradient [104, 105]. ERLIC, an extension of HILIC, can isolate phosphopeptides by polarity under normal phase conditions , but exhibits different elution properties from HILIC due to electrostatic repulsion effects between peptides and their charge-matched stationary phase. These techniques have been used with [102, 104, 105, 107] or without [101, 103] other phosphopeptide enrichment methods described above, producing as many as 14,000 phosphorylation site identifications in a single experiment .
Bioaffinity separations also play an important role in phosphoproteomics, especially for profiling tyrosine phosphorylation. This subset of phosphorylation is particularly important in regulating cell growth and differentiation , but is difficult to detect as it represents <1% of total cellular phosphorylation . Protein  and peptide  level phosphotyrosine immunoprecipitation have been used to study the signaling of several receptor tyrosine kinases .
Mass spectrometry – based phosphopeptide identification methods capitalize on unique features of the ion chemistry of gas phase phosphopeptides. For example, PO2− (63 Da) and PO3− (79 Da) are fragmentation products derived from the phosphate moiety and can be used to detect the associated phosphopeptide precursors in the context of linked scans on a triple quadrupole mass spectrometer . Another common fragmentation feature of phosphopeptides is the H3PO4 neutral loss product, which can be observed indirectly as a precursor mass shift of (−98 Da/|Z|) in MS/MS spectra, where |Z| is the magnitude of the peptide charge state. Frequently the neutral loss product dominates many fragmentation pathways in MS/MS, which makes it a convenient and intense ion for detection of phosphorylation [89, 90, 114], but also eliminates structurally informative ions for MS/MS sequence assignment. As a solution, MS/MS/MS (or neutral loss product fragmentation) has been used to identify these product ions [101, 115]. Neutral loss can be especially problematic for basic and/or multiply phosphorylated peptides, and hence these are particularly amenable to sequence identification by non-ergodic dissociation techniques such as ECD  or ETD .
Coupling multidimensional separations [104, 116] with various ion dissociation techniques , improved scoring , site localization methods [118, 119] and the quantitative methods described above, thousands of phosphopeptides are routinely profiled across multiple biological states. Further technological advances in instrumentation, as well as methodology for enrichment and fractionation of phosphopeptides will provide for more complete phosphoproteome coverage, and serve to enrich our understanding of the role of cell signaling in normal physiology and human disease.
Although the computational approaches for quantitative proteomics data are not fully resolved, it is nonetheless possible to organize data based on similarity of response (e.g., “upregulated or downregulated”) irrespective of the underlying confidence for a given ratio measurement. Recently the concept of multiplexing has enabled the comparison of up to eight biological conditions simultaneously [120, 121]. In such cases, peptides are clustered based on similar quantitative profiles across all the conditions. The underlying assumption is that peptides exhibiting similar behavior within a cluster will likely share a common biological pathway or response. For example, Wolf-Yadlin et al. used a method called self-organizing maps to cluster 62 phosphopeptides across 4 time points and 4 conditions to identify the upstream stimulus, such as EGF stimulation or HER2 activation, responsible for the regulation of phosphorylation activity . Tang et al. also used this approach with quantitative phosphoproteomic data to generate 5 clusters based on a classical k-means algorithm to identify groups of effectors, positive regulators, and negative regulators in the Wnt signaling pathway . Clustering not only separates quantitative mass spectrometry data into meaningful groups but is also used to identify recurring patterns that may yield further biological insight.
Biological insight can be gleaned from high-throughput data by integrating prior knowledge stored in publicly available databases. These biological annotation databases include: (i) Gene Ontology (GO) for mining protein function, localization, and process , (ii) literature-curated pathway databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway database  and BioCarta , in addition to (iii) experimental data warehouses such as BioGRID  that contains protein interaction data from high-throughput experiments and the Human Protein Reference Database (HPRD)  that includes curated enzyme-substrate data.
GO is divided into three categories: (i) molecular function, (ii) cellular component, and (iii) biological processes. Molecular function distinguishes proteins based on their functional role in the cell and includes terms such as “MAP kinase activity.” The cellular component category groups proteins based on cellular localization and, at a higher resolution, complexes such as “IkappaB kinase complex.” Finally, biological processes include terms such as “MAPKKK cascade” that consist of proteins participating in the particular biological event. Similar to GO biological process, pathway databases such as KEGG Pathway and BioCarta assemble groups of proteins that belong to the same signaling or metabolic pathway. Unlike GO, which only contains lists of functional protein groups, the aforementioned pathway databases also store images of how the proteins interact within each pathway. The typical use-case of these annotation databases in automated analysis is to quantify the enrichment of a particular functional group of proteins within an experimental dataset. Statistical methods such as Fisher’s Exact Test calculate a significance level for the degree of enrichment of a particular protein group such as a pathway, complex, or enzyme class within an experimentally determined list of proteins (Figure 4). Given the role of mass spectrometry in interrogating signaling pathways via phosphoproteomic experiments, identifying members of a complex , or targeting enzyme classes via activity-based profiling assays, enrichment meta-analyses can facilitate the assessment of the experimental result. For example, Ficarro et al. used KEGG Pathway enrichment analysis to identify cell adhesion, junction, and matrix proteins as potentially involved in embryonic stem cell self-renewal . Several online tools exist for conducting meta-analyses given an input list of experimentally determined proteins [131–133].
Cell signaling involves a complex network of protein-protein interactions and enzymatic reactions that form feedforward and feedback loops, along with a significant degree of “crosstalk” between various cascades [134–137]. Enrichment analyses treat pathways as segregated entities, ignoring the effect of shared protein members. In addition, only a fraction of proteins relative to the entire proteome are assigned to pathways in the public databases, such that many of the proteins detected in a large-scale experiment are ignored in subsequent network or pathway analyses. The same problem occurs with other annotations such as GO categories where proteins are often excluded due to incomplete curation. In such scenarios it is often necessary to gain a higher resolution picture of the experimental data. Networks where nodes represent proteins and edges represent biochemical or physical interaction offer a solution and provide a convenient way to integrate publicly available interaction data with experimental results. Protein-protein interaction (PPI) data now account for more than 100,000 entries in databases such as BioGRID. These interactions are derived from high-throughput assays such as Yeast two-hybrid (Y2H) and affinity purification (AP-MS), with the former contributing a majority of the pairwise interactions in these databases. PPI or physical binding interactions are represented as undirected edges in protein networks (Figure 5). Biochemical interaction data such as enzyme-substrate relationships found in curated databases such as HPRD can also be integrated into protein networks as directed edges, where an arrow indicates reactivity between enzyme and substrate. Directed edges may also be used to distinguish reversible post-translational modification or activation versus inhibition of a substrate (Figure 5). Although PPI data outnumber biochemical interaction data by two orders of magnitude, directed edges are nonetheless more appropriate for interrogation of activity-based profiling data (e.g., phosphorylation or other enzyme-substrate relationships).
Network diagrams are built upon detected proteins as opposed to generic curated pathway maps, thus providing a context-specific and high-resolution representation of experimental proteomics data. Furthermore, network representations lend themselves to in-depth analysis based on decades of graph theory research that can be directly applied to protein networks. For example, Breitkreutz et al. use the graph theoretical concept of characteristic path length to illustrate the robustness of the kinase-kinase physical interaction network in yeast . Networks are also widely used to visualize the local and global dynamics of entire datasets to elucidate condition-dependent recruitment of binding partners or relative activity of pathway components (Figure 5) .
Unlike pathway enrichment analyses, which are robust to missing proteins, high-resolution networks can become unstable if key network proteins are not present within the experimental data. In these cases many detected proteins become orphans in the network with no edges connecting them to other network members. To account for this problem, Huang and Fraenkel applied another graph theory algorithm called the Prize-Collecting Steiner Tree (PCST) to connect detected proteins by selectively introducing hidden nodes (undetected proteins) in the network . The authors were thus able to identify hidden signaling components in the yeast pheromone response pathway by integrating LC-MS/MS and publicly available interaction data.
Although the PCST algorithm was used to predict new signaling pathway members, the aforementioned clustering and data integration methods are mostly used to reveal biologically significant and recurring patterns in mass spectrometry data. However, with continued development of proteomic methods for consistently and simultaneously quantifying proteins, computational scientists have been able to apply machine learning methods to build predictive models. Kumar et al. used a machine learning method called partial least squares regression (PLSR) to predict HER2-mediated changes in migration and proliferation from phosphoproteomic data . As a side effect of constructing a predictive model that could use new data to predict cell behavior, the authors identified a small set of 9 phosphorylation sites that maintained high fidelity with respect to predictive power. Woolf et al. used a graphical machine learning method called Bayesian Networks to construct a model of 28 signaling proteins governing embryonic stem cell fate decisions . The model was trained on 49 quantitative phosphorylation measurements per protein. Output of the machine learning method is a causal network where nodes are variables, which include phosphorylation levels of proteins and phenotypic measurements such as differentiation rate, and directed edges represent causal relationships between variables. This model can be used to infer or predict new values of a subset of variables, such as the differentiation rate given a certain level of STAT3 phosphorylation. Similar to PLSR, constructing a Bayesian network model has the added benefit of identifying causal relationships between proteins based on the network structure. Woolf et al. not only identified known signaling cascades such as the classical RAF-MEK-ERK cascade but also proposed a new prediction that MEK3/6 causally affects MEK1/2 activity.
In this section we will discuss how several of the aforementioned tools and methods were used to interrogate phosphorylation dynamics during early differentiation of human embryonic stem cells (hESCs) . Van Hoof et al. identified 1399 phosphoproteins with 3067 phosphosites, of which 1091 were regulated across three different time points. The regulated phosphoproteins belonged to several pathways including BMP, PI3K, WNT and JNK signaling pathways, as determined by enrichment analysis described above. To better understand temporal behavior, the regulated phosphopeptides were grouped using k-means clustering. Clustering revealed that the most dramatic phosphorylation changes occurred during the first hour of differentiation. Interestingly, different phosphosites on certain hyperphosphorylated proteins, such as tumor suppressor p53-binding protein 1, exhibited different temporal profiles and therefore segregated into different clusters, suggesting that certain key proteins act as platforms for integrating signals from several kinases. In order to gain higher resolution, the authors used the NetworKIN algorithm  to predict kinases upstream of the regulated phosphosites and thus generated the first kinase-substrate database for hESCs. CDK1/2 was predicted to be the most active upstream kinase accounting for 26% of the hESC phosphosites. Focusing on the regulated phosphosites on kinases, the authors noted that the phosphonetwork, connecting upstream predicted kinases to experimentally determined phosphorylated kinases, expanded over time during differentiation.
Although important in generating hypotheses about upstream phosphorylation cascades, kinase-substrate prediction algorithms face several challenges as noted by the authors of the NetworKIN algorithm , which combines context specificity from the STRING database  with phosphosite motif scanning algorithms such as NetPhosK  and Scansite . First, the algorithm is prone to errors from the phosphopeptide data, incorrect kinase family assignment to phosphorylation motifs, and errors in the probabilistic context association network. Second, predictive algorithms have poor coverage of the kinome (<22%) and seem to focus on pleiotropic kinase families, which are potentially responsible for most phosphosites, and consequently ignore the more specific and selectively expressed kinases. Given that modern phosphoproteomic data analyses rely heavily on prediction of upstream kinases, greater availability of accurate kinase-substrate data is required not only to increase our knowledge of the global kinase-substrate network but also to improve prediction algorithms.
Although traditional shotgun proteomics experiments generate large catalogs of proteins , and can also provide relative quantification data , they do not directly interrogate the catalytic activity of any protein class. This information is an essential piece of the systems biology puzzle as proteins are the major elements of cellular catalysis. As an example, consider proteases, which are synthesized as inactive zymogens. An increased amount of protease in this form does not translate into increased activity until specific processing or co-factor binding events activate them . In addition many kinases are regulated by phosphorylation within the activation loop, an event that can yield an increase in catalytic activity of up to four orders of magnitude . So although protein profiling is important to identify relevant enzymes, mere changes in protein quantity do not necessarily correlate with overall activity. Moreover, many enzymes are expressed at relatively low levels making them difficult to detect in global profiling experiments. The emergent field of ABPP helps to bridge this gap, allowing for system-wide detection of active enzymes in lysates, cells, or even live animals.
ABPP relies on chemical probes that have two essential components (Figure 6A). The first is a reactive group that targets a specific enzyme class. Probes have been developed for many types of enzymes including kinases , serine hydrolases , cysteine proteases , phosphatases , glycosidases , and several others . In addition to a reactive group, the probe may have structural components that improve selectivity for a particular enzyme class. Patricelli et al. designed ADP/ATP analogues that target the ATP binding pocket of kinases and position an activated carbonyl near the epsilon amino group of conserved lysine residues in these proteins . Second, the probe must facilitate detection of its target through a reporter, such as a fluorescent tag or an affinity handle that allows isolation (e.g., biotin). Note that the choice of reporter dictates the type of workflow used to assess protein activity; for example, fluorescent tags are typically used for gel-based assays, whereas affinity handles can facilitate isolation of proteins or peptides for LC/MS-based interrogation. Many of the early studies utilized probes directly attached to reporter tags through a linker region [152, 153]. These types of probes tend to be larger in molecular weight, are not easily internalized by cells or tissue, and are typically used after lysis of cells. This approach has the disadvantage of removing proteins from their native environment and may often lead to loss of activity, and hence poor recovery depending on the probe used. To circumvent this limitation, Speers et al. developed click chemistry-based ABPP, or CC-ABPP (Figure 6B) . Click chemistry is a philosophy of synthesis  that advocates building target molecules through highly efficient and favorable reactions such as Huisgen’s 1,3 dipolar cycloaddition that couples azide and alkyne moieties . This reaction has an interesting history, and has been optimized through copper catalysis [160, 161] as well as ligands that stabilize copper in the correct oxidation state . In CC-ABPP, the probe and reporter are synthesized as discrete functional molecules that are linked through the bio-orthogonal azide/alkyne cycloaddition. In a typical experiment, probes are synthesized with reactive and binding groups that target the enzyme of interest, along with an alkyne or azide moiety. This molecule, without the reporter, is more compact, and (if properly designed) can be added to live cells or fed to animals. After reaction of the probe with target proteins, cells or tissues are lysed, and the reporter is selectively attached to the tagged protein through the click chemistry reaction. One drawback of the copper catalyzed CC-ABPP is that reporter incorporation must occur ex vivo, due to the toxicity of copper. Bertozzi and colleagues have developed a copper free click reaction based on strained, fluorinated cyclooctynes that allows reporters to be attached to labeled molecules in living organisms [163, 164]. Although these experiments were used to label cell surface glycans that carried azide functionalities and were not activity-based, they nonetheless demonstrated the potential to apply click chemistry for dynamic imaging of targets in living organisms .
Once target proteins are labeled, they are typically quantified using gel-(fluorescence scanning or avidin blotting) or LC/MS- (for affinity tagging) based assays. Gel assays allow for a higher-throughput assessment of probe binding in general, and are also useful to survey differences in protein activity between cell states. LC-MS/MS experiments are utilized to determine the identity of tagged proteins, where two levels of information can be obtained. First, tryptic peptides from isolated proteins can be analyzed (Figure 7A). Second, peptides covalently conjugated to the probe, which represent the site of activity, can also be identified (Figure 7B). Although it is possible to obtain both types of information by digesting isolated proteins and subjecting their peptides to LC-MS/MS analysis, it may be difficult to identify the probe-conjugated fragments amongst the other, and far more abundant, tryptic peptides. An alternative strategy (Figure 7C) is to separate tryptic fragments from probe conjugates, and analyze these subsets in separate LC-MS/MS analyses. For example, Weerapana et al.  have developed a strategy where the reporter contains a TEV protease cleavage site as well as a biotin affinity handle. After proteins are labeled, the reporter module is attached in the click reaction, and labeled proteins are isolated by streptavidin. The proteins are digested on-resin, and the tryptic fragments are analyzed by LC-MS/MS. Next, a TEV digestion is employed to release only the probe-labeled peptides, which are analyzed separately.
Variations of the methodology described above have been applied with myriad chemical probes to assess enzyme activities in both cellular systems and animal models. These studies have met with a great deal of success and have uncovered enzymes associated with human disease and allowed for the development of selective inhibitors [167–171].
Mass spectrometry-based proteomics has emerged as the method of choice for large-scale protein analyses because of technological advancements on three fronts: 1) improvements in mass spectrometer acquisition speed and data quality (e.g., mass resolution, mass accuracy, and detection limit), 2) experimental approaches that improve dynamic range through biochemical enrichment and/or chromatographic fractionation, and 3) bioinformatics approaches that reveal quantitative trends and relevant biological pathways in complex datasets. Some of the more prominent developments have been discussed in this review in the context of bottom-up proteomic analysis. Despite these advances, the wide dynamic range of protein expression along with the number and variable stoichiometry of post-translational modifications will likely keep comprehensive proteome characterization well beyond our reach. In addition, the stochastic nature of discovery or “shotgun” proteomics methods often limits the reproducibility of high-throughput studies; this scenario is particularly problematic for protein biomarker discovery efforts, where putative markers would preferably be subject to thorough analytical validation prior to advancement through the pre-clinical pipeline. Fortunately the field is making rapid progress on all fronts, from standardization of pre-analytical sample treatment procedures (e.g., storage and distribution) to the use of multiple reaction monitoring mass spectrometry (MRM-MS) for targeted and reproducible biomarker studies that span multiple laboratories [172–175]. When used in combination with the rapidly maturing enrichment/separation techniques described above, these proteomics methods may become a standard for pre-clinical biomarker discovery and validation. Activity-based protein profiling is a more recent development that will facilitate analysis of important enzymatically active protein subclasses. These functional data represent another important step in the application of proteomics to the field of personalized medicine, where treatments will be tailored based on patient-specific biomarkers or quantitative elucidation of associated disease pathways.
Generous support for this work was provided by the Dana-Farber Cancer Institute and the National Institutes of Health, NHGRI (P50HG004233), and NINDS (P01NS047572).