High-throughput protein identification in biological samples aids our understanding of complex cellular systems and their behavior. Mass spectrometry (MS)-based shotgun proteomics offers fast, high-throughput characterization of complex protein mixtures. Several thousand proteins may be identified in a sample using high-resolution MS/MS instruments and/or extensive biochemical fractionation (Brunner
et al.,
2007; Graumann
et al.,
2007), but standard approaches only identify a fraction of the expected proteins.
A shotgun proteomics experiment typically proceeds by MS/MS analysis of peptides from proteolytically digested proteins, followed by
in silico matching of the MS/MS spectra against a database of theoretical peptide spectra derived from protein sequences (). Proteins are identified using combined evidence from constituent peptides, resulting in a list in which each protein is associated with a score signifying the confidence of correct identification. We refer to this score as the MS/MS protein score, e.g. ProteinProphet's protein probability (Nesvizhskii
et al.,
2003). Proteins with scores that satisfy an error threshold are labeled present by the MS analysis software.
Effective MS/MS protein identification is hindered by factors such as noisy spectra, low-concentration proteins, post-translational modifications and chemical properties that interfere with peptide ionization. For complex samples such as cell lysates, current MS search algorithms typically only match a small percentage (<20%) of all MS/MS spectra to real peptides, resulting in higher error rates and low recall at the protein level. As a result, only a percentage of the expected proteins are identified with confidence despite presence in the biological sample, and the MS/MS identification scores of many other proteins fall below acceptable confidence thresholds.
MS/MS protein identification scoring schemes, such as BioWorks (ThermoFinnegan) and ProteinProphet (Nesvizhskii
et al.,
2003), assume that all proteins are equally likely to be present. In reality, other information may be available and can be used to influence the inferred probability of protein presence thereby rescuing proteins that fall below confidence thresholds.
We use gene functional networks (Marcotte
et al.,
1999) as an external information source to analyze proteins in a sample in the context of the biological processes that are active in the cell. Given a list of proteins identified in an MS experiment (
M), we determine a more complete list (
M′) by considering the proteins that are expected to be present (or absent) based on their functional linkages to proteins in
M. Each protein receives a revised identification score with contributions both from direct MS-based evidence, and MS evidence of neighbors in the gene functional network. Since current gene networks can be incomplete, we intend for
M′ to serve as a complement to
M, rather than replace it as the authoritative list of expressed proteins.
Our data integration approach has the potential to enable pathway-based interpretation of high-throughput MS/MS experiments that are otherwise run in isolation. For instance, by integrating mass spectrometry data from yeast grown in rich medium with a published yeast functional network (Lee
et al.,
2007), we were able to confidently identify many proteins from ribosomal complexes and proteins involved in RNA binding, processing and degradation, thereby increasing the protein coverage in several active pathways (
Section 4). When our method was applied to yeast grown in minimal medium, we increased the number of proteins identified in the reductive carboxylate cycle pathway (Ogata
et al.,
1999). In both cases, we expect the newly identified proteins to be present in the sample, but they were not identified with confidence by the MS analysis software, despite having at least one peptide identified per protein.
We demonstrate the applicability of MSNet to data from different organisms, mass spectrometers, MS analysis pipelines, and experimental conditions. We identify 8–29% more proteins on different yeast datasets at the same error rate, and evaluate the quality of protein identifications via ROC and precision–recall plots. In yeast grown in rich medium, analyzed on a high-resolution mass spectrometer, we identify 29% more proteins than the original MS analysis, 97% of which are present in a reference set derived from independent identification experiments. We also demonstrate direct applicability to the human proteome using a human functional gene network, reporting 37% more proteins than the original MS analysis.