High-resolution hybrid mass spectrometers and improved methods for sample preparation and chromatography have enabled routine quantitative profiling of thousands of proteins in a single sample. Typically, profiling of complex biological samples is performed by “bottom up” proteomics, where proteins are proteolyzed and peptides separated by one or more dimensions of chromatography before mass spectrometry analysis. Peptide ions are isolated and dissociated in the gas phase to yield tandem mass spectra (MS/MS)1
, which are interpreted by algorithms to identify the fragmented peptides present in the sample. “Spectrum-to-sequence” or sequence-based approaches are commonly employed for MS/MS identification, for example using database search algorithms that match MS/MS spectra to sequences in a protein database, by first generating theoretical fragmentation spectra for different peptide sequences, and then scoring overlap between model and experimental spectra. In general, the models for generating theoretical MS/MS spectra use simple fragmentation rules that consider all backbone fragmentation events as equally likely. This ignores well-known residue-specific effects on backbone cleavage, which contribute to the variable fragment intensities characteristic of different peptide sequences (1
). Consequently, the scoring functions used by many search algorithms show poor discrimination in separating valid peptide sequences from incorrect or false positive assignments. Previous studies showed that discrimination of sequence-based approaches is improved by using a kinetic fragmentation model to evaluate the chemical plausibility of MS/MS assignments (2
). “Spectrum-to-spectrum” or spectral library searching, is an alternative to sequenced based approaches, where experimental MS/MS are directly matched against a library of previously identified reference spectra, assembled from MS/MS assigned to peptides with high confidence (5
). Spectral library searching has two advantages over sequence-based approaches. First, because the number of spectra in reference libraries normally employed is small compared with the number of database peptide sequences, search times are significantly reduced. Second, true peptide MS/MS assignments are more easily distinguished from false positive assignments with the added fragment intensity information. Most spectral library search algorithms use a dot product score to measure the similarity in fragment intensity patterns between the candidate MS/MS and library spectra (5
). This is because dot product scores have shown better performance compared with other scores for searching small molecule spectral libraries (8
). In contrast, scores used in spectrum-to-sequence search tools, such as Mascot (9
), OMSSA (10
), and MyriMatch (11
), are often based on probabilistic functions that match fragment ion masses, largely ignoring fragment intensity information in observed MS/MS. By matching peak intensities, spectrum-to-spectrum methods fully exploit the intensity patterns unique to different peptides, including those of noncanonical fragment ions that are not predicted by simple fragmentation models often used in sequence-based searching. As a result, spectral library searching can show increased score discrimination and sensitivity over sequence-based methods (7
The compact sizes of reference libraries, although providing advantages of search speed and discrimination, are far from comprehensive in their coverage over human proteins. In many human tissues and cancers, much of the proteome and its associated complement of modifications remain undiscovered; consequently, they are incompletely represented in reference libraries. For example, the “Human IT Library” from the National Institute of Standards and Technology (NIST) covers only 21% of amino acids in the human proteome (12
). Accordingly, spectrum-to-spectrum search methods are limited to rediscovery of previously identified sequences, such as in targeted proteomics applications. For this reason, database search algorithms remain the primary peptide identification tool employed for new protein discovery.
We recently reported a method for addressing the limited coverage of reference libraries, which uses a kinetic gas phase peptide fragmentation model (3
) to create a library of simulated MS/MS spectra for all predicted peptides in the human proteome (13
). In this way, a “proteome-wide library” can be searched using spectrum-to-spectrum search software in the same manner used for smaller reference libraries, maintaining the advantages of direct intensity comparisons, but extending the search to larger numbers of peptides typically covered only by database search algorithms. However, we observed lower performances of spectrum-to-spectrum searching against proteome-wide libraries compared with conventional sequence-based tools such as Mascot.
Here we present a new strategy for searching proteome-wide spectral libraries, comprised of kinetically simulated MS/MS spectra, and incorporated into an efficient search application, Spec2spec. We evaluate the contributions of increased search space, proteome coverage and the quality or accuracy of spectral intensity predictions on discrimination performance in spectral library searching. We show that current limitations in spectral library search tools include the scoring functions, which are not optimized for proteome-wide libraries, because of larger search spaces representing more than 20-fold the number of peptides contained in reference libraries, and the quality of simulated spectra used to increase proteome coverage. Thus, although the high proteome coverage in simulated proteome-wide libraries increases the number of unique peptide identifications compared with reference libraries, the increased library size degrades performance of dot product and similarity scoring. To address this limitation, we present new scoring metrics, including probabilistic scores based on a hypergeometric model of random peak matching in library spectra, and dot product scores based on peak intensity rankings. The new scores enable proteome-wide library searching with more discriminatory power, outperforming sequence-based searching with Mascot. Furthermore, we evaluate the use of target-decoy search methods for estimating false discovery rates (FDR) in spectrum-to-spectrum searching. We identify score-dependent biases which lead to underestimated FDR with smaller reference libraries, compared with proteome-wide simulated libraries. These findings demonstrate the potential for replacing traditional spectrum-to-sequence searching with spectrum-to-spectrum searching against proteome-wide simulated libraries in discovery proteomics.