mspire is a software package for working with MS proteomics data as outlined in A.
Fig. 1. (A) Overview of mspire functionality. Black arrows and gray boxes depict mspire functionality. From left to right, mspire creates randomized databases (DBs) for FIR determination. MS::MSRun is a unified model for working with LC-MS/MS datasets. The Bioworks (more ...)
2.1 Memory usage and speed
mspire relies on several memory-saving techniques that are critical for working with large data files. Large quantities of objects are implemented as Arrayclass (http://arrayclass.rubyforge.org
) objects, providing highly efficient memory usage (B), while preserving accessor behavior common to typical Ruby objects.
By default, spectra from MS file formats (mzXML and mzData) are decoded into memory-efficient strings and are only completely cast when spectral information is accessed. An option is also available for storing only byte indices of spectral information that can be used for fast, random access of spectra or for reading files of essentially unlimited size.
REXML, Ruby's standard library XML parser, can be far too slow when reading large XML files generated in MS proteomics. mspire can use either XMLParser or LibXML (both of which have C/C++bindings) for rapid parsing of large files.
Performance reading and then accessing two spectra across thousands of mzXML files from the PeptideAtlas is shown in C. Late evaluation of a spectrum allows files to be read at ~20 MB/s with no file-size limit.
2.2 Reading MS proteomics data formats
mspire parses mzXML and mzData formats into a unified object model to simplify working with liquid chromatography (LC) MS and MS/MS runs. D shows the basic class hierarchy and E demonstrates a simple ‘use case’.
2.3 Bioworks SEQUEST results files (.srf)
Bioworks previously produced separate text files for each spectrum, but now outputs a single SEQUEST results file (.srf) for each set of searches. This increases the speed of a search, decreases disk space usage and is much easier to work with in file system operations. Unfortunately, because the output is binary, accessing its contents can be difficult and downstream analysis tools (outside of Bioworks) do not currently support this format.
We created a reader for .srf files using the Ruby ‘unpack’ function. It extracts both spectral information and SEQUEST results. The reader is fast and also works across platforms because it does not rely on any vendor software libraries.
2.4 Reading/writing spectral identification formats
Even when derived from the same upstream data source, formats for working with spectra identifications can vary widely. We designed readers and writers for common downstream spectral-identification software formats for SEQUEST-based data: pepXML files which are used in the trans-proteomic pipeline (Protein Prophet) and also the .sqt format, which can be used with DTASelect and Percolator (Kall et al.
Readers are tailored to their respective format so that users can not only extract format-specific information easily but also implement a common interface so that users can easily extract information shared across these formats.
2.5 Determining FIRs
Bioworks software support for determining FIRs is currently non-existent, and so downstream tools are necessary. mspire supports peptide FIR determination from target-decoy database searches (both the creation of decoy databases and the summary of search results), PeptideProphet and Percolator. Known biases in sample content can also be used to establish an FIR.