|Home | About | Journals | Submit | Contact Us | Français|
Recent advances in the speed and sensitivity of mass spectrometers and in analytical methods, the exponential acceleration of computer processing speeds, and the availability of genomic databases from an array of species and protein information databases have led to a deluge of proteomic data. The development of a lab-based automated proteomic software platform for the automated collection, processing, storage, and visualization of expansive proteomic datasets is critically important. The high-throughput autonomous proteomic pipeline (HTAPP) described here is designed from the ground up to provide critically important flexibility for diverse proteomic workflows and to streamline the total analysis of a complex proteomic sample. This tool is comprised of software that controls the acquisition of mass spectral data along with automation of post-acquisition tasks such as peptide quantification, clustered MS/MS spectral database searching, statistical validation, and data exploration within a user-configurable lab-based relational database. The software design of HTAPP focuses on accommodating diverse workflows and providing missing software functionality to a wide range of proteomic researchers to accelerate the extraction of biological meaning from immense proteomic data sets. Although individual software modules in our integrated technology platform may have some similarities to existing tools, the true novelty of the approach described here is in the synergistic and flexible combination of these tools to provide an integrated and efficient analysis of proteomic samples.
Dramatic progress has recently been made in expanding the sensitivities, resolving power, mass accuracy, and scan rates of mass spectrometers that can fragment and identify peptides through tandem mass spectrometry (MS/MS) [1–4]. Unfortunately, this enhanced ability to acquire proteomic data has not been accompanied by increased availability of tools able to assimilate, explore, and analyze these data efficiently. The typical proteomics experiment can generate tens of thousands of spectra per hour, and the use of multidimensional LC/MS, as with the MUDPIT technique , can generate even larger datasets.
Computational tools for the collection and analysis of proteomic data lag far behind analytical methods for proteomic data creation . In a typical experiment, collection and analysis of data is a fully manual process requiring repetitive and laborious sample- and data-processing steps with much unnecessary user intervention . Proteomic datasets are expansive; adequate systems for the initial storage of proteomic data and its relationships to data from other external protein knowledge sources are inflexible and not integrated with the software used in data acquisition.
There are two options for handling the massive and diverse workflows in the modern proteomics lab: either provide a completely integrated software platform that is malleable to the users’ needs, or provide independent software tools that require extensive user intervention to complete a total analysis of the data. Great progress has been made in providing independent software tools such as software focused on a single aspect of the proteomic pipeline. However, proteomic end users are left to fend for themselves in passing data amongst the various software tools and in modifying the individual software tools to provide the processing and analysis needed for interpretation of their specific data. For example, one software tool is used for data acquisition, (such as Xcalibur or Analyst). A second tool interprets tandem mass spectra (such as X!Tandem [7, 8], Mascot , SEQUEST , OMSSA ) or statistical validation of database search results (such as Peptide/Protein Prophet , or Ascore ). A third tool provides quantitation of proteomic data (such as Xcalibur XDK, or MSQuant ), and a fourth provides a relational database for data warehousing (such as PRIME  or PeptideAtlas ) or a database graphical user interface for visual analysis of proteomic database search results (CPAS ). An assortment of web-based protein knowledge resources such as Swiss-Prot , HPRD , Genbank , OMIM , BLAST , IPI , and STRING  provide rich annotation of the proteins revealed in high-throughput proteomic experiments. These web-based metadata tools do not permit users to organize these external information sources relationally within the expansive proteomic datasets or to archive user observations. Although each of these tools provides essential functionality, they have not necessarily been engineered to adapt to diverse proteomic workflows or to work together efficiently.
Recent progress has been made in developing integrated systems for post-acquisition processing of data from high-throughput proteomic analysis. Notably, the Trans-Proteomic Pipeline  (TPP) integrates many critical aspects of post-acquisition proteomic analysis, including user initiated MS/MS sequence assignment, validation, quantitation and interpretation. To further expand the concepts driving the creation of workflow automation systems for proteomics such as TPP, we have now integrated sample management, data acquisition, post acquisition analysis, and data visualization as integral components of a fully autonomous analysis pipeline called HTAPP.
The overall scheme of HTAPP is illustrated in Figure 1. While each individual component of the integrated system can provide critical functionality independently, it is the interoperability of the components that provides a complete technology platform integrating data collection, storage, and visualization. In parallel with the development of HTAPP we have also developed a new relational database for proteomic analysis called PeptideDepot . HTAPP automatically directs the incoming data stream into PeptideDepot where a user may then interact with the processed proteomic data.
To accelerate data processing and enhance system performance through parallel processing, the system components of HTAPP reside separately on several computers running Windows Server 2003 or Windows XP (Figure 2). Proteomic data are exchanged among those computers using HTAPP programs with a build-in TCP-based file transfer server/client. Inter-component communications are also achieved using HTAPP programs with TCP/IP protocol. Most of user-specific parameters such as server IP, port number and file directories are externalized to a user-configurable parameter file.
The modular design of HTAPP allows increased throughput as each component of the analysis workflow is performed simultaneously on separate computers. Through use of a distributed system, parallel processing enables the complete analysis of a proteomic data set within the acquisition time of the next proteomic sample. For example, an experiment containing 10,000 total MS/MS spectra in which ~1,000 spectra are high-quality (as defined by user determined thresholds) requires 1.5 hours to acquire the raw data on the mass spectrometer coupled to PC-1, 1 hour to perform a clustered SEQUEST search on PC-2 and the database search cluster, and 1.5 hours to complete the post-processing tasks including loading of data into the PeptideDepot relational database. Since SEQUEST search and post-processing can be quite CPU-intensive, sequential processing of the data on a single computer requires approximately 4 hours per sample. However, with the distributed system the overall time is reduced to a total of 1.5 hours per sample.
An automated data acquisition tool developed in Microsoft Visual Basic 6.0 (VB6) runs on PC-1 to organize the predefined sample queue for analysis and to control a set of instrument manufacture software using Visual Basic (Figure 2, and and3D).3D). The extensibility of this tool is derived from flexible instrument control using Visual Basic SendKeys commands allowing the autonomous operation of any instrument control software. This central component of the automated acquisition of LC/MS data controls the unmonitored separation of peptides in, at most, three dimensions of chromatography and a simplified version has been described previously .
Here, we expand this data acquisition tool to integrate it within a data analysis pipeline that includes a relational database organized sample queue, MS/MS database searching, validation, and quantitation pipeline that automatically deposits the proteomic data and associated analysis within a relational database called PeptideDepot. An ODBC connection (Figure 3E) between the sample queue in PeptideDepot (Figure 3C) and the Visual Basic data acquisition software (Figure 3D) allows retrieval of selected sample information from the PeptideDepot database (FileMaker, version 9.0v3, FileMaker Inc., Santa Clara, CA). During the run, real-time instrument status information such as HPLC pressure profiles (Figure 4D), automated evaluation of selected ion chromatogram peak areas of peptides from standard mixes (Figure 4F), and screen captures (Figure 4E) are archived in the MySQL (version 5.1.40; MySQL Inc., Cupertino, CA) component of PeptideDepot using a VB6 program. This data is available remotely through a website (Figure 4D–F) driven by Apache 2.2.4 (The Apache Software Foundation, Los Angeles, CA) and PHP 5.2.1 (http://www.php.net/).
Once a sample tagged as ‘Autoload’ is acquired on PC-1, a VB6 program running on PC-2 is notified and raw data files are downloaded from PC-1 via TCP/IP communication over a user-configurable port (Figure 2). MS/MS spectra are extracted from Thermo RAW files using extract_msn.exe (version 4.0; Thermo Scientific, Waltham, MA) or extracted from mzData, mzXML and mzML format data files using ExtractMSMS.jar (in-house developed in Java 1.6.0; Sun Microsystems, Santa Clara, CA) to generate DTA files. A SEQUEST cluster (version 27; Thermo Scientific) or Mascot cluster (version 2.2.1; Matrix Science) MS/MS database search is initiated via a networked computer cluster.
After completion of SEQUEST or Mascot searching, using HTAPP’s build-in file transfer service, proteomic data is pushed to a third computer, PC-3 over a user-configurable port. (Figure 2) On PC-3, a variety of independent post-acquisition calculations are performed on the proteomic data. A VB6 program called “AutoLoad” orchestrates the initiation and transfer of data amongst these separate software tools. A peptide quantitation tool and SILAC [28, 29] calculation tool are available to quantify each identified peptide hits. A phosphosite localization tool that calculates Ascore as described previously  is written in Java 1.6.0, and MS/MS validation tool implementing a new user-trainable logistic regression algorithm that more than doubles peptide identifications at a user selected false discovery rate compared to XCorr  is implemented in R 2.4.1 (The R Foundation, http://www.r-project.org/). Once the calculations are finished, proteomic data are immediately uploaded to a FileMaker/MySQL relational database called PeptideDepot hosted on the remote server PC-4 using FileMaker script. The proteomic data is then accessible from a graphical FileMaker client (version 9.0v3) running both on Mac and Windows. The database files are synchronized daily without user intervention to an offsite server for the incremental backup using either the commercial software tool Retrospect 7.5 (EMC Insignia; Pleasanton, CA) or the Carbonite backup service (http://carbonite.com; Boston, MA).
To create a robust infrastructure capable of high-throughput analysis of proteomic samples, we sought tight integration between the bioinformatic tools used in analyzing proteomic data and the software involved in acquiring mass spectral data. This system can flexibly automate projects ranging from simple LC/MS of in-gel digested proteins to more complex proteomic analyses, such as 2D nano-LC/MS experiments or protein post-translational modification analyses. To maximize the capability of controlling various venders’ instruments, software utilizing a VB SendKeys API was built to automatically run the underlying native HPLC and mass spectrometry software. Such a design transcends any limitations that are artificially imposed by API limitations of the instrument manufacturer’s data acquisition software. SendKeys controls are utilized solely for communication between HTAPP and data acquisition software such as Xcalibur and Chemstation and not used for post-acquisition analysis.
A sample queue capability within the FileMaker component of PeptideDepot relational database integrates sample creation, and metadata annotation with data acquisition control and automated post-acquisition analysis (Figure 3A–D; Figure 4A, B). This system provides unparalleled flexibility to the user by 1) letting any user tailor the sample queue in FileMaker for automation of any lab-specific post-acquisition analysis task or association of any experimental meta-data with the nascent proteomic data, and 2) providing an array of choices for post-acquisition analysis for the automated or manual interpretation of proteomic data.
The laboratory information management system (LIMS) components of the PeptideDepot database are created in the user-friendly FileMaker environment, allowing proteomic end-users to tailor the associated fields and layouts to their specific needs (Figure 3A–C). For example, users wanting to store a new piece of information within the system to be automatically associated with the analyzed proteomic data may quickly add a field for this data in FileMaker and position it precisely within user-defined layouts with FileMaker’s WYSIWYG layout tools (such as illustrated for the protocol library and sample storage inventory in Figure 3A–B). With this flexibility, the end-user need not wait for a programmer or database engineer to add the desired functionality; it may be implemented directly.
Although sample metadata may vary dramatically from lab to lab, the processing of proteomic data after acquisition most commonly involves some combination of database searching, quantitation, validation of database search results, and storage of proteomic data within a relational database. A variety of software tools are used in each step of this standard analysis pipeline (summarized in Figure 1). For database searching, our automated system currently supports SEQUEST, Mascot, or any other algorithm that exports to pepXML. For quantitation, our automated system currently uses the ICIS algorithm available in the Xcalibur XDK to calculate peak areas for label free or isotopic labeling methods such as SILAC from any Thermo Scientific Xcalibur (RAW) file and uses an existing software tool ProteinQuant  for the label-free quantitation from standard proteomic data formats mzXML and mzData. For validation, our system currently automates the analysis of reversed database searches , performs peptide validation using a recently developed logistic spectral score that more than doubles peptide yield at a fixed FDR , and phosphorylation site localization using the Ascore algorithm . Our relational database PeptideDepot also provides unique tools, namely SpecNote for database-integrated manual validation and annotation of spectra . Our current system provides for unmonitored import of proteomic data and proteomic analyses into our flexible PeptideDepot relational database that utilizes a FileMaker generated user interface.
We have also created a sample tracking database and protocol library (Figure 3A, B) that organize information about sample preparation and storage and associate this information tightly with the nascent proteomic data. This tool enhances the ability to find correlations between proteomic results and the conditions used to prepare and store samples while facilitating post-acquisition analysis by specification of data processing parameters prior to data acquisition. These tools are dynamically integrated within our data acquisition and automation tools to facilitate the automation and documentation of samples awaiting proteomic analysis. By requiring the entry of sample protocols before data acquisition, critical experimental conditions and metadata are captured, organized, and associated with complex proteomic datasets. Also, the protocol library allows assimilation of all protocols used in the lab within a lab-based relational database and provides a mechanism by which protocols can be reviewed and optionally approved by other researchers.
To promote efficient troubleshooting of fluctuations in system performance, the automated data acquisition includes the capability to store and analyze metadata captured during spectral acquisition in a fully automated fashion. Information such as the pressure profiles and chromatography gradients are all automatically archived in the MySQL component of the PeptideDepot relational database that is linked to the raw data and SEQUEST results and accessible through a web-based PHP interface (Figure 4D). Selected ion chromatogram (SIC) peak areas of either Bovine Serum Albumin (BSA) or α-casein peptides from automated standard runs, or of user-selected standard peptides incorporated into user samples, are monitored automatically. If any selected peptide falls below a user-defined threshold, the operator is optionally alerted via email or instant SMS and the acquisition queue can be set by the user to pause until the problem is solved (Figure 4C). A user may also explore the historical BSA and α-casein SIC data acquired on the instrument in an interactive web browser layout driven by PHP (Figure 4F) or in a VB6 program to track and troubleshoot instrument sensitivity over time. Remote access capabilities allow any operator to monitor the status of the system in real time (Figure 4E) and to control the system through an encrypted network connection.
Proteomic results are automatically imported to a networked relational database called PeptideDepot which is described in detail elsewhere . Tight integration of external protein information sources is a critical aspect of this system. Once newly acquired data are deposited into the PeptideDepot database, many data-mining calculations are triggered automatically by querying externally available protein information databases such as PDB , IPI , HPRD , Swiss-Prot , STRING , Phosphosite  and Scansite  by peptide sequence across locally cached databases. All possible protein names associated with a given peptide sequence are collated from the locally cached external protein information databases. This capability overcomes the limitation of alternative protein naming by allowing for users to “deep search” the data sets across an index of all possible protein names in every database.
After automated analysis and deposition of the data within PeptideDepot, users may explore the data with flexible FileMaker WSIWYG layouts. PeptideDepot features an extensive collection of predefined data filters that enable users to limit false-discovery rates estimated by reversed database search while focusing on specific peptide qualities such as tyrosine phosphorylation, etc. Comparative analysis views, useful in comparing peptides observed in different cellular states such as disease versus healthy tissue, are provided to facilitate quantitative comparison among samples using either label-free or stable-isotope incorporation quantitation strategies such as SILAC.
One of the largest impediments to truly high-throughput proteomic methods is the lack of automation after the acquisition of spectra and lack of capture of critical acquisition-specific metadata. In addition, there is a fundamental need not only to acquire data more quickly but also to increase the quality of data acquired. An ideal high-throughput proteomic pipeline would provide for the thorough documentation of a sample’s provenance: the protocol used in sample preparation, sample storage information, environmental conditions such as temperature and humidity during the analysis, and HPLC gradients and pressure profiles.
One of the fundamental goals of the work described here is to provide truly high-throughput multidimensional acquisition of spectra coupled to automated database searching, data archiving, data filtering, visualization, analysis, quantification, and statistical validation of spectra. The software described here uses an integrative approach in which all information concerning a proteomic experiment is archived automatically along with the raw data and database assignments. All components of analysis are integrated within a lab-centric relational database. Capturing a myriad of experimental metadata in addition to spectral acquisition enables the organization and documentation of complex experiments and facilitates troubleshooting. Unlike other currently available proteomic software, our integrated platform utilizes a sample queue in which post-processing parameters and user-provided proteomic sample annotation are passed directly to data acquisition control software and are associated automatically with proteomic data as it is collected and processed within a lab’s relational database. This tight integration greatly increases efficiency by automating labor-intensive post-processing tasks and reduces the chances that critical connections between newly collected proteomic data and experimental metadata will be lost.
This work provides an integrated yet extensible technology platform for the automated processing, storage, and visual analysis of expansive proteomic datasets. Instead of trying to patch together a variety of preexisting software tools that fit together awkwardly, match analytic needs only marginally, and lack critically important functionality, we have created from the ground up an optimized set of integrated tools that provides automated acquisition, processing and visual analysis of proteomic data. Although many aspects of our software implementations are both unique and essential for a thorough analysis of these types of data, the main novelty of our approach is the direct software integration of the collection, quantitative processing, and visual analysis of proteomic data. No publicly available software tool currently available provides this level of integration. Current proteomic end-users must either develop their own proteomic pipeline software systems in each lab or else perform tedious data manipulation steps manually to extract biological meaning from the immense datasets.
The HTAPP software is designed to provide critical flexibility and functional extensibility for users to implement alternative proteomic workflows as needed. Although the software tool that performs automated data acquisition currently incorporates a Thermo Scientific hybrid linear ion trap – Fourier Transform mass spectrometer (LTQ-FTICR) and Agilent 1100/1200 HPLC pumps, our control software is adaptable to any mass spectrometer and chromatography system through the use of flexible Visual Basic SendKeys controls . In its current implementation, SendKeys works through Xcalibur and Chemstation software to control the automated acquisition of data. Using SendKeys controls, our software sends keyboard commands to any currently running software. By using SendKeys, control of additional mass spectrometer data acquisition software systems can be rapidly implemented to provide critically important extensibility to our automated platform.
HTAPP also supports the analysis of any additional mass spectrometer MS/MS data that may be converted to the standard proteomic data formats, i.e. mzData , mzXML  and mzML  (Figure 5). Tools to convert manufacturer-specific raw data to standard formats are publicly available (http://tools.proteomecenter.org/wiki/index.php?title=Formats:mzXML). For Thermo Scientific RAW files, the analysis pipeline is fully automated. If a user desires to analyze data from other types of mass spectrometers, the user first converts the data to either mzData, mzXML or mzML format using publicly available software prior to autonomous analysis through HTAPP. We have implemented a Java program in HTAPP to convert MS/MS spectra from standard formats and initiate autonomous data analysis. This software was successfully tested with publicly available proteomic datasets acquired on Agilent, LCQ-Deca, LTQ and QSTAR mass spectrometers .
After data acquisition, the peptide sequences are assigned through a SEQUEST or Mascot cluster, peptides quantitated, uncertainties of peptide and phosphorylation site placement are accessed, and proteomic data are deposited into a networked relational database (Figure 3E). If a user’s workflow includes additional analysis tasks beyond the core functionality already available within HTAPP, these additional calculations may be automated through FileMaker scripts which export the proteomic data in standard formats, trigger external analysis software, and import the analysis results back into the PeptideDepot database into user-defined fields that are displayed on user-configured layouts.
To support expansion for future software to interact with the automated pipeline, samples awaiting analysis reside in two independent flat-file formatted sample queues. The first sample queue resides on the data acquisition component (PC-1; Figure 2) while the second queue resides downstream of the database search component on the data loader (PC-3; Figure 2). By adding, removing, or altering the text formatted sample queues, a user can integrate their own software within the HTAPP pipeline (See Supplemental Data 1 for formatting details of the sample flat-file).
To incorporate a new database search engine such as X!Tandem for MS/MS interpretation, the proteomic researcher only need to configure the database search program to export the results in the standard pepXML  format and trigger existing pepXML import scripts that are already available in HTAPP (Figure 5). Once imported to FileMaker, the parsed database search results would be integrated into user-defined flexible layouts.
To accomplish any additional post-acquisition data analysis task, the sample queue table within FileMaker has a unique counter field that is transferred throughout the data analysis pipeline and stored with the analyzed proteomic data. Using this counter field, proteomic end users may add any post acquisition preferences to the sample queue, and optionally trigger the execution of external software tools using FileMaker scripts that export the proteomic data from PeptideDepot, trigger the external program and import the results of the external analysis back into FileMaker for display on user-defined custom layouts. For fully automated post-acquisition analysis, the existing FileMaker data import script can then optionally trigger these external calculations.
The laboriousness of current proteomics software manual implementations distracts the proteomics investigator from the biological meaning of the data, leading to the all-too-frequent deposition of data into the scientific literature with minimal biological or clinical interpretation. Instead of treating individual steps in the proteomic pipeline as separate events whose integration depends on end-user intervention, we let the user focus on the interpretation of the data through automation of routine data manipulations and caching of comparisons between newly collected proteomic data and external bioinformatic resources within a lab-based relational database. The software described here are available for nonprofit use free of charge from http://peptidedepot.com after the completion of a license agreement.
We thank Samuel P. Ulin of the Brown’s Department of Molecular Biology, Cell Biology, and Biochemistry for help in the preparation of this manuscript. This work was supported by National Institutes of Health Grant 2P20RR015578 and by a Beckman Young Investigator Award.
The authors have declared no conflict of interests.
For quick evaluation of the utility of this software, we have provided a phosphoproteomic dataset from a mast cell stimulation timecourse  and a simple BSA protein digest. These datasets are available, along with the software described here, at http://peptidedepot.com/.