Robust biomarkers are needed to improve microbial identification and diagnostics. Proteomics methods based on mass spectrometry can be used for the discovery of novel biomarkers through their high sensitivity and specificity. However, there has been a lack of a coherent pipeline connecting biomarker discovery with established approaches for evaluation and validation. We propose such a pipeline that uses in silico methods for refined biomarker discovery and confirmation.
The pipeline has four main stages: Sample preparation, mass spectrometry analysis, database searching and biomarker validation. Using the pathogen Clostridium botulinum as a model, we show that the robustness of candidate biomarkers increases with each stage of the pipeline. This is enhanced by the concordance shown between various database search algorithms for peptide identification. Further validation was done by focusing on the peptides that are unique to C. botulinum strains and absent in phylogenetically related Clostridium species. From a list of 143 peptides, 8 candidate biomarkers were reliably identified as conserved across C. botulinum strains. To avoid discarding other unique peptides, a confidence scale has been implemented in the pipeline giving priority to unique peptides that are identified by a union of algorithms.
This study demonstrates that implementing a coherent pipeline which includes intensive bioinformatics validation steps is vital for discovery of robust biomarkers. It also emphasises the importance of proteomics based methods in biomarker discovery.
Although the field of mass spectrometry-based proteomics is still in its infancy, recent developments in targeted proteomic techniques have left the field poised to impact the clinical protein biomarker pipeline now more than at any other time in history. For proteomics to meet its potential for finding biomarkers, clinicians, statisticians, epidemiologists and chemists must work together in an interdisciplinary approach. These interdisciplinary efforts will have the greatest chance for success if participants from each discipline have a basic working knowledge of the other disciplines. To that end, the purpose of this review is to provide a nontechnical overview of the emerging/evolving roles that mass spectrometry (especially targeted modes of mass spectrometry) can play in the biomarker pipeline, in hope of making the technology more accessible to the broader community for biomarker discovery efforts. Additionally, the technologies discussed are broadly applicable to proteomic studies, and are not restricted to biomarker discovery.
targeted proteomics; multiple reaction monitoring; selected reaction monitoring; biomarker; mass spectrometry
The application of “omics” technologies to biological samples generates hundreds to thousands of biomarker candidates; however, a discouragingly small number make it through the pipeline to clinical use. This is in large part due to the incredible mismatch between the large numbers of biomarker candidates and the paucity of reliable assays and methods for validation studies. We desperately need a pipeline that relieves this bottleneck between biomarker discovery and validation. This paper reviews the requirements for technologies to adequately credential biomarker candidates for costly clinical validation and proposes methods and systems to verify biomarker candidates. Models involving pooling of clinical samples, where appropriate, are discussed. We conclude that current proteomic technologies are on the cusp of significantly affecting translation of molecular diagnostics into the clinic.
Biomarker verification; Multiple reaction monitoring; Targeted proteomics
In today’s proteomics research, various techniques and instrumentation bioinformatics tools are necessary to manage the large amount of heterogeneous data with an automatic quality control to produce reliable and comparable results. Therefore a data-processing pipeline is mandatory for data validation and comparison in a data-warehousing system. The proteome bioinformatics platform ProteinScape has been proven to cover these needs. The reprocessing of HUPO BPP participants’ MS data was done within ProteinScape. The reprocessed information was transferred into the global data repository PRIDE.
ProteinScape as a data-warehousing system covers two main aspects: archiving relevant data of the proteomics workflow and information extraction functionality (protein identification, quantification and generation of biological knowledge). As a strategy for automatic data validation, different protein search engines are integrated. Result analysis is performed using a decoy database search strategy, which allows the measurement of the false-positive identification rate. Peptide identifications across different workflows, different MS techniques, and different search engines are merged to obtain a quality-controlled protein list.
The proteomics identifications database (PRIDE), as a public data repository, is an archiving system where data are finally stored and no longer changed by further processing steps. Data submission to PRIDE is open to proteomics laboratories generating protein and peptide identifications. An export tool has been developed for transferring all relevant HUPO BPP data from ProteinScape into PRIDE using the PRIDE.xml format.
The EU-funded ProDac project will coordinate the development of software tools covering international standards for the representation of proteomics data. The implementation of data submission pipelines and systematic data collection in public standards–compliant repositories will cover all aspects, from the generation of MS data in each laboratory to the conversion of all the annotating information and identifications to a standardized format. Such datasets can be used in the course of publishing in scientific journals.
Proteomics technologies have revolutionized cell biology and biochemistry by providing powerful new tools to characterize complex proteomes, multiprotein complexes and post-translational modifications. Although proteomics technologies could address important problems in clinical and translational cancer research, attempts to use proteomics approaches to discover cancer biomarkers in biofluids and tissues have been largely unsuccessful and have given rise to considerable skepticism. The National Cancer Institute has taken a leading role in facilitating the translation of proteomics from research to clinical application, through its Clinical Proteomic Technologies for Cancer. This article highlights the building of a more reliable and efficient protein biomarker development pipeline that incorporates three steps: discovery, verification and qualification. In addition, we discuss the merits of multiple reaction monitoring mass spectrometry, a multiplex targeted proteomics platform, which has emerged as a potentially promising, high-throughput protein biomarker measurements technology for preclinical ‘verification’.
biomarker; multiple reaction monitoring mass spectrometry; proteomics; verification
The identification and quantification of proteins using label-free Liquid Chromatography/Mass Spectrometry (LC/MS) play crucial roles in biological and biomedical research. Increasing evidence has shown that biomarkers are often low abundance proteins. However, LC/MS systems are subject to considerable noise and sample variability, whose statistical characteristics are still elusive, making computational identification of low abundance proteins extremely challenging. As a result, the inability of identifying low abundance proteins in a proteomic study is the main bottleneck in protein biomarker discovery.
In this paper, we propose a new peak detection method called Information Combining Peak Detection (ICPD ) for high resolution LC/MS. In LC/MS, peptides elute during a certain time period and as a result, peptide isotope patterns are registered in multiple MS scans. The key feature of the new algorithm is that the observed isotope patterns registered in multiple scans are combined together for estimating the likelihood of the peptide existence. An isotope pattern matching score based on the likelihood probability is provided and utilized for peak detection.
The performance of the new algorithm is evaluated based on protein standards with 48 known proteins. The evaluation shows better peak detection accuracy for low abundance proteins than other LC/MS peak detection methods.
Motivation: Liquid chromatography tandem mass spectrometry (LC-MS/MS) is the predominant method to comprehensively characterize complex protein mixtures such as samples from prefractionated or complete proteomes. In order to maximize proteome coverage for the studied sample, i.e. identify as many traceable proteins as possible, LC-MS/MS experiments are typically repeated extensively and the results combined. Proteome coverage prediction is the task of estimating the number of peptide discoveries of future LC-MS/MS experiments. Proteome coverage prediction is important to enhance the design of efficient proteomics studies. To date, there does not exist any method to reliably estimate the increase of proteome coverage at an early stage.
Results: We propose an extended infinite Markov model DiriSim to extrapolate the progression of proteome coverage based on a small number of already performed LC-MS/MS experiments. The method explicitly accounts for the uncertainty of peptide identifications. We tested DiriSim on a set of 37 LC-MS/MS experiments of a complete proteome sample and demonstrated that DiriSim correctly predicts the coverage progression already from a small subset of experiments. The predicted progression enabled us to specify maximal coverage for the test sample. We demonstrated that quality requirements on the final proteome map impose an upper bound on the number of useful experiment repetitions and limit the achievable proteome coverage.
Contact: email@example.com; firstname.lastname@example.org
A typical tandem mass spectrometry (MS/MS) proteomics workflow involves a series of steps including format conversion, spectrum identification, peptide validation, protein inference, quantification, interpretation, and public repository deposition. This talk will provide an overview of the proteomic bioinformatics resources developed at the Institute for Systems Biology, covering the Trans-Proteomic Pipeline (TPP) and related tools, the PeptideAtlas public repository, and the emerging SRMAtlas resource. The TPP provides an easily-installable suite of tools to enable users to perform nearly all steps in an MS/MS analysis workflow. PeptideAtlas is a multi-species public compendium of peptide and protein identifications assembled from a large number of uniformly processed MS/MS experiments, along with tools to use the information in a variety of ways. SRMAtlas is a resource that enables the design of selected reaction monitoring (SRM) experiments based on information from several different sources. In addition, the interface of these resources with community standardization and cooperation efforts such as the Proteomics Standards Initiative and the ProteomeXchange Consortium will be presented.
The global analysis of proteins is now feasible due to improvements in techniques such as two-dimensional gel electrophoresis (2-DE), mass spectrometry, yeast two-hybrid
systems and the development of bioinformatics applications. The experiments form
the basis of proteomics, and present significant challenges in data analysis, storage and
querying. We argue that a standard format for proteome data is required to enable
the storage, exchange and subsequent re-analysis of large datasets. We describe the
criteria that must be met for the development of a standard for proteomics. We have
developed a model to represent data from 2-DE experiments, including difference
gel electrophoresis along with image analysis and statistical analysis across multiple
gels. This part of proteomics analysis is not represented in current proposals for
proteomics standards. We are working with the Proteomics Standards Initiative to
develop a model encompassing biological sample origin, experimental protocols, a
number of separation techniques and mass spectrometry. The standard format will
facilitate the development of central repositories of data, enabling results to be verified
or re-analysed, and the correlation of results produced by different research groups
using a variety of laboratory techniques.
Mass spectrometry (MS)-based label-free proteomics offers an unbiased approach to screen biomarkers related to disease progression and therapy-resistance of breast cancer on the global scale. However, multi-step sample preparation can introduce large variation in generated data, while inappropriate statistical methods will lead to false positive hits. All these issues have hampered the identification of reliable protein markers. A workflow, which integrates reproducible and robust sample preparation and data handling methods, is highly desirable in clinical proteomics investigations. Here we describe a label-free tissue proteomics pipeline, which encompasses laser capture microdissection (LCM) followed by nanoscale liquid chromatography and high resolution MS. This pipeline routinely identifies on average ∼10,000 peptides corresponding to ∼1,800 proteins from sub-microgram amounts of protein extracted from ∼4,000 LCM breast cancer epithelial cells. Highly reproducible abundance data were generated from different technical and biological replicates. As a proof-of-principle, comparative proteome analysis was performed on estrogen receptor α positive or negative (ER+/−) samples, and commonly known differentially expressed proteins related to ER expression in breast cancer were identified. Therefore, we show that our tissue proteomics pipeline is robust and applicable for the identification of breast cancer specific protein markers.
Electronic supplementary material
The online version of this article (doi:10.1007/s10911-012-9252-6) contains supplementary material, which is available to authorized users.
Breast cancer; High resolution mass spectrometry; Label-free proteomics; Data analysis; Estrogen receptor associated proteins
In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps.
We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at
http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms.
Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.
Mass spectrometry-based protein identification methods are fundamental to proteomics. Biological experiments are usually performed in replicates and proteomic analyses generate huge datasets which need to be integrated and quantitatively analyzed. The Sequest™ search algorithm is a commonly used algorithm for identifying peptides and proteins from two dimensional liquid chromatography electrospray ionization tandem mass spectrometry (2-D LC ESI MS2) data. A number of proteomic pipelines that facilitate high throughput 'post data acquisition analysis' are described in the literature. However, these pipelines need to be updated to accommodate the rapidly evolving data analysis methods. Here, we describe a proteomic data analysis pipeline that specifically addresses two main issues pertinent to protein identification and differential expression analysis: 1) estimation of the probability of peptide and protein identifications and 2) non-parametric statistics for protein differential expression analysis. Our proteomic analysis workflow analyzes replicate datasets from a single experimental paradigm to generate a list of identified proteins with their probabilities and significant changes in protein expression using parametric and non-parametric statistics.
The input for our workflow is Bioworks™ 3.2 Sequest (or a later version, including cluster) output in XML format. We use a decoy database approach to assign probability to peptide identifications. The user has the option to select "quality thresholds" on peptide identifications based on the P value. We also estimate probability for protein identification. Proteins identified with peptides at a user-specified threshold value from biological experiments are grouped as either control or treatment for further analysis in ProtQuant. ProtQuant utilizes a parametric (ANOVA) method, for calculating differences in protein expression based on the quantitative measure ΣXcorr. Alternatively ProtQuant output can be further processed using non-parametric Monte-Carlo resampling statistics to calculate P values for differential expression. Correction for multiple testing of ANOVA and resampling P values is done using Benjamini and Hochberg's method. The results of these statistical analyses are then combined into a single output file containing a comprehensive protein list with probabilities and differential expression analysis, associated P values, and resampling statistics.
For biologists carrying out proteomics by mass spectrometry, our workflow facilitates automated, easy to use analyses of Bioworks (3.2 or later versions) data. All the methods used in the workflow are peer-reviewed and as such the results of our workflow are compliant with proteomic data submission guidelines to public proteomic data repositories including PRIDE. Our workflow is a necessary intermediate step that is required to link proteomics data to biological knowledge for generating testable hypotheses.
The advances in proteomics technologies offer an unprecedented opportunity and valuable resources to understand how living organisms execute necessary functions at systems levels. However, little work has been done up to date to utilize the highly accurate spatio-temporal dynamic proteome data generated by phosphoprotemics for mathematical modeling of complex cell signaling pathways. This work proposed a novel computational framework to develop mathematical models based on proteomic datasets. Using the MAP kinase pathway as the test system, we developed a mathematical model including the cytosolic and nuclear subsystems; and applied the genetic algorithm to infer unknown model parameters. Robustness property of the mathematical model was used as a criterion to select the appropriate rate constants from the estimated candidates. Quantitative information regarding the absolute protein concentrations was used to refine the mathematical model. We have demonstrated that the incorporation of more experimental data could significantly enhance both the simulation accuracy and robustness property of the proposed model. In addition, we used the MAP kinase pathway inhibited by phosphatases with different concentrations to predict the signal output influenced by different cellular conditions. Our predictions are in good agreement with the experimental observations when the MAP kinase pathway was inhibited by phosphatase PP2A and MKP3. The successful application of the proposed modeling framework to the MAP kinase pathway suggests that our method is very promising for developing accurate mathematical models and yielding insights into the regulatory mechanisms of complex cell signaling pathways.
Insulin resistance in skeletal muscle tissues and diabetes-related muscle weakness are serious pathophysiological problems of increasing medical importance. In order to determine global changes in the protein complement of contractile tissues due to diabetes mellitus, mass-spectrometry-based proteomics has been applied to the investigation of diabetic muscle. This review summarizes the findings from recent proteomic surveys of muscle preparations from patients and established animal models of type 2 diabetes. The potential impact of novel biomarkers of diabetes, such as metabolic enzymes and molecular chaperones, is critically examined. Disease-specific signature molecules may be useful for increasing our understanding of the molecular and cellular mechanisms of insulin resistance and possibly identify new therapeutic options that counteract diabetic abnormalities in peripheral organ systems. Importantly, the biomedical establishment of biomarkers promises to accelerate the development of improved diagnostic procedures for characterizing individual stages of diabetic disease progression, including the early detection of prediabetic complications.
Recent advances in the speed and sensitivity of mass spectrometers and in analytical methods, the exponential acceleration of computer processing speeds, and the availability of genomic databases from an array of species and protein information databases have led to a deluge of proteomic data. The development of a lab-based automated proteomic software platform for the automated collection, processing, storage, and visualization of expansive proteomic datasets is critically important. The high-throughput autonomous proteomic pipeline (HTAPP) described here is designed from the ground up to provide critically important flexibility for diverse proteomic workflows and to streamline the total analysis of a complex proteomic sample. This tool is comprised of software that controls the acquisition of mass spectral data along with automation of post-acquisition tasks such as peptide quantification, clustered MS/MS spectral database searching, statistical validation, and data exploration within a user-configurable lab-based relational database. The software design of HTAPP focuses on accommodating diverse workflows and providing missing software functionality to a wide range of proteomic researchers to accelerate the extraction of biological meaning from immense proteomic data sets. Although individual software modules in our integrated technology platform may have some similarities to existing tools, the true novelty of the approach described here is in the synergistic and flexible combination of these tools to provide an integrated and efficient analysis of proteomic samples.
Automation; LIMS; MS/MS database search; Peptide analysis; Relational database
OmicsHub Proteomics integrates in one single platform all the steps of a Mass Spectrometry Experiment reducing time and data management complexity. The proteomics data automation and data management/analysis provided by OmicsHub Proteomics solves the typical problems your lab members find on a daily basis and makes life easier when performing tasks such as multiple search engine support, pathways integration or custom report generation for external customers. OmicsHub has been designed as a central data management system to collect, analyze and annotate proteomics experimental data enabling users to automate tasks. OmicsHub Proteomics helps laboratories to easily meet proteomics standards such as PRIDE or FuGE and works with controlled vocabulary experiment annotation. The software enables your lab members to take a greater advantage of the Mascot and Phenyx search engines unique capabilities for protein identification. Multiple searches can be launch at once, allowing peak list data from several spots or chromatograms to be sent concurrently to Mascot/Phenyx. OmicsHub Proteomics works for both LC and Gel workflows. The system allows to store and compare proteomics data generated from different Mass Spectrometry instruments in a single platform instead of having a specific software for each of them. It is a web application which installs in a single server needing just Web Browser to have access to it. All experimental actions are userstamp and datestamp allowing the audit tracking of every action performed in OmicsHub. Some of the OmicsHub Proteomics main features are Protein identification, Biological annotation, Report customization, PRIDE standard, Pathways integration, Group proteins results removing redundancy, Peak filtering and FDR cutoff for decoy databases. OmicsHub Proteomics its flexible enough to parsers for new file formats to be easily imported and fits your budget having a very competitive price for its perpetual license.
Mass spectrometry based quantification of peptides can be performed using the iTRAQ™ reagent in conjunction with mass spectrometry. This technology yields information about the relative abundance of single peptides. A method for the calculation of reliable quantification information is required in order to obtain biologically relevant data at the protein expression level.
A method comprising sound error estimation and statistical methods is presented that allows precise abundance analysis plus error calculation at the peptide as well as at the protein level. This yields the relevant information that is required for quantitative proteomics. Comparing the performance of our method named Quant with existing approaches the error estimation is reliable and offers information for precise bioinformatic models. Quant is shown to generate results that are consistent with those produced by ProQuant™, thus validating both systems. Moreover, the results are consistent with that of Mascot™ 2.2. The MATLAB® scripts of Quant are freely available via and , each under the GNU Lesser General Public License.
The software Quant demonstrates improvements in protein quantification using iTRAQ™. Precise quantification data can be obtained at the protein level when using error propagation and adequate visualization. Quant integrates both and additionally provides the possibility to obtain more reliable results by calculation of wise quality measures. Peak area integration has been replaced by sum of intensities, yielding more reliable quantification results. Additionally, Quant allows the combination of quantitative information obtained by iTRAQ™ with peptide and protein identifications from popular tandem MS identification tools. Hence Quant is a useful tool for the proteomics community and may help improving analysis of proteomic experimental data. In addition, we have shown that a lognormal distribution fits the data of mass spectrometry based relative peptide quantification.
Formalin-fixed paraffin-embedded (FFPE) tissue specimens comprise a potentially valuable resource for retrospective biomarker discovery studies, and recent work indicates the feasibility of using shotgun proteomics to characterize FFPE tissue proteins. A critical question in the field is whether proteomes characterized in FFPE specimens are equivalent to proteomes in corresponding fresh or frozen tissue specimens. Here we compared shotgun proteomic analyses of frozen and FFPE specimens prepared from the same colon adenoma tissues. Following deparaffinization, rehydration, and tryptic digestion under mild conditions, FFPE specimens corresponding to 200 μg of protein yielded ∼400 confident protein identifications in a one-dimensional reverse phase liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis. The major difference between frozen and FFPE proteomes was a decrease in the proportions of lysine C-terminal to arginine C-terminal peptides observed, but these differences had little effect on the proteins identified. No covalent peptide modifications attributable to formaldehyde chemistry were detected by analyses of the MS/MS datasets, which suggests that undetected, cross-linked peptides comprise the major class of modifications in FFPE tissues. Fixation of tissue for up to 2 days in neutral buffered formalin did not adversely impact protein identifications. Analysis of archival colon adenoma FFPE specimens indicated equivalent numbers of MS/MS spectral counts and protein group identifications from specimens stored for 1, 3, 5, and 10 years. Combination of peptide isoelectric focusing-based separation with reverse phase LC-MS/MS identified 2554 protein groups in 600 ng of protein from frozen tissue and 2302 protein groups from FFPE tissue with at least two distinct peptide identifications per protein. Analysis of the combined frozen and FFPE data showed a 92% overlap in the protein groups identified. Comparison of gene ontology categories of identified proteins revealed no bias in protein identification based on subcellular localization. Although the status of posttranslational modifications was not examined in this study, archival samples displayed a modest increase in methionine oxidation, from ∼17% after one year of storage to ∼25% after 10 years. These data demonstrate the equivalence of proteome inventories obtained from FFPE and frozen tissue specimens and provide support for retrospective proteomic analysis of FFPE tissues for biomarker discovery.
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
data standard; gel electrophoresis; database; ontology
Two-dimensional gel electrophoresis (2-DE) is widely applied and remains the method of choice in proteomics; however, pervasive 2-DE-related concerns undermine its prospects as a dominant separation technique in proteome research. Consequently, the state-of-the-art shotgun techniques are slowly taking over and utilising the rapid expansion and advancement of mass spectrometry (MS) to provide a new toolbox of gel-free quantitative techniques. When coupled to MS, the shotgun proteomic pipeline can fuel new routes in sensitive and high-throughput profiling of proteins, leading to a high accuracy in quantification. Although label-based approaches, either chemical or metabolic, gained popularity in quantitative proteomics because of the multiplexing capacity, these approaches are not without drawbacks. The burgeoning label-free methods are tag independent and suitable for all kinds of samples. The challenges in quantitative proteomics are more prominent in plants due to difficulties in protein extraction, some protein abundance in green tissue, and the absence of well-annotated and completed genome sequences. The goal of this perspective assay is to present the balance between the strengths and weaknesses of the available gel-based and -free methods and their application to plants. The latest trends in peptide fractionation amenable to MS analysis are as well discussed.
High-throughput screening of protein-protein interactions opens new systems biology perspectives for the comprehensive understanding of cell physiology in normal and pathological conditions. In this context, yeast two-hybrid system appears as a promising approach to efficiently reconstruct protein interaction networks at the proteome-wide scale. This protein interaction screening method generates a large amount of raw sequence data, i.e. the ISTs (Interaction Sequence Tags), which urgently need appropriate tools for their systematic and standardised analysis.
We develop pISTil, a bioinformatics pipeline combined with a user-friendly web-interface: (i) to establish a standardised system to analyse and to annotate ISTs generated by two-hybrid technologies with high performance and flexibility and (ii) to provide high-quality protein-protein interaction datasets for systems-level approach. This pipeline has been validated on a large dataset comprising more than 11.000 ISTs. As a case study, a detailed analysis of ISTs obtained from yeast two-hybrid screens of Hepatitis C Virus proteins against human cDNA libraries is also provided.
We have developed pISTil, an open source pipeline made of a collection of several applications governed by a Perl script. The pISTil pipeline is intended to laboratories, with IT-expertise in system administration, scripting and database management, willing to automatically process large amount of ISTs data for accurate reconstruction of protein interaction networks in a systems biology perspective. pISTil is publicly available for download at .
As proteomic data sets increase in size and complexity, the necessity for database-centric software systems able to organize, compare, and visualize all the proteomic experiments in a lab grows. We recently developed an integrated platform called high-throughput autonomous proteomic pipeline (HTAPP) for the automated acquisition and processing of quantitative proteomic data, and integration of proteomic results with existing external protein information resources within a lab-based relational database called PeptideDepot. Here, we introduce the peptide validation software component of this system, which combines relational database-integrated electronic manual spectral annotation in Java with a new software tool in the R programming language for the generation of logistic regression spectral models from user-supplied validated data sets and flexible application of these user-generated models in automated proteomic workflows. This logistic regression spectral model uses both variables computed directly from SEQUEST output in addition to deterministic variables based on expert manual validation criteria of spectral quality. In the case of linear quadrupole ion trap (LTQ) or LTQ-FTICR LC/MS data, our logistic spectral model outperformed both XCorr (242% more peptides identified on average) and the X!Tandem E-value (87% more peptides identified on average) at a 1% false discovery rate estimated by decoy database approach.
Decoy database; Logistic regression model; SEQUEST; Software; Spectral validation
Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. However, determining protein-coding genes for most new genomes is almost completely performed by inference using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function.
We experimentally annotated the bacterial pathogen Salmonella Typhimurium 14028, using "shotgun" proteomics to accurately uncover the translational landscape and post-translational features. The data provide protein-level experimental validation for approximately half of the predicted protein-coding genes in Salmonella and suggest revisions to several genes that appear to have incorrectly assigned translational start sites, including a potential novel alternate start codon. Additionally, we uncovered 12 non-annotated genes missed by gene prediction programs, as well as evidence suggesting a role for one of these novel ORFs in Salmonella pathogenesis. We also characterized post-translational features in the Salmonella genome, including chemical modifications and proteolytic cleavages. We find that bacteria have a much larger and more complex repertoire of chemical modifications than previously thought including several novel modifications. Our in vivo proteolysis data identified more than 130 signal peptide and N-terminal methionine cleavage events critical for protein function.
This work highlights several ways in which application of proteomics data can improve the quality of genome annotations to facilitate novel biological insights and provides a comprehensive proteome map of Salmonella as a resource for systems analysis.
gene annotation; proteomics; post-translational modifications
Applying high-throughput Top-Down MS to an entire proteome requires a yet-to-be-established model for data processing. Since Top-Down is becoming possible on a large scale, we report our latest software pipeline dedicated to capturing the full value of intact protein data in automated fashion. For intact mass detection, we combine algorithms for processing MS1 data from both isotopically resolved (FT) and charge-state resolved (ion trap) LC-MS data, which are then linked to their fragment ions for database searching using ProSight. Automated determination of human keratin and tubulin isoforms is one result. Optimized for the intricacies of whole proteins, new software modules visualize proteome-scale data based on the LC retention time and intensity of intact masses and enable selective detection of PTMs to automatically screen for acetylation, phosphorylation, and methylation. Software functionality was demonstrated using comparative LC-MS data from yeast strains in addition to human cells undergoing chemical stress. We further these advances as a key aspect of realizing Top-Down MS on a proteomic scale.
Bioinformatics; Data reduction; Deconvolution; Intact protein; Tandem MS; Top down
The Trans-Proteomic Pipeline (TPP) is a suite of software tools for the analysis of tandem mass spectrometry datasets. The tools encompass most of the steps in a proteomic data analysis workflow in a single, integrated software system. Specifically, the TPP supports all steps from spectrometer output file conversion to protein-level statistical validation, including quantification by stable isotope ratios. We describe here the full workflow of the TPP and the tools therein, along with an example on a sample dataset, demonstrating that the set up and use of the tools is straightforward and well supported and does not require specialized informatics resources or knowledge.