Motivation: Discovery of novel splicing from RNA sequence data remains a critical and exciting focus of transcriptomics, but reduced alignment power impedes expression quantification of novel splice junctions.
Results: Here, we profile performance characteristics of two-pass alignment, which separates splice junction discovery from quantification. Per sample, across a variety of transcriptome sequencing datasets, two-pass alignment improved quantification of at least 94% of simulated novel splice junctions, and provided as much as 1.7-fold deeper median read depth over those splice junctions. We further demonstrate that two-pass alignment works by increasing alignment of reads to splice junctions by short lengths, and that potential alignment errors are readily identifiable by simple classification. Taken together, two-pass alignment promises to advance quantification and discovery of novel splicing events.
Availability and implementation: Two-pass alignment was implemented here as sequential alignment, genome indexing, and re-alignment steps with STAR. Full parameters are provided in Supplementary Table 2.
Supplementary data are available at Bioinformatics online.
We introduce QPROT, a statistical framework and computational tool for differential protein expression analysis using protein intensity data. QPROT is an extension of the QSPEC suite, originally developed for spectral count data, adapted for statistical significance analysis using continuously measured protein-level intensity data. QPROT offers a new intensity normalization procedure and model-based differential expression analysis, both of which account for missing data. Determination of differential expression of each protein is based on the standardized Z-statistic based on the posterior distribution of the log fold change parameter, guided by the false discovery rate estimated by a well-known Empirical Bayes method. We evaluated the classification performance of QPROT using the quantification calibration data from the clinical proteomic technology assessment for cancer (CPTAC) study and a recently published E. coli benchmark dataset, with evaluation of FDR accuracy in the latter.
Differential expression; intensity; continuously normalized spectral counts; missing data
Data independent acquisition (DIA) mass spectrometry is an emerging technique that offers more complete detection and quantification of peptides and proteins across multiple samples. DIA allows fragment-level quantification, which can be considered as repeated measurements of the abundance of the corresponding peptides and proteins in the downstream statistical analysis. However, few statistical approaches are available for aggregating these complex fragment-level data into peptide- or protein-level statistical summaries. In this work, we describe a software package, mapDIA, for statistical analysis of differential protein expression using DIA fragment-level intensities. The workflow consists of three major steps: intensity normalization, peptide/fragment selection, and statistical analysis. First, mapDIA offers normalization of fragment-level intensities by total intensity sums as well as a novel alternative normalization by local intensity sums in retention time space. Second, mapDIA removes outlier observations and selects peptides/fragments that preserve the major quantitative patterns across all samples for each protein. Last, using the selected fragments and peptides, mapDIA performs model-based statistical significance analysis of protein-level differential expression between specified groups of samples. Using a comprehensive set of simulation datasets, we show that mapDIA detects differentially expressed proteins with accurate control of the false discovery rates. We also describe the analysis procedure in detail using two recently published DIA datasets generated for 14-3-3β dynamic interaction network and prostate cancer glycoproteome.
Data independent acquisition; Data preprocessing; Normalization; Differential expression
Despite significant efforts in the past decade towards complete mapping of the human proteome, 3564 proteins (neXtProt, 09-2014) are still “missing proteins”. Over one-third of these missing proteins are annotated as membrane proteins, owing to their relatively challenging accessibility with standard shotgun proteomics. Using non-small cell lung cancer (NSCLC) as a model study, we aim to mine missing proteins from disease-associated membrane proteome, which may be still largely under-represented. To increase identification coverage, we employed Hp-RP StageTip pre-fractionation of membrane-enriched samples from 11 NSCLC cell lines. Analysis of membrane samples from 20 pairs of tumor and adjacent normal lung tissue were incorporated to include physiologically expressed membrane proteins. Using multiple search engines (X!Tandem, Comet and Mascot) and stringent evaluation of FDR (MAYU and PeptideShaker), we identified 7702 proteins (66% membrane proteins) and 178 missing proteins (74 membrane proteins) with PSM-, peptide-, and protein-level FDR of 1%. Through multiple reaction monitoring (MRM) using synthetic peptides, we provided additional evidences for 8 missing proteins including 7 with transmembrane helix domains (TMH). This study demonstrates that mining missing proteins focused on cancer membrane sub-proteome can greatly contribute to map the whole human proteome. All data were deposited into ProteomeXchange with the identifier PXD002224.
Missing Proteins; Hp-RP StageTip; Membrane Proteins; MRM; Lung Cancer
Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry-based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of next generation sequencing technologies such as RNA-Seq and dramatic improvements in the depths and throughput of mass spectrometry-based proteomics, the pace of proteogenomics research has greatly accelerated. Here I review the current state of proteogenomics methods and applications, including computational strategies for building and using customized protein sequence databases. I also draw attention to the challenge of false positives in proteogenomics, and provide guidelines for analyzing the data and reporting the results of proteogenomics studies.
Trypsin is an endoprotease commonly used for sample preparation in proteomics experiments. Importantly, protein digestion is dependent on multiple factors, including the trypsin origin and digestion conditions. In-depth characterization of trypsin activity could lead to improved reliability of peptide detection and quantitation in both targeted and discovery proteomics studies. To this end, we assembled a data analysis pipeline and suite of visualization tools for quality control and comprehensive characterization of pre-analytical variability in proteomics experiments. Using these tools, we evaluated six available proteomics-grade trypsins and their digestion of a single purified protein, human serum albumin (HSA). HSA was aliquoted and then digested for 2 or 18 hours for each trypsin, and the resulting digests were desalted and analyzed in triplicate by reversed phase liquid chromatography - tandem mass spectrometry. Peptides were identified and quantified using the NIST MSQC pipeline and a comprehensive HSA mass spectral library. We performed a statistical analysis of peptide abundances from different digests, and further visualized the data using the principal component analysis and quantitative protein “sequence maps”. While the performance of individual trypsins across repeat digests was reproducible, significant differences were observed depending on the origin of the trypsin (i.e., bovine vs. porcine). Bovine trypsins produced a higher number of peptides containing missed cleavages, whereas porcine trypsins produced more semi-tryptic peptides. In addition, many cleavage sites showed variable digestion kinetics patterns, evident from the comparison of peptide abundances in 2 hour vs. 18 hour digests. Overall, this work illustrates effects of an often neglected source of variability in proteomics experiments: the origin of the trypsin.
proteomics; mass spectrometry; trypsin; digestion; endoprotease specificity; peptide abundance; variability; missed cleavages; label-free quantification; statistical analysis
Affinity purification coupled with mass spectrometry (AP-MS) is a robust technique used to identify protein-protein interactions. With recent improvements in sample preparation, and dramatic advances in MS instrumentation speed and sensitivity, this technique is becoming more widely used throughout the scientific community. To meet the needs of research groups both large and small, we have developed software solutions for tracking, scoring and analyzing AP-MS data. Here, we provide details for the installation and utilization of ProHits, a Laboratory Information Management System designed specifically for AP-MS interaction proteomics. This protocol explains: (i) how to install the complete ProHits system, including modules for the management of mass spectrometry files and the analysis of interaction data, and (ii) alternative options for the use of pre-existing search results in simpler versions of ProHits, including a virtual machine implementation of our ProHits Lite software. We also describe how to use the main features of the software to analyze AP-MS data.
Affinity purification coupled with mass spectrometry; Data analysis; Virtual machine; Statistical models; Protein-protein interactions
Robust statistical validation of peptide identifications obtained by tandem mass spectrometry and sequence database searching is an important task in shotgun proteomics. PeptideProphet is a commonly used computational tool that computes confidence measures for peptide identifications. In this paper, we investigate several limitations of the PeptideProphet modeling approach, including the use of fixed coefficients in computing the discriminant search score and selection of the top scoring peptide assignment per spectrum only. To address these limitations, we describe an adaptive method in which a new discriminant function is learned from the data in an iterative fashion. We extend the modeling framework to go beyond the top scoring peptide assignment per spectrum. We also investigate the effect of clustering the spectra according to their spectrum quality score followed by cluster-specific mixture modeling. The analysis is carried out using data acquired from a mixture of purified proteins on four different types of mass spectrometers, as well as using a complex human serum dataset. A special emphasis is placed on the analysis of data generated on high mass accuracy instruments.
Tandem Mass Spectrometry; Database searching; Peptide Identification; Statistical Modeling; Adaptive Discriminant Analysis; Mass Accuracy; Decoy Sequences
The sequence tag-based peptide identification methods are a promising alternative to the traditional database search approach. However, a more comprehensive analysis, optimization, and comparison with established methods are necessary before these methods can gain widespread use in the proteomics community. Using the InsPecT open source code base (Tanner et al., Anal Chem. 2005, 77:4626–39), we present an improved sequence tag generation method that directly incorporates multi-charged fragment ion peaks present in many tandem mass spectra of higher charge states. We also investigate the performance of sequence tagging under different settings using control datasets generated on five different types of mass spectrometers, as well as using a complex phosphopeptide-enriched sample. We also demonstrate that additional modeling of InsPecT search scores using a semi-parametric approach incorporating the accuracy of the precursor ion mass measurement provides additional improvement in the ability to discriminate between correct and incorrect peptide identifications. The overall superior performance of the sequence tag-based peptide identification method is demonstrated by comparison with a commonly used SEQUEST/PeptideProphet approach.
Proteomics; Tandem Mass Spectrometry; Peptide Identification; Database Searching; De Novo Sequencing; Algorithms; Statistical Analysis
We present LuciPHOr2, a site localization tool for generic post-translational modifications (PTMs) using tandem mass spectrometry data. As an extension of the original LuciPHOr (version 1) for phosphorylation site localization, the new software provides a site-level localization score for generic PTMs and associated false discovery rate called the false localization rate. We describe several novel features such as operating system independence and reduced computation time through multiple threading. We also discuss optimal parameters for different types of data and illustrate the new tool on a human skeletal muscle dataset for lysine-acetylation.
Availability and implementation: The software is freely available on the SourceForge website http://luciphor2.sourceforge.net.
Supplementary data are available at Bioinformatics online.
Remarkable progress continues on the annotation of the proteins identified in the Human Proteome and on finding credible proteomic evidence for the expression of “missing proteins”. Missing proteins are those with no previous protein-level evidence or insufficient evidence to make a confident identification upon reanalysis in PeptideAtlas and curation in neXtProt. Enhanced with several major new data sets published in 2014, the human proteome presented as neXtProt, version 2014-09-19, has 16 491 unique confident proteins (PE level 1), up from 13 664 at 2012-12 and 15 646 at 2013-09. That leaves 2948 missing proteins from genes classified having protein existence level PE 2, 3, or 4, as well as 616 dubious proteins at PE 5. Here, we document the progress of the HPP and discuss the importance of assessing the quality of evidence, confirming automated findings and considering alternative protein matches for spectra and peptides. We provide guidelines for proteomics investigators to apply in reporting newly identified proteins.
Human Proteome Project; HPP metrics; guidelines; high-confidence protein identifications; neXtProt; PeptideAtlas; Human Protein Atlas; Global Proteome Machine database (GPMDB); missing proteins; novel proteins
Current mass spectrometers provide a number of alternative methodologies for producing tandem mass spectra specifically for phosphopeptide analysis. In particular, generation of MS3 spectra in a data-dependent manner upon detection of the neutral loss of a phosphoric acid in MS2 spectra is a popular technique for circumventing the problem of poor phosphopeptide backbone fragmentation. The newer Multistage Activation method provides another option. Both these strategies require additional cycle time on the instrument and therefore reduce the number of spectra that can be measured in the same amount of time. Additional informatics is often required to make most efficient use of the additional information provided by these spectra as well. This work presents a comparison of several commonly used mass spectrometry methods for the study of phosphopeptide-enriched samples: an MS2-only method, a Multistage Activation method, and an MS2/MS3 data-dependent neutral loss method. Several strategies for dealing effectively with the resulting MS3 data in the latter approach are also presented and compared. The overall goal is to infer whether any one methodology performs significantly better than another for identifying phosphopeptides. On data presented here, the Multistage Activation methodology is demonstrated to perform optimally and does not result in significant loss of unique peptide identifications.
Protein phosphorylation; mass spectrometry; MS3; Multistage Activation; phosphoproteomics; bioinformatics; peptide identification; database search
Resistance to androgen deprivation therapies and increased androgen receptor (AR) activity are major drivers of castration resistant prostate cancer (CRPC). Although prior work focused on targeting AR directly, co-activators of AR signaling—which may represent new therapeutic targets—are relatively underexplored. Here we demonstrate that the mixed-lineage leukemia (MLL) complex, a well-known driver of MLL-fusion-positive leukemia, acts as a co-activator of AR signaling. AR directly interacts with the MLL complex via the menin MLL subunit. Menin expression is higher in castration resistant prostate cancer compared to hormone naïve prostate cancer and benign prostate and high menin expression correlates with poor overall survival. Treatment with a small molecule inhibitor of the menin-MLL interaction blocks AR signaling and inhibits the growth of castration resistant tumors in vivo in mice. Taken together, this work identifies the MLL complex as a critical co-activator of AR and a potential therapeutic target in advanced prostate cancer.
Androgen deprivation; prostate cancer; MLL complex; menin; androgen receptor
Due to recent improvements in mass spectrometry (MS), there is an increased interest in data independent acquisition (DIA) strategies in which all peptides are systematically fragmented using wide mass isolation windows (“multiplex fragmentation”). DIA-Umpire (http://diaumpire.sourceforge.net/), a comprehensive computational workflow and open-source software for DIA data, detects precursor and fragment chromatographic features and assembles them into pseudo MS/MS spectra. These spectra can be identified using conventional database searching and protein inference tools, allowing sensitive untargeted analysis of DIA data without the need for a spectral library. Quantification is obtained using both precursor and fragment ion intensities. Furthermore, DIA-Umpire enables targeted extraction of quantitative information based on peptides initially identified in only a subset of the samples, resulting in more consistent quantification across multiple samples. We demonstrate the performance of the method using control samples of varying complexity, and publicly available glycoproteomics and affinity purification - mass spectrometry data.
Tandem mass spectrometry (MS/MS)
followed by database search is
the method of choice for protein identification in proteomic studies.
Database searching methods employ spectral matching algorithms and
statistical models to identify and quantify proteins in a sample.
In general, these methods do not utilize any information other than
spectral data for protein identification. However, considering the
wealth of external data available for many biological systems, analysis
methods can incorporate such information to improve the sensitivity
of protein identification. In this study, we present a method to utilize
Global Proteome Machine Database identification frequencies and RNA-seq
transcript abundances to adjust the confidence scores of protein identifications.
The method described is particularly useful for samples with low-to-moderate
proteome coverage (i.e., <2000–3000 proteins), where we
observe up to an 8% improvement in the number of proteins identified
at a 1% false discovery rate.
Tandem mass spectrometry; RNA-seq; GPMDB; integrative analysis; probability adjustment; FDR; confidence threshold
Our ability to model the dynamics of signal transduction networks will depend on accurate methods to quantify levels of protein phosphorylation on a global scale. Here we describe a motif-targeting quantitation method for phosphorylation stoichiometry typing. Proteome-wide phosphorylation stoichiometry can be obtained by a simple phosphoproteomic workflow integrating dephosphorylation and isotope tagging with enzymatic kinase reaction. Proof-of-concept experiments using CK2-, MAPK- and EGFR-targeting assays in lung cancer cells demonstrate the advantage of kinase-targeted complexity reduction, resulting in deeper phosphoproteome quantification. We measure the phosphorylation stoichiometry of >1,000 phosphorylation sites including 366 low-abundance tyrosine phosphorylation sites, with high reproducibility and using small sample sizes. Comparing drug-resistant and sensitive lung cancer cells, we reveal that post-translational phosphorylation changes are significantly more dramatic than those at the protein and messenger RNA levels, and suggest potential drug targets within the kinase–substrate network associated with acquired drug resistance.
Measuring phosphorylation stoichiometry on a proteomic scale remains a challenge. Tsai et al. develop a technique to measure the basal level of phosphorylation stoichiometry in a single human phosphoproteome and identify molecular changes associated with gefitinib resistance in lung cancer cells.
Significance Analysis of INTeractome (SAINT) is a statistical method for probabilistically scoring protein-protein interaction data from affinity purification-mass spectrometry (AP-MS) experiments. The utility of the software has been demonstrated in many protein-protein interaction mapping studies, yet the extensive testing also revealed some practical drawbacks. In this paper, we present a new implementation, SAINTexpress, with a simpler statistical model and a quicker scoring algorithm, leading to significant improvements in computational speed and sensitivity of scoring. SAINTexpress also incorporates external interaction data to compute a supplemental topology-based score to improve the likelihood of identifying co-purifying protein complexes in a probabilistically objective manner. Overall, these changes are expected to improve the performance and user experience of SAINT across various types of high quality datasets.
Affinity-purification; protein-protein interaction; probabilistic scoring
Global ‘multi-omics’ profiling of cancer cells harbours the potential for characterizing the signaling networks associated with specific oncogenes. Here we profile the transcriptome, proteome and phosphoproteome in a panel of non-small cell lung cancer (NSCLC) cell lines in order to reconstruct targetable networks associated with KRAS dependency. We develop a two-step bioinformatics strategy addressing the challenge of integrating these disparate data sets. We first define an ‘abundance-score’ combining transcript, protein and phospho-protein abundances to nominate differentially abundant proteins and then use the Prize Collecting Steiner Tree algorithm to identify functional sub-networks. We identify three modules centered on KRAS and MET, LCK and PAK1 and b-Catenin. We validate activation of these proteins in KRAS-dependent (KRAS-Dep) cells and perform functional studies defining LCK as a critical gene for cell proliferation in KRAS-Dep but not KRAS-independent NSCLCs. These results suggest that LCK is a potential druggable target protein in KRAS-Dep lung cancers.
Glycogen synthase kinase 3 beta (GSK3β) is highly inactivated in epithelial cancers and is known to inhibit tumor migration and invasion. The zinc-finger-containing transcriptional repressor, Slug, represses E-cadherin transcription and enhances epithelial-mesenchymal transition (EMT). In this study, we find that the GSK3β-pSer9 level is associated with the expression of Slug in non-small cell lung cancer (NSCLC). GSK3β-mediated phosphorylation of Slug facilitates Slug protein turnover. Proteomic analysis reveals that the C-terminus of Hsc70-interacting protein (CHIP) interacts with wild-type Slug (wtSlug). Knockdown of CHIP stabilizes the wtSlug protein and reduces Slug ubiquitylation and degradation. In contrast, nonphosphorylatable Slug-4SA is not degraded by CHIP. The accumulation of nondegradable Slug may further lead to the repression of E-cadherin expression and promote cancer cell migration, invasion, and metastasis. Our findings provide evidence of a de novo GSK3β-CHIP-Slug pathway that may be involved in the progression of metastasis in lung cancer.
GSK3β; Slug; CHIP; post-translational modification
The interactions of protein kinases and phosphatases with their regulatory subunits and substrates underpin cellular regulation. We identified a kinase and phosphatase interaction (KPI) network of 1844 interactions in budding yeast by mass spectrometric analysis of protein complexes. The KPI network contained many dense local regions of interactions that suggested new functions. Notably, the cell cycle phosphatase Cdc14 associated with multiple kinases that revealed roles for Cdc14 in mitogen-activated protein kinase signaling, the DNA damage response, and metabolism, whereas interactions of the target of rapamycin complex 1 (TORC1) uncovered new effector kinases in nitrogen and carbon metabolism. An extensive backbone of kinase-kinase interactions cross-connects the proteome and may serve to coordinate diverse cellular responses.
The yeast Saccharomyces cerevisiae undergoes a dramatic growth transition from its unicellular form to a filamentous state, marked by the formation of pseudohyphal filaments of elongated and connected cells. Yeast pseudohyphal growth is regulated by signaling pathways responsive to reductions in the availability of nitrogen and glucose, but the molecular link between pseudohyphal filamentation and glucose signaling is not fully understood. Here, we identify the glucose-responsive Sks1p kinase as a signaling protein required for pseudohyphal growth induced by nitrogen limitation and coupled nitrogen/glucose limitation. To identify the Sks1p signaling network, we applied mass spectrometry-based quantitative phosphoproteomics, profiling over 900 phosphosites for phosphorylation changes dependent upon Sks1p kinase activity. From this analysis, we report a set of novel phosphorylation sites and highlight Sks1p-dependent phosphorylation in Bud6p, Itr1p, Lrg1p, Npr3p, and Pda1p. In particular, we analyzed the Y309 and S313 phosphosites in the pyruvate dehydrogenase subunit Pda1p; these residues are required for pseudohyphal growth, and Y309A mutants exhibit phenotypes indicative of impaired aerobic respiration and decreased mitochondrial number. Epistasis studies place SKS1 downstream of the G-protein coupled receptor GPR1 and the G-protein RAS2 but upstream of or at the level of cAMP-dependent PKA. The pseudohyphal growth and glucose signaling transcription factors Flo8p, Mss11p, and Rgt1p are required to achieve wild-type SKS1 transcript levels. SKS1 is conserved, and deletion of the SKS1 ortholog SHA3 in the pathogenic fungus Candida albicans results in abnormal colony morphology. Collectively, these results identify Sks1p as an important regulator of filamentation and glucose signaling, with additional relevance towards understanding stress-responsive signaling in C. albicans.
Eukaryotic cells respond to nutritional and environmental stress through complex regulatory programs controlling cell metabolism, growth, and morphology. In the budding yeast Saccharomyces cerevisiae, conditions of limited nitrogen and/or glucose can initiate a dramatic growth transition wherein the yeast cells form extended multicellular filaments resembling the true hyphal tubes of filamentous fungi. The formation of these pseudohyphal filaments is governed by core regulatory pathways that have been studied for decades; however, the mechanism by which these signaling systems are integrated is less well understood. We find that the protein kinase Sks1p contributes to the integration of signals for nitrogen and/or glucose limitation, resulting in pseudohyphal growth. We implemented a mass spectrometry-based approach to profile phosphorylation events across the proteome dependent upon Sks1p kinase activity and identified phosphorylation sites important for mitochondrial function and pseudohyphal growth. Our studies place Sks1p in the regulatory context of a well-known pseudohyphal growth signaling pathway. We further find that SKS1 is conserved and required for stress-responsive colony morphology in the principal opportunistic human fungal pathogen Candida albicans. Thus, Sks1p is part of the mechanism integrating glucose-responsive cell signaling and pseudohyphal growth, and its function is required for colony morphology linked with virulence in C. albicans.
Motivation: Multiply correlated datasets have become increasingly common in genome-wide location analysis of regulatory proteins and epigenetic modifications. Their correlation can be directly incorporated into a statistical model to capture underlying biological interactions, but such modeling quickly becomes computationally intractable.
Results: We present sparsely correlated hidden Markov models (scHMM), a novel method for performing simultaneous hidden Markov model (HMM) inference for multiple genomic datasets. In scHMM, a single HMM is assumed for each series, but the transition probability in each series depends on not only its own hidden states but also the hidden states of other related series. For each series, scHMM uses penalized regression to select a subset of the other data series and estimate their effects on the odds of each transition in the given series. Following this, hidden states are inferred using a standard forward–backward algorithm, with the transition probabilities adjusted by the model at each position, which helps retain the order of computation close to fitting independent HMMs (iHMM). Hence, scHMM is a collection of inter-dependent non-homogeneous HMMs, capable of giving a close approximation to a fully multivariate HMM fit. A simulation study shows that scHMM achieves comparable sensitivity to the multivariate HMM fit at a much lower computational cost. The method was demonstrated in the joint analysis of 39 histone modifications, CTCF and RNA polymerase II in human CD4+ T cells. scHMM reported fewer high-confidence regions than iHMM in this dataset, but scHMM could recover previously characterized histone modifications in relevant genomic regions better than iHMM. In addition, the resulting combinatorial patterns from scHMM could be better mapped to the 51 states reported by the multivariate HMM method of Ernst and Kellis.
Availability: The scHMM package can be freely downloaded from http://sourceforge.net/p/schmm/ and is recommended for use in a linux environment.
email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
Affinity purification coupled with mass spectrometry (AP-MS) is now a widely used approach for the identification of protein-protein interactions. However, for any given protein of interest, determining which of the identified polypeptides represent bona fide interactors versus those that are background contaminants (e.g. proteins that interact with the solid-phase support, affinity reagent or epitope tag) is a challenging task. While the standard approach is to identify nonspecific interactions using one or more negative controls, most small-scale AP-MS studies do not capture a complete, accurate background protein set. Fortunately, negative controls are largely bait-independent. Hence, aggregating negative controls from multiple AP-MS studies can increase coverage and improve the characterization of background associated with a given experimental protocol. Here we present the Contaminant Repository for Affinity Purification (the CRAPome) and describe the use of this resource to score protein-protein interactions. The repository (currently available for Homo sapiens and Saccharomyces cerevisiae) and computational tools are freely available online at www.crapome.org.
Significance Analysis of INTeractome (SAINT) is a software package for scoring protein-protein interactions based on label-free quantitative proteomics data (e.g. spectral count or intensity) in affinity purification – mass spectrometry (AP-MS) experiments. SAINT allows bench scientists to select bona fide interactions and remove non-specific interactions in an unbiased manner. However, there is no `one-size-fits-all' statistical model for every dataset, since the experimental design varies across studies. Key variables include the number of baits, the number of biological replicates per bait, and control purifications. Here we give a detailed account of input data format, control data, selection of high confidence interactions, and visualization of filtered data. We explain additional options for customizing the statistical model for optimal filtering in specific datasets. We also discuss a graphical user interface of SAINT in connection to the LIMS system ProHits which can be installed as a virtual machine on Mac OSX or PC Windows computers.
Protein-protein interactions; Label-free quantitative proteomics; Affinity purification – mass spectrometry (AP-MS); Statistical model
An increasing number of studies involve integrative analysis of gene and protein expression data taking advantage of new technologies such as next-generation transcriptome sequencing (RNA-Seq) and highly sensitive mass spectrometry (MS) instrumentation. Thus, it becomes interesting to revisit the correlative analysis of gene and protein expression data using more recently generated datasets. Furthermore, within the proteomics community there is a substantial interest in comparing the performance of different label-free quantitative proteomic strategies. Gene expression data can be used as an indirect benchmark for such protein-level comparisons. In this work we use publicly available mouse data to perform a joint analysis of genomic and proteomic data obtained on the same organism. First, we perform a comparative analysis of different label-free protein quantification methods (intensity-based and spectral count based, and using various associated data normalization steps) using several software tools on proteomic side. Similarly, we perform correlative analysis of gene expression data derived using microarray and RNA-Seq methods on genomic side. We also investigate the correlation between gene and protein expression data, and various factors affecting the accuracy of quantitation at both levels. It is observed that spectral count-based protein abundance metrics, which are easy to extract from any published data, are comparable to intensity-base measures with respect to correlation with gene expression data. The results of this work should be useful for designing robust computational pipelines for extraction and joint analysis of gene and protein expression data in the context of integrative studies.