PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (772953)

Clipboard (0)
None

Related Articles

1.  Mining gene functional networks to improve mass-spectrometry-based protein identification 
Bioinformatics  2009;25(22):2955-2961.
Motivation: High-throughput protein identification experiments based on tandem mass spectrometry (MS/MS) often suffer from low sensitivity and low-confidence protein identifications. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other evidence to suggest that a protein is present and confidence in individual protein identification can be updated accordingly.
Results: We develop a method that analyzes MS/MS experiments in the larger context of the biological processes active in a cell. Our method, MSNet, improves protein identification in shotgun proteomics experiments by considering information on functional associations from a gene functional network. MSNet substantially increases the number of proteins identified in the sample at a given error rate. We identify 8–29% more proteins than the original MS experiment when applied to yeast grown in different experimental conditions analyzed on different MS/MS instruments, and 37% more proteins in a human sample. We validate up to 94% of our identifications in yeast by presence in ground-truth reference sets.
Availability and Implementation: Software and datasets are available at http://aug.csres.utexas.edu/msnet
Contact: miranker@cs.utexas.edu, marcotte@icmb.utexas.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp461
PMCID: PMC2773251  PMID: 19633097
2.  Revisiting the negative example sampling problem for predicting protein–protein interactions 
Bioinformatics  2011;27(21):3024-3028.
Motivation: A number of computational methods have been proposed that predict protein–protein interactions (PPIs) based on protein sequence features. Since the number of potential non-interacting protein pairs (negative PPIs) is very high both in absolute terms and in comparison to that of interacting protein pairs (positive PPIs), computational prediction methods rely upon subsets of negative PPIs for training and validation. Hence, the need arises for subset sampling for negative PPIs.
Results: We clarify that there are two fundamentally different types of subset sampling for negative PPIs. One is subset sampling for cross-validated testing, where one desires unbiased subsets so that predictive performance estimated with them can be safely assumed to generalize to the population level. The other is subset sampling for training, where one desires the subsets that best train predictive algorithms, even if these subsets are biased. We show that confusion between these two fundamentally different types of subset sampling led one study recently published in Bioinformatics to the erroneous conclusion that predictive algorithms based on protein sequence features are hardly better than random in predicting PPIs. Rather, both protein sequence features and the ‘hubbiness’ of interacting proteins contribute to effective prediction of PPIs. We provide guidance for appropriate use of random versus balanced sampling.
Availability: The datasets used for this study are available at http://www.marcottelab.org/PPINegativeDataSampling.
Contact: yungki@mail.utexas.edu; marcotte@icmb.utexas.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr514
PMCID: PMC3198576  PMID: 21908540
3.  mspire: mass spectrometry proteomics in Ruby 
Bioinformatics  2008;24(23):2796-2797.
Summary: Mass spectrometry-based proteomics stands to gain from additional analysis of its data, but its large, complex datasets make demands on speed and memory usage requiring special consideration from scripting languages. The software library ‘mspire’—developed in the Ruby programming language—offers quick and memory-efficient readers for standard xml proteomics formats, converters for intermediate file types in typical proteomics spectral-identification work flows (including the Bioworks .srf format), and modules for the calculation of peptide false identification rates.
Availability: Freely available at http://mspire.rubyforge.org. Additional data models, usage information, and methods available at http://bioinformatics.icmb.utexas.edu/mspire
Contact: marcotte@icmb.utexas.edu
doi:10.1093/bioinformatics/btn513
PMCID: PMC2639276  PMID: 18930952
4.  A dynamic data structure for flexible molecular maintenance and informatics 
Bioinformatics  2010;27(1):55-62.
Motivation: We present the ‘Dynamic Packing Grid’ (DPG), a neighborhood data structure for maintaining and manipulating flexible molecules and assemblies, for efficient computation of binding affinities in drug design or in molecular dynamics calculations.
Results: DPG can efficiently maintain the molecular surface using only linear space and supports quasi-constant time insertion, deletion and movement (i.e. updates) of atoms or groups of atoms. DPG also supports constant time neighborhood queries from arbitrary points. Our results for maintenance of molecular surface and polarization energy computations using DPG exhibit marked improvement in time and space requirements.
Availability: http://www.cs.utexas.edu/~bajaj/cvc/software/DPG.shtml
Contact: bajaj@cs.utexas.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq627
PMCID: PMC3008647  PMID: 21115440
5.  Network-based inference from complex proteomic mixtures using SNIPE 
Bioinformatics  2012;28(23):3115-3122.
Motivation: Proteomics presents the opportunity to provide novel insights about the global biochemical state of a tissue. However, a significant problem with current methods is that shotgun proteomics has limited success at detecting many low abundance proteins, such as transcription factors from complex mixtures of cells and tissues. The ability to assay for these proteins in the context of the entire proteome would be useful in many areas of experimental biology.
Results: We used network-based inference in an approach named SNIPE (Software for Network Inference of Proteomics Experiments) that selectively highlights proteins that are more likely to be active but are otherwise undetectable in a shotgun proteomic sample. SNIPE integrates spectral counts from paired case–control samples over a network neighbourhood and assesses the statistical likelihood of enrichment by a permutation test. As an initial application, SNIPE was able to select several proteins required for early murine tooth development. Multiple lines of additional experimental evidence confirm that SNIPE can uncover previously unreported transcription factors in this system. We conclude that SNIPE can enhance the utility of shotgun proteomics data to facilitate the study of poorly detected proteins in complex mixtures.
Availability and Implementation: An implementation for the R statistical computing environment named snipeR has been made freely available at http://genetics.bwh.harvard.edu/snipe/.
Contact: ssunyaev@rics.bwh.harvard.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts594
PMCID: PMC3509492  PMID: 23060611
6.  Inductive matrix completion for predicting gene–disease associations 
Bioinformatics  2014;30(12):i60-i68.
Motivation: Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies—for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms in patients. Similarly, the type of evidence available for genes varies—for example, specific microarray probes convey information only for certain sets of genes. In this article, we apply a novel matrix-completion method called Inductive Matrix Completion to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene–disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases not seen at training time, unlike traditional matrix-completion approaches and network-based inference methods that are transductive.
Results: Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better—it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed Catapult method (second best) that has <15% chance. We demonstrate that the inductive method is particularly effective for a query disease with no previously known gene associations, and for predicting novel genes, i.e. genes that are previously not linked to diseases. Thus the method is capable of predicting novel genes even for well-characterized diseases. We also validate the novelty of predictions by evaluating the method on recently reported OMIM associations and on associations recently reported in the literature.
Availability: Source code and datasets can be downloaded from http://bigdata.ices.utexas.edu/project/gene-disease.
Contact: naga86@cs.utexas.edu
doi:10.1093/bioinformatics/btu269
PMCID: PMC4058925  PMID: 24932006
7.  Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins 
Bioinformatics  2009;25(15):1905-1914.
Motivation: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems.
Results: In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins.
Contact: weiwen.zhang@asu.edu; george.runger@asu.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp325
PMCID: PMC2712339  PMID: 19447782
8.  The Galaxy Framework as a Unifying Bioinformatics Solution for ‘omics’ Core Facilities 
Integration of different omics data (genomic, transcriptomic, proteomic) reveals novel discoveries into biological systems. Integration of these datasets is challenging however, involving use of multiple disparate software in a sequential manner. However, the use of multiple, disparate software in a sequential manner makes the integration of multi-omic data a serious challenge. We describe the extension of Galaxy for mass spectrometric-based proteomics software, enabling advanced multi-omic applications in proteogenomics and metaproteomics. We will demonstrate the benefits of Galaxy for these analyses, as well as its value for software developers seeking to publish new software. We will also share insights on the benefits of the Galaxy framework as a bioinformatics solution for proteomic/metabolomic core facilities. Multiple datasets for proteogenomics research (3D-fractionated salivary dataset and oral pre-malignant lesion (OPML) dataset) and metaproteomics research (OPML dataset and Severe Early Childhood Caries (SECC) dataset). Software required for analytical steps such as peaklist generation, database generation (RNA-Seq derived and others), database search (ProteinPilot and X! tandem) and for quantitative proteomics were deployed, tested and optimized for use in workflows. The software are shared in Galaxy toolshed (http://toolshed.g2.bx.psu.edu/). Usage of analytical workflows resulted in reliable identification of novel proteoforms (proteogenomics) or microorganisms (metaproteomics). Proteogenomics analysis identified novel proteoforms in the salivary dataset (51) and OPML dataset (38). Metaproteomics analysis led to microbial identification in OPML and SECC datasets using MEGAN software. As examples, workflows for proteogenomics analysis (http://z.umn.edu/pg140) and metaproteomic analysis (http://z.umn.edu/mp65) are available at the usegalaxyp.org website. Tutorials for workflow usage within Galaxy-P framework are also available (http://z.umn.edu/ppingp). We demonstrate the use of Galaxy for integrated analysis of multi-omic data, in an accessible, transparent and reproducible manner. Our results and experiences using this framework demonstrate the potential for Galaxy to be a unifying bioinformatics solution for ‘omics core facilities.
PMCID: PMC4162280
9.  Verification of a Parkinson's Disease Protein Signature by Multiple Reaction Monitoring 
OBJECTIVE: Integration of different ‘omics data (genomic, transcriptomic, proteomic) reveals novel discoveries into biological systems. Integration of these datasets is challenging however, involving use of multiple disparate software in a sequential manner. However, the use of multiple, disparate software in a sequential manner makes the integration of multi-omic data a serious challenge. We describe the extension of Galaxy for mass spectrometric-based proteomics software, enabling advanced multi-omic applications in proteogenomics and metaproteomics. We will demonstrate the benefits of Galaxy for these analyses, as well as its value for software developers seeking to publish new software. We will also share insights on the benefits of the Galaxy framework as a bioinformatics solution for proteomic/metabolomic core facilities. METHODS: Multiple datasets for proteogenomics research (3D-fractionated salivary dataset and oral pre-malignant lesion (OPML) dataset) and metaproteomics research (OPML dataset and Severe Early Childhood Caries (SECC) dataset). Software required for analytical steps such as peaklist generation, database generation (RNA-Seq derived and others), database search (ProteinPilot and X! tandem) and for quantitative proteomics were deployed, tested and optimized for use in workflows. The software are shared in Galaxy toolshed (http://toolshed.g2.bx.psu.edu/). Results: Usage of analytical workflows resulted in reliable identification of novel proteoforms (proteogenomics) or microorganisms (metaproteomics). Proteogenomics analysis identified novel proteoforms in the salivary dataset (51) and OPML dataset (38). Metaproteomics analysis led to microbial identification in OPML and SECC datasets using MEGAN software. As examples, workflows for proteogenomics analysis (http://z.umn.edu/pg140) and metaproteomic analysis (http://z.umn.edu/mp65) are available at the usegalaxyp.org website. Tutorials for workflow usage within Galaxy-P framework are also available (http://z.umn.edu/ppingp). CONCLUSIONS: We demonstrate the use of Galaxy for integrated analysis of multi-omic data, in an accessible, transparent and reproducible manner. Our results and experiences using this framework demonstrate the potential for Galaxy to be a unifying bioinformatics solution for ‘omics core facilities.
PMCID: PMC4162281
10.  customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search 
Bioinformatics  2013;29(24):3235-3237.
Summary: Database search is the most widely used approach for peptide and protein identification in mass spectrometry-based proteomics studies. Our previous study showed that sample-specific protein databases derived from RNA-Seq data can better approximate the real protein pools in the samples and thus improve protein identification. More importantly, single nucleotide variations, short insertion and deletions and novel junctions identified from RNA-Seq data make protein database more complete and sample-specific. Here, we report an R package customProDB that enables the easy generation of customized databases from RNA-Seq data for proteomics search. This work bridges genomics and proteomics studies and facilitates cross-omics data integration.
Availability and implementation: customProDB and related documents are freely available at http://bioconductor.org/packages/2.13/bioc/html/customProDB.html.
Contact: bing.zhang@vanderbilt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt543
PMCID: PMC3842753  PMID: 24058055
11.  IsoQuant: A Software Tool for SILAC-Based Mass Spectrometry Quantitation 
Analytical chemistry  2012;84(10):4535-4543.
Accurate protein identification and quantitation are critical when interpreting the biological relevance of large-scale shotgun proteomics datasets. Although significant technical advances in peptide and protein identification have been made, accurate quantitation of high throughput datasets remains a key challenge in mass spectrometry data analysis and is a labor intensive process for many proteomics laboratories. Here, we report a new SILAC-based proteomics quantitation software tool, named IsoQuant, which is used to process high mass accuracy mass spectrometry data. IsoQuant offers a convenient quantitation framework to calculate peptide/protein relative abundance ratios. At the same time, it also includes a visualization platform that permits users to validate the quality of SILAC peptide and protein ratios. The program is written in the C# programming language under the Microsoft .NET framework version 4.0 and has been tested to be compatible with both 32-bit and 64-bit Windows 7. It is freely available to non-commercial users at http://www.proteomeumb.org/MZw.html.
doi:10.1021/ac300510t
PMCID: PMC3583527  PMID: 22519468
Quantitative proteomics; mass spectrometry; SILAC; bioinformatics
12.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs 
BMC Bioinformatics  2002;3:2.
Background
Comparative analysis of RNA sequences is the basis for the detailed and accurate predictions of RNA structure and the determination of phylogenetic relationships for organisms that span the entire phylogenetic tree. Underlying these accomplishments are very large, well-organized, and processed collections of RNA sequences. This data, starting with the sequences organized into a database management system and aligned to reveal their higher-order structure, and patterns of conservation and variation for organisms that span the phylogenetic tree, has been collected and analyzed. This type of information can be fundamental for and have an influence on the study of phylogenetic relationships, RNA structure, and the melding of these two fields.
Results
We have prepared a large web site that disseminates our comparative sequence and structure models and data. The four major types of comparative information and systems available for the three ribosomal RNAs (5S, 16S, and 23S rRNA), transfer RNA (tRNA), and two of the catalytic intron RNAs (group I and group II) are: (1) Current Comparative Structure Models; (2) Nucleotide Frequency and Conservation Information; (3) Sequence and Structure Data; and (4) Data Access Systems.
Conclusions
This online RNA sequence and structure information, the result of extensive analysis, interpretation, data collection, and computer program and web development, is accessible at our Comparative RNA Web (CRW) Site http://www.rna.icmb.utexas.edu. In the future, more data and information will be added to these existing categories, new categories will be developed, and additional RNAs will be studied and presented at the CRW Site.
doi:10.1186/1471-2105-3-2
PMCID: PMC65690  PMID: 11869452
13.  ASTRAL: genome-scale coalescent-based species tree estimation 
Bioinformatics  2014;30(17):i541-i548.
Motivation: Species trees provide insight into basic biology, including the mechanisms of evolution and how it modifies biomolecular function and structure, biodiversity and co-evolution between genes and species. Yet, gene trees often differ from species trees, creating challenges to species tree estimation. One of the most frequent causes for conflicting topologies between gene trees and species trees is incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent. While many methods have been developed to estimate species trees from multiple genes, some which have statistical guarantees under the multi-species coalescent model, existing methods are too computationally intensive for use with genome-scale analyses or have been shown to have poor accuracy under some realistic conditions.
Results: We present ASTRAL, a fast method for estimating species trees from multiple genes. ASTRAL is statistically consistent, can run on datasets with thousands of genes and has outstanding accuracy—improving on MP-EST and the population tree from BUCKy, two statistically consistent leading coalescent-based methods. ASTRAL is often more accurate than concatenation using maximum likelihood, except when ILS levels are low or there are too few gene trees.
Availability and implementation: ASTRAL is available in open source form at https://github.com/smirarab/ASTRAL/. Datasets studied in this article are available at http://www.cs.utexas.edu/users/phylo/datasets/astral.
Contact: warnow@illinois.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu462
PMCID: PMC4147915  PMID: 25161245
14.  DACTAL: divide-and-conquer trees (almost) without alignments 
Bioinformatics  2012;28(12):i274-i282.
Motivation: While phylogenetic analyses of datasets containing 1000–5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale.
Methods: We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATé and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000–27 643 taxa.
Results: Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SATé shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SATé, including a dataset with almost 28 000 sequences.
Availability: DACTAL source code and results of dataset analyses are available at www.cs.utexas.edu/users/phylo/software/dactal.
Contact: tandy@cs.utexas.edu
doi:10.1093/bioinformatics/bts218
PMCID: PMC3371850  PMID: 22689772
15.  A proteomic chronology of gene expression through the cell cycle in human myeloid leukemia cells 
eLife  2014;3:e01630.
Technological advances have enabled the analysis of cellular protein and RNA levels with unprecedented depth and sensitivity, allowing for an unbiased re-evaluation of gene regulation during fundamental biological processes. Here, we have chronicled the dynamics of protein and mRNA expression levels across a minimally perturbed cell cycle in human myeloid leukemia cells using centrifugal elutriation combined with mass spectrometry-based proteomics and RNA-Seq, avoiding artificial synchronization procedures. We identify myeloid-specific gene expression and variations in protein abundance, isoform expression and phosphorylation at different cell cycle stages. We dissect the relationship between protein and mRNA levels for both bulk gene expression and for over ∼6000 genes individually across the cell cycle, revealing complex, gene-specific patterns. This data set, one of the deepest surveys to date of gene expression in human cells, is presented in an online, searchable database, the Encyclopedia of Proteome Dynamics (http://www.peptracker.com/epd/).
DOI: http://dx.doi.org/10.7554/eLife.01630.001
eLife digest
Cells are complex environments: at any one time, thousands of different genes act as molecular templates to produce messenger RNA (mRNA) molecules, which themselves are templates used to produce proteins. However, not all genes are active at all times inside all cells: as cells grow and divide as part of the cell division cycle, genes are switched on and off on a regular basis. Similarly, the patterns of mRNA and protein production are different in, say, immune and skin cells.
In recent years, the tools available for detecting mRNA molecules and proteins have become more powerful, allowing researchers to move beyond just measuring the total amounts of mRNA and protein in the cell to now measuring individual amounts of specific mRNA and protein molecules encoded by specific genes. However, it has been a challenge to make these measurements at different stages of the cell cycle. Most of the methods used to do this have involved artificially ‘arresting’ the cell cycle, which can lead to side effects that are difficult to account for.
Ly et al. have now overcome these problems using a combination of three methods to measure the levels of mRNA and protein molecules associated with over 6000 genes in human cancer cells derived from myeloid leukemia. Exploiting the fact that cells change size during the cell cycle, Ly et al. used a centrifugation technique to separate cells based on their size and, therefore, the stage of the cell cycle they were at, thus avoiding the need to arrest the cell cycle. An approach called RNA-Seq was then employed to measure the levels of the different mRNA molecules in the cells, and a device called a mass spectrometer was used to identify and measure the levels of many different proteins.
In addition to being able to follow the level of mRNA and protein production for a large number of genes throughout the cell division cycle, while also obtaining detailed information about how many of the proteins are modified, Ly et al. discovered that—contrary to expectations—low numbers of mRNA molecules were sometimes associated with high numbers of the corresponding protein, and vice versa. This work provides a better understanding of the complex relationship between the levels of an mRNA and its corresponding protein product, and also demonstrates how it may be possible to detect subtle but important differences between cell types and disease states, including different types of cancer.
DOI: http://dx.doi.org/10.7554/eLife.01630.002
doi:10.7554/eLife.01630
PMCID: PMC3936288  PMID: 24596151
proteomics; mass spectrometry; RNA-Seq; cell cycle; transcriptomics; human
16.  OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing 
Bioinformatics  2012;28(13):1677-1683.
Motivation: Next-generation DNA sequencing platforms are becoming increasingly cost-effective and capable of providing enormous number of reads in a relatively short time. However, their accuracy and read lengths are still lagging behind those of conventional Sanger sequencing method. Performance of next-generation sequencing platforms is fundamentally limited by various imperfections in the sequencing-by-synthesis and signal acquisition processes. This drives the search for accurate, scalable and computationally tractable base calling algorithms capable of accounting for such imperfections.
Results: Relying on a statistical model of the sequencing-by-synthesis process and signal acquisition procedure, we develop a computationally efficient base calling method for Illumina's sequencing technology (specifically, Genome Analyzer II platform). Parameters of the model are estimated via a fast unsupervised online learning scheme, which uses the generalized expectation–maximization algorithm and requires only 3 s of running time per tile (on an Intel i7 machine @3.07GHz, single core)—a three orders of magnitude speed-up over existing parametric model-based methods. To minimize the latency between the end of the sequencing run and the generation of the base calling reports, we develop a fast online scalable decoding algorithm, which requires only 9 s/tile and achieves significantly lower error rates than the Illumina's base calling software. Moreover, it is demonstrated that the proposed online parameter estimation scheme efficiently computes tile-dependent parameters, which can thereafter be provided to the base calling algorithm, resulting in significant improvements over previously developed base calling methods for the considered platform in terms of performance, time/complexity and latency.
Availability: A C code implementation of our algorithm can be downloaded from http://www.cerc.utexas.edu/OnlineCall/
Contact: hvikalo@ece.utexas.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts256
PMCID: PMC3381969  PMID: 22569177
17.  A hybrid approach to protein differential expression in mass spectrometry-based proteomics 
Bioinformatics  2012;28(12):1586-1591.
Motivation: Quantitative mass spectrometry-based proteomics involves statistical inference on protein abundance, based on the intensities of each protein's associated spectral peaks. However, typical MS-based proteomics datasets have substantial proportions of missing observations, due at least in part to censoring of low intensities. This complicates intensity-based differential expression analysis.
Results: We outline a statistical method for protein differential expression, based on a simple Binomial likelihood. By modeling peak intensities as binary, in terms of ‘presence/absence,’ we enable the selection of proteins not typically amenable to quantitative analysis; e.g. ‘one-state’ proteins that are present in one condition but absent in another. In addition, we present an analysis protocol that combines quantitative and presence/absence analysis of a given dataset in a principled way, resulting in a single list of selected proteins with a single-associated false discovery rate.
Availability: All R code available here: http://www.stat.tamu.edu/~adabney/share/xuan_code.zip.
Contact: adabney@stat.tamu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts193
PMCID: PMC3371829  PMID: 22522136
18.  PhosphoChain: a novel algorithm to predict kinase and phosphatase networks from high-throughput expression data 
Bioinformatics  2013;29(19):2435-2444.
Motivation: Protein phosphorylation is critical for regulating cellular activities by controlling protein activities, localization and turnover, and by transmitting information within cells through signaling networks. However, predictions of protein phosphorylation and signaling networks remain a significant challenge, lagging behind predictions of transcriptional regulatory networks into which they often feed.
Results: We developed PhosphoChain to predict kinases, phosphatases and chains of phosphorylation events in signaling networks by combining mRNA expression levels of regulators and targets with a motif detection algorithm and optional prior information. PhosphoChain correctly reconstructed ∼78% of the yeast mitogen-activated protein kinase pathway from publicly available data. When tested on yeast phosphoproteomic data from large-scale mass spectrometry experiments, PhosphoChain correctly identified ∼27% more phosphorylation sites than existing motif detection tools (NetPhosYeast and GPS2.0), and predictions of kinase–phosphatase interactions overlapped with ∼59% of known interactions present in yeast databases. PhosphoChain provides a valuable framework for predicting condition-specific phosphorylation events from high-throughput data.
Availability: PhosphoChain is implemented in Java and available at http://virgo.csie.ncku.edu.tw/PhosphoChain/ or http://aitchisonlab.com/PhosphoChain
Contact: john.aitchison@systemsbiology.org or jchiang@mail.ncku.edu.tw
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt387
PMCID: PMC3777105  PMID: 23832245
19.  PEPPI: a peptidomic database of human protein isoforms for proteomics experiments 
BMC Bioinformatics  2010;11(Suppl 6):S7.
Abstract
Background
Protein isoform generation, which may derive from alternative splicing, genetic polymorphism, and posttranslational modification, is an essential source of achieving molecular diversity by eukaryotic cells. Previous studies have shown that protein isoforms play critical roles in disease diagnosis, risk assessment, sub-typing, prognosis, and treatment outcome predictions. Understanding the types, presence, and abundance of different protein isoforms in different cellular and physiological conditions is a major task in functional proteomics, and may pave ways to molecular biomarker discovery of human diseases. In tandem mass spectrometry (MS/MS) based proteomics analysis, peptide peaks with exact matches to protein sequence records in the proteomics database may be identified with mass spectrometry (MS) search software. However, due to limited annotation and poor coverage of protein isoforms in proteomics databases, high throughput protein isoform identifications, particularly those arising from alternative splicing and genetic polymorphism, have not been possible.
Results
Therefore, we present the PEPtidomics Protein Isoform Database (PEPPI, http://bio.informatics.iupui.edu/peppi), a comprehensive database of computationally-synthesized human peptides that can identify protein isoforms derived from either alternatively spliced mRNA transcripts or SNP variations. We collected genome, pre-mRNA alternative splicing and SNP information from Ensembl. We synthesized in silico isoform transcripts that cover all exons and theoretically possible junctions of exons and introns, as well as all their variations derived from known SNPs. With three case studies, we further demonstrated that the database can help researchers discover and characterize new protein isoform biomarkers from experimental proteomics data.
Conclusions
We developed a new tool for the proteomics community to characterize protein isoforms from MS-based proteomics experiments. By cataloguing each peptide configurations in the PEPPI database, users can study genetic variations and alternative splicing events at the proteome level. They can also batch-download peptide sequences in FASTA format to search for MS/MS spectra derived from human samples. The database can help generate novel hypotheses on molecular risk factors and molecular mechanisms of complex diseases, leading to identification of potentially highly specific protein isoform biomarkers.
doi:10.1186/1471-2105-11-S6-S7
PMCID: PMC3026381  PMID: 20946618
20.  pep2pro: the high-throughput proteomics data processing, analysis, and visualization tool 
The pep2pro database was built to support effective high-throughput proteome data analysis. Its database schema allows the coherent integration of search results from different database-dependent search algorithms and filtering of the data including control for unambiguous assignment of peptides to proteins. The capacity of the pep2pro database has been exploited in data analysis of various Arabidopsis proteome datasets. The diversity of the datasets and the associated scientific questions required thorough querying of the data. This was supported by the relational format structure of the data that links all information on the sample, spectrum, search database, and algorithm to peptide and protein identifications and their post-translational modifications. After publication of datasets they are made available on the pep2pro website at www.pep2pro.ethz.ch. Further, the pep2pro data analysis pipeline also handles data export do the PRIDE database (http://www.ebi.ac.uk/pride) and data retrieval by the MASCP Gator (http://gator.masc-proteomics.org/). The utility of pep2pro will continue to be used for analysis of additional datasets and as a data warehouse. The capacity of the pep2pro database for proteome data analysis has now also been made publicly available through the release of pep2pro4all, which consists of a database schema and a script that will populate the database with mass spectrometry data provided in mzIdentML format.
doi:10.3389/fpls.2012.00123
PMCID: PMC3371593  PMID: 22701464
database; mzIdentML; pep2pro; plant proteomics; standard format
21.  WallProtDB, a database resource for plant cell wall proteomics 
Plant Methods  2015;11:2.
Background
During the last fifteen years, cell wall proteomics has become a major research field with the publication of more than 50 articles describing plant cell wall proteomes. The WallProtDB database has been designed as a tool to facilitate the inventory, the interpretation of cell wall proteomics data and the comparisons between cell wall proteomes.
Results
WallProtDB (http://www.polebio.lrsv.ups-tlse.fr/WallProtDB/) presently contains 2170 proteins and ESTs identified experimentally in 36 cell wall proteomics studies performed on 11 different plant species. Two criteria have to be met for entering WallProtDB. First one is related to the identification of proteins. Only proteins identified in plant with available genomic or ESTs data are considered to ensure unambiguous identification. Second criterion is related to the difficulty to obtain clean cell wall fractions. Indeed, since cell walls constitute an open compartment difficult to isolate, numerous proteins predicted to be intracellular and/or having functions inside the cell have been identified in cell wall extracts. Then, except proteins predicted to be plasma membrane proteins, only proteins having a predicted signal peptide and no known intracellular retention signal are included in the database. In addition, WallProtDB contains information about the strategies used to obtain cell wall protein extracts and to identify proteins by mass spectrometry and bioinformatics. Mass spectrometry data are included when available. All the proteins of WallProtDB are linked to ProtAnnDB, another database, which contains structural and functional bioinformatics annotations of proteins as well as links to other databases (Aramemnon, CAZy, Planet, Phytozome). A list of references in the cell wall proteomics field is also provided.
Conclusions
WallProtDB aims at becoming a cell wall proteome reference database. It can be updated at any time on request and provide a support for sharing cell wall proteomics data and literature references with researchers interested in plant cell wall biology.
Electronic supplementary material
The online version of this article (doi:10.1186/s13007-015-0045-y) contains supplementary material, which is available to authorized users.
doi:10.1186/s13007-015-0045-y
PMCID: PMC4302427  PMID: 25614757
Cell wall; Mass spectrometry; Plant; ProtAnnDB; Proteomics
22.  Prophossi: automating expert validation of phosphopeptide–spectrum matches from tandem mass spectrometry 
Bioinformatics  2010;26(17):2153-2159.
Motivation: Complex patterns of protein phosphorylation mediate many cellular processes. Tandem mass spectrometry (MS/MS) is a powerful tool for identifying these post-translational modifications. In high-throughput experiments, mass spectrometry database search engines, such as MASCOT provide a ranked list of peptide identifications based on hundreds of thousands of MS/MS spectra obtained in a mass spectrometry experiment. These search results are not in themselves sufficient for confident assignment of phosphorylation sites as identification of characteristic mass differences requires time-consuming manual assessment of the spectra by an experienced analyst. The time required for manual assessment has previously rendered high-throughput confident assignment of phosphorylation sites challenging.
Results: We have developed a knowledge base of criteria, which replicate expert assessment, allowing more than half of cases to be automatically validated and site assignments verified with a high degree of confidence. This was assessed by comparing automated spectral interpretation with careful manual examination of the assignments for 501 peptides above the 1% false discovery rate (FDR) threshold corresponding to 259 putative phosphorylation sites in 74 proteins of the Trypanosoma brucei proteome. Despite this stringent approach, we are able to validate 80 of the 91 phosphorylation sites (88%) positively identified by manual examination of the spectra used for the MASCOT searches with a FDR < 15%.
Conclusions:High-throughput computational analysis can provide a viable second stage validation of primary mass spectrometry database search results. Such validation gives rapid access to a systems level overview of protein phosphorylation in the experiment under investigation.
Availability: A GPL licensed software implementation in Perl for analysis and spectrum annotation is available in the supplementary material and a web server can be assessed online at http://www.compbio.dundee.ac.uk/prophossi
Contact: d.m.a.martin@dundee.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq341
PMCID: PMC2922888  PMID: 20651112
23.  Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs 
BMC Bioinformatics  2014;15:121.
Background
Tandem mass spectrometry-based database searching is currently the main method for protein identification in shotgun proteomics. The explosive growth of protein and peptide databases, which is a result of genome translations, enzymatic digestions, and post-translational modifications (PTMs), is making computational efficiency in database searching a serious challenge. Profile analysis shows that most search engines spend 50%-90% of their total time on the scoring module, and that the spectrum dot product (SDP) based scoring module is the most widely used. As a general purpose and high performance parallel hardware, graphics processing units (GPUs) are promising platforms for speeding up database searches in the protein identification process.
Results
We designed and implemented a parallel SDP-based scoring module on GPUs that exploits the efficient use of GPU registers, constant memory and shared memory. Compared with the CPU-based version, we achieved a 30 to 60 times speedup using a single GPU. We also implemented our algorithm on a GPU cluster and achieved an approximately favorable speedup.
Conclusions
Our GPU-based SDP algorithm can significantly improve the speed of the scoring module in mass spectrometry-based protein identification. The algorithm can be easily implemented in many database search engines such as X!Tandem, SEQUEST, and pFind. A software tool implementing this algorithm is available at http://www.comp.hkbu.edu.hk/~youli/ProteinByGPU.html
doi:10.1186/1471-2105-15-121
PMCID: PMC4049470  PMID: 24773593
24.  Absolute quantification of microbial proteomes at different states by directed mass spectrometry 
The developed, directed mass spectrometry workflow allows to generate consistent and system-wide quantitative maps of microbial proteomes in a single analysis. Application to the human pathogen L. interrogans revealed mechanistic proteome changes over time involved in pathogenic progression and antibiotic defense, and new insights about the regulation of absolute protein abundances within operons.
The developed, directed proteomic approach allowed consistent detection and absolute quantification of 1680 proteins of the human pathogen L. interrogans in a single LC–MS/MS experiment.The comparison of 25 extensive, consistent and quantitative proteome maps revealed new insights about the proteome changes involved in pathogenic progression and antibiotic defense of L. interrogans, and about the regulation of protein abundances within operons.The generated time-resolved data sets are compatible with pattern analysis algorithms developed for transcriptomics, including hierarchical clustering and functional enrichment analysis of the detected profile clusters.This is the first study that describes the absolute quantitative behavior of any proteome over multiple states and represents the most comprehensive proteome abundance pattern comparison for any organism to date.
Over the last decade, mass spectrometry (MS)-based proteomics has evolved as the method of choice for system-wide proteome studies and now allows for the characterization of several thousands of proteins in a single sample. Despite these great advances, redundant monitoring of protein levels over large sample numbers in a high-throughput manner remains a challenging task. New directed MS strategies have shown to overcome some of the current limitations, thereby enabling the acquisition of consistent and system-wide data sets of proteomes with low-to-moderate complexity at high throughput.
In this study, we applied this integrated, two-stage MS strategy to investigate global proteome changes in the human pathogen L. interrogans. In the initial discovery phase, 1680 proteins (out of around 3600 gene products) could be identified (Schmidt et al, 2008) and, by focusing precious MS-sequencing time on the most dominant, specific peptides per protein, all proteins could be accurately and consistently monitored over 25 different samples within a few days of instrument time in the following scoring phase (Figure 1). Additionally, the co-analysis of heavy reference peptides enabled us to obtain absolute protein concentration estimates for all identified proteins in each perturbation (Malmström et al, 2009). The detected proteins did not show any biases against functional groups or protein classes, including membrane proteins, and span an abundance range of more than three orders of magnitude, a range that is expected to cover most of the L. interrogans proteome (Malmström et al, 2009).
To elucidate mechanistic proteome changes over time involved in pathogenic progression and antibiotic defense of L. interrogans, we generated time-resolved proteome maps of cells perturbed with serum and three different antibiotics at sublethal concentrations that are currently used to treat Leptospirosis. This yielded an information-rich proteomic data set that describes, for the first time, the absolute quantitative behavior of any proteome over multiple states, and represents the most comprehensive proteome abundance pattern comparison for any organism to date. Using this unique property of the data set, we could quantify protein components of entire pathways across several time points and subject the data sets to cluster analysis, a tool that was previously limited to the transcript level due to incomplete sampling on protein level (Figure 4). Based on these analyses, we could demonstrate that Leptospira cells adjust the cellular abundance of a certain subset of proteins and pathways as a general response to stress while other parts of the proteome respond highly specific. The cells furthermore react to individual treatments by ‘fine tuning' the abundance of certain proteins and pathways in order to cope with the specific cause of stress. Intriguingly, the most specific and significant expression changes were observed for proteins involved in motility, tissue penetration and virulence after serum treatment where we tried to simulate the host environment. While many of the detected protein changes demonstrate good agreement with available transcriptomics data, most proteins showed a poor correlation. This includes potential virulence factors, like Loa22 or OmpL1, with confirmed expression in vivo that were significantly up-regulated on the protein level, but not on the mRNA level, strengthening the importance of proteomic studies. The high resolution and coverage of the proteome data set enabled us to further investigate protein abundance changes of co-regulated genes within operons. This suggests that although most proteins within an operon respond to regulation synchronously, bacterial cells seem to have subtle means to adjust the levels of individual proteins or protein groups outside of the general trend, a phenomena that was recently also observed on the transcript level of other bacteria (Güell et al, 2009).
The method can be implemented with standard high-resolution mass spectrometers and software tools that are readily available in the majority of proteomics laboratories. It is scalable to any proteome of low-to-medium complexity and can be extended to post-translational modifications or peptide-labeling strategies for quantification. We therefore expect the approach outlined here to become a cornerstone for microbial systems biology.
Over the past decade, liquid chromatography coupled with tandem mass spectrometry (LC–MS/MS) has evolved into the main proteome discovery technology. Up to several thousand proteins can now be reliably identified from a sample and the relative abundance of the identified proteins can be determined across samples. However, the remeasurement of substantially similar proteomes, for example those generated by perturbation experiments in systems biology, at high reproducibility and throughput remains challenging. Here, we apply a directed MS strategy to detect and quantify sets of pre-determined peptides in tryptic digests of cells of the human pathogen Leptospira interrogans at 25 different states. We show that in a single LC–MS/MS experiment around 5000 peptides, covering 1680 L. interrogans proteins, can be consistently detected and their absolute expression levels estimated, revealing new insights about the proteome changes involved in pathogenic progression and antibiotic defense of L. interrogans. This is the first study that describes the absolute quantitative behavior of any proteome over multiple states, and represents the most comprehensive proteome abundance pattern comparison for any organism to date.
doi:10.1038/msb.2011.37
PMCID: PMC3159967  PMID: 21772258
absolute quantification; directed mass spectrometry; Leptospira interrogans; microbiology; proteomics
25.  A Mouse to Human Search for Plasma Proteome Changes Associated with Pancreatic Tumor Development 
PLoS Medicine  2008;5(6):e123.
Background
The complexity and heterogeneity of the human plasma proteome have presented significant challenges in the identification of protein changes associated with tumor development. Refined genetically engineered mouse (GEM) models of human cancer have been shown to faithfully recapitulate the molecular, biological, and clinical features of human disease. Here, we sought to exploit the merits of a well-characterized GEM model of pancreatic cancer to determine whether proteomics technologies allow identification of protein changes associated with tumor development and whether such changes are relevant to human pancreatic cancer.
Methods and Findings
Plasma was sampled from mice at early and advanced stages of tumor development and from matched controls. Using a proteomic approach based on extensive protein fractionation, we confidently identified 1,442 proteins that were distributed across seven orders of magnitude of abundance in plasma. Analysis of proteins chosen on the basis of increased levels in plasma from tumor-bearing mice and corroborating protein or RNA expression in tissue documented concordance in the blood from 30 newly diagnosed patients with pancreatic cancer relative to 30 control specimens. A panel of five proteins selected on the basis of their increased level at an early stage of tumor development in the mouse was tested in a blinded study in 26 humans from the CARET (Carotene and Retinol Efficacy Trial) cohort. The panel discriminated pancreatic cancer cases from matched controls in blood specimens obtained between 7 and 13 mo prior to the development of symptoms and clinical diagnosis of pancreatic cancer.
Conclusions
Our findings indicate that GEM models of cancer, in combination with in-depth proteomic analysis, provide a useful strategy to identify candidate markers applicable to human cancer with potential utility for early detection.
Samir Hanash and colleagues identify proteins that are increased at an early stage of pancreatic tumor development in a mouse model and may be a useful tool in detecting early tumors in humans.
Editors' Summary
Background.
Cancers are life-threatening, disorganized masses of cells that can occur anywhere in the human body. They develop when cells acquire genetic changes that allow them to grow uncontrollably and to spread around the body (metastasize). If a cancer is detected when it is still small and has not metastasized, surgery can often provide a cure. Unfortunately, many cancers are detected only when they are large enough to press against surrounding tissues and cause pain or other symptoms. By this time, surgical removal of the original (primary) tumor may be impossible and there may be secondary cancers scattered around the body. In such cases, radiotherapy and chemotherapy can sometimes help, but the outlook for patients whose cancers are detected late is often poor. One cancer type for which late detection is a particular problem is pancreatic adenocarcinoma. This cancer rarely causes any symptoms in its early stages. Furthermore, the symptoms it eventually causes—jaundice, abdominal and back pain, and weight loss—are seen in many other illnesses. Consequently, pancreatic cancer has usually spread before it is diagnosed, and most patients die within a year of their diagnosis.
Why Was This Study Done?
If a test could be developed to detect pancreatic cancer in its early stages, the lives of many patients might be extended. Tumors often release specific proteins—“cancer biomarkers”—into the blood, a bodily fluid that can be easily sampled. If a protein released into the blood by pancreatic cancer cells could be identified, it might be possible to develop a noninvasive screening test for this deadly cancer. In this study, the researchers use a “proteomic” approach to identify potential biomarkers for early pancreatic cancer. Proteomics is the study of the patterns of proteins made by an organism, tissue, or cell and of the changes in these patterns that are associated with various diseases.
What Did the Researchers Do and Find?
The researchers started their search for pancreatic cancer biomarkers by studying the plasma proteome (the proteins in the fluid portion of blood) of mice genetically engineered to develop cancers that closely resemble human pancreatic tumors. Through the use of two techniques called high-resolution mass spectrometry and acrylamide isotopic labeling, the researchers identified 165 proteins that were present in larger amounts in plasma collected from mice with early and/or advanced pancreatic cancer than in plasma from control mice. Then, to test whether any of these protein changes were relevant to human pancreatic cancer, the researchers analyzed blood samples collected from patients with pancreatic cancer. These samples, they report, contained larger amounts of some of these proteins than blood collected from patients with chronic pancreatitis, a condition that has similar symptoms to pancreatic cancer. Finally, using blood samples collected during a clinical trial, the Carotene and Retinol Efficacy Trial (a cancer-prevention study), the researchers showed that the measurement of five of the proteins present in increased amounts at an early stage of tumor development in the mouse model discriminated between people with pancreatic cancer and matched controls up to 13 months before cancer diagnosis.
What Do These Findings Mean?
These findings suggest that in-depth proteomic analysis of genetically engineered mouse models of human cancer might be an effective way to identify biomarkers suitable for the early detection of human cancers. Previous attempts to identify such biomarkers using human samples have been hampered by the many noncancer-related differences in plasma proteins that exist between individuals and by problems in obtaining samples from patients with early cancer. The use of a mouse model of human cancer, these findings indicate, can circumvent both of these problems. More specifically, these findings identify a panel of proteins that might allow earlier detection of pancreatic cancer and that might, therefore, extend the life of some patients who develop this cancer. However, before a routine screening test becomes available, additional markers will need to be identified and extensive validation studies in larger groups of patients will have to be completed.
Additional Information.
Please access these Web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.0050123.
The MedlinePlus Encyclopedia has a page on pancreatic cancer (in English and Spanish). Links to further information are provided by MedlinePlus
The US National Cancer Institute has information about pancreatic cancer for patients and health professionals (in English and Spanish)
The UK charity Cancerbackup also provides information for patients about pancreatic cancer
The Clinical Proteomic Technologies for Cancer Initiative (a US National Cancer Institute initiative) provides a tutorial about proteomics and cancer and information on the Mouse Proteomic Technologies Initiative
doi:10.1371/journal.pmed.0050123
PMCID: PMC2504036  PMID: 18547137

Results 1-25 (772953)