PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1076048)

Clipboard (0)
None

Related Articles

1.  Mining gene functional networks to improve mass-spectrometry-based protein identification 
Bioinformatics  2009;25(22):2955-2961.
Motivation: High-throughput protein identification experiments based on tandem mass spectrometry (MS/MS) often suffer from low sensitivity and low-confidence protein identifications. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other evidence to suggest that a protein is present and confidence in individual protein identification can be updated accordingly.
Results: We develop a method that analyzes MS/MS experiments in the larger context of the biological processes active in a cell. Our method, MSNet, improves protein identification in shotgun proteomics experiments by considering information on functional associations from a gene functional network. MSNet substantially increases the number of proteins identified in the sample at a given error rate. We identify 8–29% more proteins than the original MS experiment when applied to yeast grown in different experimental conditions analyzed on different MS/MS instruments, and 37% more proteins in a human sample. We validate up to 94% of our identifications in yeast by presence in ground-truth reference sets.
Availability and Implementation: Software and datasets are available at http://aug.csres.utexas.edu/msnet
Contact: miranker@cs.utexas.edu, marcotte@icmb.utexas.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp461
PMCID: PMC2773251  PMID: 19633097
2.  Revisiting the negative example sampling problem for predicting protein–protein interactions 
Bioinformatics  2011;27(21):3024-3028.
Motivation: A number of computational methods have been proposed that predict protein–protein interactions (PPIs) based on protein sequence features. Since the number of potential non-interacting protein pairs (negative PPIs) is very high both in absolute terms and in comparison to that of interacting protein pairs (positive PPIs), computational prediction methods rely upon subsets of negative PPIs for training and validation. Hence, the need arises for subset sampling for negative PPIs.
Results: We clarify that there are two fundamentally different types of subset sampling for negative PPIs. One is subset sampling for cross-validated testing, where one desires unbiased subsets so that predictive performance estimated with them can be safely assumed to generalize to the population level. The other is subset sampling for training, where one desires the subsets that best train predictive algorithms, even if these subsets are biased. We show that confusion between these two fundamentally different types of subset sampling led one study recently published in Bioinformatics to the erroneous conclusion that predictive algorithms based on protein sequence features are hardly better than random in predicting PPIs. Rather, both protein sequence features and the ‘hubbiness’ of interacting proteins contribute to effective prediction of PPIs. We provide guidance for appropriate use of random versus balanced sampling.
Availability: The datasets used for this study are available at http://www.marcottelab.org/PPINegativeDataSampling.
Contact: yungki@mail.utexas.edu; marcotte@icmb.utexas.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr514
PMCID: PMC3198576  PMID: 21908540
3.  mspire: mass spectrometry proteomics in Ruby 
Bioinformatics  2008;24(23):2796-2797.
Summary: Mass spectrometry-based proteomics stands to gain from additional analysis of its data, but its large, complex datasets make demands on speed and memory usage requiring special consideration from scripting languages. The software library ‘mspire’—developed in the Ruby programming language—offers quick and memory-efficient readers for standard xml proteomics formats, converters for intermediate file types in typical proteomics spectral-identification work flows (including the Bioworks .srf format), and modules for the calculation of peptide false identification rates.
Availability: Freely available at http://mspire.rubyforge.org. Additional data models, usage information, and methods available at http://bioinformatics.icmb.utexas.edu/mspire
Contact: marcotte@icmb.utexas.edu
doi:10.1093/bioinformatics/btn513
PMCID: PMC2639276  PMID: 18930952
4.  Computational discovery of pathway-level genetic vulnerabilities in non-small-cell lung cancer 
Bioinformatics  2016;32(9):1373-1379.
Motivation: Novel approaches are needed for discovery of targeted therapies for non-small-cell lung cancer (NSCLC) that are specific to certain patients. Whole genome RNAi screening of lung cancer cell lines provides an ideal source for determining candidate drug targets.
Results: Unsupervised learning algorithms uncovered patterns of differential vulnerability across lung cancer cell lines to loss of functionally related genes. Such genetic vulnerabilities represent candidate targets for therapy and are found to be involved in splicing, translation and protein folding. In particular, many NSCLC cell lines were especially sensitive to the loss of components of the LSm2-8 protein complex or the CCT/TRiC chaperonin. Different vulnerabilities were also found for different cell line subgroups. Furthermore, the predicted vulnerability of a single adenocarcinoma cell line to loss of the Wnt pathway was experimentally validated with screening of small-molecule Wnt inhibitors against an extensive cell line panel.
Availability and implementation: The clustering algorithm is implemented in Python and is freely available at https://bitbucket.org/youngjh/nsclc_paper.
Contact: marcotte@icmb.utexas.edu or jon.young@utexas.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btw010
PMCID: PMC4848405  PMID: 26755624
5.  A dynamic data structure for flexible molecular maintenance and informatics 
Bioinformatics  2010;27(1):55-62.
Motivation: We present the ‘Dynamic Packing Grid’ (DPG), a neighborhood data structure for maintaining and manipulating flexible molecules and assemblies, for efficient computation of binding affinities in drug design or in molecular dynamics calculations.
Results: DPG can efficiently maintain the molecular surface using only linear space and supports quasi-constant time insertion, deletion and movement (i.e. updates) of atoms or groups of atoms. DPG also supports constant time neighborhood queries from arbitrary points. Our results for maintenance of molecular surface and polarization energy computations using DPG exhibit marked improvement in time and space requirements.
Availability: http://www.cs.utexas.edu/~bajaj/cvc/software/DPG.shtml
Contact: bajaj@cs.utexas.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq627
PMCID: PMC3008647  PMID: 21115440
6.  DACTAL: divide-and-conquer trees (almost) without alignments 
Bioinformatics  2012;28(12):i274-i282.
Motivation: While phylogenetic analyses of datasets containing 1000–5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale.
Methods: We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATé and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000–27 643 taxa.
Results: Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SATé shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SATé, including a dataset with almost 28 000 sequences.
Availability: DACTAL source code and results of dataset analyses are available at www.cs.utexas.edu/users/phylo/software/dactal.
Contact: tandy@cs.utexas.edu
doi:10.1093/bioinformatics/bts218
PMCID: PMC3371850  PMID: 22689772
7.  The Galaxy Framework as a Unifying Bioinformatics Solution for ‘omics’ Core Facilities 
Integration of different omics data (genomic, transcriptomic, proteomic) reveals novel discoveries into biological systems. Integration of these datasets is challenging however, involving use of multiple disparate software in a sequential manner. However, the use of multiple, disparate software in a sequential manner makes the integration of multi-omic data a serious challenge. We describe the extension of Galaxy for mass spectrometric-based proteomics software, enabling advanced multi-omic applications in proteogenomics and metaproteomics. We will demonstrate the benefits of Galaxy for these analyses, as well as its value for software developers seeking to publish new software. We will also share insights on the benefits of the Galaxy framework as a bioinformatics solution for proteomic/metabolomic core facilities. Multiple datasets for proteogenomics research (3D-fractionated salivary dataset and oral pre-malignant lesion (OPML) dataset) and metaproteomics research (OPML dataset and Severe Early Childhood Caries (SECC) dataset). Software required for analytical steps such as peaklist generation, database generation (RNA-Seq derived and others), database search (ProteinPilot and X! tandem) and for quantitative proteomics were deployed, tested and optimized for use in workflows. The software are shared in Galaxy toolshed (http://toolshed.g2.bx.psu.edu/). Usage of analytical workflows resulted in reliable identification of novel proteoforms (proteogenomics) or microorganisms (metaproteomics). Proteogenomics analysis identified novel proteoforms in the salivary dataset (51) and OPML dataset (38). Metaproteomics analysis led to microbial identification in OPML and SECC datasets using MEGAN software. As examples, workflows for proteogenomics analysis (http://z.umn.edu/pg140) and metaproteomic analysis (http://z.umn.edu/mp65) are available at the usegalaxyp.org website. Tutorials for workflow usage within Galaxy-P framework are also available (http://z.umn.edu/ppingp). We demonstrate the use of Galaxy for integrated analysis of multi-omic data, in an accessible, transparent and reproducible manner. Our results and experiences using this framework demonstrate the potential for Galaxy to be a unifying bioinformatics solution for ‘omics core facilities.
PMCID: PMC4162280
8.  Verification of a Parkinson's Disease Protein Signature by Multiple Reaction Monitoring 
OBJECTIVE: Integration of different ‘omics data (genomic, transcriptomic, proteomic) reveals novel discoveries into biological systems. Integration of these datasets is challenging however, involving use of multiple disparate software in a sequential manner. However, the use of multiple, disparate software in a sequential manner makes the integration of multi-omic data a serious challenge. We describe the extension of Galaxy for mass spectrometric-based proteomics software, enabling advanced multi-omic applications in proteogenomics and metaproteomics. We will demonstrate the benefits of Galaxy for these analyses, as well as its value for software developers seeking to publish new software. We will also share insights on the benefits of the Galaxy framework as a bioinformatics solution for proteomic/metabolomic core facilities. METHODS: Multiple datasets for proteogenomics research (3D-fractionated salivary dataset and oral pre-malignant lesion (OPML) dataset) and metaproteomics research (OPML dataset and Severe Early Childhood Caries (SECC) dataset). Software required for analytical steps such as peaklist generation, database generation (RNA-Seq derived and others), database search (ProteinPilot and X! tandem) and for quantitative proteomics were deployed, tested and optimized for use in workflows. The software are shared in Galaxy toolshed (http://toolshed.g2.bx.psu.edu/). Results: Usage of analytical workflows resulted in reliable identification of novel proteoforms (proteogenomics) or microorganisms (metaproteomics). Proteogenomics analysis identified novel proteoforms in the salivary dataset (51) and OPML dataset (38). Metaproteomics analysis led to microbial identification in OPML and SECC datasets using MEGAN software. As examples, workflows for proteogenomics analysis (http://z.umn.edu/pg140) and metaproteomic analysis (http://z.umn.edu/mp65) are available at the usegalaxyp.org website. Tutorials for workflow usage within Galaxy-P framework are also available (http://z.umn.edu/ppingp). CONCLUSIONS: We demonstrate the use of Galaxy for integrated analysis of multi-omic data, in an accessible, transparent and reproducible manner. Our results and experiences using this framework demonstrate the potential for Galaxy to be a unifying bioinformatics solution for ‘omics core facilities.
PMCID: PMC4162281
9.  IsoQuant: A Software Tool for SILAC-Based Mass Spectrometry Quantitation 
Analytical chemistry  2012;84(10):4535-4543.
Accurate protein identification and quantitation are critical when interpreting the biological relevance of large-scale shotgun proteomics datasets. Although significant technical advances in peptide and protein identification have been made, accurate quantitation of high throughput datasets remains a key challenge in mass spectrometry data analysis and is a labor intensive process for many proteomics laboratories. Here, we report a new SILAC-based proteomics quantitation software tool, named IsoQuant, which is used to process high mass accuracy mass spectrometry data. IsoQuant offers a convenient quantitation framework to calculate peptide/protein relative abundance ratios. At the same time, it also includes a visualization platform that permits users to validate the quality of SILAC peptide and protein ratios. The program is written in the C# programming language under the Microsoft .NET framework version 4.0 and has been tested to be compatible with both 32-bit and 64-bit Windows 7. It is freely available to non-commercial users at http://www.proteomeumb.org/MZw.html.
doi:10.1021/ac300510t
PMCID: PMC3583527  PMID: 22519468
Quantitative proteomics; mass spectrometry; SILAC; bioinformatics
10.  TIPP: taxonomic identification and phylogenetic profiling 
Bioinformatics  2014;30(24):3548-3555.
Motivation: Abundance profiling (also called ‘phylogenetic profiling’) is a crucial step in understanding the diversity of a metagenomic sample, and one of the basic techniques used for this is taxonomic identification of the metagenomic reads.
Results: We present taxon identification and phylogenetic profiling (TIPP), a new marker-based taxon identification and abundance profiling method. TIPP combines SAT\'e-enabled phylogenetic placement a phylogenetic placement method, with statistical techniques to control the classification precision and recall, and results in improved abundance profiles. TIPP is highly accurate even in the presence of high indel errors and novel genomes, and matches or improves on previous approaches, including NBC, mOTU, PhymmBL, MetaPhyler and MetaPhlAn.
Availability and implementation: Software and supplementary materials are available at http://www.cs.utexas.edu/users/phylo/software/sepp/tipp-submission/.
Contact: warnow@illinois.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu721
PMCID: PMC4253836  PMID: 25359891
11.  Inductive matrix completion for predicting gene–disease associations 
Bioinformatics  2014;30(12):i60-i68.
Motivation: Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies—for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms in patients. Similarly, the type of evidence available for genes varies—for example, specific microarray probes convey information only for certain sets of genes. In this article, we apply a novel matrix-completion method called Inductive Matrix Completion to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene–disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases not seen at training time, unlike traditional matrix-completion approaches and network-based inference methods that are transductive.
Results: Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better—it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed Catapult method (second best) that has <15% chance. We demonstrate that the inductive method is particularly effective for a query disease with no previously known gene associations, and for predicting novel genes, i.e. genes that are previously not linked to diseases. Thus the method is capable of predicting novel genes even for well-characterized diseases. We also validate the novelty of predictions by evaluating the method on recently reported OMIM associations and on associations recently reported in the literature.
Availability: Source code and datasets can be downloaded from http://bigdata.ices.utexas.edu/project/gene-disease.
Contact: naga86@cs.utexas.edu
doi:10.1093/bioinformatics/btu269
PMCID: PMC4058925  PMID: 24932006
12.  Network-based inference from complex proteomic mixtures using SNIPE 
Bioinformatics  2012;28(23):3115-3122.
Motivation: Proteomics presents the opportunity to provide novel insights about the global biochemical state of a tissue. However, a significant problem with current methods is that shotgun proteomics has limited success at detecting many low abundance proteins, such as transcription factors from complex mixtures of cells and tissues. The ability to assay for these proteins in the context of the entire proteome would be useful in many areas of experimental biology.
Results: We used network-based inference in an approach named SNIPE (Software for Network Inference of Proteomics Experiments) that selectively highlights proteins that are more likely to be active but are otherwise undetectable in a shotgun proteomic sample. SNIPE integrates spectral counts from paired case–control samples over a network neighbourhood and assesses the statistical likelihood of enrichment by a permutation test. As an initial application, SNIPE was able to select several proteins required for early murine tooth development. Multiple lines of additional experimental evidence confirm that SNIPE can uncover previously unreported transcription factors in this system. We conclude that SNIPE can enhance the utility of shotgun proteomics data to facilitate the study of poorly detected proteins in complex mixtures.
Availability and Implementation: An implementation for the R statistical computing environment named snipeR has been made freely available at http://genetics.bwh.harvard.edu/snipe/.
Contact: ssunyaev@rics.bwh.harvard.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts594
PMCID: PMC3509492  PMID: 23060611
13.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs 
BMC Bioinformatics  2002;3:2.
Background
Comparative analysis of RNA sequences is the basis for the detailed and accurate predictions of RNA structure and the determination of phylogenetic relationships for organisms that span the entire phylogenetic tree. Underlying these accomplishments are very large, well-organized, and processed collections of RNA sequences. This data, starting with the sequences organized into a database management system and aligned to reveal their higher-order structure, and patterns of conservation and variation for organisms that span the phylogenetic tree, has been collected and analyzed. This type of information can be fundamental for and have an influence on the study of phylogenetic relationships, RNA structure, and the melding of these two fields.
Results
We have prepared a large web site that disseminates our comparative sequence and structure models and data. The four major types of comparative information and systems available for the three ribosomal RNAs (5S, 16S, and 23S rRNA), transfer RNA (tRNA), and two of the catalytic intron RNAs (group I and group II) are: (1) Current Comparative Structure Models; (2) Nucleotide Frequency and Conservation Information; (3) Sequence and Structure Data; and (4) Data Access Systems.
Conclusions
This online RNA sequence and structure information, the result of extensive analysis, interpretation, data collection, and computer program and web development, is accessible at our Comparative RNA Web (CRW) Site http://www.rna.icmb.utexas.edu. In the future, more data and information will be added to these existing categories, new categories will be developed, and additional RNAs will be studied and presented at the CRW Site.
doi:10.1186/1471-2105-3-2
PMCID: PMC65690  PMID: 11869452
14.  PEPPI: a peptidomic database of human protein isoforms for proteomics experiments 
BMC Bioinformatics  2010;11(Suppl 6):S7.
Abstract
Background
Protein isoform generation, which may derive from alternative splicing, genetic polymorphism, and posttranslational modification, is an essential source of achieving molecular diversity by eukaryotic cells. Previous studies have shown that protein isoforms play critical roles in disease diagnosis, risk assessment, sub-typing, prognosis, and treatment outcome predictions. Understanding the types, presence, and abundance of different protein isoforms in different cellular and physiological conditions is a major task in functional proteomics, and may pave ways to molecular biomarker discovery of human diseases. In tandem mass spectrometry (MS/MS) based proteomics analysis, peptide peaks with exact matches to protein sequence records in the proteomics database may be identified with mass spectrometry (MS) search software. However, due to limited annotation and poor coverage of protein isoforms in proteomics databases, high throughput protein isoform identifications, particularly those arising from alternative splicing and genetic polymorphism, have not been possible.
Results
Therefore, we present the PEPtidomics Protein Isoform Database (PEPPI, http://bio.informatics.iupui.edu/peppi), a comprehensive database of computationally-synthesized human peptides that can identify protein isoforms derived from either alternatively spliced mRNA transcripts or SNP variations. We collected genome, pre-mRNA alternative splicing and SNP information from Ensembl. We synthesized in silico isoform transcripts that cover all exons and theoretically possible junctions of exons and introns, as well as all their variations derived from known SNPs. With three case studies, we further demonstrated that the database can help researchers discover and characterize new protein isoform biomarkers from experimental proteomics data.
Conclusions
We developed a new tool for the proteomics community to characterize protein isoforms from MS-based proteomics experiments. By cataloguing each peptide configurations in the PEPPI database, users can study genetic variations and alternative splicing events at the proteome level. They can also batch-download peptide sequences in FASTA format to search for MS/MS spectra derived from human samples. The database can help generate novel hypotheses on molecular risk factors and molecular mechanisms of complex diseases, leading to identification of potentially highly specific protein isoform biomarkers.
doi:10.1186/1471-2105-11-S6-S7
PMCID: PMC3026381  PMID: 20946618
15.  Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins 
Bioinformatics  2009;25(15):1905-1914.
Motivation: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems.
Results: In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins.
Contact: weiwen.zhang@asu.edu; george.runger@asu.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp325
PMCID: PMC2712339  PMID: 19447782
16.  A hybrid approach to protein differential expression in mass spectrometry-based proteomics 
Bioinformatics  2012;28(12):1586-1591.
Motivation: Quantitative mass spectrometry-based proteomics involves statistical inference on protein abundance, based on the intensities of each protein's associated spectral peaks. However, typical MS-based proteomics datasets have substantial proportions of missing observations, due at least in part to censoring of low intensities. This complicates intensity-based differential expression analysis.
Results: We outline a statistical method for protein differential expression, based on a simple Binomial likelihood. By modeling peak intensities as binary, in terms of ‘presence/absence,’ we enable the selection of proteins not typically amenable to quantitative analysis; e.g. ‘one-state’ proteins that are present in one condition but absent in another. In addition, we present an analysis protocol that combines quantitative and presence/absence analysis of a given dataset in a principled way, resulting in a single list of selected proteins with a single-associated false discovery rate.
Availability: All R code available here: http://www.stat.tamu.edu/~adabney/share/xuan_code.zip.
Contact: adabney@stat.tamu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts193
PMCID: PMC3371829  PMID: 22522136
17.  customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search 
Bioinformatics  2013;29(24):3235-3237.
Summary: Database search is the most widely used approach for peptide and protein identification in mass spectrometry-based proteomics studies. Our previous study showed that sample-specific protein databases derived from RNA-Seq data can better approximate the real protein pools in the samples and thus improve protein identification. More importantly, single nucleotide variations, short insertion and deletions and novel junctions identified from RNA-Seq data make protein database more complete and sample-specific. Here, we report an R package customProDB that enables the easy generation of customized databases from RNA-Seq data for proteomics search. This work bridges genomics and proteomics studies and facilitates cross-omics data integration.
Availability and implementation: customProDB and related documents are freely available at http://bioconductor.org/packages/2.13/bioc/html/customProDB.html.
Contact: bing.zhang@vanderbilt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt543
PMCID: PMC3842753  PMID: 24058055
18.  ASTRAL: genome-scale coalescent-based species tree estimation 
Bioinformatics  2014;30(17):i541-i548.
Motivation: Species trees provide insight into basic biology, including the mechanisms of evolution and how it modifies biomolecular function and structure, biodiversity and co-evolution between genes and species. Yet, gene trees often differ from species trees, creating challenges to species tree estimation. One of the most frequent causes for conflicting topologies between gene trees and species trees is incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent. While many methods have been developed to estimate species trees from multiple genes, some which have statistical guarantees under the multi-species coalescent model, existing methods are too computationally intensive for use with genome-scale analyses or have been shown to have poor accuracy under some realistic conditions.
Results: We present ASTRAL, a fast method for estimating species trees from multiple genes. ASTRAL is statistically consistent, can run on datasets with thousands of genes and has outstanding accuracy—improving on MP-EST and the population tree from BUCKy, two statistically consistent leading coalescent-based methods. ASTRAL is often more accurate than concatenation using maximum likelihood, except when ILS levels are low or there are too few gene trees.
Availability and implementation: ASTRAL is available in open source form at https://github.com/smirarab/ASTRAL/. Datasets studied in this article are available at http://www.cs.utexas.edu/users/phylo/datasets/astral.
Contact: warnow@illinois.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btu462
PMCID: PMC4147915  PMID: 25161245
19.  Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics 
Journal of proteome research  2016;15(11):4091-4100.
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances – a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ~20,000 primary isoforms plus contaminants to a very large database that includes almost all non-redundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/.
doi:10.1021/acs.jproteome.6b00445
PMCID: PMC5096980  PMID: 27577934
shotgun mass spectrometry; search databases; human
20.  A proteogenomics approach integrating proteomics and ribosome profiling increases the efficiency of protein identification and enables the discovery of alternative translation start sites 
Proteomics  2014;14(0):2688-2698.
Next-generation transcriptome sequencing is increasingly integrated with mass spectrometry to enhance MS-based protein and peptide identification. Recently, a breakthrough in transcriptome analysis was achieved with the development of ribosome profiling (ribo-seq). This technology is based on the deep sequencing of ribosome-protected mRNA fragments, thereby enabling the direct observation of in vivo protein synthesis at the transcript level. In order to explore the impact of a ribo-seq-derived protein sequence search space on MS/MS spectrum identification, we performed a comprehensive proteome study on a human cancer cell line, using both shotgun and N-terminal proteomics, next to ribosome profiling, which was used to delineate (alternative) translational reading-frames. By including protein-level evidence of sample-specific genetic variation and alternative translation, this strategy improved the identification score of 69 proteins and identified 22 new proteins in the shotgun experiment. Furthermore, we discovered 18 new alternative translation start sites in the N-terminal proteomics data and observed a correlation between the quantitative measures of ribo-seq and shotgun proteomics with a Pearson correlation coefficient ranging from 0.483 to 0.664. Overall, this study demonstrated the benefits of ribosome profiling for MS-based protein and peptide identification and we believe this approach could develop into a common practice for next-generation proteomics.
doi:10.1002/pmic.201400180
PMCID: PMC4391000  PMID: 25156699
proteogenomics; ribosome profiling; N-terminomics; bioinformatics; translation initiation
21.  Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line 
We provide a large-scale dataset on absolute protein and matching mRNA concentrations from the human medulloblastoma cell line Daoy. The correlation between mRNA and protein concentrations is significant and positive (Rs=0.46, R2=0.29, P-value<2e16), although non-linear.Out of ∼200 tested sequence features, sequence length, frequency and properties of amino acids, as well as translation initiation-related features are the strongest individual correlates of protein abundance when accounting for variation in mRNA concentration.When integrating mRNA expression data and all sequence features into a non-parametric regression model (Multivariate Adaptive Regression Splines), we were able to explain up to 67% of the variation in protein concentrations. Half of the contributions were attributed to mRNA concentrations, the other half to sequence features relating to regulation of translation and protein degradation. The sequence features are primarily linked to the coding and 3′ untranslated region. To our knowledge, this is the most comprehensive predictive model of human protein concentrations achieved so far.
mRNA decay, translation regulation and protein degradation are essential parts of eukaryotic gene expression regulation (Hieronymus and Silver, 2004; Mata et al, 2005), which enable the dynamics of cellular systems and their responses to external and internal stimuli without having to rely exclusively on transcription regulation. The importance of these processes is emphasized by the generally low correlation between mRNA and protein concentrations. For many prokaryotic and eukaryotic organisms, <50% of variation in protein abundance variation is explained by variation in mRNA concentrations (de Sousa Abreu et al, 2009).
Given the plethora of regulatory mechanisms involved, most studies have focused so far on individual regulators and specific targets. Particularly in human, we currently lack system-wide, quantitative analyses that evaluate the relative contribution of regulatory elements encoded in the mRNA and protein sequence. Existing studies have been carried out only in bacteria and yeast (Nie et al, 2006; Brockmann et al, 2007; Tuller et al, 2007; Wu et al, 2008). Here, we present the first comprehensive analysis on the impact of translation and protein degradation on protein abundance variation in a human cell line. For this purpose, we experimentally measured absolute protein and mRNA concentrations in the Daoy medulloblastoma cell line, using shotgun proteomics and microarrays, respectively (Figure 1). These data comprise one of the largest such sets available today for human. We focused on sequence features that likely impact protein translation and protein degradation, including length, nucleotide composition, structure of the untranslated regions (UTRs), coding sequence, composition of the translation initiation site, presence of upstream open reading frames putative target sites of miRNAs, codon usage, amino-acid composition and protein degradation signals.
Three types of tests have been conducted: (a) we examined partial Spearman's rank correlation of numerical features (e.g. length) with protein concentration, accounting for variation in mRNA concentrations; (b) for numerical and categorical features (e.g. function), we compared two extreme populations with Welch's t-test and (c) using a Multivariate Adaptive Regression Splines model, we analyzed the combined contributions of mRNA expression and sequence features to protein abundance variation (Figure 1). To account for the non-linearity of many relationships, we use non-parametric approaches throughout the analysis.
We observed a significant positive correlation between mRNA and protein concentrations, larger than many previous measurements (de Sousa Abreu et al, 2009). We also show that the contribution of translation and protein degradation is at least as important as the contribution of mRNA transcription and stability to the abundance variation of the final protein products. Although variation in mRNA expression explains ∼25–30% of the variation in protein abundance, another 30–40% can be accounted for by characteristics of the sequences, which we identified in a comparative assessment of global correlates. Among these characteristics, sequence length, amino-acid frequencies and also nucleotide frequencies in the coding region are of strong influence (Figure 3A). Characteristics of the 3′UTR and of the 5′UTR, that is length, nucleotide composition and secondary structures, describe another part of the variation, leaving 33% expression variation unexplained. The unexplained fraction may be accounted for by mechanisms not considered in this analysis (e.g. regulation by RNA-binding proteins or gene-specific structural motifs), as well as expression and measurement noise.
Our combined model including mRNA concentration and sequence features can explain 67% of the variation of protein abundance in this system—and thus has the highest predictive power for human protein abundance achieved so far (Figure 3B).
Transcription, mRNA decay, translation and protein degradation are essential processes during eukaryotic gene expression, but their relative global contributions to steady-state protein concentrations in multi-cellular eukaryotes are largely unknown. Using measurements of absolute protein and mRNA abundances in cellular lysate from the human Daoy medulloblastoma cell line, we quantitatively evaluate the impact of mRNA concentration and sequence features implicated in translation and protein degradation on protein expression. Sequence features related to translation and protein degradation have an impact similar to that of mRNA abundance, and their combined contribution explains two-thirds of protein abundance variation. mRNA sequence lengths, amino-acid properties, upstream open reading frames and secondary structures in the 5′ untranslated region (UTR) were the strongest individual correlates of protein concentrations. In a combined model, characteristics of the coding region and the 3′UTR explained a larger proportion of protein abundance variation than characteristics of the 5′UTR. The absolute protein and mRNA concentration measurements for >1000 human genes described here represent one of the largest datasets currently available, and reveal both general trends and specific examples of post-transcriptional regulation.
doi:10.1038/msb.2010.59
PMCID: PMC2947365  PMID: 20739923
gene expression regulation; protein degradation; protein stability; translation
22.  A proteomic chronology of gene expression through the cell cycle in human myeloid leukemia cells 
eLife  2014;3:e01630.
Technological advances have enabled the analysis of cellular protein and RNA levels with unprecedented depth and sensitivity, allowing for an unbiased re-evaluation of gene regulation during fundamental biological processes. Here, we have chronicled the dynamics of protein and mRNA expression levels across a minimally perturbed cell cycle in human myeloid leukemia cells using centrifugal elutriation combined with mass spectrometry-based proteomics and RNA-Seq, avoiding artificial synchronization procedures. We identify myeloid-specific gene expression and variations in protein abundance, isoform expression and phosphorylation at different cell cycle stages. We dissect the relationship between protein and mRNA levels for both bulk gene expression and for over ∼6000 genes individually across the cell cycle, revealing complex, gene-specific patterns. This data set, one of the deepest surveys to date of gene expression in human cells, is presented in an online, searchable database, the Encyclopedia of Proteome Dynamics (http://www.peptracker.com/epd/).
DOI: http://dx.doi.org/10.7554/eLife.01630.001
eLife digest
Cells are complex environments: at any one time, thousands of different genes act as molecular templates to produce messenger RNA (mRNA) molecules, which themselves are templates used to produce proteins. However, not all genes are active at all times inside all cells: as cells grow and divide as part of the cell division cycle, genes are switched on and off on a regular basis. Similarly, the patterns of mRNA and protein production are different in, say, immune and skin cells.
In recent years, the tools available for detecting mRNA molecules and proteins have become more powerful, allowing researchers to move beyond just measuring the total amounts of mRNA and protein in the cell to now measuring individual amounts of specific mRNA and protein molecules encoded by specific genes. However, it has been a challenge to make these measurements at different stages of the cell cycle. Most of the methods used to do this have involved artificially ‘arresting’ the cell cycle, which can lead to side effects that are difficult to account for.
Ly et al. have now overcome these problems using a combination of three methods to measure the levels of mRNA and protein molecules associated with over 6000 genes in human cancer cells derived from myeloid leukemia. Exploiting the fact that cells change size during the cell cycle, Ly et al. used a centrifugation technique to separate cells based on their size and, therefore, the stage of the cell cycle they were at, thus avoiding the need to arrest the cell cycle. An approach called RNA-Seq was then employed to measure the levels of the different mRNA molecules in the cells, and a device called a mass spectrometer was used to identify and measure the levels of many different proteins.
In addition to being able to follow the level of mRNA and protein production for a large number of genes throughout the cell division cycle, while also obtaining detailed information about how many of the proteins are modified, Ly et al. discovered that—contrary to expectations—low numbers of mRNA molecules were sometimes associated with high numbers of the corresponding protein, and vice versa. This work provides a better understanding of the complex relationship between the levels of an mRNA and its corresponding protein product, and also demonstrates how it may be possible to detect subtle but important differences between cell types and disease states, including different types of cancer.
DOI: http://dx.doi.org/10.7554/eLife.01630.002
doi:10.7554/eLife.01630
PMCID: PMC3936288  PMID: 24596151
proteomics; mass spectrometry; RNA-Seq; cell cycle; transcriptomics; human
23.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes 
Bioinformatics  2015;31(12):i44-i52.
Motivation: The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed ‘bipartitions’. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent.
Results: We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL’s running time is O(n2k|X|2), and ASTRAL-II’s running time is O(nk|X|2), where n is the number of species, k is the number of loci and X is the set of allowed bipartitions for the search space.
Availability and implementation: ASTRAL-II is available in open source at https://github.com/smirarab/ASTRAL and datasets used are available at http://www.cs.utexas.edu/~phylo/datasets/astral2/.
Contact: smirarab@gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btv234
PMCID: PMC4765870  PMID: 26072508
24.  Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs 
BMC Bioinformatics  2014;15:121.
Background
Tandem mass spectrometry-based database searching is currently the main method for protein identification in shotgun proteomics. The explosive growth of protein and peptide databases, which is a result of genome translations, enzymatic digestions, and post-translational modifications (PTMs), is making computational efficiency in database searching a serious challenge. Profile analysis shows that most search engines spend 50%-90% of their total time on the scoring module, and that the spectrum dot product (SDP) based scoring module is the most widely used. As a general purpose and high performance parallel hardware, graphics processing units (GPUs) are promising platforms for speeding up database searches in the protein identification process.
Results
We designed and implemented a parallel SDP-based scoring module on GPUs that exploits the efficient use of GPU registers, constant memory and shared memory. Compared with the CPU-based version, we achieved a 30 to 60 times speedup using a single GPU. We also implemented our algorithm on a GPU cluster and achieved an approximately favorable speedup.
Conclusions
Our GPU-based SDP algorithm can significantly improve the speed of the scoring module in mass spectrometry-based protein identification. The algorithm can be easily implemented in many database search engines such as X!Tandem, SEQUEST, and pFind. A software tool implementing this algorithm is available at http://www.comp.hkbu.edu.hk/~youli/ProteinByGPU.html
doi:10.1186/1471-2105-15-121
PMCID: PMC4049470  PMID: 24773593
25.  Proteome data to explore the impact of pBClin15 on Bacillus cereus ATCC 14579 
Data in Brief  2016;8:1243-1246.
This data article reports changes in the cellular and exoproteome of B. cereus cured from pBClin15.Time-course changes of proteins were assessed by high-throughput nanoLC-MS/MS. We report all the peptides and proteins identified and quantified in B. cereus with and without pBClin15. Proteins were classified into functional groups using the information available in the KEGG classification and we reported their abundance in term of normalized spectral abundance factor. The repertoire of experimentally confirmed proteins of B. cereus presented here is the largest ever reported, and provides new insights into the interplay between pBClin15 and its host B. cereus ATCC 14579. The data reported here is related to a published shotgun proteomics analysis regarding the role of pBClin15, “Deciphering the interactions between the Bacillus cereus linear plasmid, pBClin15, and its host by high-throughput comparative proteomics” Madeira et al. [1]. All the associated mass spectrometry data have been deposited in the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository (http://www.ebi.ac.uk/pride/), with the dataset identifier PRIDE: PXD001568, PRIDE: PXD002788 and PRIDE: PXD002789.
doi:10.1016/j.dib.2016.07.042
PMCID: PMC4983103  PMID: 27547804

Results 1-25 (1076048)