Due to its highly reproducible and quantitative nature, and minimal requirements for sample preparation or separation, 1H nuclear magnetic resonance (NMR) spectroscopy is widely used for profiling small-molecule metabolites in biofluids. However 1H NMR spectra contain many overlapped peaks. In particular, blood serum/plasma and diabetic urine samples contain high concentrations of glucose, which produce strong peaks between 3.2 ppm – 4.0 ppm. Signals from most metabolites in this region are overwhelmed by the glucose background signals and become invisible. We propose a simple “Add to Subtract” background subtraction method, and show that it can reduce the glucose signals by 98% to allow retrieval of the hidden information. This procedure includes adding a small drop of concentrated glucose solution to the sample in the NMR tube, mixing, waiting for an equilibration time, and acquisition of a second spectrum. The glucose-free spectra are then generated by spectral subtraction using Bruker Topspin software. Subsequent multivariate statistical analysis can then be used to identify biomarker candidate signals for distinguishing different types of biological samples. The principle of this approach is generally applicable for all quantitative spectral data and should find utility in a variety of NMR-based mixture analyses as well as in metabolite profiling.
1H NMR; metabolomics; metabolite profiling; glucose; signal suppression; mixture analysis; blood; urine
PeptideProphet is a post-processing algorithm designed to evaluate the confidence in identifications of MS/MS spectra returned by a database search. In this manuscript we describe the "what and how" of PeptideProphet in a manner aimed at statisticians and life scientists who would like to gain a more in-depth understanding of the underlying statistical modeling. The theory and rationale behind the mixture-modeling approach taken by PeptideProphet is discussed from a statistical model-building perspective followed by a description of how a model can be used to express confidence in the identification of individual peptides or sets of peptides. We also demonstrate how to evaluate the quality of model fit and select an appropriate model from several available alternatives. We illustrate the use of PeptideProphet in association with the Trans-Proteomic Pipeline, a free suite of software used for protein identification.
Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is widely used for quantitative proteomic investigations. The typical output of such studies is a list of identified and quantified peptides. The biological and clinical interest is, however, usually focused on quantitative conclusions at the protein level. Furthermore, many investigations ask complex biological questions by studying multiple interrelated experimental conditions. Therefore, there is a need in the field for generic statistical models to quantify protein levels even in complex study designs.
We propose a general statistical modeling approach for protein quantification in arbitrary complex experimental designs, such as time course studies, or those involving multiple experimental factors. The approach summarizes the quantitative experimental information from all the features and all the conditions that pertain to a protein. It enables both protein significance analysis between conditions, and protein quantification in individual samples or conditions. We implement the approach in an open-source R-based software package MSstats suitable for researchers with a limited statistics and programming background.
We demonstrate, using as examples two experimental investigations with complex designs, that a simultaneous statistical modeling of all the relevant features and conditions yields a higher sensitivity of protein significance analysis and a higher accuracy of protein quantification as compared to commonly employed alternatives. The software is available at http://www.stat.purdue.edu/~ovitek/Software.html.
Label-free LC-MS/MS; linear mixed effects models; protein quantification; quantitative proteomics; statistical design of experiments
Motivation: High-throughput perturbation screens measure the phenotypes of thousands of biological samples under various conditions. The phenotypes measured in the screens are subject to substantial biological and technical variation. At the same time, in order to enable high throughput, it is often impossible to include a large number of replicates, and to randomize their order throughout the screens. Distinguishing true changes in the phenotype from stochastic variation in such experimental designs is extremely challenging, and requires adequate statistical methodology.
Results: We propose a statistical modeling framework that is based on experimental designs with at least two controls profiled throughout the experiment, and a normalization and variance estimation procedure with linear mixed-effects models. We evaluate the framework using three comprehensive screens of Saccharomyces cerevisiae, which involve 4940 single-gene knock-out haploid mutants, 1127 single-gene knock-out diploid mutants and 5798 single-gene overexpression haploid strains. We show that the proposed approach (i) can be used in conjunction with practical experimental designs; (ii) allows extensions to alternative experimental workflows; (iii) enables a sensitive discovery of biologically meaningful changes; and (iv) strongly outperforms the existing noise reduction procedures.
Availability: All experimental datasets are publicly available at www.ionomicshub.org. The R package HTSmix is available at http://www.stat.purdue.edu/~ovitek/HTSmix.html.
Supplementary information: Supplementary data are available at Bioinformatics online.
Evolutionary and reproductive success of angiosperms, the most diverse group of land plants, relies on visual and olfactory cues for pollinator attraction. Previous work has focused on elucidating the developmental regulation of pathways leading to the formation of pollinator-attracting secondary metabolites such as scent compounds and flower pigments. However, to date little is known about how flowers control their entire metabolic network to achieve the highly regulated production of metabolites attracting pollinators. Integrative analysis of transcripts and metabolites in snapdragon sepals and petals over flower development performed in this study revealed a profound developmental remodeling of gene expression and metabolite profiles in petals, but not in sepals. Genes up-regulated during petal development were enriched in functions related to secondary metabolism, fatty acid catabolism, and amino acid transport, whereas down-regulated genes were enriched in processes involved in cell growth, cell wall formation, and fatty acid biosynthesis. The levels of transcripts and metabolites in pathways leading to scent formation were coordinately up-regulated during petal development, implying transcriptional induction of metabolic pathways preceding scent formation. Developmental gene expression patterns in the pathways involved in scent production were different from those of glycolysis and the pentose phosphate pathway, highlighting distinct developmental regulation of secondary metabolism and primary metabolic pathways feeding into it.
Motivation: Nuclear magnetic resonance (NMR) spectroscopy is widely used for high-throughput characterization of metabolites in complex biological mixtures. However, accurate interpretation of the spectra in terms of identities and abundances of metabolites can be challenging, in particular in crowded regions with heavy peak overlap. Although a number of computational approaches for this task have recently been proposed, they are not entirely satisfactory in either accuracy or extent of automation.
Results: We introduce a probabilistic approach Bayesian Quantification (BQuant), for fully automated database-based identification and quantification of metabolites in local regions of 1H NMR spectra. The approach represents the spectra as mixtures of reference profiles from a database, and infers the identities and the abundances of metabolites by Bayesian model selection. We show using a simulated dataset, a spike-in experiment and a metabolomic investigation of plasma samples that BQuant outperforms the available automated alternatives in accuracy for both identification and quantification.
Availability: The R package BQuant is available at: http://www.stat.purdue.edu/~ovitek/BQuant-Web/.
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
Diagnosis of human bladder cancer in untreated tissue sections is achieved by using imaging data from desorption electrospray ionization mass spectrometry (DESI-MS) combined with multivariate statistical analysis. We use the distinctive DESI-MS glycerophospholipid (GP) mass spectral profiles to visually characterize and formally classify twenty pairs (40 tissue samples) of human cancerous and adjacent normal bladder tissue samples. The individual ion images derived from the acquired profiles correlate with standard histological hematoxylin and eosin (H&E)-stained serial sections. The profiles allow us to classify the disease status of the tissue samples with high accuracy as judged by reference histological data. To achieve this, the data from the twenty pairs were divided into a training set and a validation set. Spectra from the tumor and normal regions of each of the tissue sections in the training set were used for orthogonal projection to latent structures (O-PLS) treated partial least-square discriminate analysis (PLS-DA). This predictive model was then validated by using the validation set and showed a 5% error rate for classification and a misclassification rate of 12%. It was also used to create synthetic images of the tissue sections showing pixel-by-pixel disease classification of the tissue and these data agreed well with the independent classification that uses histological data by a certified pathologist. This represents the first application of multivariate statistical methods for classification by ambient ionization although these methods have been applied previously to other MS imaging methods. The results are encouraging in terms of the development of a method that could be utilized in a clinical setting through visualization and diagnosis of intact tissue.
cancer; desorption electrospray ionization; lipidomics; molecular imaging; multivariate statistics; mass spectrometry
The phosphorylation and dephosphorylation of proteins by kinases and phosphatases constitute an essential regulatory network in eukaryotic cells. This network supports the flow of information from sensors through signaling systems to effector molecules, and ultimately drives the phenotype and function of cells, tissues, and organisms. Dysregulation of this process has severe consequences and is one of the main factors in the emergence and progression of diseases, including cancer. Thus, major efforts have been invested in developing specific inhibitors that modulate the activity of individual kinases or phosphatases; however, it has been difficult to assess how such pharmacological interventions would affect the cellular signaling network as a whole. Here, we used label-free, quantitative phosphoproteomics in a systematically perturbed model organism (Saccharomyces cerevisiae) to determine the relationships between 97 kinases, 27 phosphatases, and more than 1000 phosphoproteins. We identified 8814 regulated phosphorylation events, describing the first system-wide protein phosphorylation network in vivo. Our results show that, at steady state, inactivation of most kinases and phosphatases affected large parts of the phosphorylation-modulated signal transduction machinery, and not only the immediate downstream targets. The observed cellular growth phenotype was often well maintained despite the perturbations, arguing for considerable robustness in the system. Our results serve to constrain future models of cellular signaling and reinforce the idea that simple linear representations of signaling pathways might be insufficient for drug development and for describing organismal homeostasis.
The genetic model plant Arabidopsis thaliana, like many plant species, experiences a range of edaphic conditions across its natural habitat. Such heterogeneity may drive local adaptation, though the molecular genetic basis remains elusive. Here, we describe a study in which we used genome-wide association mapping, genetic complementation, and gene expression studies to identify cis-regulatory expression level polymorphisms at the AtHKT1;1 locus, encoding a known sodium (Na+) transporter, as being a major factor controlling natural variation in leaf Na+ accumulation capacity across the global A. thaliana population. A weak allele of AtHKT1;1 that drives elevated leaf Na+ in this population has been previously linked to elevated salinity tolerance. Inspection of the geographical distribution of this allele revealed its significant enrichment in populations associated with the coast and saline soils in Europe. The fixation of this weak AtHKT1;1 allele in these populations is genetic evidence supporting local adaptation to these potentially saline impacted environments.
The unusual geographical distribution of certain animal and plant species has provided puzzling questions to the scientific community regarding the interrelationship of evolutionary and geographic histories for generations. With DNA sequencing, such puzzles have now extended to the geographical distribution of genetic variation within a species. Here, we explain one such puzzle in the European population of Arabidopsis thaliana, where we find that a version of a gene encoding for a sodium-transporter with reduced function is almost uniquely found in populations of this plant growing close to the coast or on known saline soils. This version of the gene has previously been linked with elevated salinity tolerance, and its unusual distribution in populations of plants growing in coastal regions and on saline soils suggests that it is playing a role in adapting these plants to the elevated salinity of their local environment.
Multiple reaction monitoring mass spectrometry (MRM-MS) is a technique for high-sensitivity targeted analysis. In proteomics, MRM-MS can be used to monitor and quantify a peptide based on the production of expected fragment peaks from the selected peptide precursor ion. The choice of which fragment ions to monitor in order to achieve maximum sensitivity in MRM-MS can potentially be guided by existing MS/MS spectra. However, because the majority of discovery experiments are performed on ion trap platforms, there is concern in the field regarding the generalizability of these spectra to MRM-MS on a triple quadrupole instrument. In light of this concern, many operators perform an optimization step to determine the most intense fragments for a target peptide on a triple quadrupole mass spectrometer. We have addressed this issue by targeting, on a triple quadrupole, the top six y-ion peaks from ion trap-derived consensus library spectra for 258 doubly charged peptides from three different sample sets and quantifying the observed elution curves. This analysis revealed a strong correlation between the y-ion peak rank order and relative intensity across platforms. This suggests that y-type ions obtained from ion trap-based library spectra are well-suited for generating MRM-MS assays for triple quadrupoles and that optimization is not required for each target peptide.
multiple reaction monitoring (MRM); selective reaction monitoring (SRM); triple quadrupole; ion trap; mass spectrometer; y-ions; spectral library; spectral correlation
Metabolic profiling of urine presents challenges due to the extensive random variation of metabolite concentrations, and to dilution resulting from changes in the overall urine volume. Thus statistical analysis methods play a particularly important role, however appropriate choices of these methods are not straightforward. Here we investigate constant and variance-stabilization normalization of raw and peak picked spectra, for use with exploratory analysis (principal component analysis) and confirmatory analysis (ordinary and Empirical Bayes t-test) in 1H NMR-based metabolic profiling of urine. We compare the performance of these methods using urine samples spiked with known metabolites according to a Latin square design. We find that analysis of peak picked and log-transformed spectra is preferred, and that signal processing and statistical analysis steps are interdependent. While variance-stabilizing transformation is preferred in conjunction with principal component analysis, constant normalization is more appropriate for use with a t-test. Empirical Bayes t-test provides more reliable conclusions when the number of samples in each group is relatively small. Performance of these methods is illustrated using a clinical metabolomics experiment on patients with type 1 diabetes to evaluate the effect of insulin deprivation.
Metabolomics; Metabolite profiling; NMR spectroscopy; Normalization; Moderated t-test; Logarithmic transformation; Urine; Diabetes
A proof-of-concept demonstration of the use of label-free quantitative glycoproteomics for biomarker discovery workflow is presented here, using a mouse model for skin cancer as an example. Blood plasma was collected from 10 control mice, and 10 mice having a mutation in the p19ARF gene, conferring them high propensity to develop skin cancer after carcinogen exposure. We enriched for N-glycosylated plasma proteins, ultimately generating deglycosylated forms of the modified tryptic peptides for liquid chromatography mass spectrometry (LC-MS) analyses. LC-MS runs for each sample were then performed with a view to identifying proteins that were differentially abundant between the two mouse populations. We then used a recently developed computational framework, Corra, to perform peak picking and alignment, and to compute the statistical significance of any observed changes in individual peptide abundances. Once determined, the most discriminating peptide features were then fragmented and identified by tandem mass spectrometry with the use of inclusion lists. We next assessed the identified proteins to see if there were sets of proteins indicative of specific biological processes that correlate with the presence of disease, and specifically cancer, according to their functional annotations. As expected for such sick animals, many of the proteins identified were related to host immune response. However, a significant number of proteins also directly associated with processes linked to cancer development, including proteins related to the cell cycle, localisation, trasport, and cell death. Additional analysis of the same samples in profiling mode, and in triplicate, confirmed that replicate MS analysis of the same plasma sample generated less variation than that observed between plasma samples from different individuals, demonstrating that the reproducibility of the LC-MS platform was sufficient for this application. These results thus show that an LC-MS-based workflow can be a useful tool for the generation of candidate proteins of interest as part of a disease biomarker discovery effort.
Skin cancer; LC-MS; Label-free protein quantification; Biomarker discovery; Systems biology; Targeted peptide sequencing; Glycoproteomics; Plasma
Immonium ions have been largely overlooked during the rapid expansion of mass spectrometry-based proteomics largely due to the dominance of ion trap instruments in the field. However, immonium ions are visible in hybrid quadrupole-time-of-flight (QTOF) mass spectrometers, which are now widely available. We have created the largest database to date of high-confidence sequence assignments to characterize the appearance of immonium ions in CID spectra using a QTOF instrument under “typical” operating conditions. With these data, we are able to demonstrate excellent correlation between immonium ion peak intensity and the likelihood of the appearance of the expected amino acid in the assigned sequence for phenylalanine, tyrosine, tryptophan, proline, histidine, valine, and the indistinguishable leucine and isoleucine residues. In addition, we have clearly demonstrated a positional effect whereby the proximity of the amino acid generating the immonium ion to the amino terminal of the peptide correlates with the strength of the immonium ion peak. This compositional information provided by the immonium ion peaks could substantially improve algorithms used for spectral assignment in mass spectrometry analysis using QTOF platforms.
Quantitative proteomics holds great promise for identifying proteins that are differentially abundant between populations representing different physiological or disease states. A range of computational tools is now available for both isotopically labeled and label-free liquid chromatography mass spectrometry (LC-MS) based quantitative proteomics. However, they are generally not comparable to each other in terms of functionality, user interfaces, information input/output, and do not readily facilitate appropriate statistical data analysis. These limitations, along with the array of choices, present a daunting prospect for biologists, and other researchers not trained in bioinformatics, who wish to use LC-MS-based quantitative proteomics.
We have developed Corra, a computational framework and tools for discovery-based LC-MS proteomics. Corra extends and adapts existing algorithms used for LC-MS-based proteomics, and statistical algorithms, originally developed for microarray data analyses, appropriate for LC-MS data analysis. Corra also adapts software engineering technologies (e.g. Google Web Toolkit, distributed processing) so that computationally intense data processing and statistical analyses can run on a remote server, while the user controls and manages the process from their own computer via a simple web interface. Corra also allows the user to output significantly differentially abundant LC-MS-detected peptide features in a form compatible with subsequent sequence identification via tandem mass spectrometry (MS/MS). We present two case studies to illustrate the application of Corra to commonly performed LC-MS-based biological workflows: a pilot biomarker discovery study of glycoproteins isolated from human plasma samples relevant to type 2 diabetes, and a study in yeast to identify in vivo targets of the protein kinase Ark1 via phosphopeptide profiling.
The Corra computational framework leverages computational innovation to enable biologists or other researchers to process, analyze and visualize LC-MS data with what would otherwise be a complex and not user-friendly suite of tools. Corra enables appropriate statistical analyses, with controlled false-discovery rates, ultimately to inform subsequent targeted identification of differentially abundant peptides by MS/MS. For the user not trained in bioinformatics, Corra represents a complete, customizable, free and open source computational platform enabling LC-MS-based proteomic workflows, and as such, addresses an unmet need in the LC-MS proteomics field.