Search tips
Search criteria

Results 1-25 (923309)

Clipboard (0)

Related Articles

1.  Assembling proteomics data as a prerequisite for the analysis of large scale experiments 
Despite the complete determination of the genome sequence of a huge number of bacteria, their proteomes remain relatively poorly defined. Beside new methods to increase the number of identified proteins new database applications are necessary to store and present results of large- scale proteomics experiments.
In the present study, a database concept has been developed to address these issues and to offer complete information via a web interface. In our concept, the Oracle based data repository system SQL-LIMS plays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis, Helicobacter pylori, Salmonella typhimurium and protein complexes such as 20S proteasome. Technical operations of our proteomics labs were used as the standard for SQL-LIMS template creation. By means of a Java based data parser, post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-D gel electrophoresis (2-DE), were stored in SQL-LIMS. A minimum set of the proteomics data were transferred in our public 2D-PAGE database using a Java based interface (Data Transfer Tool) with the requirements of the PEDRo standardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS via XML.
The Oracle based data repository system SQL-LIMS played the central role in the proteomics workflow concept. Technical operations of our proteomics labs were used as standards for SQL-LIMS templates. Using a Java based parser, post-processed data of different approaches such as LC/ESI-MS, MALDI-MS and 1-DE and 2-DE were stored in SQL-LIMS. Thus, unique data formats of different instruments were unified and stored in SQL-LIMS tables. Moreover, a unique submission identifier allowed fast access to all experimental data. This was the main advantage compared to multi software solutions, especially if personnel fluctuations are high. Moreover, large scale and high-throughput experiments must be managed in a comprehensive repository system such as SQL-LIMS, to query results in a systematic manner. On the other hand, these database systems are expensive and require at least one full time administrator and specialized lab manager. Moreover, the high technical dynamics in proteomics may cause problems to adjust new data formats. To summarize, SQL-LIMS met the requirements of proteomics data handling especially in skilled processes such as gel-electrophoresis or mass spectrometry and fulfilled the PSI standardization criteria. The data transfer into a public domain via DTT facilitated validation of proteomics data. Additionally, evaluation of mass spectra by post-processing using MS-Screener improved the reliability of mass analysis and prevented storage of data junk.
PMCID: PMC2653022  PMID: 19166578
2.  An XML standard for the dissemination of annotated 2D gel electrophoresis data complemented with mass spectrometry results 
BMC Bioinformatics  2004;5:9.
Many proteomics initiatives require a seamless bioinformatics integration of a range of analytical steps between sample collection and systems modeling immediately assessable to the participants involved in the process. Proteomics profiling by 2D gel electrophoresis to the putative identification of differentially expressed proteins by comparison of mass spectrometry results with reference databases, includes many components of sample processing, not just analysis and interpretation, are regularly revisited and updated. In order for such updates and dissemination of data, a suitable data structure is needed. However, there are no such data structures currently available for the storing of data for multiple gels generated through a single proteomic experiments in a single XML file. This paper proposes a data structure based on XML standards to fill the void that exists between data generated by proteomics experiments and storing of data.
In order to address the resulting procedural fluidity we have adopted and implemented a data model centered on the concept of annotated gel (AG) as the format for delivery and management of 2D Gel electrophoresis results. An eXtensible Markup Language (XML) schema is proposed to manage, analyze and disseminate annotated 2D Gel electrophoresis results. The structure of AG objects is formally represented using XML, resulting in the definition of the AGML syntax presented here.
The proposed schema accommodates data on the electrophoresis results as well as the mass-spectrometry analysis of selected gel spots. A web-based software library is being developed to handle data storage, analysis and graphic representation. Computational tools described will be made available at . Our development of AGML provides a simple data structure for storing 2D gel electrophoresis data.
PMCID: PMC341449  PMID: 15005801
3.  NCBI GEO: archive for high-throughput functional genomic data 
Nucleic Acids Research  2008;37(Database issue):D885-D890.
The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) is the largest public repository for high-throughput gene expression data. Additionally, GEO hosts other categories of high-throughput functional genomic data, including those that examine genome copy number variations, chromatin structure, methylation status and transcription factor binding. These data are generated by the research community using high-throughput technologies like microarrays and, more recently, next-generation sequencing. The database has a flexible infrastructure that can capture fully annotated raw and processed data, enabling compliance with major community-derived scientific reporting standards such as ‘Minimum Information About a Microarray Experiment’ (MIAME). In addition to serving as a centralized data storage hub, GEO offers many tools and features that allow users to effectively explore, analyze and download expression data from both gene-centric and experiment-centric perspectives. This article summarizes the GEO repository structure, content and operating procedures, as well as recently introduced data mining features. GEO is freely accessible at
PMCID: PMC2686538  PMID: 18940857
4.  VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data 
BMC Genomics  2012;13:131.
The procedural aspects of genome sequencing and assembly have become relatively inexpensive, yet the full, accurate structural annotation of these genomes remains a challenge. Next-generation sequencing transcriptomics (RNA-Seq), global microarrays, and tandem mass spectrometry (MS/MS)-based proteomics have demonstrated immense value to genome curators as individual sources of information, however, integrating these data types to validate and improve structural annotation remains a major challenge. Current visual and statistical analytic tools are focused on a single data type, or existing software tools are retrofitted to analyze new data forms. We present Visual Exploration and Statistics to Promote Annotation (VESPA) is a new interactive visual analysis software tool focused on assisting scientists with the annotation of prokaryotic genomes though the integration of proteomics and transcriptomics data with current genome location coordinates.
VESPA is a desktop Java™ application that integrates high-throughput proteomics data (peptide-centric) and transcriptomics (probe or RNA-Seq) data into a genomic context, all of which can be visualized at three levels of genomic resolution. Data is interrogated via searches linked to the genome visualizations to find regions with high likelihood of mis-annotation. Search results are linked to exports for further validation outside of VESPA or potential coding-regions can be analyzed concurrently with the software through interaction with BLAST. VESPA is demonstrated on two use cases (Yersinia pestis Pestoides F and Synechococcus sp. PCC 7002) to demonstrate the rapid manner in which mis-annotations can be found and explored in VESPA using either proteomics data alone, or in combination with transcriptomic data.
VESPA is an interactive visual analytics tool that integrates high-throughput data into a genomic context to facilitate the discovery of structural mis-annotations in prokaryotic genomes. Data is evaluated via visual analysis across multiple levels of genomic resolution, linked searches and interaction with existing bioinformatics tools. We highlight the novel functionality of VESPA and core programming requirements for visualization of these large heterogeneous datasets for a client-side application. The software is freely available at
PMCID: PMC3364912  PMID: 22480257
5.  ISPIDER Central: an integrated database web-server for proteomics 
Nucleic Acids Research  2008;36(Web Server issue):W485-W490.
Despite the growing volumes of proteomic data, integration of the underlying results remains problematic owing to differences in formats, data captured, protein accessions and services available from the individual repositories. To address this, we present the ISPIDER Central Proteomic Database search (, an integration service offering novel search capabilities over leading, mature, proteomic repositories including PRoteomics IDEntifications database (PRIDE), PepSeeker, PeptideAtlas and the Global Proteome Machine. It enables users to search for proteins and peptides that have been characterised in mass spectrometry-based proteomics experiments from different groups, stored in different databases, and view the collated results with specialist viewers/clients. In order to overcome limitations imposed by the great variability in protein accessions used by individual laboratories, the European Bioinformatics Institute's Protein Identifier Cross-Reference (PICR) service is used to resolve accessions from different sequence repositories. Custom-built clients allow users to view peptide/protein identifications in different contexts from multiple experiments and repositories, as well as integration with the Dasty2 client supporting any annotations available from Distributed Annotation System servers. Further information on the protein hits may also be added via external web services able to take a protein as input. This web server offers the first truly integrated access to proteomics repositories and provides a unique service to biologists interested in mass spectrometry-based proteomics.
PMCID: PMC2447780  PMID: 18440977
6.  A Novel Cross-Disciplinary Multi-Institute Approach to Translational Cancer Research: Lessons Learned from Pennsylvania Cancer Alliance Bioinformatics Consortium (PCABC) 
Cancer Informatics  2007;3:255-274.
The Pennsylvania Cancer Alliance Bioinformatics Consortium (PCABC, is one of the first major project-based initiatives stemming from the Pennsylvania Cancer Alliance that was funded for four years by the Department of Health of the Commonwealth of Pennsylvania. The objective of this was to initiate a prototype biorepository and bioinformatics infrastructure with a robust data warehouse by developing a statewide data model (1) for bioinformatics and a repository of serum and tissue samples; (2) a data model for biomarker data storage; and (3) a public access website for disseminating research results and bioinformatics tools. The members of the Consortium cooperate closely, exploring the opportunity for sharing clinical, genomic and other bioinformatics data on patient samples in oncology, for the purpose of developing collaborative research programs across cancer research institutions in Pennsylvania. The Consortium’s intention was to establish a virtual repository of many clinical specimens residing in various centers across the state, in order to make them available for research. One of our primary goals was to facilitate the identification of cancer-specific biomarkers and encourage collaborative research efforts among the participating centers.
The PCABC has developed unique partnerships so that every region of the state can effectively contribute and participate. It includes over 80 individuals from 14 organizations, and plans to expand to partners outside the State. This has created a network of researchers, clinicians, bioinformaticians, cancer registrars, program directors, and executives from academic and community health systems, as well as external corporate partners - all working together to accomplish a common mission.
The various sub-committees have developed a common IRB protocol template, common data elements for standardizing data collections for three organ sites, intellectual property/tech transfer agreements, and material transfer agreements that have been approved by each of the member institutions. This was the foundational work that has led to the development of a centralized data warehouse that has met each of the institutions’ IRB/HIPAA standards.
Currently, this “virtual biorepository” has over 58,000 annotated samples from 11,467 cancer patients available for research purposes. The clinical annotation of tissue samples is either done manually over the internet or semi-automated batch modes through mapping of local data elements with PCABC common data elements. The database currently holds information on 7188 cases (associated with 9278 specimens and 46,666 annotated blocks and blood samples) of prostate cancer, 2736 cases (associated with 3796 specimens and 9336 annotated blocks and blood samples) of breast cancer and 1543 cases (including 1334 specimens and 2671 annotated blocks and blood samples) of melanoma. These numbers continue to grow, and plans to integrate new tumor sites are in progress. Furthermore, the group has also developed a central web-based tool that allows investigators to share their translational (genomics/proteomics) experiment data on research evaluating potential biomarkers via a central location on the Consortium’s web site.
The technological achievements and the statewide informatics infrastructure that have been established by the Consortium will enable robust and efficient studies of biomarkers and their relevance to the clinical course of cancer. Studies resulting from the creation of the Consortium may allow for better classification of cancer types, more accurate assessment of disease prognosis, a better ability to identify the most appropriate individuals for clinical trial participation, and better surrogate markers of disease progression and/or response to therapy.
PMCID: PMC2675833  PMID: 19455246
7.  Proposal for a Standard Representation of Two-Dimensional Gel Electrophoresis Data 
The global analysis of proteins is now feasible due to improvements in techniques such as two-dimensional gel electrophoresis (2-DE), mass spectrometry, yeast two-hybrid systems and the development of bioinformatics applications. The experiments form the basis of proteomics, and present significant challenges in data analysis, storage and querying. We argue that a standard format for proteome data is required to enable the storage, exchange and subsequent re-analysis of large datasets. We describe the criteria that must be met for the development of a standard for proteomics. We have developed a model to represent data from 2-DE experiments, including difference gel electrophoresis along with image analysis and statistical analysis across multiple gels. This part of proteomics analysis is not represented in current proposals for proteomics standards. We are working with the Proteomics Standards Initiative to develop a model encompassing biological sample origin, experimental protocols, a number of separation techniques and mass spectrometry. The standard format will facilitate the development of central repositories of data, enabling results to be verified or re-analysed, and the correlation of results produced by different research groups using a variety of laboratory techniques.
PMCID: PMC2447283  PMID: 18628860
8.  An informatic pipeline for the data capture and submission of quantitative proteomic data using iTRAQTM 
Proteome Science  2007;5:4.
Proteomics continues to play a critical role in post-genomic science as continued advances in mass spectrometry and analytical chemistry support the separation and identification of increasing numbers of peptides and proteins from their characteristic mass spectra. In order to facilitate the sharing of this data, various standard formats have been, and continue to be, developed. Still not fully mature however, these are not yet able to cope with the increasing number of quantitative proteomic technologies that are being developed.
We propose an extension to the PRIDE and mzData XML schema to accommodate the concept of multiple samples per experiment, and in addition, capture the intensities of the iTRAQTM reporter ions in the entry. A simple Java-client has been developed to capture and convert the raw data from common spectral file formats, which also uses a third-party open source tool for the generation of iTRAQTM reported intensities from Mascot output, into a valid PRIDE XML entry.
We describe an extension to the PRIDE and mzData schemas to enable the capture of quantitative data. Currently this is limited to iTRAQTM data but is readily extensible for other quantitative proteomic technologies. Furthermore, a software tool has been developed which enables conversion from various mass spectrum file formats and corresponding Mascot peptide identifications to PRIDE formatted XML. The tool represents a simple approach to preparing quantitative and qualitative data for submission to repositories such as PRIDE, which is necessary to facilitate data deposition and sharing in public domain database. The software is freely available from .
PMCID: PMC1796855  PMID: 17270041
9.  Absolute quantification of microbial proteomes at different states by directed mass spectrometry 
The developed, directed mass spectrometry workflow allows to generate consistent and system-wide quantitative maps of microbial proteomes in a single analysis. Application to the human pathogen L. interrogans revealed mechanistic proteome changes over time involved in pathogenic progression and antibiotic defense, and new insights about the regulation of absolute protein abundances within operons.
The developed, directed proteomic approach allowed consistent detection and absolute quantification of 1680 proteins of the human pathogen L. interrogans in a single LC–MS/MS experiment.The comparison of 25 extensive, consistent and quantitative proteome maps revealed new insights about the proteome changes involved in pathogenic progression and antibiotic defense of L. interrogans, and about the regulation of protein abundances within operons.The generated time-resolved data sets are compatible with pattern analysis algorithms developed for transcriptomics, including hierarchical clustering and functional enrichment analysis of the detected profile clusters.This is the first study that describes the absolute quantitative behavior of any proteome over multiple states and represents the most comprehensive proteome abundance pattern comparison for any organism to date.
Over the last decade, mass spectrometry (MS)-based proteomics has evolved as the method of choice for system-wide proteome studies and now allows for the characterization of several thousands of proteins in a single sample. Despite these great advances, redundant monitoring of protein levels over large sample numbers in a high-throughput manner remains a challenging task. New directed MS strategies have shown to overcome some of the current limitations, thereby enabling the acquisition of consistent and system-wide data sets of proteomes with low-to-moderate complexity at high throughput.
In this study, we applied this integrated, two-stage MS strategy to investigate global proteome changes in the human pathogen L. interrogans. In the initial discovery phase, 1680 proteins (out of around 3600 gene products) could be identified (Schmidt et al, 2008) and, by focusing precious MS-sequencing time on the most dominant, specific peptides per protein, all proteins could be accurately and consistently monitored over 25 different samples within a few days of instrument time in the following scoring phase (Figure 1). Additionally, the co-analysis of heavy reference peptides enabled us to obtain absolute protein concentration estimates for all identified proteins in each perturbation (Malmström et al, 2009). The detected proteins did not show any biases against functional groups or protein classes, including membrane proteins, and span an abundance range of more than three orders of magnitude, a range that is expected to cover most of the L. interrogans proteome (Malmström et al, 2009).
To elucidate mechanistic proteome changes over time involved in pathogenic progression and antibiotic defense of L. interrogans, we generated time-resolved proteome maps of cells perturbed with serum and three different antibiotics at sublethal concentrations that are currently used to treat Leptospirosis. This yielded an information-rich proteomic data set that describes, for the first time, the absolute quantitative behavior of any proteome over multiple states, and represents the most comprehensive proteome abundance pattern comparison for any organism to date. Using this unique property of the data set, we could quantify protein components of entire pathways across several time points and subject the data sets to cluster analysis, a tool that was previously limited to the transcript level due to incomplete sampling on protein level (Figure 4). Based on these analyses, we could demonstrate that Leptospira cells adjust the cellular abundance of a certain subset of proteins and pathways as a general response to stress while other parts of the proteome respond highly specific. The cells furthermore react to individual treatments by ‘fine tuning' the abundance of certain proteins and pathways in order to cope with the specific cause of stress. Intriguingly, the most specific and significant expression changes were observed for proteins involved in motility, tissue penetration and virulence after serum treatment where we tried to simulate the host environment. While many of the detected protein changes demonstrate good agreement with available transcriptomics data, most proteins showed a poor correlation. This includes potential virulence factors, like Loa22 or OmpL1, with confirmed expression in vivo that were significantly up-regulated on the protein level, but not on the mRNA level, strengthening the importance of proteomic studies. The high resolution and coverage of the proteome data set enabled us to further investigate protein abundance changes of co-regulated genes within operons. This suggests that although most proteins within an operon respond to regulation synchronously, bacterial cells seem to have subtle means to adjust the levels of individual proteins or protein groups outside of the general trend, a phenomena that was recently also observed on the transcript level of other bacteria (Güell et al, 2009).
The method can be implemented with standard high-resolution mass spectrometers and software tools that are readily available in the majority of proteomics laboratories. It is scalable to any proteome of low-to-medium complexity and can be extended to post-translational modifications or peptide-labeling strategies for quantification. We therefore expect the approach outlined here to become a cornerstone for microbial systems biology.
Over the past decade, liquid chromatography coupled with tandem mass spectrometry (LC–MS/MS) has evolved into the main proteome discovery technology. Up to several thousand proteins can now be reliably identified from a sample and the relative abundance of the identified proteins can be determined across samples. However, the remeasurement of substantially similar proteomes, for example those generated by perturbation experiments in systems biology, at high reproducibility and throughput remains challenging. Here, we apply a directed MS strategy to detect and quantify sets of pre-determined peptides in tryptic digests of cells of the human pathogen Leptospira interrogans at 25 different states. We show that in a single LC–MS/MS experiment around 5000 peptides, covering 1680 L. interrogans proteins, can be consistently detected and their absolute expression levels estimated, revealing new insights about the proteome changes involved in pathogenic progression and antibiotic defense of L. interrogans. This is the first study that describes the absolute quantitative behavior of any proteome over multiple states, and represents the most comprehensive proteome abundance pattern comparison for any organism to date.
PMCID: PMC3159967  PMID: 21772258
absolute quantification; directed mass spectrometry; Leptospira interrogans; microbiology; proteomics
10.  BioMoby extensions to the Taverna workflow management and enactment software 
BMC Bioinformatics  2006;7:523.
As biology becomes an increasingly computational science, it is critical that we develop software tools that support not only bioinformaticians, but also bench biologists in their exploration of the vast and complex data-sets that continue to build from international genomic, proteomic, and systems-biology projects. The BioMoby interoperability system was created with the goal of facilitating the movement of data from one Web-based resource to another to fulfill the requirements of non-expert bioinformaticians. In parallel with the development of BioMoby, the European myGrid project was designing Taverna, a bioinformatics workflow design and enactment tool. Here we describe the marriage of these two projects in the form of a Taverna plug-in that provides access to many of BioMoby's features through the Taverna interface.
The exposed BioMoby functionality aids in the design of "sensible" BioMoby workflows, aids in pipelining BioMoby and non-BioMoby-based resources, and ensures that end-users need only a minimal understanding of both BioMoby, and the Taverna interface itself. Users are guided through the construction of syntactically and semantically correct workflows through plug-in calls to the Moby Central registry. Moby Central provides a menu of only those BioMoby services capable of operating on the data-type(s) that exist at any given position in the workflow. Moreover, the plug-in automatically and correctly connects a selected service into the workflow such that users are not required to understand the nature of the inputs or outputs for any service, leaving them to focus on the biological meaning of the workflow they are constructing, rather than the technical details of how the services will interoperate.
With the availability of the BioMoby plug-in to Taverna, we believe that BioMoby-based Web Services are now significantly more useful and accessible to bench scientists than are more traditional Web Services.
PMCID: PMC1693925  PMID: 17137515
11.  The development and deployment of Common Data Elements for tissue banks for translational research in cancer – An emerging standard based approach for the Mesothelioma Virtual Tissue Bank 
BMC Cancer  2008;8:91.
Recent advances in genomics, proteomics, and the increasing demands for biomarker validation studies have catalyzed changes in the landscape of cancer research, fueling the development of tissue banks for translational research. A result of this transformation is the need for sufficient quantities of clinically annotated and well-characterized biospecimens to support the growing needs of the cancer research community. Clinical annotation allows samples to be better matched to the research question at hand and ensures that experimental results are better understood and can be verified. To facilitate and standardize such annotation in bio-repositories, we have combined three accepted and complementary sets of data standards: the College of American Pathologists (CAP) Cancer Checklists, the protocols recommended by the Association of Directors of Anatomic and Surgical Pathology (ADASP) for pathology data, and the North American Association of Central Cancer Registry (NAACCR) elements for epidemiology, therapy and follow-up data. Combining these approaches creates a set of International Standards Organization (ISO) – compliant Common Data Elements (CDEs) for the mesothelioma tissue banking initiative supported by the National Institute for Occupational Safety and Health (NIOSH) of the Center for Disease Control and Prevention (CDC).
The purpose of the project is to develop a core set of data elements for annotating mesothelioma specimens, following standards established by the CAP checklist, ADASP cancer protocols, and the NAACCR elements. We have associated these elements with modeling architecture to enhance both syntactic and semantic interoperability. The system has a Java-based multi-tiered architecture based on Unified Modeling Language (UML).
Common Data Elements were developed using controlled vocabulary, ontology and semantic modeling methodology. The CDEs for each case are of different types: demographic, epidemiologic data, clinical history, pathology data including block level annotation, and follow-up data including treatment, recurrence and vital status. The end result of such an effort would eventually provide an increased sample set to the researchers, and makes the system interoperable between institutions.
The CAP, ADASP and the NAACCR elements represent widely established data elements that are utilized in many cancer centers. Herein, we have shown these representations can be combined and formalized to create a core set of annotations for banked mesothelioma specimens. Because these data elements are collected as part of the normal workflow of a medical center, data sets developed on the basis of these elements can be easily implemented and maintained.
PMCID: PMC2329649  PMID: 18397527
12.  NuChart: An R Package to Study Gene Spatial Neighbourhoods with Multi-Omics Annotations 
PLoS ONE  2013;8(9):e75146.
Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at
PMCID: PMC3777921  PMID: 24069388
13.  A global approach to analysis and interpretation of metabolic data for plant natural product discovery† 
Natural product reports  2013;30(4):565-583.
Discovering molecular components and their functionality is key to the development of hypotheses concerning the organization and regulation of metabolic networks. The iterative experimental testing of such hypotheses is the trajectory that can ultimately enable accurate computational modelling and prediction of metabolic outcomes. This information can be particularly important for understanding the biology of natural products, whose metabolism itself is often only poorly defined. Here, we describe factors that must be in place to optimize the use of metabolomics in predictive biology. A key to achieving this vision is a collection of accurate time-resolved and spatially defined metabolite abundance data and associated metadata. One formidable challenge associated with metabolite profiling is the complexity and analytical limits associated with comprehensively determining the metabolome of an organism. Further, for metabolomics data to be efficiently used by the research community, it must be curated in publically available metabolomics databases. Such databases require clear, consistent formats, easy access to data and metadata, data download, and accessible computational tools to integrate genome system-scale datasets. Although transcriptomics and proteomics integrate the linear predictive power of the genome, the metabolome represents the nonlinear, final biochemical products of the genome, which results from the intricate system(s) that regulate genome expression. For example, the relationship of metabolomics data to the metabolic network is confounded by redundant connections between metabolites and gene-products. However, connections among metabolites are predictable through the rules of chemistry. Therefore, enhancing the ability to integrate the metabolome with anchor-points in the transcriptome and proteome will enhance the predictive power of genomics data. We detail a public database repository for metabolomics, tools and approaches for statistical analysis of metabolomics data, and methods for integrating these dataset with transcriptomic data to create hypotheses concerning specialized metabolism that generates the diversity in natural product chemistry. We discuss the importance of close collaborations among biologists, chemists, computer scientists and statisticians throughout the development of such integrated metabolism-centric databases and software.
PMCID: PMC3629923  PMID: 23447050
14.  PrestOMIC, an open source application for dissemination of proteomic datasets by individual laboratories 
Proteome Science  2007;5:8.
Technological advances in mass spectrometry and other detection methods are leading to larger and larger proteomics datasets. However, when papers describing such information are published the enormous volume of data can typically only be provided as supplementary data in a tabular form through the journal website. Several journals in the proteomics field, together with the Human Proteome Organization's (HUPO) Proteomics Standards Initiative and institutions such as the Institute for Systems Biology are working towards standardizing the reporting of proteomics data, but just defining standards is only a means towards an end for sharing data. Data repositories such as and the Open Proteomics Database allow for public access to proteomics data but provide little, if any, interpretation.
Results & conclusion
Here we describe PrestOMIC, an open source application for storing mass spectrometry-based proteomic data in a relational database and for providing a user-friendly, searchable and customizable browser interface to share one's data with the scientific community. The underlying database and all associated applications are built on other existing open source tools, allowing PrestOMIC to be modified as the data standards evolve. We then use PrestOMIC to present a recently published dataset from our group through our website.
PMCID: PMC1892544  PMID: 17553161
15.  How to Generate High Quality Protein Interaction Maps 
Affinity purification followed by mass spectrometry (AP-MS) has become a commonly used method for the identification of protein-protein interactions and protein complexes. We will start with a review of the most commonly used experimental AP-MS workflows, with an emphasis on the experimental design and data analysis challenges typically encountered in such studies. One of the foremost challenges of interactome mapping is a large number of false positive protein interactions present in unfiltered datasets. We will review computational and informatics strategies for detecting specific protein interaction partners in AP-MS experiments, with a focus on incomplete (as opposite to genome-wide) interactome mapping studies. These strategies range from standard statistical approaches, to empirical scoring schemes optimized for a particular type of data, to advanced computational frameworks. The common denominator among these methods is the use of label-free quantitative information such as spectral counts or peptide ion intensities that can be extracted from MS data. We will discuss in more detail the current state of the computational tool SAINT developed in our lab. We will present its extension to intensity-based data, and compare the two quantitative strategies (spectral counts and intensities) in the context of AP-MS studies. We will also discuss related issues such as combining multiple biological or technical replicates, and dealing with data generated using different tagging strategies. We then present a new resource – the Contaminant Repository for Affinity Purification – a central repository to store, annotate, statistically analyze and disseminate lists of background contaminants likely to be observed in AP-MS studies. We will show how the contaminant repository, coupled with statistical scoring tools such as SAINT, can significantly improve the ability of individual researchers, especially in small-scale studies, to filter out likely false interactions based on the analysis of protein abundance profiles across multiple control experiments annotated in the repository.
PMCID: PMC3635379
16.  Managing Complexity - How Many Platforms Do We Need for Metabolomics? 
Metabolomics has matured over the past 10 years. By combining different platforms, over 2,000 identified metabolites can be screened. At the UC Davis Genome Center Metabolomics Facility, two laboratories work towards advancing methods and reaching out services, the Fiehn research laboratory and the metabolomics core. We have a combined use of 11 mass spectrometers for which a range of SOPs and quality controls have been developed for (a) primary metabolism, (b) volatile metabolites, (c) lipidomics, (d) secondary metabolites and (e) metabolic polymers. Over 300 studies have been completed over the past 5 years which are stored and disseminated via the SetupX study design database and facilitated by the BinBase mass spectrometry repositories. Lipid identifications by nanoESI-ion trap MS/MS are based on Genedata's MS Refiner software and a novel cross-instrument library, the LipidBLAST tool that stores calculated MS/MS spectra of over 180,000 lipids based on fragmentation patterns of authentic standards. The FiehnLib libraries of over 1,000 primary metabolites authenticate identifications in GC-TOF platforms, in conjunction with BinBase and the Adams volatile MS library. Polymers in biofuel research are assessed by pyrolysis-GC/MS and the MIT-based SpectConnect tool. LC-ion trap, Qtrap and QTOF mass spectrometry are used for determining compounds that are not amenable by one of the above methods, such as cationic metabolites (SAM, betaine, SMM), metabolic active biomarkers (acylcarnitines) and other important metabolic classes (dietary phytochemicals, folates and glucuronides). Despite this progress, metabolomics still faces a number of analytical challenges: the need for accuracy in structural identifications and quantifications, increases in total peak capacities, improved data processing software and the need for standardized database repositories. Current efforts are presented as well as a discussion on experiences in the dual task of ‘research’ and ‘service’ for metabolomic facilities and how to meet outside expectations and financial constraints.
PMCID: PMC2918200
17.  P19-S Managing Proteomics Data from Data Generation and Data Warehousing to Central Data Repository and Journal Reviewing Processes 
In today’s proteomics research, various techniques and instrumentation bioinformatics tools are necessary to manage the large amount of heterogeneous data with an automatic quality control to produce reliable and comparable results. Therefore a data-processing pipeline is mandatory for data validation and comparison in a data-warehousing system. The proteome bioinformatics platform ProteinScape has been proven to cover these needs. The reprocessing of HUPO BPP participants’ MS data was done within ProteinScape. The reprocessed information was transferred into the global data repository PRIDE.
ProteinScape as a data-warehousing system covers two main aspects: archiving relevant data of the proteomics workflow and information extraction functionality (protein identification, quantification and generation of biological knowledge). As a strategy for automatic data validation, different protein search engines are integrated. Result analysis is performed using a decoy database search strategy, which allows the measurement of the false-positive identification rate. Peptide identifications across different workflows, different MS techniques, and different search engines are merged to obtain a quality-controlled protein list.
The proteomics identifications database (PRIDE), as a public data repository, is an archiving system where data are finally stored and no longer changed by further processing steps. Data submission to PRIDE is open to proteomics laboratories generating protein and peptide identifications. An export tool has been developed for transferring all relevant HUPO BPP data from ProteinScape into PRIDE using the PRIDE.xml format.
The EU-funded ProDac project will coordinate the development of software tools covering international standards for the representation of proteomics data. The implementation of data submission pipelines and systematic data collection in public standards–compliant repositories will cover all aspects, from the generation of MS data in each laboratory to the conversion of all the annotating information and identifications to a standardized format. Such datasets can be used in the course of publishing in scientific journals.
PMCID: PMC2291891
18.  Making proteomics data accessible and reusable: Current state of proteomics databases and repositories 
Proteomics  2015;15(5-6):930-950.
Compared to other data-intensive disciplines such as genomics, public deposition and storage of MS-based proteomics, data are still less developed due to, among other reasons, the inherent complexity of the data and the variety of data types and experimental workflows. In order to address this need, several public repositories for MS proteomics experiments have been developed, each with different purposes in mind. The most established resources are the Global Proteome Machine Database (GPMDB), PeptideAtlas, and the PRIDE database. Additionally, there are other useful (in many cases recently developed) resources such as ProteomicsDB, Mass Spectrometry Interactive Virtual Environment (MassIVE), Chorus, MaxQB, PeptideAtlas SRM Experiment Library (PASSEL), Model Organism Protein Expression Database (MOPED), and the Human Proteinpedia. In addition, the ProteomeXchange consortium has been recently developed to enable better integration of public repositories and the coordinated sharing of proteomics information, maximizing its benefit to the scientific community. Here, we will review each of the major proteomics resources independently and some tools that enable the integration, mining and reuse of the data. We will also discuss some of the major challenges and current pitfalls in the integration and sharing of the data.
PMCID: PMC4409848  PMID: 25158685
Bioinformatics; Databases; MS; Repositories
19.  Methods for visual mining of genomic and proteomic data atlases 
BMC Bioinformatics  2012;13:58.
As the volume, complexity and diversity of the information that scientists work with on a daily basis continues to rise, so too does the requirement for new analytic software. The analytic software must solve the dichotomy that exists between the need to allow for a high level of scientific reasoning, and the requirement to have an intuitive and easy to use tool which does not require specialist, and often arduous, training to use. Information visualization provides a solution to this problem, as it allows for direct manipulation and interaction with diverse and complex data. The challenge addressing bioinformatics researches is how to apply this knowledge to data sets that are continually growing in a field that is rapidly changing.
This paper discusses an approach to the development of visual mining tools capable of supporting the mining of massive data collections used in systems biology research, and also discusses lessons that have been learned providing tools for both local researchers and the wider community. Example tools were developed which are designed to enable the exploration and analyses of both proteomics and genomics based atlases. These atlases represent large repositories of raw and processed experiment data generated to support the identification of biomarkers through mass spectrometry (the PeptideAtlas) and the genomic characterization of cancer (The Cancer Genome Atlas). Specifically the tools are designed to allow for: the visual mining of thousands of mass spectrometry experiments, to assist in designing informed targeted protein assays; and the interactive analysis of hundreds of genomes, to explore the variations across different cancer genomes and cancer types.
The mining of massive repositories of biological data requires the development of new tools and techniques. Visual exploration of the large-scale atlas data sets allows researchers to mine data to find new meaning and make sense at scales from single samples to entire populations. Providing linked task specific views that allow a user to start from points of interest (from diseases to single genes) enables targeted exploration of thousands of spectra and genomes. As the composition of the atlases changes, and our understanding of the biology increase, new tasks will continually arise. It is therefore important to provide the means to make the data available in a suitable manner in as short a time as possible. We have done this through the use of common visualization workflows, into which we rapidly deploy visual tools. These visualizations follow common metaphors where possible to assist users in understanding the displayed data. Rapid development of tools and task specific views allows researchers to mine large-scale data almost as quickly as it is produced. Ultimately these visual tools enable new inferences, new analyses and further refinement of the large scale data being provided in atlases such as PeptideAtlas and The Cancer Genome Atlas.
PMCID: PMC3352268  PMID: 22524279
20.  MASPECTRAS: a platform for management and analysis of proteomics LC-MS/MS data 
BMC Bioinformatics  2007;8:197.
The advancements of proteomics technologies have led to a rapid increase in the number, size and rate at which datasets are generated. Managing and extracting valuable information from such datasets requires the use of data management platforms and computational approaches.
We have developed the MAss SPECTRometry Analysis System (MASPECTRAS), a platform for management and analysis of proteomics LC-MS/MS data. MASPECTRAS is based on the Proteome Experimental Data Repository (PEDRo) relational database schema and follows the guidelines of the Proteomics Standards Initiative (PSI). Analysis modules include: 1) import and parsing of the results from the search engines SEQUEST, Mascot, Spectrum Mill, X! Tandem, and OMSSA; 2) peptide validation, 3) clustering of proteins based on Markov Clustering and multiple alignments; and 4) quantification using the Automated Statistical Analysis of Protein Abundance Ratios algorithm (ASAPRatio). The system provides customizable data retrieval and visualization tools, as well as export to PRoteomics IDEntifications public repository (PRIDE). MASPECTRAS is freely available at
Given the unique features and the flexibility due to the use of standard software technology, our platform represents significant advance and could be of great interest to the proteomics community.
PMCID: PMC1906842  PMID: 17567892
21.  Chromosome 19 Annotations with Disease Speciation – A First Report from the Global Research Consortium 
Journal of proteome research  2012;12(1):135-150.
A first research development progress report of the Chromosome 19 Consortium with members from Sweden, Norway, Spain, USA, China and India, a part of the Chromosome-Centric Human Proteome Project (C-HPP) global initiative is presented ( From the chromosome 19 peptide-targeted library constituting 6159 peptides, a pilot study was conducted using a sub-set with 125 isotope-labeled peptides. We applied an annotation strategy with triple quadrupole, ESI-Qtrap, and MALDI mass spectrometry platforms, comparing the quality of data within, and in-between these instrumental set-ups. LC-MS conditions were outlined by multiplex assay developments, followed by MRM assay developments. SRM was applied to biobank samples, quantifying kallikrein 3 (prostate specific antigen) in plasma from prostate cancer patients. The antibody production has been initiated for more than 1200 genes from the entire chromosome 19, and the progress developments are presented. We developed a dedicated transcript microarray, to serve as the mRNA identifier by screening cancer cell lines. NAPPA protein arrays were built to align with the transcript data with the Chromosome 19 NAPPA chip, dedicated to 90 proteins, as the first development delivery. We have introduced an IT-infrastructure utilizing a LIMS system that serves as the key interface for the research teams in order to share and explore data generated within the project. The cross-site data repository will form the basis for sample processing, including biological samples, as well as patient samples from national Biobanks.
PMCID: PMC3539432  PMID: 23249167
Proteins; Genes; Antibodies; mRNA; Mass spectrometry; Bioinformatics; Protein microarray; Human Disease
22.  The central nervous system transcriptome of the weakly electric brown ghost knifefish (Apteronotus leptorhynchus): de novo assembly, annotation, and proteomics validation 
BMC Genomics  2015;16(1):166.
The brown ghost knifefish (Apteronotus leptorhynchus) is a weakly electric teleost fish of particular interest as a versatile model system for a variety of research areas in neuroscience and biology. The comprehensive information available on the neurophysiology and neuroanatomy of this organism has enabled significant advances in such areas as the study of the neural basis of behavior, the development of adult-born neurons in the central nervous system and their involvement in the regeneration of nervous tissue, as well as brain aging and senescence. Despite substantial scientific interest in this species, no genomic resources are currently available.
Here, we report the de novo assembly and annotation of the A. leptorhynchus transcriptome. After evaluating several trimming and transcript reconstruction strategies, de novo assembly using Trinity uncovered 42,459 unique contigs containing at least a partial protein-coding sequence based on alignment to a reference set of known Actinopterygii sequences. As many as 11,847 of these contigs contained full or near-full length protein sequences, providing broad coverage of the proteome. A variety of non-coding RNA sequences were also identified and annotated, including conserved long intergenic non-coding RNA and other long non-coding RNA observed previously to be expressed in adult zebrafish (Danio rerio) brain, as well as a variety of miRNA, snRNA, and snoRNA. Shotgun proteomics confirmed translation of open reading frames from over 2,000 transcripts, including alternative splice variants. Assignment of tandem mass spectra was greatly improved by use of the assembly compared to databases of sequences from closely related organisms. The assembly and raw reads have been deposited at DDBJ/EMBL/GenBank under the accession number GBKR00000000. Tandem mass spectrometry data is available via ProteomeXchange with identifier PXD001285.
Presented here is the first release of an annotated de novo transcriptome assembly from Apteronotus leptorhynchus, providing a broad overview of RNA expressed in central nervous system tissue. The assembly, which includes substantial coverage of a wide variety of both protein coding and non-coding transcripts, will allow the development of better tools to understand the mechanisms underlying unique characteristics of the knifefish model system, such as their tremendous regenerative capacity and negligible brain senescence.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1354-2) contains supplementary material, which is available to authorized users.
PMCID: PMC4424500  PMID: 25879418
De novo transcriptome; RNA-Seq; Brain; Spinal cord; Apteronotus leptorhynchus; Brown ghost knifefish; Proteomics; Non-coding RNA; Splice-junction peptides
23.  Variability of Metabolite Levels Is Linked to Differential Metabolic Pathways in Arabidopsis's Responses to Abiotic Stresses 
PLoS Computational Biology  2014;10(6):e1003656.
Constraint-based approaches have been used for integrating data in large-scale metabolic networks to obtain insights into metabolism of various organisms. Due to the underlying steady-state assumption, these approaches are usually not suited for making predictions about metabolite levels. Here, we ask whether we can make inferences about the variability of metabolite levels from a constraint-based analysis based on the integration of transcriptomics data. To this end, we analyze time-resolved transcriptomics and metabolomics data from Arabidopsis thaliana under a set of eight different light and temperature conditions. In a previous study, the gene expression data have already been integrated in a genome-scale metabolic network to predict pathways, termed modulators and sustainers, which are differentially regulated with respect to a biochemically meaningful data-driven null model. Here, we present a follow-up analysis which bridges the gap between flux- and metabolite-centric methods. One of our main findings demonstrates that under certain environmental conditions, the levels of metabolites acting as substrates in modulators or sustainers show significantly lower temporal variations with respect to the remaining measured metabolites. This observation is discussed within the context of a systems-view of plasticity and robustness of metabolite contents and pathway fluxes. Our study paves the way for investigating the existence of similar principles in other species for which both genome-scale networks and high-throughput metabolomics data of high quality are becoming increasingly available.
Author Summary
Organisms are usually exposed to changing environments and balance these perturbations by altering their metabolic state. Gaining a deeper understanding of metabolic adjustment to varying external conditions is important for the development of advanced engineering strategies for microorganisms as well as for higher plants. One tool which is particularly suited for investigating these processes is genome-scale metabolic models. These large-scale representations of the underlying metabolic networks enable the integration of experimental data and application of constrain-based mathematical approaches to estimate flux rates through the chemical reactions of the network under different environmental scenarios. However, for most of these approaches the assumption of a steady-state (flux balance) is indispensable and therefore precludes the prediction of metabolite concentrations. Here, we present a data-driven observation that relates results from a flux-centric constraint-based approach that is based on transcriptomics data to metabolite levels from the same experiments. Our observations suggest that constraint-based modeling approaches in combination with high-throughput data can be used to infer regulatory principles about the plasticity and robustness of metabolic behavior from the stoichiometry of the underlying reactions alone.
PMCID: PMC4063599  PMID: 24946036
24.  Metabolites Associated with Adaptation of Microorganisms to an Acidophilic, Metal-Rich Environment Identified by Stable-Isotope-Enabled Metabolomics 
mBio  2013;4(2):e00484-12.
Microorganisms grow under a remarkable range of extreme conditions. Environmental transcriptomic and proteomic studies have highlighted metabolic pathways active in extremophilic communities. However, metabolites directly linked to their physiology are less well defined because metabolomics methods lag behind other omics technologies due to a wide range of experimental complexities often associated with the environmental matrix. We identified key metabolites associated with acidophilic and metal-tolerant microorganisms using stable isotope labeling coupled with untargeted, high-resolution mass spectrometry. We observed >3,500 metabolic features in biofilms growing in pH ~0.9 acid mine drainage solutions containing millimolar concentrations of iron, sulfate, zinc, copper, and arsenic. Stable isotope labeling improved chemical formula prediction by >50% for larger metabolites (>250 atomic mass units), many of which were unrepresented in metabolic databases and may represent novel compounds. Taurine and hydroxyectoine were identified and likely provide protection from osmotic stress in the biofilms. Community genomic, transcriptomic, and proteomic data implicate fungi in taurine metabolism. Leptospirillum group II bacteria decrease production of ectoine and hydroxyectoine as biofilms mature, suggesting that biofilm structure provides some resistance to high metal and proton concentrations. The combination of taurine, ectoine, and hydroxyectoine may also constitute a sulfur, nitrogen, and carbon currency in the communities.
Microbial communities are central to many critical global processes and yet remain enigmatic largely due to their complex and distributed metabolic interactions. Metabolomics has the possibility of providing mechanistic insights into the function and ecology of microbial communities. However, our limited knowledge of microbial metabolites, the difficulty of identifying metabolites from complex samples, and the inability to link metabolites directly to community members have proven to be major limitations in developing advances in systems interactions. Here, we show that combining stable-isotope-enabled metabolomics with genomics, transcriptomics, and proteomics can illuminate the ecology of microorganisms at the community scale.
PMCID: PMC3604775  PMID: 23481603
25.  eXframe: reusable framework for storage, analysis and visualization of genomics experiments 
BMC Bioinformatics  2011;12:452.
Genome-wide experiments are routinely conducted to measure gene expression, DNA-protein interactions and epigenetic status. Structured metadata for these experiments is imperative for a complete understanding of experimental conditions, to enable consistent data processing and to allow retrieval, comparison, and integration of experimental results. Even though several repositories have been developed for genomics data, only a few provide annotation of samples and assays using controlled vocabularies. Moreover, many of them are tailored for a single type of technology or measurement and do not support the integration of multiple data types.
We have developed eXframe - a reusable web-based framework for genomics experiments that provides 1) the ability to publish structured data compliant with accepted standards 2) support for multiple data types including microarrays and next generation sequencing 3) query, analysis and visualization integration tools (enabled by consistent processing of the raw data and annotation of samples) and is available as open-source software. We present two case studies where this software is currently being used to build repositories of genomics experiments - one contains data from hematopoietic stem cells and another from Parkinson's disease patients.
The web-based framework eXframe offers structured annotation of experiments as well as uniform processing and storage of molecular data from microarray and next generation sequencing platforms. The framework allows users to query and integrate information across species, technologies, measurement types and experimental conditions. Our framework is reusable and freely modifiable - other groups or institutions can deploy their own custom web-based repositories based on this software. It is interoperable with the most important data formats in this domain. We hope that other groups will not only use eXframe, but also contribute their own useful modifications.
PMCID: PMC3235155  PMID: 22103807

Results 1-25 (923309)