PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-13 (13)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
1.  The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary 
Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo
doi:10.1093/database/bat009
PMCID: PMC3594986  PMID: 23482073
2.  Library of Apicomplexan Metabolic Pathways: a manually curated database for metabolic pathways of apicomplexan parasites 
Nucleic Acids Research  2012;41(D1):D706-D713.
The Library of Apicomplexan Metabolic Pathways (LAMP, http://www.llamp.net) is a web database that provides near complete mapping from genes to the central metabolic functions for some of the prominent intracellular parasites of the phylum Apicomplexa. This phylum includes the causative agents of malaria, toxoplasmosis and theileriosis—diseases with a huge economic and social impact. A number of apicomplexan genomes have been sequenced, but the accurate annotation of gene function remains challenging. We have adopted an approach called metabolic reconstruction, in which genes are systematically assigned to functions within pathways/networks for Toxoplasma gondii, Neospora caninum, Cryptosporidium and Theileria species, and Babesia bovis. Several functions missing from pathways have been identified, where the corresponding gene for an essential process appears to be absent from the current genome annotation. For each species, LAMP contains interactive diagrams of each pathway, hyperlinked to external resources and annotated with detailed information, including the sources of evidence used. We have also developed a section to highlight the overall metabolic capabilities of each species, such as the ability to synthesize or the dependence on the host for a particular metabolite. We expect this new database will become a valuable resource for fundamental and applied research on the Apicomplexa.
doi:10.1093/nar/gks1139
PMCID: PMC3531055  PMID: 23193253
3.  The mzIdentML Data Standard for Mass Spectrometry-Based Proteomics Results 
Molecular & Cellular Proteomics : MCP  2012;11(7):M111.014381.
We report the release of mzIdentML, an exchange standard for peptide and protein identification data, designed by the Proteomics Standards Initiative. The format was developed by the Proteomics Standards Initiative in collaboration with instrument and software vendors, and the developers of the major open-source projects in proteomics. Software implementations have been developed to enable conversion from most popular proprietary and open-source formats, and mzIdentML will soon be supported by the major public repositories. These developments enable proteomics scientists to start working with the standard for exchanging and publishing data sets in support of publications and they provide a stable platform for bioinformatics groups and commercial software vendors to work with a single file format for identification data.
doi:10.1074/mcp.M111.014381
PMCID: PMC3394945  PMID: 22375074
4.  The Gel Electrophoresis Markup Language (GelML) from the Proteomics Standards Initiative 
Proteomics  2010;10(17):3073-3081.
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
doi:10.1002/pmic.201000120
PMCID: PMC3193076  PMID: 20677327
data standard; gel electrophoresis; database; ontology
5.  A DIGE study on the effects of salbutamol on the rat muscle proteome - an exemplar of best practice for data sharing in proteomics 
BMC Research Notes  2011;4:86.
Background
Proteomic techniques allow researchers to perform detailed analyses of cellular states and many studies are published each year, which highlight large numbers of proteins quantified in different samples. However, currently few data sets make it into public databases with sufficient metadata to allow other groups to verify findings, perform data mining or integrate different data sets. The Proteomics Standards Initiative has released a series of "Minimum Information About a Proteomics Experiment" guideline documents (MIAPE modules) and accompanying data exchange formats. This article focuses on proteomic studies based on gel electrophoresis and demonstrates how the corresponding MIAPE modules can be fulfilled and data deposited in public databases, using a new experimental data set as an example.
Findings
We have performed a study of the effects of an anabolic agent (salbutamol) at two different time points on the protein complement of rat skeletal muscle cells, quantified by difference gel electrophoresis. In the DIGE study, a total of 31 non-redundant proteins were identified as being potentially modulated at 24 h post treatment and 110 non redundant proteins at 96 h post-treatment. Several categories of function have been highlighted as strongly enriched, providing candidate proteins for further study. We also use the study as an example of best practice for data deposition.
Conclusions
We have deposited all data sets from this study in public databases for further analysis by the community. We also describe more generally how gel-based protein identification data sets can now be deposited in the PRoteomics IDEntifications database (PRIDE), using a new software tool, the PRIDESpotMapper, which we developed to work in conjunction with the PRIDE Converter application. We also demonstrate how the ProteoRed MIAPE generator tool can be used to create and share a complete and compliant set of MIAPE reports for this experiment and others.
doi:10.1186/1756-0500-4-86
PMCID: PMC3080311  PMID: 21443781
6.  Allele frequency net: a database and online repository for immune gene frequencies in worldwide populations 
Nucleic Acids Research  2010;39(Database issue):D913-D919.
The allele frequency net database (http://www.allelefrequencies.net) is an online repository that contains information on the frequencies of immune genes and their corresponding alleles in different populations. The extensive variability observed in genes and alleles related to the immune system response and its significance in transplantation, disease association studies and diversity in populations led to the development of this electronic resource. At present, the system contains data from 1133 populations in 608 813 individuals on the frequency of genes from different polymorphic regions such as human leukocyte antigens, killer-cell immunoglobulin-like receptors, major histocompatibility complex Class I chain-related genes and a number of cytokine gene polymorphisms. The project was designed to create a central source for the storage of frequency data and provide individuals with a set of bioinformatics tools to analyze the occurrence of these variants in worldwide populations. The resource has been used in a wide variety of contexts, including clinical applications (histocompatibility, immunology, epidemiology and pharmacogenetics) and population genetics. Demographic information, frequency data and searching tools can be freely accessed through the website.
doi:10.1093/nar/gkq1128
PMCID: PMC3013710  PMID: 21062830
7.  Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines 
Proteomics  2009;9(5):1220-1229.
Tandem mass spectrometry, run in combination with liquid chromatography (LC-MS/MS), can generate large numbers of peptide and protein identifications, for which a variety of database search engines are available. Distinguishing correct identifications from false positives is far from trivial because all data sets are noisy, and tend to be too large for manual inspection, therefore probabilistic methods must be employed to balance the trade-off between sensitivity and specificity. Decoy databases are becoming widely used to place statistical confidence in results sets, allowing the false discovery rate (FDR) to be estimated. It has previously been demonstrated that different MS search engines produce different peptide identification sets, and as such, employing more than one search engine could result in an increased number of peptides being identified. However, such efforts are hindered by the lack of a single scoring framework employed by all search engines.
We have developed a search engine independent scoring framework based on FDR which allows peptide identifications from different search engines to be combined, called the FDRScore. We observe that peptide identifications made by three search engines are infrequently false positives, and identifications made by only a single search engine, even with a strong score from the source search engine, are significantly more likely to be false positives. We have developed a second score based on the FDR within peptide identifications grouped according to the set of search engines that have made the identification, called the combined FDRScore. We demonstrate by searching large publicly available data sets that the combined FDRScore can differentiate between between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine.
doi:10.1002/pmic.200800473
PMCID: PMC2899855  PMID: 19253293
proteomics; mass spectrometry; decoy database; search engine; scoring; false discovery rate
8.  FuGEFlow: data model and markup language for flow cytometry 
BMC Bioinformatics  2009;10:184.
Background
Flow cytometry technology is widely used in both health care and research. The rapid expansion of flow cytometry applications has outpaced the development of data storage and analysis tools. Collaborative efforts being taken to eliminate this gap include building common vocabularies and ontologies, designing generic data models, and defining data exchange formats. The Minimum Information about a Flow Cytometry Experiment (MIFlowCyt) standard was recently adopted by the International Society for Advancement of Cytometry. This standard guides researchers on the information that should be included in peer reviewed publications, but it is insufficient for data exchange and integration between computational systems. The Functional Genomics Experiment (FuGE) formalizes common aspects of comprehensive and high throughput experiments across different biological technologies. We have extended FuGE object model to accommodate flow cytometry data and metadata.
Methods
We used the MagicDraw modelling tool to design a UML model (Flow-OM) according to the FuGE extension guidelines and the AndroMDA toolkit to transform the model to a markup language (Flow-ML). We mapped each MIFlowCyt term to either an existing FuGE class or to a new FuGEFlow class. The development environment was validated by comparing the official FuGE XSD to the schema we generated from the FuGE object model using our configuration. After the Flow-OM model was completed, the final version of the Flow-ML was generated and validated against an example MIFlowCyt compliant experiment description.
Results
The extension of FuGE for flow cytometry has resulted in a generic FuGE-compliant data model (FuGEFlow), which accommodates and links together all information required by MIFlowCyt. The FuGEFlow model can be used to build software and databases using FuGE software toolkits to facilitate automated exchange and manipulation of potentially large flow cytometry experimental data sets. Additional project documentation, including reusable design patterns and a guide for setting up a development environment, was contributed back to the FuGE project.
Conclusion
We have shown that an extension of FuGE can be used to transform minimum information requirements in natural language to markup language in XML. Extending FuGE required significant effort, but in our experiences the benefits outweighed the costs. The FuGEFlow is expected to play a central role in describing flow cytometry experiments and ultimately facilitating data exchange including public flow cytometry repositories currently under development.
doi:10.1186/1471-2105-10-184
PMCID: PMC2711079  PMID: 19531228
9.  The proteome of Toxoplasma gondii: integration with the genome provides novel insights into gene expression and annotation 
Genome Biology  2008;9(7):R116.
A proteomics analysis identifies one third of the predicted Toxoplasma gondii proteins and integrates proteomics and genomics data to refine genome annotation.
Background
Although the genomes of many of the most important human and animal pathogens have now been sequenced, our understanding of the actual proteins expressed by these genomes and how well they predict protein sequence and expression is still deficient. We have used three complementary approaches (two-dimensional electrophoresis, gel-liquid chromatography linked tandem mass spectrometry and MudPIT) to analyze the proteome of Toxoplasma gondii, a parasite of medical and veterinary significance, and have developed a public repository for these data within ToxoDB, making for the first time proteomics data an integral part of this key genome resource.
Results
The draft genome for Toxoplasma predicts around 8,000 genes with varying degrees of confidence. Our data demonstrate how proteomics can inform these predictions and help discover new genes. We have identified nearly one-third (2,252) of all the predicted proteins, with 2,477 intron-spanning peptides providing supporting evidence for correct splice site annotation. Functional predictions for each protein and key pathways were determined from the proteome. Importantly, we show evidence for many proteins that match alternative gene models, or previously unpredicted genes. For example, approximately 15% of peptides matched more convincingly to alternative gene models. We also compared our data with existing transcriptional data in which we highlight apparent discrepancies between gene transcription and protein expression.
Conclusion
Our data demonstrate the importance of protein data in expression profiling experiments and highlight the necessity of integrating proteomic with genomic data so that iterative refinements of both annotation and expression models are possible.
doi:10.1186/gb-2008-9-7-r116
PMCID: PMC2530874  PMID: 18644147
10.  Capture and Analysis of Quantitative Proteomic Data 
Proteomics  2007;7(16):2787-2799.
Whilst the array of techniques available for quantitative proteomics continues to grow, the attendant bioinformatic software tools are similarly expanding in number. The data capture and analysis of such quantitative data is obviously crucial to the experiment and the methods used to process it will critically affect the quality of the data obtained. These tools must deal with a variety of issues, including identification of labelled and unlabelled peptide species, location of the corresponding mass spectrometry scans in the experiment, construction of representative ion chromatograms, location of the true peptide ion chromatogram start and end, elimination of background signal in the mass spectrum and chromatogram, and calculation of both peptide and protein ratios/abundances. A variety of tools and approaches are available, in part restricted by the nature of the experiment to be performed and available instrumentation. Currently, although there is no single consensus on precisely how to calculate protein and peptide abundances, many common themes have emerged which identify and reduce many of the key sources of error. These issues will be discussed, along with those relating to deposition of quantitative data. At present, mature data standards for quantitative proteomics are not yet available, although formats are beginning to emerge.
doi:10.1002/pmic.200700127
PMCID: PMC2260796  PMID: 17640002
proteomics; relative quantitation; absolute quantitation; software; bioinformatics; data standards
11.  An informatic pipeline for the data capture and submission of quantitative proteomic data using iTRAQTM 
Proteome Science  2007;5:4.
Background
Proteomics continues to play a critical role in post-genomic science as continued advances in mass spectrometry and analytical chemistry support the separation and identification of increasing numbers of peptides and proteins from their characteristic mass spectra. In order to facilitate the sharing of this data, various standard formats have been, and continue to be, developed. Still not fully mature however, these are not yet able to cope with the increasing number of quantitative proteomic technologies that are being developed.
Results
We propose an extension to the PRIDE and mzData XML schema to accommodate the concept of multiple samples per experiment, and in addition, capture the intensities of the iTRAQTM reporter ions in the entry. A simple Java-client has been developed to capture and convert the raw data from common spectral file formats, which also uses a third-party open source tool for the generation of iTRAQTM reported intensities from Mascot output, into a valid PRIDE XML entry.
Conclusion
We describe an extension to the PRIDE and mzData schemas to enable the capture of quantitative data. Currently this is limited to iTRAQTM data but is readily extensible for other quantitative proteomic technologies. Furthermore, a software tool has been developed which enables conversion from various mass spectrum file formats and corresponding Mascot peptide identifications to PRIDE formatted XML. The tool represents a simple approach to preparing quantitative and qualitative data for submission to repositories such as PRIDE, which is necessary to facilitate data deposition and sharing in public domain database. The software is freely available from .
doi:10.1186/1477-5956-5-4
PMCID: PMC1796855  PMID: 17270041
12.  An analysis of extensible modelling for functional genomics data 
BMC Bioinformatics  2005;6:235.
Background
Several data formats have been developed for large scale biological experiments, using a variety of methodologies. Most data formats contain a mechanism for allowing extensions to encode unanticipated data types. Extensions to data formats are important because the experimental methodologies tend to be fairly diverse and rapidly evolving, which hinders the creation of formats that will be stable over time.
Results
In this paper we review the data formats that exist in functional genomics, some of which have become de facto or de jure standards, with a particular focus on how each domain has been modelled, and how each format allows extensions. We describe the tasks that are frequently performed over data formats and analyse how well each task is supported by a particular modelling structure.
Conclusion
From our analysis, we make recommendations as to the types of modelling structure that are most suitable for particular types of experimental annotation. There are several standards currently under development that we believe could benefit from systematically following a set of guidelines.
doi:10.1186/1471-2105-6-235
PMCID: PMC1262694  PMID: 16188029
13.  XGAP: a uniform and extensible data model and software platform for genotype and phenotype experiments 
Genome Biology  2010;11(3):R27.
XGAP, a software platform for the integration and analysis of genotype and phenotype data.
We present an extensible software model for the genotype and phenotype community, XGAP. Readers can download a standard XGAP (http://www.xgap.org) or auto-generate a custom version using MOLGENIS with programming interfaces to R-software and web-services or user interfaces for biologists. XGAP has simple load formats for any type of genotype, epigenotype, transcript, protein, metabolite or other phenotype data. Current functionality includes tools ranging from eQTL analysis in mouse to genome-wide association studies in humans.
doi:10.1186/gb-2010-11-3-r27
PMCID: PMC2864567  PMID: 20214801

Results 1-13 (13)