Biocuration has become a cornerstone for analyses in biology, and to meet needs, the amount of annotations has considerably grown in recent years. However, the reliability of these annotations varies; it has thus become necessary to be able to assess the confidence in annotations. Although several resources already provide confidence information about the annotations that they produce, a standard way of providing such information has yet to be defined. This lack of standardization undermines the propagation of knowledge across resources, as well as the credibility of results from high-throughput analyses. Seeded at a workshop during the Biocuration 2012 conference, a working group has been created to address this problem. We present here the elements that were identified as essential for assessing confidence in annotations, as well as a draft ontology—the Confidence Information Ontology—to illustrate how the problems identified could be addressed. We hope that this effort will provide a home for discussing this major issue among the biocuration community.
Molecular interaction databases are essential resources that enable access to a wealth of information on associations between proteins and other biomolecules. Network graphs generated from these data provide an understanding of the relationships between different proteins in the cell, and network analysis has become a widespread tool supporting –omics analysis. Meaningfully representing this information remains far from trivial and different databases strive to provide users with detailed records capturing the experimental details behind each piece of interaction evidence. A targeted curation approach is necessary to transfer published data generated by primarily low-throughput techniques into interaction databases. In this review we present an example highlighting the value of both targeted curation and the subsequent effective visualization of detailed features of manually curated interaction information. We have curated interactions involving LRRK2, a protein of largely unknown function linked to familial forms of Parkinson's disease, and hosted the data in the IntAct database. This LRRK2-specific dataset was then used to produce different visualization examples highlighting different aspects of the data: the level of confidence in the interaction based on orthogonal evidence, those interactions found under close-to-native conditions, and the enzyme–substrate relationships in different in vitro enzymatic assays. Finally, pathway annotation taken from the Reactome database was overlaid on top of interaction networks to bring biological functional context to interaction maps.
Bioinformatics; Curation; Data visualization; Molecular interaction database; Parkinson's disease; Protein interaction network
Data standardization; Human Proteome Organisation; Proteomics Standards Initiative
The manual curation of the information in biomedical resources is an expensive task. This article argues the value of this approach in comparison with other apparently less costly options, such as automated annotation or text-mining, then discusses ways in which databases can make cost savings by sharing infrastructure and tool development. Sharing curation effort is a model already being adopted by several data resources. Approaches taken by two of these, the Gene Ontology annotation effort and the IntAct molecular interaction database, are reviewed in more detail. These models help to ensure long-term persistence of curated data and minimizes redundant development of resources by multiple disparate groups.
http://www.ebi.ac.uk/intact and http://www.ebi.ac.uk/GOA/
The IntAct molecular interaction database has created a new, free, open-source, manually curated resource, the Complex Portal (www.ebi.ac.uk/intact/complex), through which protein complexes from major model organisms are being collated and made available for search, viewing and download. It has been built in close collaboration with other bioinformatics services and populated with data from ChEMBL, MatrixDB, PDBe, Reactome and UniProtKB. Each entry contains information about the participating molecules (including small molecules and nucleic acids), their stoichiometry, topology and structural assembly. Complexes are annotated with details about their function, properties and complex-specific Gene Ontology (GO) terms. Consistent nomenclature is used throughout the resource with systematic names, recommended names and a list of synonyms all provided. The use of the Evidence Code Ontology allows us to indicate for which entries direct experimental evidence is available or if the complex has been inferred based on homology or orthology. The data are searchable using standard identifiers, such as UniProt, ChEBI and GO IDs, protein, gene and complex names or synonyms. This reference resource will be maintained and grow to encompass an increasing number of organisms. Input from groups and individuals with specific areas of expertise is welcome.
Availability and implementation:
Supplementary data are available at Bioinformatics online.
The Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) was established in 2002 with the aim of defining community standards for data representation in proteomics and facilitating data comparison, exchange and verification. Over the last 10 years significant advances have been made, with common data standards now published and implemented in the field of both mass spectrometry and molecular interactions. The 2012 meeting further advanced this work, with the mass spectrometry groups finalising approaches to capturing the output from recent developments in the field, such as quantitative proteomics and SRM. The molecular interaction group focused on improving the integration of data from multiple resources. Both groups united with a guest work track, organized by the HUPO Technology/Standards Committee, to formulate proposals for data submissions from the HUPO Human Proteome Project and to start an initiative to collect standard experimental protocols.
This paper focuses on the use of controlled vocabularies (CVs) and ontologies especially in the area of proteomics, primarily related to the work of the Proteomics Standards Initiative (PSI). It describes the relevant proteomics standard formats and the ontologies used within them. Software and tools for working with these ontology files are also discussed. The article also examines the “mapping files” used to ensure correct controlled vocabulary terms that are placed within PSI standards and the fulfillment of the MIAPE (Minimum Information about a Proteomics Experiment) requirements. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
► The semantic annotation using ontologies is a prerequisite for the semantic web. ► The HUPO-PSI defined a set of XML-based standard formats for proteomics. ► These standard formats allow the referencing of CV terms defined in obo files. ► The CV terms can be used to enforce MIAPE compliance of the data files. ► The mass spectrometry CV is constantly maintained in a community process.
ANDI-MS, Analytical Data Interchange format for Mass Spectrometry; AniML, Analytical Information Markup Language; API, Application Programming Interface; ASCII, American Standard Code for Information Interchange; ASTM, American Society for Testing and Materials; BTO, BRENDA (BRaunschweig ENzyme DAtabase) Tissue Ontology; ChEBI, Chemical Entities of Biological Interest; CV, Controlled Vocabulary; DL, Description Logic; EBI, European Bioinformatics Institute; HDF5, Hierarchical Data Format, version 5; HUPO-PSI, Human Proteome Organisation-Proteomics Standards Initiative; ICD, International Classification of Diseases; IUPAC, International Union for Pure and Applied Chemistry; JCAMP-DX, Joint Committee on Atomic and Molecular Physical data-Data eXchange format; MALDI, Matrix Assisted Laser Desorption Ionization; MeSH, Medical Subject Headings; MI, Molecular Interaction; MIBBI, Minimal Information for Biological and Biomedical Investigations; MITAB, Molecular Interactions TABular format; MIAPE, Minimum Information About a Proteomics Experiment; MS, Mass Spectrometry; NCBI, National Center for Biotechnology Information; NCBO, National Center for Biomedical Ontology; netCDF, Network Common Data Format; OBI, Ontology for Biomedical Investigations; OBO, Open Biological and Biomedical Ontologies; OLS, Ontology Lookup Service; OWL, Web Ontology Language; PAR, Protein Affinity Reagents; PATO, Phenotype Attribute Trait Ontology; PRIDE, PRoteomics IDEntifications database; RDF(S), Resource Description Framework (Schema); SRM, Selected Reaction Monitoring; TPP, Trans-Proteomic Pipeline; URI, Uniform Resource Identifier; XSLT, eXtensible Stylesheet Language Transformation; YAFMS, Yet Another Format for Mass Spectrometry; Proteomics data standards; Controlled vocabularies; Ontologies in proteomics; Ontology formats; Ontology editors and software; Ontology maintenance
IntAct (freely available at http://www.ebi.ac.uk/intact) is an open-source, open data molecular interaction database populated by data either curated from the literature or from direct data depositions. IntAct has developed a sophisticated web-based curation tool, capable of supporting both IMEx- and MIMIx-level curation. This tool is now utilized by multiple additional curation teams, all of whom annotate data directly into the IntAct database. Members of the IntAct team supply appropriate levels of training, perform quality control on entries and take responsibility for long-term data maintenance. Recently, the MINT and IntAct databases decided to merge their separate efforts to make optimal use of limited developer resources and maximize the curation output. All data manually curated by the MINT curators have been moved into the IntAct database at EMBL-EBI and are merged with the existing IntAct dataset. Both IntAct and MINT are active contributors to the IMEx consortium (http://www.imexconsortium.org).
The complex biological processes that control cellular function are mediated by intricate networks of molecular interactions. Accumulating evidence indicates that these interactions are often interdependent, thus acting cooperatively. Cooperative interactions are prevalent in and indispensible for reliable and robust control of cell regulation, as they underlie the conditional decision-making capability of large regulatory complexes. Despite an increased focus on experimental elucidation of the molecular details of cooperative binding events, as evidenced by their growing occurrence in literature, they are currently lacking from the main bioinformatics resources. One of the contributing factors to this deficiency is the lack of a computer-readable standard representation and exchange format for cooperative interaction data. To tackle this shortcoming, we added functionality to the widely used PSI-MI interchange format for molecular interaction data by defining new controlled vocabulary terms that allow annotation of different aspects of cooperativity without making structural changes to the underlying XML schema. As a result, we are able to capture cooperative interaction data in a structured format that is backward compatible with PSI-MI–based data and applications. This will facilitate the storage, exchange and analysis of cooperative interaction data, which in turn will advance experimental research on this fundamental principle in biology.
The IMEx consortium is an international collaboration between major public interaction data providers to share curation effort and make a non-redundant set of protein interactions available in a single search interface on a common website (www.imexconsortium.org). Common curation rules have been developed and a central registry is used to manage the selection of articles to enter into the dataset. The advantages of such a service to the user, quality control measures adopted and data distribution practices are discussed.
Background and Purpose
Because brain endothelial cells exist at the neurovascular interface, they may serve as cellular reporters of brain dysfunction by releasing biomarkers into the circulation.
We used proteomic techniques to screen conditioned media from human brain endothelial cultures subjected to oxidative stress induced by nitric oxide over 24 hours. Plasma samples from human stroke patients were analyzed by enzyme-linked immunosorbent assay.
In healthy endothelial cells, interaction mapping demonstrated cross-talk involving secreted factors, membrane receptors, and matrix components. In oxidatively challenged endothelial cells, networks of interacting proteins failed to emerge. Instead, inflammatory markers increased, secreted factors oscillated over time, and endothelial injury repair was manifested as changes in factors related to matrix integrity. Elevated inflammatory markers included heat shock protein, chemokine ligand-1, serum amyloid-A1, annexin-A5, and thrombospondin-1. Neurotrophic factors (prosaposin, nucleobindin-1, and tachykinin precursors) peaked at 12 hours, then rapidly decreased by 24 hours. Basement membrane components (fibronectin, desomoglein, profiling-1) were decreased. Cytoskeletal markers (actin, vimentin, nidogen, and filamin B) increased over time. From this initial analysis, the high-ranking candidate thrombospondin-1 was further explored in human plasma. Acute ischemic stroke patients had significantly higher thrombospondin-1 levels within 8 hours of symptom onset compared to controls with similar clinical risk factors (659±81 vs 1132±98 ng/mL; P<0.05; n=20).
Screening of simplified cell culture systems may aid the discovery of novel biomarkers in clinical neurovascular injury. Further collaborative efforts are warranted to discover and validate more candidates of interest.
biomarker; cerebral ischemia; human brain endothelial cells; oxidative stress; proteomics
The Proteomics Standard Initiative Common QUery InterfaCe (PSICQUIC) specification was
created by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) to
enable computational access to molecular-interaction data resources by means of a standard
Web Service and query language. Currently providing >150 million binary interaction
evidences from 28 servers globally, the PSICQUIC interface allows the concurrent search of
multiple molecular-interaction information resources using a single query. Here, we
present an extension of the PSICQUIC specification (version 1.3), which has been released
to be compliant with the enhanced standards in molecular interactions. The new release
also includes a new reference implementation of the PSICQUIC server available to the data
providers. It offers augmented web service capabilities and improves the user experience.
PSICQUIC has been running for almost 5 years, with a user base growing from only 4 data
providers to 28 (April 2013) allowing access to 151 310 109 binary interactions. The power
of this web service is shown in PSICQUIC View web application, an example of how to
simultaneously query, browse and download results from the different PSICQUIC servers.
This application is free and open to all users with no login requirement (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml).
Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo
The large-conductance Ca2+-activated K+ (BK) channel and its β-subunit underlie tuning in non-mammalian sensory or hair cells, whereas in mammals its function is less clear. To gain insights into species differences and to reveal putative BK functions, we undertook a systems analysis of BK and BK-Associated Proteins (BKAPS) in the chicken cochlea and compared these results to other species. We identified 110 putative partners from cytoplasmic and membrane/cytoskeletal fractions, using a combination of coimmunoprecipitation, 2-D gel, and LC-MS/MS. Partners included 14-3-3γ, valosin-containing protein (VCP), stathmin (STMN), cortactin (CTTN), and prohibitin (PHB), of which 16 partners were verified by reciprocal coimmunoprecipitation. Bioinformatics revealed binary partners, the resultant interactome, subcellular localization, and cellular processes. The interactome contained 193 proteins involved in 190 binary interactions in subcellular compartments such as the ER, mitochondria, and nucleus. Comparisons with mice showed shared hub proteins that included N-methyl-D-aspartate receptor (NMDAR) and ATP-synthase. Ortholog analyses across six species revealed conserved interactions involving apoptosis, Ca2+ binding, and trafficking, in chicks, mice, and humans. Functional studies using recombinant BK and RNAi in a heterologous expression system revealed that proteins important to cell death/survival, such as annexinA5, γ-actin, lamin, superoxide dismutase, and VCP, caused a decrease in BK expression. This revelation led to an examination of specific kinases and their effectors relevant to cell viability. Sequence analyses of the BK C-terminus across 10 species showed putative binding sites for 14-3-3, RAC-α serine/threonine-protein kinase 1 (Akt), glycogen synthase kinase-3β (GSK3β) and phosphoinositide-dependent kinase-1 (PDK1). Knockdown of 14-3-3 and Akt caused an increase in BK expression, whereas silencing of GSK3β and PDK1 had the opposite effect. This comparative systems approach suggests conservation in BK function across different species in addition to novel functions that may include the initiation of signals relevant to cell death/survival.
IntAct is an open-source, open data molecular interaction database populated by data either curated from the literature or from direct data depositions. Two levels of curation are now available within the database, with both IMEx-level annotation and less detailed MIMIx-compatible entries currently supported. As from September 2011, IntAct contains approximately 275 000 curated binary interaction evidences from over 5000 publications. The IntAct website has been improved to enhance the search process and in particular the graphical display of the results. New data download formats are also available, which will facilitate the inclusion of IntAct's data in the Semantic Web. IntAct is an active contributor to the IMEx consortium (http://www.imexconsortium.org). IntAct source code and data are freely available at http://www.ebi.ac.uk/intact.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
This report summarizes the proceedings of the second workshop of the ‘Minimum Information for Biological and Biomedical Investigations’ (MIBBI) consortium held on Dec 1-2, 2010 in Rüdesheim, Germany through the sponsorship of the Beilstein-Institute. MIBBI is an umbrella organization uniting communities developing Minimum Information (MI) checklists to standardize the description of data sets, the workflows by which they were generated and the scientific context for the work. This workshop brought together representatives of more than twenty communities to present the status of their MI checklists and plans for future development. Shared challenges and solutions were identified and the role of MIBBI in MI checklist development was discussed. The meeting featured some thirty presentations, wide-ranging discussions and breakout groups. The top outcomes of the two-day workshop as defined by the participants were: 1) the chance to share best practices and to identify areas of synergy; 2) defining a series of tasks for updating the MIBBI Portal; 3) reemphasizing the need to maintain independent MI checklists for various communities while leveraging common terms and workflow elements contained in multiple checklists; and 4) revision of the concept of the MIBBI Foundry to focus on the creation of a core set of MIBBI modules intended for reuse by individual MI checklist projects while maintaining the integrity of each MI project. Further information about MIBBI and its range of activities can be found at http://mibbi.org/.
This report summarizes the proceedings of the one day BioSharing meeting held at the Intelligent Systems for Molecular Biology (ISMB) 2010 conference in Boston, MA, USA This inaugural BioSharing event was hosted by the Genomic Standards Consortium as part of its M3 & BioSharing special interest group (SIG) workshop. The BioSharing event included invited talks from a range of community leaders and a panel discussion at the end of the day. The panel session led to the formal agreement among community leaders to join together to promote cross-community knowledge exchange and collaborations. A key focus of the newly formed Biosharing community will be linking up resources to promote real-world data sharing (virtuous cycle of data) and supporting compliance with data policies through the creation of a one-stop-portal of information. Further information about the newly established BioSharing effort can be found at http://biosharing.org.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
Short-chain dehydrogenases/reductases (SDR) constitute one of the largest enzyme superfamilies with presently over 46 000 members. In phylogenetic comparisons, members of this superfamily show early divergence where the majority have only low pair-wise sequence identity, although sharing common structural properties. The SDR enzymes are present in virtually all genomes investigated, and in humans over 70 SDR genes have been identified. In humans, these enzymes are involved in the metabolism of a large variety of compounds, including steroid hormones, prostaglandins, retinoids, lipids and xenobiotics. It is now clear that SDRs represent one of the oldest protein families and contribute to essential functions and interactions of all forms of life. As this field continues to grow rapidly, a systematic nomenclature is essential for future annotation and reference purposes. A functional subdivision of the SDR superfamily into at least 200 SDR families based upon hidden Markov models forms a suitable foundation for such a nomenclature system, which we present in this paper using human SDRs as examples.
SDR; enzymes; nomenclature; bioinformatics; hidden Markov models
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.
The guest editor (AM) provides his perspective on the most recent advances on nucleoside diphosphate kinase (NDPK, otherwise known as AWD or NM23) showcasing phospho-histidine biochemistry and its impact on diverse pathology when disordered. His co-author (SO) provides state-of-the-art analyses from the European institute of Bioinformatics in an appendix to support the most recent advances made by the NDPK community. Unfortunately, to those outside the field, NDPK is often dismissed as a tiny ‘ancient housekeeper’ protein found in marine sponges, social amoebae, worms, fruit flies, rodents and humans but the state-of-the-art papers overviewed here show that NDPK does not act simply in mindless rote, inter-converting cellular ‘energy currencies’. That two NDPK isoforms regulate fetal erythroid lineage is a developmental case in point. Seminal Cancer Research UK support is gratefully acknowledged that generated additional resources to enable the NDPK community to meet in Dundee in 2007 (www.dundee.ac.uk/mchs/ndpk; next meeting is planned: 2010/Mannheim-Heidelberg). The presented papers illustrate the point that when scientists are left alone ‘shut up in the narrow cell of their laboratory’ (as the philosopher Ortega once said, a sentiment echoed by Erwin Schrödinger), then progress will ultimately occur bridging the gap between specialization and translation for human benefit. To aid translation, this overview initially introduces the NDPK family to the non-specialist, who serendipitously finds these proteins in their biology. This is immediately followed by examples of the diverse biology generated by this self-aggregating group of multi-functional proteins and finally capped by an emerging idea explaining how this diversity might arise.
HAART; Drosophila; Jade Goody; Bioinformatics; Dictyostelium; Ion transport