Motivation: Data collection in spreadsheets is ubiquitous, but current solutions lack support for collaborative semantic annotation that would promote shared and interdisciplinary annotation practices, supporting geographically distributed players.
Results: OntoMaton is an open source solution that brings ontology lookup and tagging capabilities into a cloud-based collaborative editing environment, harnessing Google Spreadsheets and the NCBO Web services. It is a general purpose, format-agnostic tool that may serve as a component of the ISA software suite. OntoMaton can also be used to assist the ontology development process.
Availability: OntoMaton is freely available from Google widgets under the CPAL open source license; documentation and examples at: https://github.com/ISA-tools/OntoMaton.
The range of publicly available biomedical data is enormous and is expanding fast. This expansion means that researchers now face a hurdle to extracting the data they need from the large numbers of data that are available. Biomedical researchers have turned to ontologies and terminologies to structure and annotate their data with ontology concepts for better search and retrieval. However, this annotation process cannot be easily automated and often requires expert curators. Plus, there is a lack of easy-to-use systems that facilitate the use of ontologies for annotation. This paper presents the Open Biomedical Annotator (OBA), an ontology-based Web service that annotates public datasets with biomedical ontology concepts based on their textual metadata (www.bioontology.org). The biomedical community can use the annotator service to tag datasets automatically with ontology terms (from UMLS and NCBO BioPortal ontologies). Such annotations facilitate translational discoveries by integrating annotated data.
The Gene Ontology (GO) project (http://www.geneontology.org/) provides a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see http://www.sequenceontology.org/). The ontologies have been extended and refined for several biological areas, and improvements to the structure of the ontologies have been implemented. To improve the quantity and quality of gene product annotations available from its public repository, the GO Consortium has launched a focused effort to provide comprehensive and detailed annotation of orthologous genes across a number of ‘reference’ genomes, including human and several key model organisms. Software developments include two releases of the ontology-editing tool OBO-Edit, and improvements to the AmiGO browser interface.
PomBase (www.pombase.org) is a new model organism database established to provide access to comprehensive, accurate, and up-to-date molecular data and biological information for the fission yeast Schizosaccharomyces pombe to effectively support both exploratory and hypothesis-driven research. PomBase encompasses annotation of genomic sequence and features, comprehensive manual literature curation and genome-wide data sets, and supports sophisticated user-defined queries. The implementation of PomBase integrates a Chado relational database that houses manually curated data with Ensembl software that supports sequence-based annotation and web access. PomBase will provide user-friendly tools to promote curation by experts within the fission yeast community. This will make a key contribution to shaping its content and ensuring its comprehensiveness and long-term relevance.
IntAct is an open source database and software suite for modeling, storing and analyzing molecular interaction data. The data available in the database originates entirely from published literature and is manually annotated by expert biologists to a high level of detail, including experimental methods, conditions and interacting domains. The database features over 126 000 binary interactions extracted from over 2100 scientific publications and makes extensive use of controlled vocabularies. The web site provides tools allowing users to search, visualize and download data from the repository. IntAct supports and encourages local installations as well as direct data submission and curation collaborations. IntAct source code and data are freely available from .
VectorBase (http://www.vectorbase.org) is an NIAID-funded Bioinformatic Resource Center focused on invertebrate vectors of human pathogens. VectorBase annotates and curates vector genomes providing a web accessible integrated resource for the research community. Currently, VectorBase contains genome information for three mosquito species: Aedes aegypti, Anopheles gambiae and Culex quinquefasciatus, a body louse Pediculus humanus and a tick species Ixodes scapularis. Since our last report VectorBase has initiated a community annotation system, a microarray and gene expression repository and controlled vocabularies for anatomy and insecticide resistance. We have continued to develop both the software infrastructure and tools for interrogating the stored data.
The MIAME and MAGE-OM standards defined by the MGED society provide a specification and implementation of a software infrastructure to facilitate the submission and sharing of data from microarray studies via public repositories. However, although the MAGE object model is flexible enough to support different annotation strategies, the annotation of array descriptions can be complex.
We have developed a graphical Java-based application (Adamant) to assist with submission of Microarray designs to public repositories. Output of the application is fully compliant with the standards prescribed by the various public data repositories.
Adamant will allow researchers to annotate and submit their own array designs to public repositories without requiring programming expertise, knowledge of the MAGE-OM or XML. The application has been used to submit a number of ArrayDesigns to the Array Express database.
The innate immune response is the first line of defence against invading pathogens and is regulated by complex signalling and transcriptional networks. Systems biology approaches promise to shed new light on the regulation of innate immunity through the analysis and modelling of these networks. A key initial step in this process is the contextual cataloguing of the components of this system and the molecular interactions that comprise these networks. InnateDB (http://www.innatedb.com) is a molecular interaction and pathway database developed to facilitate systems-level analyses of innate immunity.
Here, we describe the InnateDB curation project, which is manually annotating the human and mouse innate immunity interactome in rich contextual detail, and present our novel curation software system, which has been developed to ensure interactions are curated in a highly accurate and data-standards compliant manner. To date, over 13,000 interactions (protein, DNA and RNA) have been curated from the biomedical literature. Here, we present data, illustrating how InnateDB curation of the innate immunity interactome has greatly enhanced network and pathway annotation available for systems-level analysis and discuss the challenges that face such curation efforts. Significantly, we provide several lines of evidence that analysis of the innate immunity interactome has the potential to identify novel signalling, transcriptional and post-transcriptional regulators of innate immunity. Additionally, these analyses also provide insight into the cross-talk between innate immunity pathways and other biological processes, such as adaptive immunity, cancer and diabetes, and intriguingly, suggests links to other pathways, which as yet, have not been implicated in the innate immune response.
In summary, curation of the InnateDB interactome provides a wealth of information to enable systems-level analysis of innate immunity.
Summary: Computational methods in molecular biology will increasingly depend on standards-based annotations that describe biological experiments in an unambiguous manner. Annotare is a software tool that enables biologists to easily annotate their high-throughput experiments, biomaterials and data in a standards-compliant way that facilitates meaningful search and analysis.
Availability and Implementation: Annotare is available from http://code.google.com/p/annotare/ under the terms of the open-source MIT License (http://www.opensource.org/licenses/mit-license.php). It has been tested on both Mac and Windows.
Although policy providers have outlined minimal metadata guidelines and naming conventions, ontologies of today still display inter- and intra-ontology heterogeneities in class labelling schemes and metadata completeness. This fact is at least partially due to missing or inappropriate tools. Software support can ease this situation and contribute to overall ontology consistency and quality by helping to enforce such conventions.
We provide a plugin for the Protégé Ontology editor to allow for easy checks on compliance towards ontology naming conventions and metadata completeness, as well as curation in case of found violations.
In a requirement analysis, derived from a prior standardization approach carried out within the OBO Foundry, we investigate the needed capabilities for software tools to check, curate and maintain class naming conventions. A Protégé tab plugin was implemented accordingly using the Protégé 4.1 libraries. The plugin was tested on six different ontologies. Based on these test results, the plugin could be refined, also by the integration of new functionalities.
The new Protégé plugin, OntoCheck, allows for ontology tests to be carried out on OWL ontologies. In particular the OntoCheck plugin helps to clean up an ontology with regard to lexical heterogeneity, i.e. enforcing naming conventions and metadata completeness, meeting most of the requirements outlined for such a tool. Found test violations can be corrected to foster consistency in entity naming and meta-annotation within an artefact. Once specified, check constraints like name patterns can be stored and exchanged for later re-use. Here we describe a first version of the software, illustrate its capabilities and use within running ontology development efforts and briefly outline improvements resulting from its application. Further, we discuss OntoChecks capabilities in the context of related tools and highlight potential future expansions.
The OntoCheck plugin facilitates labelling error detection and curation, contributing to lexical quality assurance in OWL ontologies. Ultimately, we hope this Protégé extension will ease ontology alignments as well as lexical post-processing of annotated data and hence can increase overall secondary data usage by humans and computers.
Ictal single photon emission computed tomography (SPECT) is a powerful tool for noninvasive seizure localization, but it has been underutilized because of practical challenges, including difficulty in implementing ictal-interictal SPECT difference analysis. We previously validated a freely available utility for this purpose, ictal-interictal subtraction analysis by statistical parametric mapping (SPM) (ISAS). To further simplify and improve the difference imaging technique, we now compare a new algorithm, ISAS BioImage Suite (see http://spect.yale.edu and http://bioimagesuite.org), to the original ISAS method in 13 patients with known seizure localization. We found that ISAS BioImage Suite was in agreement with the original algorithm in all cases for which ISAS correctly identified a single unambiguous region of seizure onset. We also tested for possible effects of scan-order bias in the control group used for the analysis and found no significant effect on the results. These findings establish a simple, validated and objective method for analyzing ictal-interictal SPECT difference images for use in the care of patients with epilepsy.
Epilepsy; Single photon emission computed tomography; Surgery; Statistical parametric mapping; Localization
Our group has developed a useful shared software framework for performing, versioning, sharing and viewing Web annotations of a number of kinds, using an open representation model.
The Domeo Annotation Tool was developed in tandem with this open model, the Annotation Ontology (AO). Development of both the Annotation Framework and the open model was driven by requirements of several different types of alpha users, including bench scientists and biomedical curators from university research labs, online scientific communities, publishing and pharmaceutical companies.
Several use cases were incrementally implemented by the toolkit. These use cases in biomedical communications include personal note-taking, group document annotation, semantic tagging, claim-evidence-context extraction, reagent tagging, and curation of textmining results from entity extraction algorithms.
We report on the Domeo user interface here. Domeo has been deployed in beta release as part of the NIH Neuroscience Information Framework (NIF, http://www.neuinfo.org) and is scheduled for production deployment in the NIF’s next full release.
Future papers will describe other aspects of this work in detail, including Annotation Framework Services and components for integrating with external textmining services, such as the NCBO Annotator web service, and with other textmining applications using the Apache UIMA framework.
The IRESite (http://www.iresite.org) presents carefully curated experimental evidence of many eukaryotic viral and cellular internal ribosome entry site (IRES) regions. At the time of submission, IRESite stored >600 records. The IRESite gradually evolved into a robust tool providing (i) biologically meaningful information regarding the IRESs and their experimental background (including annotation of IRES secondary structures and IRES trans-acting factors) as well as (ii) thorough concluding remarks to stored database entries and regularly updated evaluation of the reported IRES function. A substantial portion of the IRESite data results purely from in-house bioinformatic analyses of currently available sequences, in silico attempts to repeat published cloning experiments, DNA sequencing and restriction endonuclease verification of received plasmid DNA. We also present a newly implemented tool for displaying RNA secondary structures and for searching through the structures currently stored in the database. The supplementary material contains an updated list of reported IRESs.
Motivation: Meta-analysis of large gene expression datasets obtained from public repositories requires consistently annotated data. Curation of such experiments, however, is an expert activity which involves repetitive manipulation of text. Existing tools for automated curation are few, which bottleneck the analysis pipeline.
Results: We present MageComet, a web application for biologists and annotators that facilitates the re-annotation of gene expression experiments in MAGE-TAB format. It incorporates data mining, automatic annotation, use of ontologies and data validation to improve the consistency and quality of experimental meta-data from the ArrayExpress Repository.
Availability and implementation: Source and tutorials for MageComet are openly available at goo.gl/8LQPR under the GNU GPL v3 licenses. An implementation can be found at goo.gl/IdCuA
email@example.com or firstname.lastname@example.org
Cellular processes depend on the function of intracellular molecular networks. The curation of the literature relevant to specific biological pathways is important for many theoretical and experimental research teams and communities. No current tool supports web publication or hosting of user-developed large scale annotated pathway diagrams. Sharing via web publication is needed to allow real-time access to the current literature pathway knowledgebase, both privately within a research team or publicly among the outside research community. Web publication also facilitates team and/or community input into the curation process while allowing centralized control of the curation and validation process. We have developed new tool to address these needs. Biological Pathway Publisher (BioPP) is a software suite for converting CellDesigner Systems Biology Markup Language (CD-SBML) formatted pathways into a web viewable format. The BioPP suite is available for private use and for depositing knowledgebases into a newly created public repository.
BioPP suite is a web-based application that allows pathway knowledgebases stored in CD-SBML to be web published with an easily navigated user interface. The BioPP suite consists of four interrelated elements: a pathway publisher, an upload web-interface, a pathway repository for user-deposited knowledgebases and a pathway navigator. Users have the option to convert their CD-SBML files to HTML for restricted use or to allow their knowledgebase to be web-accessible to the scientific community. All entities in all knowledgebases in the repository are linked to public database entries as well as to a newly created public wiki which provides a discussion forum.
BioPP tools and the public repository facilitate sharing of pathway knowledgebases and interactive curation for research teams and scientific communities. BioPP suite is accessible at
The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing and sharing data about microbial biology through an advanced web-based analysis portal. CAMERA collects and links metadata relevant to environmental metagenome data sets with annotation in a semantically-aware environment allowing users to write expressive semantic queries against the database. To meet the needs of the research community, users are able to query metadata categories such as habitat, sample type, time, location and other environmental physicochemical parameters. CAMERA is compliant with the standards promulgated by the Genomic Standards Consortium (GSC), and sustains a role within the GSC in extending standards for content and format of the metagenomic data and metadata and its submission to the CAMERA repository. To ensure wide, ready access to data and annotation, CAMERA also provides data submission tools to allow researchers to share and forward data to other metagenomics sites and community data archives such as GenBank. It has multiple interfaces for easy submission of large or complex data sets, and supports pre-registration of samples for sequencing. CAMERA integrates a growing list of tools and viewers for querying, analyzing, annotating and comparing metagenome and genome data.
The Mouse Genome Database (MGD) is one component of the Mouse Genome Informatics (MGI) system (http://www.informatics.jax.org), a community database resource for the laboratory mouse. MGD strives to provide a comprehensive knowledgebase about the mouse with experiments and data annotated from both literature and online sources. MGD curates and presents consensus and experimental data representations of genetic, genotype (sequence) and phenotype information including highly detailed reports about genes and gene products. Primary foci of integration are through representations of relationships between genes, sequences and phenotypes. MGD collaborates with other bioinformatics groups to curate a definitive set of information about the laboratory mouse and to build and implement the data and semantic standards that are essential for comparative genome analysis. Recent developments in MGD discussed here include an extensive integration of the mouse sequence data and substantial revisions in the presentation, query and visualization of sequence data.
The Mouse Genome Database (MGD) (http://www.informatics.jax.org) one component of a community database resource for the laboratory mouse, a key model organism for interpreting the human genome and for understanding human biology. MGD strives to provide an extensively integrated information resource with experimental details annotated from both literature and on-line genomic data sources. MGD curates and presents the consensus representation of genotype (sequence) to phenotype information including highly detailed information about genes and gene products. Primary foci of integration are through representations of relationships between genes, sequences and phenotypes. MGD collaborates with other bioinformatics groups to curate a definitive set of information about the laboratory mouse. Recent developments include a general implementation of database structures for controlled vocabularies and the integration of a phenotype classification system.
Summary: Payao is a community-based, collaborative web service platform for gene-regulatory and biochemical pathway model curation. The system combines Web 2.0 technologies and online model visualization functions to enable a collaborative community to annotate and curate biological models. Payao reads the models in Systems Biology Markup Language format, displays them with CellDesigner, a process diagram editor, which complies with the Systems Biology Graphical Notation, and provides an interface for model enrichment (adding tags and comments to the models) for the access-controlled community members.
Availability and implementation: Freely available for model curation service at http://www.payaologue.org. Web site implemented in Seaser Framework 2.0 with S2Flex2, MySQL 5.0 and Tomcat 5.5, with all major browsers supported.
Here, we describe the development of WikiPathways (http://www.wikipathways.org), a public wiki for pathway curation, since it was first published in 2008. New features are discussed, as well as developments in the community of contributors. New features include a zoomable pathway viewer, support for pathway ontology annotations, the ability to mark pathways as private for a limited time and the availability of stable hyperlinks to pathways and the elements therein. WikiPathways content is freely available in a variety of formats such as the BioPAX standard, and the content is increasingly adopted by external databases and tools, including Wikipedia. A recent development is the use of WikiPathways as a staging ground for centrally curated databases such as Reactome. WikiPathways is seeing steady growth in the number of users, page views and edits for each pathway. To assess whether the community curation experiment can be considered successful, here we analyze the relation between use and contribution, which gives results in line with other wiki projects. The novel use of pathway pages as supplementary material to publications, as well as the addition of tailored content for research domains, is expected to stimulate growth further.
The Mouse Genome Database (MGD, http://www.informatics.jax.org) is the international community resource for integrated genetic, genomic and biological data about the laboratory mouse. Data in MGD are obtained through loads from major data providers and experimental consortia, electronic submissions from laboratories and from the biomedical literature. MGD maintains a comprehensive, unified, non-redundant catalog of mouse genome features generated by distilling gene predictions from NCBI, Ensembl and VEGA. MGD serves as the authoritative source for the nomenclature of mouse genes, mutations, alleles and strains. MGD is the primary source for evidence-supported functional annotations for mouse genes and gene products using the Gene Ontology (GO). MGD provides full annotation of phenotypes and human disease associations for mouse models (genotypes) using terms from the Mammalian Phenotype Ontology and disease names from the Online Mendelian Inheritance in Man (OMIM) resource. MGD is freely accessible online through our website, where users can browse and search interactively, access data in bulk using Batch Query or BioMart, download data files or use our web services Application Programming Interface (API). Improvements to MGD include expanded genome feature classifications, inclusion of new mutant allele sets and phenotype associations and extensions of GO to include new relationships and a new stream of annotations via phylogenetic-based approaches.
GlycoSuiteDB is an annotated and curated relational database of glycan structures reported in the literature. It contains information on the glycan type, core type, linkages and anomeric configurations, mass, composition and the analytical methods used by the researchers to determine the glycan structure. Native and recombinant sources are detailed, including species, tissue and/or cell type, cell line, strain, life stage, disease, and if known the protein to which the glycan structures are attached. There are links to SWISS-PROT/TrEMBL and PubMed where applicable. Recent developments include the implementation of searching by 2D structure and substructure, disease and reference. The database is updated twice a year, and now contains over 7650 entries. Access to GlycoSuiteDB is available at http://www.glycosuite.com.
The Rat Genome Database (RGD) is the premier repository of rat genomic and genetic data and currently houses over 40 000 rat gene records as well as human and mouse orthologs, 1771 rat and 1911 human quantitative trait loci (QTLs) and 2209 rat strains. Biological information curated for these data objects includes disease associations, phenotypes, pathways, molecular functions, biological processes and cellular components. A suite of tools has been developed to aid curators in acquiring and validating data objects, assigning nomenclature, attaching biological information to objects and making connections among data types. The software used to assign nomenclature, to create and edit objects and to make annotations to the data objects has been specifically designed to make the curation process as fast and efficient as possible. The user interfaces have been adapted to the work routines of the curators, creating a suite of tools that is intuitive and powerful.
Database URL: http://rgd.mcw.edu
As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles' contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed. We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality.
Today’s biological experiments often involve the collaboration of multidisciplinary researchers utilising several high throughput ‘omics platforms. There is a requirement for the details of the experiment to be adequately described using standardised ontologies to enable data preservation, the analysis of the data and to facilitate the export of the data to public repositories. However there are a bewildering number of ontologies, controlled vocabularies, and minimum standards available for use to describe experiments. There is a need for user-friendly software tools to aid laboratory scientists in capturing the experimental information.
A web application called XperimentR has been developed for use by laboratory scientists, consisting of a browser-based interface and server-side components which provide an intuitive platform for capturing and sharing experimental metadata. Information recorded includes details about the biological samples, procedures, protocols, and experimental technologies, all of which can be easily annotated using the appropriate ontologies. Files and raw data can be imported and associated with the biological samples via the interface, from either users’ computers, or commonly used open-source data repositories. Experiments can be shared with other users, and experiments can be exported in the standard ISA-Tab format for deposition in public databases. XperimentR is freely available and can be installed natively or by using a provided pre-configured Virtual Machine. A guest system is also available for trial purposes.
We present a web based software application to aid the laboratory scientist to capture, describe and share details about their experiments.
Experimental annotation; Ontologies; Biological data management