Next-generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis for data interpretation. We have developed an integrated approach for end-to-end clinical NGS data analysis from variant detection to functional profiling. Robust bioinformatics pipelines were implemented for genome alignment, single nucleotide polymorphism (SNP), small insertion/deletion (InDel), and copy number variation (CNV) detection of whole exome sequencing (WES) data from the Illumina platform. Quality-control metrics were analyzed at each step of the pipeline by use of a validated training dataset to ensure data integrity for clinical applications. We annotate the variants with data regarding the disease population and variant impact. Custom algorithms were developed to filter variants based on criteria, such as quality of variant, inheritance pattern, and impact of variant on protein function. The developed clinical variant pipeline links the identified rare variants to Integrated Genome Viewer for visualization in a genomic context and to the Protein Information Resource’s iProXpress for rich protein and disease information. With the application of our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for downstream variant filtering that empowers clinicians and researchers to interpret more effectively the relevance of genomic alterations within a rare genetic disease.
bioinformatics; genetic alterations; Mendelian Genetics; protein information resources
The Protein Ontology (PRO) provides a formal, logically-based classification of specific protein classes including structured representations of protein isoforms, variants and modified forms. Initially focused on proteins found in human, mouse and Escherichia coli, PRO now includes representations of protein complexes. The PRO Consortium works in concert with the developers of other biomedical ontologies and protein knowledge bases to provide the ability to formally organize and integrate representations of precise protein forms so as to enhance accessibility to results of protein research. PRO (http://pir.georgetown.edu/pro) is part of the Open Biomedical Ontology Foundry.
Summary: We have developed a new web application for peptide matching using Apache Lucene-based search engine. The Peptide Match service is designed to quickly retrieve all occurrences of a given query peptide from UniProt Knowledgebase (UniProtKB) with isoforms. The matched proteins are shown in summary tables with rich annotations, including matched sequence region(s) and links to corresponding proteins in a number of proteomic/peptide spectral databases. The results are grouped by taxonomy and can be browsed by organism, taxonomic group or taxonomy tree. The service supports queries where isobaric leucine and isoleucine are treated equivalent, and an option for searching UniRef100 representative sequences, as well as dynamic queries to major proteomic databases. In addition to the web interface, we also provide RESTful web services. The underlying data are updated every 4 weeks in accordance with the UniProt releases.
Supplementary data are available at Bioinformatics online.
The PIRSF protein classification system (http://pir.georgetown.edu/pirsf/) reflects evolutionary relationships of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture). PIRSF families are curated systematically based on literature review and integrative sequence and functional analysis, including sequence and structure similarity, domain architecture, functional association, genome context, and phyletic pattern. The results of classification and expert annotation are summarized in PIRSF family reports with graphical viewers for taxonomic distribution, domain architecture, family hierarchy, and multiple alignment and phylogenetic tree. The PIRSF system provides a comprehensive resource for bioinformatics analysis and comparative studies of protein function and evolution. Domain or fold-based searches allow identification of evolutionarily related protein families sharing domains or structural folds. Functional convergence and functional divergence are revealed by the relationships between protein classification and curated family functions. The taxonomic distribution allows the identification of lineage-specific or broadly conserved protein families and can reveal horizontal gene transfer. Here we demonstrate, with illustrative examples, how to use the web-based PIRSF system as a tool for functional and evolutionary studies of protein families.
Domain architecture; Functional convergence; Functional divergence; Genome context; Protein family classification; Taxonomic distribution
Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task.
A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations.
In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.
Protein phosphorylation is central to the regulation of most aspects of cell function. Given its importance, it has been the subject of active research as well as the focus of curation in several biological databases. We have developed Rule-based Literature Mining System for protein Phosphorylation (RLIMS-P), an online text-mining tool to help curators identify biomedical research articles relevant to protein phosphorylation. The tool presents information on protein kinases, substrates and phosphorylation sites automatically extracted from the biomedical literature. The utility of the RLIMS-P Web site has been evaluated by curators from Phospho.ELM, PhosphoGRID/BioGrid and Protein Ontology as part of the BioCreative IV user interactive task (IAT). The system achieved F-scores of 0.76, 0.88 and 0.92 for the extraction of kinase, substrate and phosphorylation sites, respectively, and a precision of 0.88 in the retrieval of relevant phosphorylation literature. The system also received highly favorable feedback from the curators in a user survey. Based on the curators’ suggestions, the Web site has been enhanced to improve its usability. In the RLIMS-P Web site, phosphorylation information can be retrieved by PubMed IDs or keywords, with an option for selecting targeted species. The result page displays a sortable table with phosphorylation information. The text evidence page displays the abstract with color-coded entity mentions and includes links to UniProtKB entries via normalization, i.e. the linking of entity mentions to database identifiers, facilitated by the GenNorm tool and by the links to the bibliography in UniProt. Log in and editing capabilities are offered to any user interested in contributing to the validation of RLIMS-P results. Retrieved phosphorylation information can also be downloaded in CSV format and the text evidence in the BioC format. RLIMS-P is freely available.
Chondrichthyan fishes are a diverse class of gnathostomes that provide a valuable perspective on fundamental characteristics shared by all jawed and limbed vertebrates. Studies of phylogeny, species diversity, population structure, conservation, and physiology are accelerated by genomic, transcriptomic and protein sequence data. These data are widely available for many sarcopterygii (coelacanth, lungfish and tetrapods) and actinoptergii (ray-finned fish including teleosts) taxa, but limited for chondrichthyan fishes. In this study, we summarize available data for chondrichthyes and describe resources for one of the largest projects to characterize one of these fish,
Leucoraja erinacea, the little skate. SkateBase (
http://skatebase.org) serves as the skate genome project portal linking data, research tools, and teaching resources.
BioC is a new simple XML format for sharing biomedical text and annotations and libraries to read and write that format. This promotes the development of interoperable tools for natural language processing (NLP) of biomedical text. The interoperability track at the BioCreative IV workshop featured contributions using or highlighting the BioC format. These contributions included additional implementations of BioC, many new corpora in the format, biomedical NLP tools consuming and producing the format and online services using the format. The ease of use, broad support and rapidly growing number of tools demonstrate the need for and value of the BioC format.
This article reports the use of the BioC standard format in our sentence simplification system, iSimp, and demonstrates its general utility. iSimp is designed to simplify complex sentences commonly found in the biomedical text, and has been shown to improve existing text mining applications that rely on the analysis of sentence structures. By adopting the BioC format, we aim to make iSimp readily interoperable with other applications in the biomedical domain. To examine the utility of iSimp in BioC, we implemented a rule-based relation extraction system that uses iSimp as a preprocessing module and BioC for data exchange. Evaluation on the training corpus of BioNLP-ST 2011 GENIA Event Extraction (GE) task showed that iSimp sentence simplification improved the recall by 3.2% without reducing precision. The iSimp simplification-annotated corpora, both our previously used corpus and the GE corpus in the current study, have been converted into the BioC format and made publicly available at the project’s Web site: http://research.bioinformatics.udel.edu/isimp/.
When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets.
We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7.
Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness.
Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies.
ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.
Next-generation sequencing; Illumina; Trimming; De novo assembly; Reference-based assembly; Perl
Motivation: Prediction of protein–protein interaction has become an important part of systems biology in reverse engineering the biological networks for better understanding the molecular biology of the cell. Although significant progress has been made in terms of prediction accuracy, most computational methods only predict whether two proteins interact but not their interacting residues—the information that can be very valuable for understanding the interaction mechanisms and designing modulation of the interaction. In this work, we developed a computational method to predict the interacting residue pairs—contact matrix for interacting protein domains, whose rows and columns correspond to the residues in the two interacting domains respectively and whose values (1 or 0) indicate whether the corresponding residues (do or do not) interact.
Results: Our method is based on supervised learning using support vector machines. For each domain involved in a given domain–domain interaction (DDI), an interaction profile hidden Markov model (ipHMM) is first built for the domain family, and then each residue position for a member domain sequence is represented as a 20-dimension vector of Fisher scores, characterizing how similar it is as compared with the family profile at that position. Each element of the contact matrix for a sequence pair is now represented by a feature vector from concatenating the vectors of the two corresponding residues, and the task is to predict the element value (1 or 0) from the feature vector. A support vector machine is trained for a given DDI, using either a consensus contact matrix or contact matrices for individual sequence pairs, and is tested by leave-one-out cross validation. The performance averaged over a set of 115 DDIs collected from the 3 DID database shows significant improvement (sensitivity up to 85%, and specificity up to 85%), as compared with a multiple sequence alignment-based method (sensitivity 57%, and specificity 78%) previously reported in the literature.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/.
The Protein Ontology (PRO; http://proconsortium.org) formally defines protein entities and explicitly represents their major forms and interrelations. Protein entities represented in PRO corresponding to single amino acid chains are categorized by level of specificity into family, gene, sequence and modification metaclasses, and there is a separate metaclass for protein complexes. All metaclasses also have organism-specific derivatives. PRO complements established sequence databases such as UniProtKB, and interoperates with other biomedical and biological ontologies such as the Gene Ontology (GO). PRO relates to UniProtKB in that PRO’s organism-specific classes of proteins encoded by a specific gene correspond to entities documented in UniProtKB entries. PRO relates to the GO in that PRO’s representations of organism-specific protein complexes are subclasses of the organism-agnostic protein complex terms in the GO Cellular Component Ontology. The past few years have seen growth and changes to the PRO, as well as new points of access to the data and new applications of PRO in immunology and proteomics. Here we describe some of these developments.
Organisms of the genus Clostridium are Gram-positive endospore formers of great importance to the carbon cycle, human normo- and pathophysiology, but also in biofuel and biorefinery applications. Exposure of Clostridium organisms to chemical and in particular toxic metabolite stress is ubiquitous in both natural (such as in the human microbiome) and engineered environments, engaging both the general stress response as well as specialized programs. Yet, despite its fundamental and applied significance, it remains largely unexplored at the systems level.
We generated a total of 96 individual sets of microarray data examining the transcriptional changes in C. acetobutylicum, a model Clostridium organism, in response to three levels of chemical stress from the native metabolites, butanol and butyrate. We identified 164 significantly differentially expressed transcriptional regulators and detailed the cellular programs associated with general and stressor-specific responses, many previously unexplored. Pattern-based, comparative genomic analyses enabled us, for the first time, to construct a detailed picture of the genetic circuitry underlying the stress response. Notably, a list of the regulons and DNA binding motifs of the stress-related transcription factors were identified: two heat-shock response regulators, HrcA and CtsR; the SOS response regulator LexA; the redox sensor Rex; and the peroxide sensor PerR. Moreover, several transcriptional regulators controlling stress-responsive amino acid and purine metabolism and their regulons were also identified, including ArgR (arginine biosynthesis and catabolism regulator), HisR (histidine biosynthesis regulator), CymR (cysteine metabolism repressor) and PurR (purine metabolism repressor).
Using an exceptionally large set of temporal transcriptional data and regulon analyses, we successfully built a STRING-based stress response network model integrating important players for the general and specialized metabolite stress response in C. acetobutylicum. Since the majority of the transcription factors and their target genes are highly conserved in other organisms of the Clostridium genus, this network would be largely applicable to other Clostridium organisms. The network informs the molecular basis of Clostridium responses to toxic metabolites in natural ecosystems and the microbiome, and will facilitate the construction of genome-scale models with added regulatory-network dimensions to guide the development of tolerant strains.
Gene expression; Protein-protein interaction; Transcriptional regulatory network (TRN); Transcription factor (TF); TF binding site (TFBS); Transcriptional regulator (TR)
Genomic, proteomic, and other omic-based approaches are now broadly used in biomedical research to facilitate the understanding of disease mechanisms and identification of molecular targets and biomarkers for therapeutic and diagnostic development. While the Omics technologies and bioinformatics tools for analyzing Omics data are rapidly advancing, the functional analysis and interpretation of the data remain challenging due to the inherent nature of the generally long workflows of Omics experiments. We adopt a strategy that emphasizes the use of curated knowledge resources coupled with expert-guided examination and interpretation of Omics data for the selection of potential molecular targets. We describe a downstream workflow and procedures for functional analysis that focus on biological pathways, from which molecular targets can be derived and proposed for experimental validation.
Proteomics; Genomics; Bioinformatics; Biological pathways; Cell signaling; Databases; Molecular targets; Biomarkers
Knowledge representation of the role of phosphorylation is essential for the meaningful understanding of many biological processes. However, such a representation is challenging because proteins can exist in numerous phosphorylated forms with each one having its own characteristic protein–protein interactions (PPIs), functions and subcellular localization. In this article, we evaluate the current state of phosphorylation event curation and then present a bioinformatics framework for the annotation and representation of phosphorylated proteins and construction of phosphorylation networks that addresses some of the gaps in current curation efforts. The integrated approach involves (i) text mining guided by RLIMS-P, a tool that identifies phosphorylation-related information in scientific literature; (ii) data mining from curated PPI databases; (iii) protein form and complex representation using the Protein Ontology (PRO); (iv) functional annotation using the Gene Ontology (GO); and (v) network visualization and analysis with Cytoscape. We use this framework to study the spindle checkpoint, the process that monitors the assembly of the mitotic spindle and blocks cell cycle progression at metaphase until all chromosomes have made bipolar spindle attachments. The phosphorylation networks we construct, centered on the human checkpoint kinase BUB1B (BubR1) and its yeast counterpart MAD3, offer a unique view of the spindle checkpoint that emphasizes biologically relevant phosphorylated forms, phosphorylation-state–specific PPIs and kinase–substrate relationships. Our approach for constructing protein phosphorylation networks can be applied to any biological process that is affected by phosphorylation.
Protein phosphorylation is a central regulatory mechanism in signal transduction involved in most biological processes. Phosphorylation of a protein may lead to activation or repression of its activity, alternative subcellular location and interaction with different binding partners. Extracting this type of information from scientific literature is critical for connecting phosphorylated proteins with kinases and interaction partners, along with their functional outcomes, for knowledge discovery from phosphorylation protein networks. We have developed the Extracting Functional Impact of Phosphorylation (eFIP) text mining system, which combines several natural language processing techniques to find relevant abstracts mentioning phosphorylation of a given protein together with indications of protein–protein interactions (PPIs) and potential evidences for impact of phosphorylation on the PPIs. eFIP integrates our previously developed tools, Extracting Gene Related ABstracts (eGRAB) for document retrieval and name disambiguation, Rule-based LIterature Mining System (RLIMS-P) for Protein Phosphorylation for extraction of phosphorylation information, a PPI module to detect PPIs involving phosphorylated proteins and an impact module for relation extraction. The text mining system has been integrated into the curation workflow of the Protein Ontology (PRO) to capture knowledge about phosphorylated proteins. The eFIP web interface accepts gene/protein names or identifiers, or PubMed identifiers as input, and displays results as a ranked list of abstracts with sentence evidence and summary table, which can be exported in a spreadsheet upon result validation. As a participant in the BioCreative-2012 Interactive Text Mining track, the performance of eFIP was evaluated on document retrieval (F-measures of 78–100%), sentence-level information extraction (F-measures of 70–80%) and document ranking (normalized discounted cumulative gain measures of 93–100% and mean average precision of 0.86). The utility and usability of the eFIP web interface were also evaluated during the BioCreative Workshop. The use of the eFIP interface provided a significant speed-up (∼2.5-fold) for time to completion of the curation task. Additionally, eFIP significantly simplifies the task of finding relevant articles on PPI involving phosphorylated forms of a given protein.
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.
Motivation: Identifier (ID) mapping establishes links between various biological databases and is an essential first step for molecular data integration and functional annotation. ID mapping allows diverse molecular data on genes and proteins to be combined and mapped to functional pathways and ontologies. We have developed comprehensive protein-centric ID mapping services providing mappings for 90 IDs derived from databases on genes, proteins, pathways, diseases, structures, protein families, protein interaction, literature, ontologies, etc. The services are widely used and have been regularly updated since 2006.
Recent advances in high-throughput DNA sequencing technologies have equipped biologists with a powerful new set of tools for advancing research goals. The resulting flood of sequence data has made it critically important to train the next generation of scientists to handle the inherent bioinformatic challenges. The North East Bioinformatics Collaborative (NEBC) is undertaking the genome sequencing and annotation of the little skate (Leucoraja erinacea) to promote advancement of bioinformatics infrastructure in our region, with an emphasis on practical education to create a critical mass of informatically savvy life scientists. In support of the Little Skate Genome Project, the NEBC members have developed several annotation workshops and jamborees to provide training in genome sequencing, annotation and analysis. Acting as a nexus for both curation activities and dissemination of project data, a project web portal, SkateBase (http://skatebase.org) has been developed. As a case study to illustrate effective coupling of community annotation with workforce development, we report the results of the Mitochondrial Genome Annotation Jamborees organized to annotate the first completely assembled element of the Little Skate Genome Project, as a culminating experience for participants from our three prior annotation workshops. We are applying the physical/virtual infrastructure and lessons learned from these activities to enhance and streamline the genome annotation workflow, as we look toward our continuing efforts for larger-scale functional and structural community annotation of the L. erinacea genome.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. To this end BioCreative I was held in 2004, BioCreative II in 2007, and BioCreative II.5 in 2009. Each of these workshops involved humanly annotated test data for several basic tasks in text mining applied to the biomedical literature. Participants in the workshops were invited to compete in the tasks by constructing software systems to perform the tasks automatically and were given scores based on their performance. The results of these workshops have benefited the community in several ways. They have 1) provided evidence for the most effective methods currently available to solve specific problems; 2) revealed the current state of the art for performance on those problems; 3) and provided gold standard data and results on that data by which future advances can be gauged. This special issue contains overview papers for the three tasks of BioCreative III.
The BioCreative III Workshop was held in September of 2010 and continued the tradition of a challenge evaluation on several tasks judged basic to effective text mining in biology, including a gene normalization (GN) task and two protein-protein interaction (PPI) tasks. In total the Workshop involved the work of twenty-three teams. Thirteen teams participated in the GN task which required the assignment of EntrezGene IDs to all named genes in full text papers without any species information being provided to a system. Ten teams participated in the PPI article classification task (ACT) requiring a system to classify and rank a PubMed® record as belonging to an article either having or not having “PPI relevant” information. Eight teams participated in the PPI interaction method task (IMT) where systems were given full text documents and were required to extract the experimental methods used to establish PPIs and a text segment supporting each such method. Gold standard data was compiled for each of these tasks and participants competed in developing systems to perform the tasks automatically.
BioCreative III also introduced a new interactive task (IAT), run as a demonstration task. The goal was to develop an interactive system to facilitate a user’s annotation of the unique database identifiers for all the genes appearing in an article. This task included ranking genes by importance (based preferably on the amount of described experimental information regarding genes). There was also an optional task to assist the user in finding the most relevant articles about a given gene. For BioCreative III, a user advisory group (UAG) was assembled and played an important role 1) in producing some of the gold standard annotations for the GN task, 2) in critiquing IAT systems, and 3) in providing guidance for a future more rigorous evaluation of IAT systems. Six teams participated in the IAT demonstration task and received feedback on their systems from the UAG group. Besides innovations in the GN and PPI tasks making them more realistic and practical and the introduction of the IAT task, discussions were begun on community data standards to promote interoperability and on user requirements and evaluation metrics to address utility and usability of systems.
In this paper we give a brief history of the BioCreative Workshops and how they relate to other text mining competitions in biology. This is followed by a synopsis of the three tasks GN, PPI, and IAT in BioCreative III with figures for best participant performance on the GN and PPI tasks. These results are discussed and compared with results from previous BioCreative Workshops and we conclude that the best performing systems for GN, PPI-ACT and PPI-IMT in realistic settings are not sufficient for fully automatic use. This provides evidence for the importance of interactive systems and we present our vision of how best to construct an interactive system for a GN or PPI like task in the remainder of the paper.