Explicitly identifying the genome of a host organism including sequencing, mapping, and annotating its genetic code has become a priority in the field of biotechnology with aims at improving the efficiency and understanding of cell culture bioprocessing. Recombinant protein therapeutics, primarily produced in mammalian cells, constitute a $108 billion global market. The most common mammalian cell line used in biologic production processes is the Chinese hamster ovary (CHO) cell line, and although great improvements have been made in titer production over the past 25 years, the underlying molecular and physiological factors are not well understood. Confident understanding of CHO bioprocessing elements (e.g. cell line selection, protein production, and reproducibility of process performance and product specifications) would significantly improve with a well understood genome. This review describes mammalian cell culture use in bioprocessing, the importance of obtaining CHO cell line genetic sequences, and the current status of sequencing efforts. Furthermore, transcriptomic techniques and gene expression tools are presented, and case studies exploring genomic techniques and applications aimed to improve mammalian bioprocess performance are reviewed. Finally, future implications of genomic advances are surmised.
Biologics; Genomics; Mammalian cell culture bioprocessing; Next-generation genomic sequencing; Proteomics; Recombinant DNA technology; Recombinant protein production; Transcriptomics
Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions.
All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.
The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE.
The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.
Detecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus).
The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty.
Statistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.
Most pharmacogenomics knowledge is contained in the text of published studies, and is thus not available for automated computation. Natural Language Processing (NLP) techniques for extracting relationships in specific domains often rely on hand-built rules and domain specific ontologies to achieve good performance. In a new and evolving field such as pharmacogenomics (PGx), rules and ontologies may not be available. Recent progress in syntactic NLP parsing in the context of a large corpus of pharmacogenomics text provides new opportunities for automated relationship extraction. We describe an ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million Medline abstracts. We used the syntactic structure of PGx statements to systematically extract commonly occurring relationships and to map them to a common schema. Our extracted relationships have a 70 to 87.7% precision and involve not only key PGx entities such as genes, drugs, and phenotypes (e.g., VKORC1, warfarin, clotting disorder), but also critical entities that are frequently modified by these key entities (e.g., VKORC1 polymorphism, warfarin response, clotting disorder treatment). The result of our analysis is a network of 40,000 relationships between more than 200 entity types with clear semantics. This network is used to guide the curation of PGx knowledge and provide a computable resource for knowledge discovery.
Relationship Extraction; Pharmacogenomics; Natural Language Processing; Ontology; Knowledge Acquisition; Data Integration; Biological Network; Text Mining; Information Extraction
The threat of bioterrorism and emerging infectious diseases has prompted various public health agencies to recommend enhanced surveillance activities to supplement existing surveillance plans. The majority of emerging infectious diseases and bioterrorist agents are zoonotic. Animals are more sensitive to certain biological agents, and their use as clinical sentinels, as a means of early detection, is warranted.
This article provides design methods for a local integrated zoonotic surveillance plan and materials developed for veterinarians to assist in the early detection of bioevents. Zoonotic surveillance in the U.S. is currently too limited and compartmentalized for broader public health objectives. To rapidly detect and respond to bioevents, collaboration and cooperation among various agencies at the federal, state, and local levels must be enhanced and maintained. Co-analysis of animal and human diseases may facilitate the response to infectious disease events and limit morbidity and mortality in both animal and human populations.
Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in extracting concepts from both biomedical literature and clinical notes.
natural language processing; distributional semantics; concept extraction; named entity recognition; empirical lexical resources
This paper explores the use of the resources in the National Library of Medicine's
Unified Medical Language System (UMLS) for the construction of a lexicon useful
for processing texts in the field of molecular biology. A lexicon is constructed from
overlapping terms in the UMLS SPECIALIST lexicon and the UMLS Metathesaurus
to obtain both morphosyntactic and semantic information for terms, and the coverage
of a domain corpus is assessed. Over 77% of tokens in the domain corpus are found in
the constructed lexicon, validating the lexicon's coverage of the most frequent terms
in the domain and indicating that the constructed lexicon is potentially an important
resource for biological text processing.
The progress in the "-omic" sciences has allowed a deeper knowledge on many biological systems with industrial interest. This knowledge is still rarely used for advanced bioprocess monitoring and control at the bioreactor level. In this work, a bioprocess control method is presented, which is designed on the basis of the metabolic network of the organism under consideration. The bioprocess dynamics are formulated using hybrid rigorous/data driven systems and its inherent structure is defined by the metabolism elementary modes.
The metabolic network of the system under study is decomposed into elementary modes (EMs), which are the simplest paths able to operate coherently in steady-state. A reduced reaction mechanism in the form of simplified reactions connecting substrates with end-products is obtained. A dynamical hybrid system integrating material balance equations, EMs reactions stoichiometry and kinetics was formulated. EMs kinetics were defined as the product of two terms: a mechanistic/empirical known term and an unknown term that must be identified from data, in a process optimisation perspective. This approach allows the quantification of fluxes carried by individual elementary modes which is of great help to identify dominant pathways as a function of environmental conditions. The methodology was employed to analyse experimental data of recombinant Baby Hamster Kidney (BHK-21A) cultures producing a recombinant fusion glycoprotein. The identified EMs kinetics demonstrated typical glucose and glutamine metabolic responses during cell growth and IgG1-IL2 synthesis. Finally, an online optimisation study was conducted in which the optimal feeding strategies of glucose and glutamine were calculated after re-estimation of model parameters at each sampling time. An improvement in the final product concentration was obtained as a result of this online optimisation.
The main contribution of this work is a novel bioreactor optimal control method that uses detailed information concerning the metabolism of the underlying biological system. Moreover, the method allows the identification of structural modifications in metabolism over batch time.
We present a predictive bioprocess design strategy employing cell- and molecular-level analysis of rate-limiting steps in human pluripotent stem cell (hPSC) expansion and differentiation, and apply it to produce definitive endoderm (DE) progenitors using a scalable directed-differentiation technology. We define a bioprocess optimization parameter (L; targeted cell Loss) and, with quantitative cell division tracking and fate monitoring, identify and overcome key suspension bioprocess bottlenecks. Adapting process operating conditions to pivotal parameters (single cell survival and growth rate) in a cell-line specific manner enabled adherent-equivalent expansion of hPSCs in feeder- and matrix-free defined-medium suspension culture. Predominantly instructive differentiation mechanisms were found to underlie a subsequent 18-fold expansion, during directed differentiation, to high-purity DE competent for further commitment along pancreatic and hepatic lineages. This study demonstrates that iPSC expansion and differentiation conditions can be prospectively specified to guide the enhanced production of target cells in a scale-free directed differentiation system.
bioprocess; pluripotent stem cells; differentiation; endoderm; expansion
Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation.
We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation.
The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.
The increasing commercial demand for L-carnitine has led to a multiplication of efforts to improve its production with bacteria. The use of different cell environments, such as growing, resting, permeabilized, dried, osmotically stressed, freely suspended and immobilized cells, to maintain enzymes sufficiently active for L-carnitine production is discussed in the text. The different cell states of enterobacteria, such as Escherichia coli and Proteus sp., which can be used to produce L-carnitine from crotonobetaine or D-carnitine as substrate, are analyzed. Moreover, the combined application of both bioprocess and metabolic engineering has allowed a deeper understanding of the main factors controlling the production process, such as energy depletion and the alteration of the acetyl-CoA/CoA ratio which are coupled to the end of the biotransformation. Furthermore, the profiles of key central metabolic activities such as the TCA cycle, the glyoxylate shunt and the acetate metabolism are seen to be closely interrelated and affect the biotransformation efficiency. Although genetically modified strains have been obtained, new strain improvement strategies are still needed, especially in Escherichia coli as a model organism for molecular biology studies. This review aims to summarize and update the state of the art in L-carnitine production using E. coli and Proteus sp, emphasizing the importance of proper reactor design and operation strategies, together with metabolic engineering aspects and the need for feed-back between wet and in silico work to optimize this biotransformation.
We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.
Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.
The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
Biomedical papers contain rich information about entities, facts and events of biological relevance. To discover these automatically, we use text mining techniques, which rely on annotated corpora for training. In order to extract protein-protein interactions, genotype-phenotype/gene-disease associations, etc., we rely on event corpora that are annotated with classified, structured representations of important facts and findings contained within text. These provide an important resource for the training of domain-specific information extraction (IE) systems, to facilitate semantic-based searching of documents. Correct interpretation of these events is not possible without additional information, e.g., does an event describe a fact, a hypothesis, an experimental result or an analysis of results? How confident is the author about the validity of her analyses? These and other types of information, which we collectively term meta-knowledge, can be derived from the context of the event.
We have designed an annotation scheme for meta-knowledge enrichment of biomedical event corpora. The scheme is multi-dimensional, in that each event is annotated for 5 different aspects of meta-knowledge that can be derived from the textual context of the event. Textual clues used to determine the values are also annotated. The scheme is intended to be general enough to allow integration with different types of bio-event annotation, whilst being detailed enough to capture important subtleties in the nature of the meta-knowledge expressed in the text. We report here on both the main features of the annotation scheme, as well as its application to the GENIA event corpus (1000 abstracts with 36,858 events). High levels of inter-annotator agreement have been achieved, falling in the range of 0.84-0.93 Kappa.
By augmenting event annotations with meta-knowledge, more sophisticated IE systems can be trained, which allow interpretative information to be specified as part of the search criteria. This can assist in a number of important tasks, e.g., finding new experimental knowledge to facilitate database curation, enabling textual inference to detect entailments and contradictions, etc. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event.
The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.
Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.
This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.
As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
In recent years bacterial inclusion bodies (IBs) were recognised as highly pure deposits of active proteins inside bacterial cells. Such active nanoparticles are very interesting for further downstream protein isolation, as well as for many other applications in nanomedicine, cosmetic, chemical and pharmaceutical industry.
To prepare large quantities of a high quality product, the whole bioprocess has to be optimised. This includes not only the cultivation of the bacterial culture, but also the isolation step itself, which can be of critical importance for the production process.
To determine the most appropriate method for the isolation of biologically active nanoparticles, three methods for bacterial cell disruption were analyzed.
In this study, enzymatic lysis and two mechanical methods, high-pressure homogenization and sonication, were compared.
During enzymatic lysis the enzyme lysozyme was found to attach to the surface of IBs, and it could not be removed by simple washing. As this represents an additional impurity in the engineered nanoparticles, we concluded that enzymatic lysis is not the most suitable method for IBs isolation.
During sonication proteins are released (lost) from the surface of IBs and thus the surface of IBs appears more porous when compared to the other two methods. We also found that the acoustic output power needed to isolate the IBs from bacterial cells actually damages proteins structures, thereby causing a reduction in biological activity.
High-pressure homogenization also caused some damage to IBs, however the protein loss from the IBs was negligible. Furthermore, homogenization had no side-effects on protein biological activity.
The study shows that among the three methods tested, homogenization is the most appropriate method for the isolation of active nanoparticles from bacterial cells.
Therapeutic monoclonal antibodies (mAbs) currently dominate the biologics marketplace. Development of a new therapeutic mAb candidate is a complex, multistep process and early stages of development typically begin in an academic research environment. Recently, a number of facilities and initiatives have been launched to aid researchers along this difficult path and facilitate progression of the next mAb blockbuster. Complementing this, there has been a renewed interest from the pharmaceutical industry to reconnect with academia in order to boost dwindling pipelines and encourage innovation. In this review, we examine the steps required to take a therapeutic mAb from discovery through early stage preclinical development and toward becoming a feasible clinical candidate. Discussion of the technologies used for mAb discovery, production in mammalian cells and innovations in single-use bioprocessing is included. We also examine regulatory requirements for product quality and characterization that should be considered at the earliest stages of mAb development. We provide details on the facilities available to help researchers and small-biotech build value into early stage product development, and include examples from within our own facility of how technologies are utilized and an analysis of our client base.
monoclonal antibody; preclinical development; biologics; CHO cells; cell culture
Silk-elastin-like proteins (SELPs) combining the physicochemical and biological properties of silk and elastin have a high potential for use in the pharmaceutical, regenerative medicine and materials fields. Their development for use is however restrained by their production levels. Here we describe the batch production optimisation for a novel recently described SELP in the pET-E. coli BL21(DE3) expression system. Both a comprehensive empirical approach examining all process variables (media, induction time and period, temperature, pH, aeration and agitation) and a detailed characterisation of the bioprocess were carried out in an attempt to maximise production with this system.
This study shows that maximum SELP volumetric production is achieved at 37°C using terrific broth at pH 6–7.5, a shake flask volume to medium volume ratio of 10:1 and an agitation speed of 200 rpm. Maximum induction is attained at the beginning of the stationary phase with 0.5 mM IPTG and an induction period of at least 4 hours. We show that the selection agents ampicillin and carbenicillin are rapidly degraded early in the cultivation and that plasmid stability decreases dramatically on induction. Furthermore, acetate accumulates during the bioprocess to levels which are shown to be inhibitory to the host cells. Using our optimised conditions, 500 mg/L of purified SELP was obtained.
We have identified the optimal conditions for the shake flask production of a novel SELP with the final production levels obtained being the highest reported to date. While this study is focused on SELPs, we believe that it could also be of general interest to any study where the pET (ampicillin selective marker)-E. coli BL21(DE3) expression system is used. In particular, we show that induction time is critical in this system with, in contrast to that which is generally believed, optimal production being obtained by induction at the beginning of the stationary phase. Furthermore, we believe that we are at or near the maximum productivity for the system used, with rapid degradation of the selective agent by plasmid encoded β-lactamase, plasmid instability on induction and high acetate production levels being the principal limiting factors for further improved production.
Biopolymers; Silk-elastin like polymers; pET-E. coli BL21(DE3); Batch production
Information extraction is a complex task which is necessary to develop high-precision information retrieval tools. In this paper, we present the platform MeTAE (Medical Texts Annotation and Exploration). MeTAE allows (i) to extract and annotate medical entities and relationships from medical texts and (ii) to explore semantically the produced RDF annotations.
Our annotation approach relies on linguistic patterns and domain knowledge and consists in two steps: (i) recognition of medical entities and (ii) identification of the correct semantic relation between each pair of entities. The first step is achieved by an enhanced use of MetaMap which improves the precision obtained by MetaMap by 19.59% in our evaluation. The second step relies on linguistic patterns which are built semi-automatically from a corpus selected according to semantic criteria. We evaluate our system’s ability to identify medical entities of 16 types. We also evaluate the extraction of treatment relations between a treatment (e.g. medication) and a problem (e.g. disease): we obtain 75.72% precision and 60.46% recall.
According to our experiments, using an external sentence segmenter and noun phrase chunker may improve the precision of MetaMap-based medical entity recognition. Our pattern-based relation extraction method obtains good precision and recall w.r.t related works. A more precise comparison with related approaches remains difficult however given the differences in corpora and in the exact nature of the extracted relations. The selection of MEDLINE articles through queries related to known drug-disease pairs enabled us to obtain a more focused corpus of relevant examples of treatment relations than a more general MEDLINE query.
Associating literature with pathways poses new challenges to the Text Mining (TM) community. There are three main challenges to this task: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge.
To address these challenges, we constructed new resources to link the text with a model pathway; they are: the GENIA pathway corpus with event annotation and NF-kB pathway. Through their detailed analysis, we address the untapped resource, ‘bio-inference,’ as well as the differences between text and pathway representation. Here, we show the precise comparisons of their representations and the nine classes of ‘bio-inference’ schemes observed in the pathway corpus.
We believe that the creation of such rich resources and their detailed analysis is the significant first step for accelerating the research of the automatic construction of pathway from text.
The integration of the rapidly expanding corpus of information about the genome, transcriptome, and proteome, engendered by powerful technological advances, such as microarrays, and the availability of genomic sequence from multiple species, challenges the grasp and comprehension of the scientific community. Despite the existence of text-mining methods that identify biological relationships based on the textual co-occurrence of gene/protein terms or similarities in abstract texts, knowledge of the underlying molecular connections on a large scale, which is prerequisite to understanding novel biological processes, lags far behind the accumulation of data. While computationally efficient, the co-occurrence-based approaches fail to characterize (e.g., inhibition or stimulation, directionality) biological interactions. Programs with natural language processing (NLP) capability have been created to address these limitations, however, they are in general not readily accessible to the public.
We present a NLP-based text-mining approach, Chilibot, which constructs content-rich relationship networks among biological concepts, genes, proteins, or drugs. Amongst its features, suggestions for new hypotheses can be generated. Lastly, we provide evidence that the connectivity of molecular networks extracted from the biological literature follows the power-law distribution, indicating scale-free topologies consistent with the results of previous experimental analyses.
Chilibot distills scientific relationships from knowledge available throughout a wide range of biological domains and presents these in a content-rich graphical format, thus integrating general biomedical knowledge with the specialized knowledge and interests of the user. Chilibot can be accessed free of charge to academic users.
Publication databases in biomedicine (e.g., PubMed, MEDLINE) are growing rapidly in size every year, as are public databases of experimental biological data and annotations derived from the data. Publications often contain evidence that confirm or disprove annotations, such as putative protein functions, however, it is increasingly difficult for biologists to identify and process published evidence due to the volume of papers and the lack of a systematic approach to associate published evidence with experimental data and annotations. Natural Language Processing (NLP) tools can help address the growing divide by providing automatic high-throughput detection of simple terms in publication text. However, NLP tools are not mature enough to identify complex terms, relationships, or events.
In this paper we present and extend BioDEAL, a community evidence annotation system that introduces a feedback loop into the database-publication cycle to allow scientists to connect data-driven biological concepts to publications.
BioDEAL may change the way biologists relate published evidence with experimental data. Instead of biologists or research groups searching and managing evidence independently, the community can collectively build and share this knowledge.
The ability to computationally extract mentions of neuroanatomical regions from the literature would assist linking to other entities within and outside of an article. Examples include extracting reports of connectivity or region-specific gene expression. To facilitate text mining of neuroscience literature we have created a corpus of manually annotated brain region mentions. The corpus contains 1,377 abstracts with 18,242 brain region annotations. Interannotator agreement was evaluated for a subset of the documents, and was 90.7% and 96.7% for strict and lenient matching respectively. We observed a large vocabulary of over 6,000 unique brain region terms and 17,000 words. For automatic extraction of brain region mentions we evaluated simple dictionary methods and complex natural language processing techniques. The dictionary methods based on neuroanatomical lexicons recalled 36% of the mentions with 57% precision. The best performance was achieved using a conditional random field (CRF) with a rich feature set. Features were based on morphological, lexical, syntactic and contextual information. The CRF recalled 76% of mentions at 81% precision, by counting partial matches recall and precision increase to 86% and 92% respectively. We suspect a large amount of error is due to coordinating conjunctions, previously unseen words and brain regions of less commonly studied organisms. We found context windows, lemmatization and abbreviation expansion to be the most informative techniques. The corpus is freely available at http://www.chibi.ubc.ca/WhiteText/.
text mining; neuroanatomy; natural language processing; corpus; conditional random field
Biomedical corpora annotated with event-level information represent an important resource for domain-specific information extraction (IE) systems. However, bio-event annotation alone cannot cater for all the needs of biologists. Unlike work on relation and event extraction, most of which focusses on specific events and named entities, we aim to build a comprehensive resource, covering all statements of causal association present in discourse. Causality lies at the heart of biomedical knowledge, such as diagnosis, pathology or systems biology, and, thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. A biomedical text corpus annotated with such relations is, hence, crucial for developing and evaluating biomedical text mining.
We have defined an annotation scheme for enriching biomedical domain corpora with causality relations. This schema has subsequently been used to annotate 851 causal relations to form BioCause, a collection of 19 open-access full-text biomedical journal articles belonging to the subdomain of infectious diseases. These documents have been pre-annotated with named entity and event information in the context of previous shared tasks. We report an inter-annotator agreement rate of over 60% for triggers and of over 80% for arguments using an exact match constraint. These increase significantly using a relaxed match setting. Moreover, we analyse and describe the causality relations in BioCause from various points of view. This information can then be leveraged for the training of automatic causality detection systems.
Augmenting named entity and event annotations with information about causal discourse relations could benefit the development of more sophisticated IE systems. These will further influence the development of multiple tasks, such as enabling textual inference to detect entailments, discovering new facts and providing new hypotheses for experimental work.
The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature.
We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text.
This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.