The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning (ML) can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning (AL), builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an AL system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine (SVM) classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1 and 2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in AL.
active Learning; text-mining; neuroinformatics; biocuration; community-curated database; machine learning
Vaccines and drugs have contributed to dramatic improvements in public health worldwide. Over the last decade, there have been efforts in developing biomedical ontologies that represent various areas associated with vaccines and drugs. These ontologies combined with existing health and clinical terminology systems (e.g., SNOMED, RxNorm, NDF-RT, MedDRA, VO, OAE, and AERO) could play significant roles on clinical and translational research. The first “Vaccine and Drug Ontology in the Study of Mechanism and Effect” workshop (VDOSME 2012) provided a platform for discussing problems and solutions in the development and application of biomedical ontologies in representing and analyzing vaccines/drugs, vaccine/drug administrations, vaccine/drug-induced immune responses (including positive host responses and adverse events), and similar topics. The workshop covered two main areas: (i) ontologies of vaccines, of drugs, and of studies thereof; and (ii) analysis of administration, mechanism and effect in terms of representations based on such ontologies. Six full-length papers included in this thematic issue focus on ontology representation and time analysis of vaccine/drug administration and host responses (including positive immune responses and adverse events), vaccine and drug adverse event text mining, and ontology-based Semantic Web applications. The workshop, together with the follow-up activities and personal meetings, provided a wonderful platform for the researchers and scientists in the vaccine and drug communities to demonstrate research progresses, share ideas, address questions, and promote collaborations for better representation and analysis of vaccine and drug-related terminologies and clinical and research data.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
As biomedical technology becomes increasingly sophisticated, researchers can probe ever more subtle effects with the added requirement that the investigation of small effects often requires the acquisition of large amounts of data. In biomedicine, these data are often acquired at, and later shared between, multiple sites. There are both technological and sociological hurdles to be overcome for data to be passed between researchers and later made accessible to the larger scientific community. The goal of the Biomedical Informatics Research Network (BIRN) is to address the challenges inherent in biomedical data sharing.
Materials and methods
BIRN tools are grouped into ‘capabilities’ and are available in the areas of data management, data security, information integration, and knowledge engineering. BIRN has a user-driven focus and employs a layered architectural approach that promotes reuse of infrastructure. BIRN tools are designed to be modular and therefore can work with pre-existing tools. BIRN users can choose the capabilities most useful for their application, while not having to ensure that their project conforms to a monolithic architecture.
BIRN has implemented a new software-based data-sharing infrastructure that has been put to use in many different domains within biomedicine. BIRN is actively involved in outreach to the broader biomedical community to form working partnerships.
BIRN's mission is to provide capabilities and services related to data sharing to the biomedical research community. It does this by forming partnerships and solving specific, user-driven problems whose solutions are then available for use by other groups.
Genomics; statistical genetics; bioinformatics; complex traits; data; machine learning; data sharing; information integration; data mediation; data security; data management; knowledge engineering
The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.
Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.
LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.
Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.
We address the goal of curating observations from published experiments in a generalizable form; reasoning over these observations to generate interpretations and then querying this interpreted knowledge to supply the supporting evidence. We present web-application software as part of the 'BioScholar' project (R01-GM083871) that fully instantiates this process for a well-defined domain: using tract-tracing experiments to study the neural connectivity of the rat brain.
The main contribution of this work is to provide the first instantiation of a knowledge representation for experimental observations called 'Knowledge Engineering from Experimental Design' (KEfED) based on experimental variables and their interdependencies. The software has three parts: (a) the KEfED model editor - a design editor for creating KEfED models by drawing a flow diagram of an experimental protocol; (b) the KEfED data interface - a spreadsheet-like tool that permits users to enter experimental data pertaining to a specific model; (c) a 'neural connection matrix' interface that presents neural connectivity as a table of ordinal connection strengths representing the interpretations of tract-tracing data. This tool also allows the user to view experimental evidence pertaining to a specific connection. BioScholar is built in Flex 3.5. It uses Persevere (a noSQL database) as a flexible data store and PowerLoom® (a mature First Order Logic reasoning system) to execute queries using spatial reasoning over the BAMS neuroanatomical ontology.
We first introduce the KEfED approach as a general approach and describe its possible role as a way of introducing structured reasoning into models of argumentation within new models of scientific publication. We then describe the design and implementation of our example application: the BioScholar software. This is presented as a possible biocuration interface and supplementary reasoning toolkit for a larger, more specialized bioinformatics system: the Brain Architecture Management System (BAMS).
This paper describes software for neuroanatomical knowledge synthesis based on neural connectivity data. This software supports a mature methodology developed since the early 1990s. Over this time, the Swanson laboratory at USC has generated an account of the neural connectivity of the sub-structures of the hypothalamus, amygdala, septum, hippocampus, and bed nucleus of the stria terminalis. This is based on neuroanatomical data maps drawn into a standard brain atlas by experts. In earlier work, we presented an application for visualizing and comparing anatomical macro connections using the Swanson third edition atlas as a framework for accurate registration. Here we describe major improvements to the NeuARt application based on the incorporation of a knowledge representation of experimental design. We also present improvements in the interface and features of the data mapping components within a unified web-application. As a step toward developing an accurate sub-regional account of neural connectivity, we provide navigational access between the data maps and a semantic representation of area-to-area connections that they support. We do so based on an approach called “Knowledge Engineering from Experimental Design” (KEfED) model that is based on experimental variables. We have extended the underlying KEfED representation of tract-tracing experiments by incorporating the definition of a neuronanatomical data map as a measurement variable in the study design. This paper describes the software design of a web-application that allows anatomical data sets to be described within a standard experimental context and thus indexed by non-spatial experimental design features.
knowledge engineering; neural connectivity; tract-tracing; neuroanatomical mapping
The L-shaped anterior zone of the lateral hypothalamic area’s subfornical region (LHAsfa) is delineated by a pontine nucleus incertus input. Function evidence suggests the subfornical region and nucleus incertus modulate foraging and defensive behaviors, although subfornical region connections are poorly understood. A high resolution Phaseolus vulgaris-leucoagglutinin (PHAL) structural analysis is presented here of the LHAsfa neuron population’s overall axonal projection pattern. The strongest LHAsfa targets are in the interbrain and cerebral hemisphere. The former include inputs to anterior hypothalamic nucleus, dorsomedial part of the ventromedial nucleus, and ventral region of the dorsal premammillary nucleus (defensive behavior control system components), and to lateral habenula and dorsal region of the dorsal premammillary nucleus (foraging behavior control system components). The latter include massive inputs to lateral and medial septal nuclei (septo-hippocampal system components), and inputs to bed nuclei of the stria terminalis posterior division related to the defensive behavior system, intercalated amygdalar nucleus (projecting to central amygdalar nucleus), and posterior part of the basomedial amygdalar nucleus. LHAsfa vertical and horizontal limb basic projection patterns are similar, although each preferentially innervates certain terminal fields. Lateral hypothalamic area regions immediately medial, lateral, and caudal to the LHAsfa each generate quite distinct projection patterns. Combined with previous evidence that major sources LHAsfa neural inputs include the parabrachial nucleus (nociceptive information), defensive and foraging behavior system components, and the septo-hippocampal system, the present results suggest that the LHAsfa helps match adaptive behavioral responses (either defensive or foraging) to current internal motivational status and external environmental conditions.
amygdala; behavioral activation; defensive behavior; hypothalamus; lateral habenula; motivation; nucleus incertus
Annual meeting abstracts published by scientific societies often contain rich arrays of information that can be computationally mined and distilled to elucidate the state and dynamics of the subject field. We extracted and processed abstract data from the Society for Neuroscience (SFN) annual meeting abstracts during the period 2001–2006 in order to gain an objective view of contemporary neuroscience. An important first step in the process was the application of data cleaning and disambiguation methods to construct a unified database, since the data were too noisy to be of full utility in the raw form initially available. Using natural language processing, text mining, and other data analysis techniques, we then examined the demographics and structure of the scientific collaboration network, the dynamics of the field over time, major research trends, and the structure of the sources of research funding. Some interesting findings include a high geographical concentration of neuroscience research in the north eastern United States, a surprisingly large transient population (66% of the authors appear in only one out of the six studied years), the central role played by the study of neurodegenerative disorders in the neuroscience community, and an apparent growth of behavioral/systems neuroscience with a corresponding shrinkage of cellular/molecular neuroscience over the six year period. The results from this work will prove useful for scientists, policy makers, and funding agencies seeking to gain a complete and unbiased picture of the community structure and body of knowledge encapsulated by a specific scientific domain.
Anatomical studies of neural circuitry describing the basic wiring diagram of the brain produce intrinsically spatial, highly complex data of great value to the neuroscience community. Published neuroanatomical atlases provide a spatial framework for these studies. We have built an informatics framework based on these atlases for the representation of neuroanatomical knowledge. This framework not only captures current methods of anatomical data acquisition and analysis, it allows these studies to be collated, compared and synthesized within a single system.
We have developed an atlas-viewing application ('NeuARt II') in the Java language with unique functional properties. These include the ability to use copyrighted atlases as templates within which users may view, save and retrieve data-maps and annotate them with volumetric delineations. NeuARt II also permits users to view multiple levels on multiple atlases at once. Each data-map in this system is simply a stack of vector images with one image per atlas level, so any set of accurate drawings made onto a supported atlas (in vector graphics format) could be uploaded into NeuARt II. Presently the database is populated with a corpus of high-quality neuroanatomical data from the laboratory of Dr Larry Swanson (consisting 64 highly-detailed maps of PHAL tract-tracing experiments, made up of 1039 separate drawings that were published in 27 primary research publications over 17 years). Herein we take selective examples from these data to demonstrate the features of NeuArt II. Our informatics tool permits users to browse, query and compare these maps. The NeuARt II tool operates within a bioinformatics knowledge management platform (called 'NeuroScholar') either as a standalone or a plug-in application.
Anatomical localization is fundamental to neuroscientific work and atlases provide an easily-understood framework that is widely used by neuroanatomists and non-neuroanatomists alike. NeuARt II, the neuroinformatics tool presented here, provides an accurate and powerful way of representing neuroanatomical data in the context of commonly-used brain atlases for visualization, comparison and analysis. Furthermore, it provides a framework that supports the delivery and manipulation of mapped data either as a standalone system or as a component in a larger knowledge management system.
Knowledge bases that summarize the published literature provide useful online references for specific areas of systems-level biology that are not otherwise supported by large-scale databases. In the field of neuroanatomy, groups of small focused teams have constructed medium size knowledge bases to summarize the literature describing tract-tracing experiments in several species. Despite years of collation and curation, these databases only provide partial coverage of the available published literature. Given that the scientists reading these papers must all generate the interpretations that would normally be entered into such a system, we attempt here to provide general-purpose annotation tools to make it easy for members of the community to contribute to the task of data collation.
In this paper, we describe an open-source, freely available knowledge management system called 'NeuroScholar' that allows straightforward structured markup of the PDF files according to a well-designed schema to capture the essential details of this class of experiment. Although, the example worked through in this paper is quite specific to neuroanatomical connectivity, the design is freely extensible and could conceivably be used to construct local knowledge bases for other experiment types. Knowledge representations of the experiment are also directly linked to the contributing textual fragments from the original research article. Through the use of this system, not only could members of the community contribute to the collation task, but input data can be gathered for automated approaches to permit knowledge acquisition through the use of Natural Language Processing (NLP).
We present a functional, working tool to permit users to populate knowledge bases for neuroanatomical connectivity data from the literature through the use of structured questionnaires. This system is open-source, fully functional and available for download from .