The leading causes of constrictive pericarditis have changed over time leading to a commensurate change in the indications and complexity of surgical pericardiectomy. We evaluated our single-center experience to define the etiologies, risk factors, and outcomes of pericardiectomy in a modern cohort.
We retrospectively reviewed our institutional database for all patients who underwent total or partial pericardiectomy. Demographic, co-morbid, operative, and outcome data were evaluated. Survival was assessed by the Kaplan-Meier method. Multivariable Cox proportional hazards regression models examined risk factors for mortality.
From 1995–2010, 98 adults underwent pericardiectomy for constrictive disease. The most common etiologies were idiopathic (n=44), postoperative (n=30), and post-radiation (n=17). Total pericardiectomy was performed in 94 cases, most commonly through a sternotomy (n=93). Thirty-three cases were redo sternotomies, 34 underwent a concomitant procedure, and 34 required cardiopulmonary bypass. Overall in-hospital, 1-year, 5-year, and 10-year survival rates were 92.9%, 82.5%, 64.3%, and 49.2%, respectively. Survival differed sharply by etiology with idiopathic, postoperative, and post-radiation 5-year survivals of 79.8%, 55.9%, and 11.0%, respectively (p<0.001). On multivariable analysis, only the need for cardiopulmonary bypass (HR: 21.2, p=0.02) was predictive of 30-day mortality while post-radiation etiology (HR: 3.19, p=0.02) and hypoalbuminemia (HR: 0.57, p=0.03) were associated with increased 10-year mortality.
Although survival varies significantly by etiology, pericardiectomy continues to be a safe operation for constrictive pericarditis. Post-radiation pericarditis and hypoalbuminemia are significant risk factors for decreased long-term survival.
We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.
Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.
The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
The development of specific biomarkers to aid in diagnosis and prognosis of neuronal injury is of paramount importance in cardiac surgery. Alpha II-spectrin is a structural protein abundant in neurons of the central nervous system and cleaved into signature fragments by proteases involved in necrotic and apoptotic cell death. We measured cerebrospinal fluid (CSF) alpha II-spectrin breakdown products (αII-SBDP’s) in a canine model of hypothermic circulatory arrest (HCA) and cardiopulmonary bypass (CPB).
Canine subjects were exposed to either 1 hour of HCA (n=8, mean lowest tympanic temperature 18.0 ± 1.2 °C), or standard CPB (n=7). CSF samples were collected prior to treatment and 8 and 24 hours post-treatment. Using polyacrylamide gel electrophoresis and immunoblotting, SBDP’s were isolated and compared between groups using computer-assisted densitometric scanning. Necrotic versus apoptotic cell death was indexed by measuring calpain and caspase-3 cleaved αII-SBDP’s (SBDP 145+150 and SBDP 120, respectively).
Animals undergoing HCA demonstrated mild patterns of histological cellular injury and clinically detectable neurologic dysfunction. Calpain-produced αII-SBDP (150kDa+145kDa bands-necrosis) 8 hours post HCA, were significantly increased (p=0.02) as compared to levels prior to HCA and remained elevated at 24 hours post HCA. In contrast caspase-3 αII-SBDP (120kDa band-apoptosis) were not significantly increased. Animals receiving CPB did not demonstrate clinical or histological evidence of injury, with no increases in necrotic or apoptotic cellular markers.
We report the use of αII-SBDP’s as markers of neurological injury following cardiac surgery. Our analysis demonstrates that Calpain and Caspase produced αII-SBDP’s may be an important and novel marker of neurologic injury following HCA.
Brain Injury; cardiac surgery; neuroprotection; hypothermic circulatory arrest; biomarkers
Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.
This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.
As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research.
In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the LLL05 corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system.
The BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from http://biolemmatizer.sourceforge.net.
Ubiquitin carboxyl-terminal esterase-L1 (UCHL1) is a protein highly selectively expressed in neurons and has been linked to neurodegenerative disease in humans. We hypothesize that UCHL1 would be an effective serum biomarker for brain injury as tested in canine models of hypothermic circulatory arrest (HCA) and cardiopulmonary bypass (CPB).
Canines were exposed to CPB (n=14), 1 hour(h) HCA (n=11), or 2h-HCA (n=20). Cerebrospinal fluid (CSF) and serum were collected at baseline, 8h, and 24h post-treatment. UCHL1 levels were measured using a sandwich enzyme-linked immunosorbent assay (ELISA). Neurological function and histopathology were scored at 24h, and UCHL1 immunoreactivity was examined at 8h.
Baseline UCHL1 protein levels in CSF and serum were similar for all groups. In serum, UCHL1 levels were elevated at 8h post-treatment for 2h-HCA subjects compared to baseline values (p<0.01), and also compared to CPB canines at 8h (p<0.01). A serum UCHL1 level above 3.9ng/(mg total protein) at 8h had the best discriminatory power for predicting functional disability. In CSF, UCHL1 was elevated in all groups at 8h post-treatment compared to baseline (p<0.01). However, UCHL1 levels in CSF remained elevated at 24h only in 2h-HCA subjects (p<0.01). Functional and histopathology scores were closely correlated (Pearson’s coefficient: 0.66; p<0.01), and were significantly worse in 2h-HCA animals.
This is the first report associating elevated serum UCHL1 with brain injury. The novel neuronal biomarker UCHL1 is increased in serum 8h after severe neurological insult in 2h-HCA animals compared with CPB animals. These results support the potential for use in cardiac surgery patients, and form the basis for clinical correlation in humans.
Animal Model; Cardiopulmonary bypass (CPB); Biomarker; Hypothermia/circulatory arrest; Neurology/Neurologic injury
The impact of Society of Thoracic Surgeons (STS) predicted mortality risk score on resource utilization after aortic valve replacement (AVR) has not been previously studied.
We hypothesize that increasing STS risk scores in patients having AVR are associated with greater hospital charges.
Design, Setting, and Patients
Clinical and financial data for patients undergoing AVR at a tertiary care, university hospital over a ten-year period (1/2000–12/2009) were retrospectively reviewed. The current STS formula (v2.61) for in-hospital mortality was used for all patients. After stratification into risk quartiles (Q), index admission hospital charges were compared across risk strata with Rank-Sum tests. Linear regression and Spearman’s coefficient assessed correlation and goodness of fit. Multivariable analysis assessed relative contributions of individual variables on overall charges.
Main Outcome Measures
Inflation-adjusted index hospitalization total charges
553 patients had AVR during the study period. Average predicted mortality was 2.9% (±3.4) and actual mortality was 3.4% for AVR. Median charges were greater in the upper Q of AVR patients [Q1–3,$39,949 (IQR32,708–51,323) vs Q4,$62,301 (IQR45,952–97,103), p=<0.01]. On univariate linear regression, there was a positive correlation between STS risk score and log-transformed charges (coefficient: 0.06, 95%CI 0.05–0.07, p<0.01). Spearman’s correlation R-value was 0.51. This positive correlation persisted in risk-adjusted multivariable linear regression. Each 1% increase in STS risk score was associated with an added $3,000 in hospital charges.
This study showed increasing STS risk score predicts greater charges after AVR. As competing therapies such as percutaneous valve replacement emerge to treat high risk patients, these results serve as a benchmark to compare resource utilization.
Aortic Valve Replacement; Resource Utilization; Society of Thoracic Surgeons Risk Score
The ideal solution for recovery of donor lungs remains unknown. Low potassium dextran (LPD) solution is most common, but University of Wisconsin (UW) solution is also used. The United Network for Organ Sharing (UNOS) database allows assessment of preservation solutions in a large cohort of lung transplant (LTx) patients.
We retrospectively reviewed the UNOS dataset for adult primary LTx patients (2005–2008) whose donor lungs were recovered with UW or LPD solution. Patients were stratified by UW vs. LPD, and secondarily grouped by quartiles of the lung allocation score (LAS) to examine high risk recipients. Kaplan-Meier (KM) short term mortality (30d, 90d, 1 yr), and rejection in the 1st yr, were examined for intervals with adequate follow-up. Cox proportional hazard regression using 11 variables examined all cause 1-yr mortality.
Of 4455 patients, 4161 (93.4%) received LPD lungs and 294 (6.6%) received UW lungs. 1105 (24.8%) patients died during the study. There was no mortality difference based on flush solution with all patients examined together. However, patients in the upper two LAS quartiles (Q3:37.8-45.4, Q4:>45.4) receiving LPD lungs had greater 1 yr survival (81.5% vs.73.5%, p=0.02). On multivariable analysis, flush with UW solution resulted in an increased risk of 1 yr mortality (Hz ratio 1.77[1.21–2.58], p=0.003) compared to LPD. Preservation solution did not affect rejection rates in the year after LTx. KM modeling demonstrated the impact of flush solution on survival (p=0.02).
This study is the largest modern cohort to evaluate the effect of donor lung flush solutions on survival in adult LTx. UW solution increases the risk of 1 yr mortality in high risk LTx recipients.
Lung Transplantation; UNOS; organ preservation
The United States lung allocation score (LAS) allows rapid organ allocation to higher acuity patients. Although, waitlist time and waitlist mortality have improved, the costs of lung transplantation (LTx) in these higher acuity patients are largely unknown. We hypothesize that LTx in high LAS recipients is associated with increased charges and resource utilization.
Methods and Materials
Clinical and financial data for LTx patients at our institution in the post-LAS era (5/2005–2009) were reviewed with follow-up through 12/2009. Patients were stratified by LAS quartiles (Q). Total hospital charges for index admission and all admissions within 1yr of LTx were compared between Q4 versus the Q1–3 using Rank-Sum and Kruskal-Wallis tests, as charge data were not normally distributed.
84 LTx’s were performed during the study period. 63 (75%) survived 1yr; 10 (11.9%) died during the index admission. Median LAS was 37.5 (interquartile range (IQR) 34.3–44.8). LAS quartiles were: Q1,30.1–34.3, n=21; Q2,34.4–37.5, n=21; Q3,37.6–44.8, n=21; Q4,44.9–94.3, n=21. Charges for index admission were: Q4,$276,668 (IQR191,301–300,156) vs. Q1–3, $153,995(IQR 129,796–176,849), P<0.001. Index admission median length of stay was greater in Q4 (Q4:35d IQR 23–46 vs Q1–3:15d IQR11–22, P=0.003). For 1yr charges: Q4, $292,247 (IQR 229,192–421,597) vs. Q1–3, $188,342 (IQR 153,455–252,045), P=0.002. Index admission and 1yr charges in Q4 were higher than other quartiles when examined individually.
This is the first study to show increased charges in high LAS patients. Charges for the index admission and hospital care in the year post-LTx were higher in the highest LAS quartile compared to patients in the lowest 75% of LAS.
Lung Transplantation; Resource Utilization
Prolonged hypothermic circulatory arrest results in neuronal cell death and neurologic injury. We have previously shown that hypothermic circulatory arrest causes both neuronal apoptosis and necrosis in a canine model. Inhibition of neuronal nitric oxide synthase reduced neuronal apoptosis, while glutamate receptor antagonism reduced necrosis in our model. This study was undertaken to determine whether glutamate receptor antagonism reduces nitric oxide formation and neuronal apoptosis after hypothermic circulatory arrest.
Sixteen hound dogs underwent 2 hours of circulatory arrest at 18°C and were sacrificed after 8 hours. Group 1 (n=8) was treated with MK-801, 0.75 mg/kg IV prior to arrest followed by 75 μg/kg/hr infusion. Group 2 dogs (n=8) received vehicle only. Intracerebral levels of excitatory amino acids and citrulline, an equal co-product of nitric oxide, were measured. Apoptosis, identified by H&E staining and confirmed by electron microscopy, was blindly scored from 0 (normal) to 100 (severe injury), while nick-end labeling demonstrated DNA fragmentation.
Group 1 and 2 dogs had similar intracerebral levels of glutamate. However, MK-801 significantly reduced intracerebral glycine and citrulline levels as compared to HCA controls. MK-801 significantly inhibited apoptosis (7.92 ± 7.85 vs. 62.08 ± 6.28, Group 1 vs. 2, p<0.001).
Our results showed that glutamate receptor antagonism significantly reduced nitric oxide formation and neuronal apoptosis. We provide evidence that glutamate excitotoxicity mediates neuronal apoptosis in addition to necrosis after hypothermic circulatory arrest. Clinical glutamate receptor antagonists may have therapeutic benefit in ameliorating both types of neurologic injury after hypothermic circulatory arrest.
Animal Model; Apoptosis; Brain; Hypothermic Circulatory Arrest; Nitric Oxide
Background and Purpose
Impaired cardiac function can adversely affect the brain via decreased perfusion. The purpose of this study was to determine if cardiac ejection fraction (EF) is associated with cognitive performance, and whether this is modified by low blood pressure.
Neuropsychological testing evaluating multiple cognitive domains, measurement of mean arterial pressure (MAP), and measurement of EF were performed in 234 individuals with coronary artery disease. The association between level of EF and performance within each cognitive domain was explored, as was the interaction between low MAP and EF.
Adjusted global cognitive performance, as well as performance in visuoconstruction and motor speed, was significantly directly associated with cardiac EF. This relationship was not entirely linear, with a steeper association between EF and cognition at lower levels of EF than at higher levels. Patients with low EF and low MAP at the time of testing had worse cognitive performance than either of these alone, particularly for the global and motor speed cognitive scores.
Low EF may be associated with worse cognitive performance, particularly among individuals with low MAP and for cognitive domains typically associated with vascular cognitive impairment. Further care should be paid to hypotension in the setting of heart failure, as this may exacerbate cerebral hypoperfusion.
Heart failure; Cognition; Blood pressure; Brain ischemia
Little is known about the molecular mechanisms of neurologic complications after hypothermic circulatory arrest (HCA) with cardiopulmonary bypass (CPB). Canine genome sequencing allows profiling of genomic changes after HCA and CPB alone. We hypothesize that gene regulation will increase with increased severity of injury.
Dogs underwent 2-hour HCA at 18°C (n = 10), 1-hour HCA (n = 8), or 2-hour CPB at 32°C alone (n = 8). In each group, half were sacrificed at 8 hours and half at 24 hours after treatment. After neurologic scoring, brains were harvested for genomic analysis. Hippocampal RNA isolates were analyzed using canine oligonucleotide expression arrays containing 42,028 probes.
Consistent with prior work, dogs that underwent 2-hour HCA experienced severe neurologic injury. One hour of HCA caused intermediate clinical damage. Cardiopulmonary bypass alone yielded normal clinical scores. Cardiopulmonary bypass, 1-hour HCA, and 2-hour HCA groups historically demonstrated increasing degrees of histopathologic damage (previously published). Exploratory analysis revealed differences in significantly regulated genes (false discovery rate < 10%, absolute fold change ≥ 1.2), with increases in differential gene expression with injury severity. At 8 hours and 24 hours after insult, 2-hour HCA dogs had 502 and 1,057 genes regulated, respectively; 1-hour HCA dogs had 179 and 56 genes regulated; and CPB alone dogs had 5 and 0 genes regulated.
Our genomic profile of canine brains after HCA and CPB revealed 1-hour and 2-hour HCA induced markedly increased gene regulation, in contrast to the minimal effect of CPB alone. This adds to the body of neurologic literature supporting the safety of CPB alone and the minimal effect of CPB on a normal brain, while illuminating genomic results of both.
We introduce a system developed for the BioCreativeII.5 community evaluation of information extraction of proteins and protein interactions. The paper focuses primarily on the gene normalization task of recognizing protein mentions in text and mapping them to the appropriate database identifiers based on contextual clues. We outline a “fuzzy” dictionary lookup approach to protein mention detection that matches regularized text to similarly-regularized dictionary entries. We describe several different strategies for gene normalization that focus on species or organism mentions in the text, both globally throughout the document and locally in the immediate vicinity of a protein mention, and present the results of experimentation with a series of system variations that explore the effectiveness of the various normalization strategies, as well as the role of external knowledge sources. While our system was neither the best nor the worst performing system in the evaluation, the gene normalization strategies show promise and the system affords the opportunity to explore some of the variables affecting performance on the BCII.5 tasks.
biomedical natural language processing; information extraction; gene normalization; text mining
Previous uncontrolled studies have suggested that there is late cognitive decline after coronary artery bypass grafting that may be attributable to use of the cardiopulmonary bypass pump.
In this prospective, nonrandomized, longitudinal study, we compared cognitive outcomes after on-pump coronary artery bypass surgery (n=152) with: off-pump bypass surgery (n=75); nonsurgical cardiac comparison subjects (n=99); and 69 heart-healthy comparison (HHC) subjects. The primary outcome measure was change from baseline to 72 months in the following cognitive domains: Verbal memory, Visual memory, Visuoconstruction, Language, Motor speed, Psychomotor speed, Attention, Executive function, and a composite Global score.
There were no consistent differences in 72-month cognitive outcomes among the 3 groups with coronary artery disease (CAD). The CAD groups had lower baseline performance, and a greater degree of decline compared to HHC. The degree of change was small with none of the groups having > 0.5 SD decline. None of the groups were substantially worse at 72 months compared to baseline.
Compared to subjects with no vascular disease risk factors, the CAD patients had lower baseline cognitive performance and greater degrees of decline over 72 months, suggesting that in these patients, vascular disease may have an impact on cognitive performance. We found no significant differences in the long-term cognitive outcomes among patients with various CAD therapies, indicating that management strategy for CAD is not an important determinant of long-term cognitive outcomes.
Neurocognitive deficits; Outcomes; CABG; vascular disease
Summary: The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator—an ontology-based annotation service—to make it available as a component in UIMA workflows.
Availability: This wrapper is freely available on the web at http://bionlp-uima.sourceforge.net/ as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows.
Self-reported cognitive and memory complaints following coronary artery bypass graft surgery (CABG) are common. Several studies have attempted to quantify the incidence of such complaints and to examine the relationship between subjective and objective cognitive functioning, but the etiology and longitudinal course of these self-reports remain unclear.
Measures of subjective memory complaints were compared in two groups: 220 CABG patients and 92 nonsurgical cardiac comparisons at 3 months, and 1, 3, and 6 years. At 6 years, additional measures were used to quantify memory self-assessment. The frequency of subjective complaints at each time point was determined and associations with objective cognitive performance as well as depression were examined.
At early (3-month and/or 1-year) follow-up, subjective memory complaints were reported more often by the CABG than the nonsurgical group (45.5% vs. 17.0%, p<0.0001). By 6 years, the frequency of complaints was similar (52%) in both groups. Subjective memory ratings were significantly correlated with performance on several memory tests at 6 years. This relationship was not confounded by depression.
Subjective memory complaints are more frequent early in follow-up in patients undergoing CABG than in controls, but by 6 years they are similar. The increase in subjective complaints over time may be related to progression of underlying cerebrovascular disease. Unlike previous studies, we found that subjective memory assessments were correlated with objective performance on several memory tests. Although subjective memory complaints are more common in patients with depression, they cannot be explained by depression alone.
CABG; neurocognitive deficits; outcomes; brain
Summary: Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources.
The profusion of high-throughput instruments and the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data. Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledge-based analysis of that data. An implementation of this approach, called the Hanalyzer, is demonstrated on a large-scale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work.
Recent technology has made it possible to do experiments that show hundreds or even thousands of genes that play a role in a disease or other biological phenomena. Interpreting these experimental results in the light of everything that has ever been published about any of those genes is often overwhelming, and the failure to take advantage of all prior knowledge may impede biomedical research. The computer program described in this paper “reads” the biomedical literature and molecular biology databases, “reasons” about what all that information means to this experiment, and “reports” on its findings in a way that makes digesting all of this information far more efficient than ever before possible. Analysis of a large, complex dataset with this tool led rapidly to the creation of a novel hypothesis about the role of several genes in the development of the tongue, which was then confirmed experimentally.
Reliable information extraction applications have been a long sought goal of the biomedical text mining community, a goal that if reached would provide valuable tools to benchside biologists in their increasingly difficult task of assimilating the knowledge contained in the biomedical literature. We present an integrated approach to concept recognition in biomedical text. Concept recognition provides key information that has been largely missing from previous biomedical information extraction efforts, namely direct links to well defined knowledge resources that explicitly cement the concept's semantics. The BioCreative II tasks discussed in this special issue have provided a unique opportunity to demonstrate the effectiveness of concept recognition in the field of biomedical language processing.
Through the modular construction of a protein interaction relation extraction system, we present several use cases of concept recognition in biomedical text, and relate these use cases to potential uses by the benchside biologist.
Current information extraction technologies are approaching performance standards at which concept recognition can begin to deliver high quality data to the benchside biologist. Our system is available as part of the BioCreative Meta-Server project and on the internet .
We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; ). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations.
Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.
Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline.
MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications.
Knowledge base construction has been an area of intense activity and great importance in the growth of computational biology. However, there is little or no history of work on the subject of evaluation of knowledge bases, either with respect to their contents or with respect to the processes by which they are constructed. This article proposes the application of a metric from software engineering known as the found/fixed graph to the problem of evaluating the processes by which genomic knowledge bases are built, as well as the completeness of their contents.
Well-understood patterns of change in the found/fixed graph are found to occur in two large publicly available knowledge bases. These patterns suggest that the current manual curation processes will take far too long to complete the annotations of even just the most important model organisms, and that at their current rate of production, they will never be sufficient for completing the annotation of all currently available proteomes.
Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.
We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial.
Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.
We used exact term matching, stemming, and inclusion of synonyms, implemented via the Lucene information retrieval library, to discover relationships between the Gene Ontology and three other OBO ontologies: ChEBI, Cell Type, and BRENDA Tissue. Proposed relationships were evaluated by domain experts. We discovered 91, 385 relationships between the ontologies. Various methods had a wide range of correctness. Based on these results, we recommend careful evaluation of all matching strategies before use, including exact string matching. The full set of relationships is available at compbio.uchsc.edu/dependencies.