We apply the event extraction system to a 1% sample of the 2009 distribution of the PubMed literature database. The full PubMed dataset contains 17.8 million citations, which we downsampled at random to create a dataset of 177 648 citations. To assure that results are representative of the full PubMed database, we performed no document selection or filtering. Consequently, 81 516 of the citations in the sample (46%) contain only a title but no abstract, and the earliest citation in the sample is for an article published in 1867 (PMID 17230723). While many of the citations in the sample are thus likely to have limited utility for biomolecular event extraction, their inclusion assures unbiased results and a fair test of the true generalization ability of the applied methods.
In total, the system extracted 168 949 events from 29 781 citations in the PubMed sample. The number of extracted events for the nine event types is presented in . The BANNER NER system marked 365 204 gene/protein entities in 54 051 citations in the sample, averaging two mentions per citation overall and almost seven per citation for those containing at least one tagged entity. By this estimate less than a third of PubMed citations contain gene/protein mentions. When the extracted events are broken up by publication year, a strong trend for increasing number of citations containing gene/protein mentions and events is visible, with 25% of citations from the last 10 years containing at least one event (). Since the extracted events are of the protein/gene-specific BioNLP'09 Shared Task types, this trend can be seen to reflect the growing prominence of molecular biology. Based on the predicted named entities and events, the present release of PubMed can be estimated to contain more than 5 million citations with gene/protein mentions, totaling over 35 million mentions overall and nearly 3 million citations containing events of the Shared Task types, totaling over 16 million such events overall. The number of citations with events has increased consistently with the growth of PubMed () until the year 2000. After this the total amount of citations grows more rapidly, perhaps reflecting PubMed's expanded coverage of life science topics since then (Benton, 1999
Frequency of the nine event types in the output of the system on the PubMed sample
Total number of citations and citations with tagged gene/protein mentions and events in the sample by year.
While not the primary result of this study, the extraction output can be used to support analysis of some large-scale trends in PubMed. As an example, shows the number of citations per year with mentions of insulin
, immunoglobulin G (IgG
) and tumor necrosis factor alpha (TNF
-α), the three most common named entities identified, and their associated events. We note that citations for insulin show a long-term growing trend, perhaps reflecting the considerable resources directed toward diabetes research. The decreasing number of article abstracts mentioning IgG, despite its centrality in many experimental applications, might be seen to indicate its waning as a primary subject of research, considering average gene quotation frequencies over time (Hoffmann and Valencia, 2003
). The number of citations mentioning tumor necrosis factor alpha has grown explosively since it was first cloned and named in 1984, showing continued and growing interest in this apoptosis-related cytokine centrally associated with multiple pathways and implicated in cancer.
Number of citations with tagged mentions of insulin, IgG and TNF-alpha (normalized for capitalization and hyphenization), as well as extracted events of these proteins. The counts are cumulative for every five years to smooth the curves.
The event extraction system (without the NER component) achieves an F
-score of 52.86% (precision 58.13% and recall 48.46%) on the BioNLP'09 Shared Task test set. The Shared Task data, based on the GENIA corpus, is composed of PubMed citations relevant to biological reactions concerning transcription factors in human blood cells (Kim et al.
). The GENIA corpus is focused on a particular subdomain and thus not a representative sample of the entire PubMed, the focus of this study. Additional analysis is, therefore, necessary to evaluate the extraction result. A particular point of interest is the ability of the system to perform on input data that, compared with the GENIA corpus, has far fewer events per sentence and thus deviates from the distribution on which the system was originally trained.
Evaluating the recall
of the system would require fully annotating a large enough fraction of the PubMed sample for all named entities and events. Annotating sentences for positive and negative events is, however, a time and labor-intensive process. For example, annotating the GENIA event corpus consisting of 9 372 sentences required 1.5 years with five part-time annotators and two coordinators (Kim et al.
). Such an annotation is thus not practical for this study, particularly since relevant events are very rare in a random sample of PubMed not focused on any particular subdomain. In contrast, manually inspecting the system output for errors is a comparatively easy task and allows us to determine the precision
of the system output. We examine a random sample of 100 predicted named entities and 100 predicted events and determine their correctness.
In the BioNLP'09 Shared Task data events are annotated only between genes and proteins, leaving out e.g. the multitude of signaling interactions between proteins and small inorganic molecules such as Ca2+. Our aim in this work is to recover as many of the biomolecular interactions stated in the texts as possible, so we extend the criteria for what is considered a named entity and an event. For named entities, we consider as positives cells, cellular components and molecules that take part in biochemical interactions. For events, we consider the event to be correct if all its named entity arguments are correct and the trigger word in the text is correctly detected.
In the manual evaluation of the 100 predicted named entities, we estimate the precision of the named entity detection step (the BANNER system) to be 87%. This compares well with BANNER performance on the GENETAG corpus with precision of 89% (for an f-score of 86%). The precision of the 100 predicted events was 64%, a figure close to the 58% (for f-score of 53%) established on the BioNLP'09 Shared Task data. While recall cannot be directly measured, as discussed above, considering that the event extraction system was trained on example-rich data that favors making positive predictions, it can be expected not to decrease substantially from results established on subdomain corpora.
The results of this manual analysis indicate that the performance of the named entity and event detection components does generalize from the subdomain corpus data to a representative unbiased sample of the entire PubMed. It should, however, be noted that evaluating automatically generated predictions after the fact is more prone to a positive bias than annotation of plain text with no predictions.
3.2 Event network
One of the most promising applications for large-scale event mining is the generation of interaction networks. Unlike networks constructed from binary interactions, an event network defines the types and directions of the relationships, the polarity (positive or negative) of regulatory relationships and the mechanisms involved (e.g. phosphorylation). Sufficiently accurate event graphs can be used for inferring complex regulatory relationship networks and other biologically relevant tasks.
illustrates a sample network constructed from events extracted by our system around interleukin-4
. To build the network, we merge the individual predicted events into a single graph, loosely following the approach employed by Saeys et al.
). We determine two protein mentions to be the same if their names match after lowercasing and removal of whitespace and hyphens. All mentions of the same protein are represented in the graph by a single node. Event argument edges connect the proteins through event trigger nodes.
Fig. 4. Extracted event network around interleukin-4. This graph shows a subset of the predicted event network, including only named entities with at least 50 extracted instances. The round event nodes are (P)ositive regulation, (N)egative regulation, (R)egulation, (more ...)
The entire graph extracted from the PubMed sample has one major connected component comprising 88 477 (38%) of the total of 232 760 nodes. The remaining nodes form a large number of considerably smaller connected components, the largest of which contains a mere 95 nodes.
3.3 Topic analysis
As a final analysis of the extraction results, we studied the topics of event-containing citations. Over 90% of records in PubMed are manually indexed with a number of descriptors chosen from Medical Subject Headings (MeSH), a hierarchical thesaurus in which the descriptors are arranged primarily in general—specific hierarchy. The descriptors assigned to a PubMed citation record express the main topics discussed in the respective article and allow queries within specific subtopics, reducing the variance and sparsity of simple keyword search. In the following, we investigate the connection between MeSH descriptors and event types, establishing the topical areas in PubMed likely to contain citations relevant to event extraction.
We measured the degree of dependence between a MeSH descriptor d
and an event type e
using pointwise mutual information
) is defined as the fraction of citations indexed by the descriptor d
and containing at least one event of type e
, out of all citations that contain at least one event. Similarly, P
) (respectively P
)) is the fraction of citations containing the descriptor d
(respectively event of type e
) out of all citations that contain at least one event. The measure calculates the ratio of joint probability of d
to the probability of their co-occurrence by chance. To deal with sparsity problems, we first expanded, for each citation, the set of its original MeSH descriptors indexed in PubMed with all descriptors that are more general in the MeSH hierarchy. This allows us to find more general descriptors, rather than the specific ones indexed in PubMed.
For each of the nine event types, we built a list of five most related descriptors, that is, descriptors with the highest pointwise MI. To avoid unnecessarily specific descriptors, we only considered those descriptors that were present in at least 10% of citations with an event of the given type, and discarded descriptors that were hyponyms (more specific) to another descriptor already in the list. The resulting lists are given in . These illustrate that the descriptors obtained are indeed relevant to the respective event types, except for the two obviously too general descriptors Technology, Industry, and Agriculture and Information Services, which we discarded in all subsequent analyses. Apart from validating the IE system, although very indirectly, these MeSH descriptor lists can be used to focus PubMed searches to citations likely containing the relevant event types.
Top related MeSH descriptors for the nine event types
Of the 177 648 citations in the PubMed sample, 66 227 (37.3%) are indexed by at least one of the descriptors in , or its hyponym (we will refer to these citations as MeSH-relevant). In contrast, only 12 405 (7.3%) of the 168 949 events identified by the system were extracted from the 62.7% MeSH-irrelevant citations, demonstrating that the MeSH terms in are indeed strong predictors of citations containing relevant events.
Intuitively, it can be expected that MeSH-irrelevant citations are also likely to contain a higher proportion of false positive named entities and events. To verify this hypothesis, we measure the proportions of true positive (TP) named entities and events among MeSH-relevant citations and contrast them to the proportions in MeSH-irrelevant citations. The results of this analysis, performed on the same set of 100 random events and 100 random entities introduced in Section 3.1
, are presented in . For both named entities and events, the proportion of TPs (precision) is notably lower in MeSH-irrelevant citations. In case of named entities, the difference is 20.6 percentage points (significant with P
= 0.009, two-tailed Fisher's test) and in case of events, the difference is full 35.9 percentage points (significant with P
= 0.028, two-tailed Fisher's test). These results suggest that MeSH descriptors may provide features for the event extraction system with a high predictive power and could be, for instance, used to generate likely negative examples for further retraining of the extraction system. Thus, the broad manual annotation of the MeSH descriptors can enhance detailed automated event extraction.
Comparison of named entity and event detection precision between MeSH-relevant and MeSH-irrelevant citations
3.4 Computational requirements
The computational requirements of the system components, in terms of time and space, are detailed in .
Processing requirements for different components as measured for the sample and estimated for the whole PubMed
NER using BANNER is a comparatively light-weight processing step in the pipeline. Processing the entire dataset consumed <18 h on a desktop-level computer, averaging more than three citations per second. NER tagging could thus be run for the entire PubMed database in ~75 days on a single machine, or a matter of days on a modest cluster.
Full dependency parsing is the most resource-intensive step in the pipeline. The parsing time of the Charniak–Johnson parser for one sentence is in our case 0.81 s, with an additional 0.15 s taken by the SD scheme conversion. Parsing all 935 186 sentences in the PubMed sample would thus take 249 processor hours (2.84 processor years projected for the entire PubMed). However, only sentences with at least one recognized named entity, and thus a potential target for the IE system, need to be parsed, considerably decreasing the number of sentences that must be parsed to 199 941 and the parsing time to 53 h (222 days projected for the entire PubMed).
We note that the parsing process is not a straightforward technical undertaking. The Charniak–Johnson parser takes about 10 s to load the parsing model files and, in order for this fixed time penalty not to accumulate, it is necessary to divide the parsing task to large batches rather than parsing a single abstract, or even single sentence at a time. On the other hand, there were 26 sentences in the PubMed sample that caused the parser to process interminably without producing an analysis. It was necessary to detect these cases, terminate the parser and restart the process. Of the 199 915 sentences parsed by the Charniak–Johnson parser, further 37 sentences were not successfully processed by the SD conversion tools. The final number of successfully parsed sentences was thus 199 878. It must be stressed that this number represents a highly respectable 99.97% of sentences successfully parsed, demonstrating the high reliability achieved by the current state-of-the-art in syntactic parsing.
Finally, the event extraction step took 27 processor hours (114 processor days projected for the entire PubMed), thus averaging roughly one citation per 2 s for the 54 051 citations with at least one detected named entity. The total processing time of the pipeline was 98 processor hours (411 processor days projected for the entire PubMed), or, one PubMed citation per 2 s.