|Home | About | Journals | Submit | Contact Us | Français|
Pattern-based approaches to relation extraction have shown very good results in many areas of biomedical text mining. However, defining the right set of patterns is difficult; approaches are either manual, incurring high cost, or automatic, often resulting in large sets of noisy patterns.
We propose several techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different extraction tasks, as defined in the recent BioNLP 2009 shared task. We focus on simple methods that only take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. We show that our techniques, despite their simplicity, yield large improvements in all tasks we analyzed. For instance, they raise the F-score for the task of extraction gene expression events from 24.8% to 51.9%.
Already very simple filtering techniques may improve the F-score of an information extraction method based on automatically generated patterns significantly. Furthermore, the application of such methods yields a considerable speed-up, as fewer matches need to be analysed. Due to their simplicity, the proposed filtering techniques also should be applicable to other methods using linguistic patterns for information extraction.
As the amount of biomedical knowledge recorded in texts is steadily increasing, it is more and more important to develop efficient techniques to automatically discover new, hidden, or unsuspected knowledge in large text collections . Many text mining (TM) methods have been developed in order to achieve this goal . TM aims to discover and extract knowledge from unstructured text, making use of natural language processing (NLP) tools, such as tokenizers, morphological analyzers, part-of-speech-taggers (POS-taggers), and syntactic or deep parsers to process texts . TM in the biomedical field mostly focuses on two tasks: Named Entity Recognition (NER), i.e., the task of recognizing various types of biological entities (genes and proteins, chemicals, diseases etc.), and Relation Extraction (RE), i.e., the task of identifying relations among those entities.
The classical baseline approach to relation extraction is to only use information about the co-occurrence of entities in the given scope (sentence or abstract; [4,5]). Using co-occurrence is popular, because it is easy to implement and allows for efficient processing of huge amounts of texts. However, co-occurrence-based approaches fail to characterize the biological meaning of an interaction (e.g., inhibition or stimulation, directionality) and, more importantly, also generate many false positives because they cannot distinguish positive from negative pairs. The second important class of approaches is based on machine learning. Here, a statistical model is learnt from a set of positive and negative examples and then applied to unseen texts. In general, machine learning-based methods to RE perform very well for any task where sufficient and high-quality training data is available [6,7]. However, their major drawback is this need for training data, which, in the Life Sciences, is only available for the most common tasks and very costly to produce. Furthermore, machine learning-based methods are prone to overfit to the training corpus, which makes results of evaluations very hard to extrapolate to other texts .
A third class of relation extraction methods is based on pattern matching. Pattern-based approaches build on annotated text fragments (the patterns), where words/phrases are labelled with linguistic information, e.g. POS-tag, word lemma, or syntactic information. Those patterns are matched against linguistically annotated text to detect relationships. Patterns can either be learned from examples [9,10] or can be defined manually [11-13]. Pattern-based systems have shown good results in many areas of biomedical text mining, such as extraction of PPI [10,13] or the reconstruction of pathways .
Systems based on manually-defined patterns typically use relatively few patterns, leading to high precision but low recall [15,16]. In contrast, systems that learn patterns automatically often produce thousands of patterns and exhibit a better recall, but often at the cost of a decrease in precision. To circumvent this penalty, one may try to improve a given set of patterns using some optimization criteria. As an example, SPIES  proposed a pattern set optimization method based on the minimum description length (MDL) principle from information theory . Starting from the set of all patterns retrieved from the training set, iteratively two patterns are selected for merging such that (1) a merging criterion is maximized; (2) the merged pattern does not violate some task-specific heuristic constraints. This procedure has increased the F-score for PPI extraction by 6.7%, but the MDL-based algorithm is quite complex and rather cumbersome to adapt to a new problem.
In this paper, we propose much simpler ways of improving pattern-based relation extraction. We focus on simple filtering techniques that take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. Our ideas are based on the observations that (1) very oftenmost of the patterns in an automatically learned pattern set are used rarely, that (2) complex patterns tend to produce more false matches, and that (3) complex or ambiguous phrases in the text tend to produce higher error rate.
We evaluate our ideas using the corpus of the BioNLP'09 Shared Task (hereafter BioNLP task)  and the pattern-based text mining system Ali Baba . While earlier shared tasks (such as BioCreative II, ) have dealt with general PPI, the BioNLP task focused on a set of more specific molecular events, such as gene expression, protein catabolism, or positive regulation and negative regulation. We hypothesized that a well-performing relationship extraction system like Ali Baba would perform well also on these tasks, because (a) all targeted events encompass the relationship between at least two constituents (either entities or trigger words), and (b) 95% of all annotated events are fully contained within a single sentence .
However, its initial performance was rather poor. However, by using simple text and pattern filters, we were able to improve F-scores considerably. For instance, the unfiltered pattern set led to an F-score of only 24.8% on gene expression events, but the score rose to 51.9% after filtering. Although we evaluated our ideas only using Ali Baba, we want to stress that our filters are defined in such a way that they also should be applicable to other tasks and to other pattern-based extraction methods.
Ali Baba is a tool which automatically parses PubMed http://www.ncbi.nlm.nih.gov/pubmed/ abstracts matching a given user query. The set of matching abstracts is further processed using a dictionary-based NER (for genes, diseases, cell lines, species, drugs, tissues etc.) and a pattern-based algorithm for the detection of protein-protein interactions (see below). Dictionaries are compiled from several biological databases (UniProt http://www.uniprot.org/, NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/, MesH http://www.ncbi.nlm.nih.gov/mesh/, KEGG http://www.genome.jp/kegg/, Drug-Bank http://www.drugbank.ca/ etc.). For matching, all entity names are converted into regular expressions that also match in the presence of slight linguistic variations (plural etc.). The resulting network of entities connected via relationships is visualized as a graph, thus presenting a quick overview over the most important information contained in the abstracts .
Ali Baba uses a pattern-based approach to extract relationships among entities. To this end, it learns typical linguistic representations of relationships and encodes them in language patterns. Patterns are task-specific, i.e., they depend on the type of relationship that should be extracted, and learning is performed in three steps. Input to the learner is a set of positive examples, i.e., a set of tuples where each tuple consists of entities that are in the sought-for relationship. For instance, for detecting PPI, the input must be a list of interacting proteins.
In the first step, Ali Baba collects a list of sentences that contain both elements of any of the input tuples plus a trigger word taken from a list of words specific to the type of relationship being searched. This step may be performed, for instance, by searching a text database such as PubMed. In the second step, all those sentences are linguistically annotated by adding word stems and part-of-speech tags. Next, the annotated sentences are reduced to their 'core', i.e., the minimal token span containing all elements of the tuple, the trigger word, and a neighbourhood of configurable size. All matched entities are replaced with a placeholder. From those core phrases, Ali Baba generates a set of patterns, each of which consists of three layers: the core phrase itself, a stemmed version of the phrase, and its sequence of POS tags (see Table Table11 for an example). In the third step, the system clusters all initial patterns based on a similarity measure that builds upon a pair-wise multi-level alignment of core patterns (see below). For each cluster, a multiple sentence alignment is computed and condensed into a profile per cluster, capturing the relative frequencies of tokens, stems and POS tags for each position of the pattern. This last step is optional; when initial patterns are directly used for matching, we usually observe very good precision but low recall. After clustering, precision usually drops, but recall increases considerably as the merged patterns are more general than the initial ones. Details on the entire process may be found in .
Ali Baba matches patterns against sentences using (as for the clustering) multi-layer sentence alignment, a procedure conceptually similar to sequence alignment, but performed on tokens, stems, and POS tags, not on single characters. The matching allows for differences between the pattern and a sentence in the same manner as sequence alignment introduces insertions, substitutions, deletions, and mismatches between DNA or protein sequences. Furthermore, the decision whether or not a term of a pattern matches a term of a phrase is not binary, but the specific term pair is assigned a score by looking up a substitution matrix . For each layer there is a dedicated substitution matrix and the overall score for each pair of terms is a linear combination of all scores for the different layers. Increasing or lowering the threshold for this score tunes the method towards higher precision or higher recall, respectively.
To focus efforts on the general problem of event extraction, we decided to use the corpus provided by the international BioNLP'09 Shared Task http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/. This task concerns the detection of complex events in biomedical text, such as gene expression, transcription, or protein catabolism. It consisted of several subtasks. The objective of the Task 1 was to identify events with their types, trigger expression in the text, and primary theme and cause arguments. The competition provided crisp task definitions, data sets for training, development, and testing, an evaluation script for the training and the development sets, and an online tool for evaluation on the test set.
In all data sets, the proteins that took part in the events were annotated in the text; i.e., NER was not part of the task. Note that the performance of event recognition would certainly drop significantly when NER were part of the evaluation procedure (see Kabiljo et al.  for a quantification of this effect on PPI extraction). For the training and development sets, the participants were also provided with the gold standard for events and their respective theme(s).
The data sets were prepared based on the GENIA event corpus . The data for training and development sets were derived from the publicly available part of GENIA, while the test data were taken from an unpublished part. Table Table22 shows statistics of the different data sets.
We applied Ali Baba's algorithm for pattern generation and matching to four event types of the BioNLP task (gene expression, phosphorylation, catabolism, transcription). We will explain the process using as example the gene expression task, which also has the highest number of examples in the corpus (1738 samples in training, 356 in development, and 722 in test). Application to the other event types followed the same strategy. Recall that proteins are already annotated in the BioNLP'09 corpus and NER thus was not part of the evaluation. If not indicated otherwise, we report on event level approximate span matching results (see the description in ) in terms of precision, recall and F-score.
Ali Baba's pattern learning method was applied with slight modifications. We decided to use only the information provided by the BioNLP organizers. The main reason for this was the expected high level of noise when we would have searched PubMed for positive examples, which, for the gene expression event type, would consist only of a gene name and a trigger word (in contrast to tasks such as PPI extraction, where positive examples would be identified by two protein names in addition to trigger). Therefore, we first collected all sentences that describe a gene expression event from the 800 training documents. Linguistic annotation and reduction to core phrases were performed as described above. We omitted the third step in Ali Baba's pattern learning (clustering of patterns) because our aim here was to study a different way of improving extraction performance, i.e., by filtering patterns and not by merging them. Furthermore, pattern clustering is highly specific to Ali Baba; including this step would make our results less applicable to other pattern-based relation-ship extraction method.
Result from applying Ali Baba in this way on development and test corpus are shown in Table Table33 (first line). The results show that there is much room for improvement; especially precision values are very low.
Based on the baseline results, we performed a detailed error analysis. This analysis revealed that the bulk of false positives stem from a surprisingly small number of error types. Furthermore, these types of error mostly can be removed by filtering certain patterns or by disregarding certain sentences. In the following sections, we present several such filters and their effect on the performance in detail. We also report on some alternative directions we followed but which did not improve performance significantly. Finally, we analyze the overall performance when several filters are combined.
We consider improvements based on two main strategies. The first is a trigger-word-based pattern filter. Trigger words are specific tokens in a sentence which indicate the occurrence of an event. Inspecting the trigger words for gene expression in the training data reveals that many textual expressions are triggers in some case but not in others (see Table Table4).4). For instance, only 44% of the occurrences of the token 'expresses' are trigger for a gene expression event, while the remaining 56% are not triggers for any annotated event. Another example is that one-third of the occurrences of the token 'overexpression' appear as gene expression event trigger, while the other two-thirds are triggers of other types of events. The second strategy to improve the quality of the pattern set is filtering patterns based on some of their properties. Filtering is based on the idea that not all patterns are equally good. We performed a separate evaluation of every individual pattern which allowed us to identify those patterns that decrease performance, and analyzed those patterns to find commonalities among them.
We first ran an experiment to test the applicability of the general algorithm behind Ali Baba on the BioNLP task data. As stated in Methods, pattern matching in our system generates a score for each pattern against each sentence. A threshold on this score determines which matches are considered as positives. Additional file 1 depicts the effect of changing this threshold. As expected, raising the threshold leads to higher precision at lower recall, while lowering it has the opposite effect. In our experiments, we decided to set the threshold such that the F-score was optimal on the development set for the baseline implementation.
We use the hit rate (HR) of a trigger word for its assessment. For a given trigger word, the HR is the number of occurrences of the word in a sentence with the respective event type divided by the total number of occurrences in the gold standard annotations (see Table Table4).4). We investigated all trigger words from the training and development sets and found that triggers with low HR degrade precision and were also found to also produce more event type mismatches (e.g. positive regulation instead of gene expression) than other patterns. We therefore sorted all the 133 trigger words according to their HR in a descending order and cumulatively evaluated the impact of using only the subsets of 10 to 100 event triggers having the best HRs. Results are shown in Figure Figure11.
Based on these results, we fixed a set of 50 event trigger words that yielded the best F-score on the development corpus. Note that all triggers words in this set are single-word triggers and either verbs, nouns, or adjectives. The impact of this simple filter is already quite large (see Table Table5).5). On the development data, precision increased by 60% (from 24.7% to 39.7%) compared to the baseline, while recall dropped only minimally (from 49.2% to 48.0%). F-score grew from 32.9% to 43.5%.
Additional file 2 depicts the length distribution of all patterns extracted from the training data. The length of most patterns is between 2 and 19 tokens, with most being between 4 and 8 tokens long. Inherently, Ali Baba's pattern matching algorithm favours longer patterns since those aggregate a score over more token matches. Consequently, even quite dissimilar phrases may still produce overly positive scores, if some tokens with large weights coincide. This is an intrinsic property of the alignment algorithm that is not easily circum-vented by normalization by pattern length. However, we hypothesized that simply using only shorter patterns could already reduce the error rate.
We evaluated this hypothesis via two experiments: First, we partitioned all patterns into subsets of the same length and evaluated every subset separately on the development corpus. We found that patterns consisting of four or five tokens yielded the best F-score, both 37.6, which is 14% better than the baseline, while long patterns perform considerably worse (see Additional file 3 and Additional file 4). As expected, the precision of each subset was boosted remarkably, while the recall of each evaluation run was much lower than when using all patterns. To prevent this decrease of recall, we then created subsets of patterns according to their maximum number of tokens. The evaluation results (see Figure Figure2)2) show that the set of patterns with the maximum length of four (141 patterns) yields the best F-score.
The impact of the pattern length filter on the performance is shown in Table Table5.5. The filtering yields a large improvement in precision (more than doubling) together with a modest decrease in recall (by 12%). The F-score increased from 32.9 to 49.0 when compared to the baseline.
The previous experiment evaluated classes of patterns based on their length. We then studied the individual performance of each of the 590 initial patterns. As usual, patterns were extracted from the training data while evaluation was performed on the development data to minimize overfitting. Because recall of an individual pattern is naturally low, we concentrated on their precision. To this end, we ran Ali Baba's matching phase once for every pattern of the entire pattern set and recorded the precision of each pattern.
Figure Figure33 shows that the differences are tremendous, covering the entire range from 100% to ~8% precision. Additionally, we found almost 400 patterns that did not match any sentence in the development corpus. Ignoring those patterns may lower the recall on a larger corpus but would increase the precision and also speeds up the processing time. We then used the same method as in the pattern length filter to evaluate subsets of patterns defined on their individual precision. That is, we reran the experiment on subsets of 10 to 200 best patterns (see Figure Figure44).
The best F-score was achieved using the 50 best patterns. Table Table55 compares the performance before and after filtering. The use of the pattern performance filter doubles the precision while it decreases the recall by a mere 2%, when compared to the baseline. In addition, the filter drastically decreases the runtime, from 43 minutes to 3.5 minutes, as fewer patterns need to be matched against sentences.
We also experimented with another filter assuming that sentences that contain many proteins produce more errors, because it should be more difficult for the matching algorithm to find the right assignment from trigger words to entities. To test this hypothesis, we sorted all sentences from development corpus according to the number of proteins they contained, and partitioned the corpus into 8 smaller corpora, where each corpus contained all sentences with exactly k proteins (k = 1,...,8). We evaluated each partition separately. Contrary to our expectation, the performance on the different partitions did not differ significantly (data are not shown).
After inspecting the performance of each filter separately, we investigated whether their combination could further improve the performance. Table Table55 summarizes our results for using combined and individual filters on the development data. The best performance was achieved by the combination of the pattern performance filter and the trigger word filter. These two filters together raised precision from 24.7% to 77.4%, i.e., the precision has tripled. At the same time, recall decreased only slightly, from 49.2% to 46.3%. The F-score almost doubled, from 32.9% to 58.0%.
For the evaluation on the test data set we used the union of training and development data as training corpus. Measuring the performance on a data set different from the training data is essential to prevent overfitting, because training and evaluating a statistical model on the same (or overlapping) data may lead to the overestimation of the method's performance . Note that the ground truth of event annotations on the test data is not available, but one can measure the performance of a method using the submission system of the BioNLP task. This opportunity is limited for one submission per user per day to prevent the server overload and fine-grained optimization against the test set. Since our pattern filtering methods presented so far require a development set against which the patterns can be optimized - which is not possible on the test set - we proceeded as follows to select a set of good patterns for the test data.
We created several train-development splits from the combined training corpus of 950 abstracts, optimized the pattern sets of each split using our pattern filters and then combined the optimized pattern sets. For the train-development splits, we randomly re-sampled the combined training corpus five times. In each split, we randomly selected 800 abstracts for training and used the remaining 150 abstracts for development. For each training set, we individually computed the optimal number of patterns with respect to F-score measured on the corresponding development set (as described in the Pattern Performance Filter Section). Then, we applied the pattern performance filter and the pattern length filter. In the first block of Table Table3,3, we report on the performance of the five optimized pattern sets measured on the corresponding development set and on the test set.
All the five optimized pattern sets consist of 40 to 60 patterns and have an individual F-score between 55.2 and 62.6 (mean 58.46, standard deviation 2.48). Their performance on the test data lies between 47.3 and 51.4 (mean 49.18, standard deviation 1.63), and is thus approximately 10% lower on average than the performance on the development data. The difference can mostly be attributed to a drop in recall, showing that the different corpora have different characteristics; it comes as no surprise, as patterns, when learned and tuned on such small corpora as available for the BioNLP task, inherently overfit. What is more interesting is that the precision remains fairly stable, which is very encouraging as it shows that well-performing patterns perform well independent of the corpus from which they are obtained.
Additionally, we combined all distinct patterns from all five experiments into a set of 104 unique patterns. We sorted the patterns according to their precision achieved on the corresponding development split. Then, we took the best 60, 70, 80, 90, and the full set of 104 patterns and evaluated each set on the test corpus. Results are also shown in Table Table33.
Again, one can see that recall of patterns is rather corpus-specific and that using only the best 60 patterns from the different experiments results in a very low recall on the test data. This implies that in each split, a large fraction of the best-performing patterns (in terms of precision) is highly corpus-specific and produce much rarer matches on the test set. This effect disappears gradually when more patterns are considered. Also as before, precision remains fairly stable (with an outlier for 70 patterns) and roughly lies in the same range as results for any of the individual experiments. The best F-measure is achieved when using all 104 unique patterns.
Recall that the gold standard annotation on the test data currently is not directly available. Evaluation on this data set is only possible by sending-in an annotated corpus to the online submission system of the shared task, and these evaluations are limited to one per day and person. Therefore, it was not feasible to find the optimal set of patterns on this data set, as this would require separate runs for each pattern (this would have taken 590 days). In turn, this means that the results practically simulate the case of applying our method on unseen text: selecting the optimal pattern set on training data and applying it on the new (test) data. Accordingly, it is not clear that the set of 104 patterns is really the best, but it is reassuring that it produces the best results from all runs we performed. Further optimizing the number of selected patterns would only make the results more corpus-specific and less applicable to unseen text.
We also applied the trigger word and pattern performance filter to three other event types of the BioNLP shared task: phosphorylation, protein catabolism, and transcription. Evaluation method was the same as for gene expression. To be able to reuse the same methods, we had to modify slightly the setting for the phosphorylation task, because this event type may encompass three slots instead of only two (as the other three event types): protein, trigger word, and phosphorylation site. However, detecting the site is very simple. We tagged all mentioning of sites using regular expressions and used the pattern matching algorithm only for detecting and associating protein and trigger word. Since this deviates from the standard procedure, we do not compare our results with those of the task, but only focus on the improvements achieved by our filtering. Finally, trigger word filtering was not tested on phosphorylation as the number of trigger words is very small for this event type. On the test data, we only evaluated the best filter according to the development data. Results are shown in Table Table6;6; the table also contains the results achieved by the best solution in the BioNLP .
The data shows that filtering drastically improves the performance in all cases. The improvement in F-measure is up to tenfold. Note that the numbers of examples for these event types are much smaller than for gene expression, which explains the general low performance of the baseline on the test data. However, pattern filtering considerably improved performance also on test data.
We described simple techniques for filtering sentences and patterns to improve pattern-based relationship extraction with an automatically learnt pattern set from the biomedical literature. A proper combination of the trigger word filter and the pattern performance filter has a major effect on the performance both on the training and on the test data for all event types we considered. For gene expression events, we were able to improve the precision from 24.7% to 77.4% and the F-score from 32.9 to 58.0 on the development corpus. The decline in recall was comparably small (from 49.2% to 46.3%). The filters also achieved a significant improvement when evaluated on the test data set, where the precision grew from 17.2% to 58.2% and the F-score from 24.8 to 51.9. Evaluation on three other types of events (phosphorylation, catabolism, and transcription) often showed even larger improvements in precision and F-measure.
In general, we expected that our approach would yield higher precision but lower recall. The evaluation confirmed this expectation for almost all event types and both data sets. Filtering inherently decreases recall as it only suppresses potential match situations, but does not add new ones: thus, only FPs can be avoided, but no new TPs can be added. On the other hand, if applied properly, filtering should increase precision by avoiding erroneous patterns or difficult sentences which have a higher chance of producing FPs. The trick is to find the right balance between decrease in recall and increase in precision. In this work, we learned optimal thresholds in terms of the hit rate of trigger words and length or performance of individual patterns by using a development corpus. However, we found that the learned parameters seem to be reasonable, as they led to considerable improvement also on the test data sets.
It is surprising that, only for gene expression, even recall improved on the test data. How-ever, this behaviour can be explained by the fact that, in the baseline, some patterns produced false positives but high scores in the matching algorithm of Ali Baba, thus overruling other patterns which would have found the right hit but produced lower scores. These patterns were removed from the filtered pattern set. This effect was not observed for the other event types, very likely because for those the number of patterns is much smaller. Our results show that it is possible to achieve good result using only a small set of 50 patterns (containing in average 4.82 tokens), which also makes the extraction very fast. We observed a similar effect in , where, for each sentence, we first sorted all patterns according to a rough estimate on their expected maximal score (based on q-gram frequencies in the pattern and the sentences) and only applied the best k patterns for the actual extraction. This yielded a 100-fold increase in extraction speed at ~3% worse F-measure. The results of our current work are orthogonal: here, we focus on improving extraction quality. Patterns are filtered globally using optimization on a development corpus instead of being selected heuristically on a per-sentence basis as in . In our current work, the observed speed-up is more of a side-product.
We are aware of only a few other works that evaluated filtering of patterns in pattern-based IE. In the Life Sciences, the SPIES project  used a pattern filter based on the Minimum Description Length Principle (MDL) for PPI extraction (see Introduction). The effect they report is much smaller than the one we observed, probably because their performance without filtering was already much higher than in our case, leaving less room for improvement. This higher performance can be explained by the fact that the SPIES system was evaluated only on a corpus that contained no sentence without a positive example. Filtering by trigger-word-based pattern evaluation was also performed by Cohen et al.  and by Buyko et al. , specifically for the BioNLP task. Cohen et al.  order trigger words by frequency and keep only the top 10-30% (depending on the event type) for pattern generation. In contrast to our filter, their solution did not take into account the ambiguity of trigger words across event type, but they allowed for some linguistic variations. Buyko et al.  tested three variations of treating trigger-word ambiguity, i.e., by frequency, by TF*IDF score, and by importance (taking into account relative frequencies in the different event types). The authors, in accordance with our results, report that the filter considering the frequencies in different event classes performed best. Note that they used these scores as input to a machine learning approach to relationship extraction, while we work with pattern matching. A system outside the Life Science' domain is DARE . DARE uses so-called seed relations, specified manually, as initial patterns. These are applied on a large corpus to generate new examples that, iteratively, are used as new seeds again. All patterns are scored according to internal and external properties before being used as new seeds. In contrast to our proposals, the pattern scoring method is highly specific to DARE and not easily transferred to other approaches.
The overall performance of our system cannot compete with the best approaches reported for the BioNLP task (see Table Table3,3, Table Table6,6, and [18,21]). However, we remark that we applied Ali Baba's method practically without any tuning to the task, while most high-scoring systems were developed specifically for this competition. It would be interesting to see how well they perform on other corpora and other tasks. Often, tuning for a specific corpus yields systems that perform considerably worse on other data; for instance, the OpenDMAP system, using a set of manually created patterns, performed best in the BioCreative II PPI task , but was found to perform worst (compared to 5 other systems, including co-occurrence as baseline) on each of five other corpora . This system also achieved exceptionally good precision results on the BioNLP task (with a new set of manually created rules) , with F-scores being sometimes better and sometimes worse than our results using filtering.
Another important difference between Ali Baba and the top-performing systems in the BioNLP task is the speed of extraction. All top-scoring systems in the task used dependency parse trees as their primary input. Such an approach requires high-cost computer equipment when applied to sizeable text collections and not only a couple of hundred sentences. For instance, Björne et al.  estimate that it would take ~230 days to parse all sentences in PubMed which contain at least one protein name, using the Charniak-Lease parser . In contrast, Ali Baba only requires sentences to be POS-tagged, which can be accomplished in less than a day on a modern quad-core machine. Whether or not systems using only POS tags can - in general - perform equally well as those using dependency parses is a question that is not yet decided; in our recent evaluation on the performance of PPI extraction systems on five PPI corpora using three different evaluation strategies , the shallow linguistic kernel based system , which uses only POS tags and tokens, performed almost equally well as the best systems using dependency parse trees.
We studied the effect of several simple pattern selection techniques on the performance of pattern-based relation extraction methods in biomedical text, where the patterns are learnt automatically from examples. In particular we investigated how the performance of Ali Baba can be improved on four event extraction problems of the BioNLP task by a careful selection of the patterns of sentences. We described the effect of several pattern selection techniques for the gene expression event in detail and also provided result for three other event types. We showed that significant improvement can be achieved in terms of precision and F-score at a slight drop of the recall. Obviously, with a smaller pattern set, the runtime of the algorithms can be also largely decreased. The main advantages of our pattern filters compared to the similar methods presented in the literature are their simplicity and portability. The filters are very easy to implement and should be applicable at basically no cost to any pattern-based relation extraction method that works with automatically determined large pattern bases matched individually against new text, such as  or the method described in . In addition, the proposed methodology offers a tool for tuning the precision-recall trade-off depending on the nature of the application task at hand.
PPI: Protein-Protein Interaction; TM: Text Mining NLP: Natural Language Processing; NER: Named Entity Recognition; RE: Relation Extraction; POS: Parts of Speech; HR: Hit Rate; MDL: Minimum Description Length; FP: false positive; TP: true positive
The authors declare that they have no competing interests.
UL conceived of and supervised the project. QLN wrote the programs and performed all evaluations. DT participated in project's design and guided the evaluation. All authors contributed to writing and approved the manuscript.
Quang Long Nguyen is a PhD student in the group of Ulf Leser. His work in text mining focuses on information extraction for life science applications, especially for Systems Biology.
Domonkos Tikk is an associate professor at Budapest University of Technology and Economics (on leave). Currently he is in Ulf Leser's group as an Alexander-von-Humboldt scholarship holder. His main research areas are text mining and recommendation systems.
Ulf Leser is a professor for Knowledge Management in Bioinformatics at the Humboldt-Universität in Berlin. His main topics of research are semantic data integration, text mining and biomedical data management.
Effect of using different match thresholds evaluated on the development set.
Distribution of patterns according to length (number of tokens).
F-score of subsets of patterns with identical length of k tokens on the development set.
Evaluation results of subsets with patterns of identical length on the development set.
QLN is supported by The Ministry of Education and Training (MOET) of Vietnam and DT is supported by the Alexander-von-Humboldt Foundation. We thank the BioNLP'09 Shared Task organizers for their efforts in data preparation and evaluation system. We also thank Illés Solt for providing hit count calculations, Conrad Plake and Jörg Hakenberg who initially developed Ali Baba's extraction system, and Philippe Thomas for critical reading of the manuscript.