The first extension we have introduced in FACTA is the ability to detect biomolecular events mentioned in MEDLINE articles, thereby allowing the user to issue queries including such events. For example, FACTA+ allows the user to specify the documents that contain the word ‘ERK2’ and also mention positive regulation events, by using the query ‘ERK2 GENIA:Positive_regulation’. This extension is motivated by the fact that biomolecular events have recently received considerable attention as an important type of information in biomedical text-mining (
Ananiadou et al., 2010;
Bjorne et al., 2010;
Miwa et al., 2010).
In this work, our definition of biomolecular events follows that of the GENIA event corpus (
Kim et al., 2008), in which events are basically characterized by verbs or nominalized verbs. For example, the sentence ‘In Escherichia Coli, glnAP2 may be activated by NifA.’ contains one event specified by the verb ‘activated’, with its arguments being ‘In Escherichia Coli’, ‘glnAP2’ and ‘NifA’. In the GENIA event definition, every event is represented with a
trigger and their arguments. shows some examples of the events in the corpus with the trigger words italicized. For example, ‘express’ is the trigger word for the gene expression event in the first row in the table.
| Table 1.Examples of event-describing phrases |
Since every event is represented with a trigger, what we need for event recognition is a component that can accurately detect triggers in text. Perhaps the most straight-forward approach to detecting trigger words in text is to use a dictionary, but pure dictionary-matching is not suitable for event recognition, since trigger words are often very ambiguous. For example, as seen in , the word ‘express’ is used as a trigger word for the gene expression event, but the word ‘express’ is a very common verb and used in many different meanings. Therefore, including the word ‘express’ in the dictionary would produce many false positive matchings.
We use a machine learning-based approach to sidestep this ambiguity problem, and use the GENIA event corpus (
Kim et al., 2008) as training data. More specifically, we used the data released for the BioNLP'09 shared task (
Kim et al., 2009) for training and testing our machine learning models. This shared task data is derived from the GENIA event corpus and contains annotations on nine event types concerning protein biology, which are a subset of the biomolecular events defined in the GENIA event ontology.
The machine learning models trained on the shard task data are used to recognize event triggers in text and their event types, and FACTA+ simply regards the detection of a trigger as an occurrence of the corresponding event in the abstract. Although this simple approach has a risk of producing false positives—because we disregard some semantically important types of information such as modality and negation (
Garten et al., 2010;
Krallinger, 2010;
Nawaz et al., 2010), we leave it for future work.
2.1 Related work
The most straight-forward approach to detecting trigger words is to use a dictionary.
Buyko et al. (2009) created a dictionary by manually curating and extending a lexicon derived from the original GENIA corpus with the help of researchers with a background in biology. A disambiguation step is performed by considering the co-occurrence statistics between each trigger word with event types in a training corpus. This disambiguation is used for some dictionary-based approaches [e.g.
Kilicoglu and Bergler (2009),
MacKinlay et al. (2009)].
Vlachos et al. (2009) extracted frequent triggers using a one-sense-per-term assumption, and performed soft matching (using lemmas and stems) to alleviate the problem of potential variability of trigger words.
Vlachos (2010) extended the extracted dictionary by incorporating ‘light’ and ‘ultra-light’ triggers, which represent the discriminative modifiers of triggers.
Kaljurand et al. (2009) extracted the dictionary from a training corpus, and disambiguated the trigger words by considering two kinds of co-occurrence statistics: one between each token and token considered to be a trigger and one between an event structure (event type and argument combination) and the trigger.
Kilicoglu and Bergler (2009) manually cleaned the dictionary by removing ambiguous triggers, and also added variations of prefixes and nominal forms of verbs to the dictionary.
Van Landeghem et al. (2009) built two separated manually cleaned dictionaries for unary events and other events.
Cohen et al. (2009) selected triggers by iteratively testing manually constructed patterns.
Another popular approach is to use machine learning.
Björne et al. (2009) and
Miwa et al. (2010) used a multi-class support vector machines (SVMs) to detect and disambiguate trigger words.
Morante et al. (2009) detected and disambiguated trigger words using IB1 memory-based classifier.
MacKinlay et al. (2009) combined the outputs from a dictionary-based look-up tagger and a conditional random field (CRF)-based tagger.
Some other approaches detected events without a specialized module for trigger detection.
Riedel et al. (2009) and
Poon and Vanderwende (2010) detected events using Markov logic networks (MLNs).
Neves et al. (2009) used the case-based reasoning, which finds ‘case-solution’ of a token including event, trigger, and argument types by retrieving the most similar, frequent case in the training data.
Hakenberg et al. (2009) extracted shortest link paths on parse tree in events as queries, and also created regular expression-based patterns for regulation events. They grouped similar terms together manually, and applied both queries and patterns to the development and test datasets to detect triggers and arguments.
2.2 Detecting trigger words
To detect trigger words, we use a CRF model (
Lafferty et al., 2001). CRF models are log-linear probabilistic models for predicting sequences, which are widely used in biomedical text-mining as the machine learning model for named entity recognition (
Okanohara et al., 2006;
Settles, 2004). The task of detecting trigger words can be performed with a CRF model by converting the task to a sequence prediction problem, in which the trigger sequences are represented with the ‘IOB2’ representation (
Sang and Veenstra, 1999). In this representation, the beginning word of a trigger is given the ‘B’ tag. The following words are given the ‘I’ tag. The other words in the sentence are given the ‘O’ tag. The task of the CRF model is then to predict an ‘IOB’ sequence for a given sentence. In this work, the ‘IOB’ tags are combined with the nine different types of biomolecular events defined in the BioNLP'09 shared task data (Available at
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/).
2.3 Joint learning
In this work, we propose to use a model that performs the joint learning to recognize event triggers and protein names simultaneously. The motivation for our joint learning approach is that the presence of a protein name often indicates the presence of a trigger word in its vicinity. It should be noted that, unlike the shared task setting, we cannot use the information from gold-standard annotations for protein names, because we need to process the whole MEDLINE corpus for FACTA+.
The joint CRF model uses three additional tags: ‘B-Protein’, ‘I-Protein’ and ‘Filler’. shows an example of an IOB tag sequence for the sentence ‘CD44 activated the transcription factor AP-1’. Note that the trigger word ‘activated’ is followed by a protein name but there is a gap between them. The tags assigned to ‘CD44’, ‘AP’, ‘-’ and ‘1’ are the ones added to recognize protein names. The ‘Filler’ tags are introduced to represent the regions that reside between the protein names and trigger words belonging to the same event. The filler tags enable the CRF model to propagate information from the existence of trigger words to non-adjacent protein names. In other words, the fact that a trigger word is followed by a protein name is captured by two transition features: (i) transition from ‘B-Positive_regulation’ to ‘Filler’ and (ii) transition from ‘Filler’ to ‘B_Protein’.
| Table 2.Joint learning of event triggers and protein names |
2.4 Experiments
We present experimental results to evaluate the performance of our joint learning approach. We compare our joint learning approach against two baseline approaches (models). The three CRF models used in the experiments are as follows.
- Triggers Only
A model limited to recognize only trigger words. The training data contains only the annotations on trigger words. Since there are nine different types of events in the data, this model considers 19(=2×9+1) different possible tags for each word.
- Joint
A model to recognize protein names and trigger words jointly. However, the training data for this model does not include the Filler tag. This model considers 21(=19+2) different possible tags for each word.
- Joint + Filler
A model to recognize protein names and trigger words jointly. The training data also include the Filler tag as described in the previous subsection. This model considers 22(=21+1) different possible tags for each word.
We trained these CRF models using the training data (consisting of 800 MEDLINE abstracts) in the BioNLP'09 shared task corpus, and evaluated them using its development data (consisting of 150 abstracts). The corpus was preprocessed with simple rule-based scripts to perform sentence segmentation and tokenization. The feature templates used in our CRF models are shown in . The features include word n-grams, substrings and the shape of the current word and tag transitions.
| Table 3.Feature templates used in the CRF tagger |
shows the results. The first nine rows in the table correspond to the nine types of biomolecular events defined in the corpus, and the bottom row shows the micro-average of the scores. Our proposed approach (i.e. the ‘Joint + Filler’ model) significantly outperforms the ‘Triggers Only’ model. This shows that the contextual information from the protein names is useful in detecting trigger words. It should also be noted that the performance of the Joint model without the filler tag is worse than the ‘Joint + Filler’ model, suggesting that it is important to explicitly transfer the information on the neighbouring tags in a CRF model.
Our approach consistently improved the performance for detecting event triggers, but the performance of detecting binding and regulation events was not very high. This is because these events can take multiple arguments, and also because regulation events can take other events as arguments. Rich linguistic information is required to deal with such event structures, and such triggers are not our current focus.
Note that the performance figures presented in this table are not comparable to those reported for the BioNLP shared task, since we did not use the gold standard information on the gene/protein names due to our purpose to evaluate the accuracy of trigger detection in a real-world setting where no gold standard annotation for gene/protein names are available.
The machine learning model described above (i.e. ‘Joint + Filler’) was applied to the whole MEDLINE corpus containing 20 033 079 articles, and the recognized events are indexed by FACTA+ so that it can accept queries including biomolecular events.
1 The number of articles indexed for each event type ranged from 53 262 (Protein catabolism) to 1 537 441 (Regulation).
We have also carried out a small-scale experiment to assess the quality of this indexing for the whole MEDLINE. We asked a bioNLP researcher with biology background to check the 10 latest articles returned by FACTA+ for each event class to see whether they are really relevant to the target class. In other words, the abstract-level precision was manually evaluated for each event class. The result was that 86 out of the 90 abstracts were actually relevant to the target event class.
2 Although the recall of this event indexing is not completely clear, the precision is probably good enough to be used in practice.