The performance of the systems was measured by precision, recall and F1-score (i.e. balanced precision and recall). To be considered correct, a system prediction must match not only the type of the entity or event, but also both boundaries. For angiogenesis events, sometimes boundaries are not crucial. For example, if the gold standard is ‘vascular endothelial cell proliferation
’, then ‘endothelial cell proliferation
’ is perhaps a good prediction, even if its left boundary does not match the gold standard. Therefore, we used an additional measure called approximate boundary matching
, which allows the spans of the predicted events to slightly differ from the gold standard. A similar measure was also adopted in the BioNLP event evaluation tasks (Kim et al., 2009
compares three sets of results for tagging angiogenesis terms, where the CRF results were obtained by 5-fold cross-validation on the manually created gold standard data. The CRF
model (Section 2.5.4
) clearly outperformed DictionaryBased
, which uses the manually compiled dictionary of angiogenesis terms (Section 2.2
). Note that the IAA
was calculated on a small set of 20 documents that were doubly annotated in the pilot annotation, and therefore it was possible that system performance exceeded IAA
shows the results for angiogenesis event identification. Both PatternBaseline and PatternExtended exploited the manually compiled tissue and trigger vocabularies, but the former performed matches following simple patterns (), whereas the latter applied patterns incorporating ENJU's predicate–argument relations (). PatternExtended was a clear winner over PatternBaseline, which demonstrated that syntactic relations were useful. CRF and CRF-entity were supervised methods, and they were trained and tested by 5-fold cross-validation on the manually created corpus. As mentioned, the difference between the two systems is that CRF used only contextual word and n-gram as features, while CRF-entity also exploited the gold standard entity annotation. CRF-entity obtained the best results as measured by every metric, and the performance of CRF was also promising. Nevertheless, the two methods were the most expensive to develop, as they required high-quality training data, which were laborious and time consuming to produce, even though we significantly simplified the annotation guidelines as compared with other annotation projects such as GENIA.
The bottom three rows in present the results of the method that automatically constructs tissue and trigger vocabularies by comparing PAS language models between a domain-specific corpus and a general one (Section 2.4
). We experimented with three different domain-specific foreground corpora, and the distinct performance indicates that this method is sensitive to the choice of foreground corpus. The empirical results show that employing the collection of the angiogenesis review articles and Wikipedia page obtained the best results, which correlates with the fact that this foreground corpus contained more concentrated information regarding angiogenesis than the others. While using the manually constructed vocabularies achieved good precision (87.71%), it suffered from a poor recall (29.40%). On the other hand, the automatic method using Review
, yielded better recall and F1 scores, indicating its ability to discover a wider range of terms from the domain-specific documents.