In our approach to parser evaluation, we measure the accuracy of a PPI extraction system, in which the parser output is embedded as statistical features of a machine learning classifier. We run the classifier with features of every possible combination of a parser and a parse representation, by applying conversions between representations when necessary.
3.1 PPI extraction
PPI extraction is an information extraction task to identify protein pairs that are mentioned as interacting in biomedical papers. Because the number of biomedical papers is growing rapidly, it is becoming difficult for biomedical researchers to find all papers relevant to their research; thus, there is an emerging need for reliable text mining technologies, such as automatic PPI extraction from texts.
shows two sentences that include protein names: the former sentence mentions a protein interaction, while the latter does not. Given a protein pair, PPI extraction is a task of binary classification; for example,

IL-8, CXCR1

is a positive example, and

RBP, TTR

is a negative example. Recent studies on PPI extraction demonstrated that syntactic/semantic relationships between target proteins are effective features for machine learning classifiers (Erkan
et al.,
2007; Katrenko and Adriaans,
2006; Sætre
et al.,
2007). For the protein pair IL-8 and CXCR1 in , a dependency parser outputs a dependency tree; partly shown in . From this dependency tree, we can extract the dependency path shown in , which appears to be a strong clue in knowing that these proteins are mentioned as interacting.
We follow the PPI extraction method of (Sætre
et al.
2007), which is based on support vector machines with SubSet Tree Kernels (Moschitti,
2006), while using different parsers and parse representations. Two types of features are incorporated in the classifier. The first is bag-of-words features, which are regarded as a strong baseline for PPI extraction systems. Lemmas of words before, between and after the pair of target proteins are included, and a linear kernel is used for these features. This kernel is included in all our models. The other type of feature is parser output features. For dependency-based parse representations, a dependency path is encoded as a flat tree as depicted in (prefix ‘r’ denotes reverse relations). Because a tree kernel measures the similarity of trees by counting common subtrees, it is expected that the system finds effective subsequences of dependency paths. For the PTB representation, we directly encode phrase structure trees.
We also measure the accuracy obtained by the ensemble of two parsers/representations. This experiment indicates differences or overlaps in the information conveyed by two different parsers or parse representations.
3.2 Conversion of parser output representations
It is widely believed that the choice of the representation format for parser output may greatly affect the performance of applications, although this has not been extensively investigated. We should, therefore, evaluate the parser performance in multiple parse representations. In this article, we create multiple parse representations by converting each parser's default output into other representations when possible. This experiment can also be considered to be a comparative evaluation of parse representations, thus providing an indication for selecting an appropriate parse representation for similar information extraction and text mining tasks.
lists the formats for parser output used in this work, and shows our scheme for representation conversion. Although only CoNLL is available for dependency parsers, we can create four representations for the phrase structure parsers, and five for the deep parsers. Dotted arrows in indicate imperfect conversion, in which the conversion inherently introduces errors, and may decrease the accuracy. We should, therefore, take caution when comparing the results obtained by imperfect conversion.
3.3 Parser retraining with GENIA
The domain of our target text is different from the Wall Street Journal (WSJ) portion of the Penn Treebank, which is the
de facto standard data for parser training. Because all the parsers listed in Section sec:syntactic_parsers were originally trained with the WSJ data (except for ENJU-GENIA), we retrain the parsers with the GENIA Treebank
2 (8127 sentences), which is a treebank of biomedical paper abstracts annotated according to the guideline of the Penn Treebank (Tateisi
et al.,
2005). Since all these parsers have programs for training with a PTB-style treebank, we use those programs for retraining with default parameter settings.
In preliminary experiments, we found that dependency parsers attain higher dependency accuracy when trained only with GENIA. We therefore use only GENIA as the training data for the retraining of dependency parsers. For the other parsers, we use the concatenation of WSJ and GENIA for the retraining, while the reranker of RERANK was not retrained due to the high cost. Also for the training of ENJU-GENIA, the same set of the WSJ and GENIA was used.
Since all the parsers except NO-RERANK and RERANK require an external POS tagger,
geniatagger.(Tsuruoka
et al.,
2005) is used with these parsers.
3.4 Evaluating the relationships between parser accuracy, treebank size and PPI accuracy
In addition to investigating the impact of different parsers and different syntactic representations on PPI identification accuracy, we also examine how the parse accuracy of a single parser affects the PPI accuracy. To this end, we retrain one of the parsers (KSDEP) with varying amounts of training text, resulting in several different versions of the same parser, having different levels of accuracy. This allows us to establish a relationship between the accuracy of the parser and the amount of training data used to create the parser. When the parser is used as a component in the PPI identification system, we can determine the relationship between the size of the dataset used to train the parser, the parser's accuracy, and the overall PPI system's accuracy. This provides a rough guide for what level of accuracy to expect in the PPI task when a new parser is used, as long as the accuracy of the parser is known.