Recall and precision measure in biocuration and comprehensive SVM scheme
The performance of an SVM can be evaluated using a testing set containing documents with known class labels. The commonly used evaluation metrics are recall and precision: recall = TP/(TP+FN); precision = TP/(TP+FP); where TP represents true positive, FN represents false negative, and FP represents false positive. A high precision value is normally more readily achievable than high recall in SVM-based text classification [23
] and a high precision value has actually been preferred over a high recall in some commonly studied areas such as web page categorization etc
. In biocuration, however, the goal is to obtain the highest recall possible while keeping the false positive rate reasonably low because, if recall is not high enough, curators would need to examine all published papers for their data type to uncover false negatives. On the other hand, curators would only need to examine a subset of the papers, those identified as positives, to eliminate potential false positives.
To achieve a high recall, we developed a 9-component comprehensive SVM scheme with multiple SVMs using the top 10, 25, 50, 75, 100, 150, 200, 300, and 400 Chi-square score ranked features. We then applied this SVM and calculated the final recall and precision by combining all the papers identified from these SVMs (see Methods). This scheme increased the recall value by as much as ~10% while only causing a tolerable decrease in precision. This comprehensive SVM scheme was also utilized to increase the confidence of the identification (see Methods). Unless indicated otherwise, all the results presented here were analyzed using this comprehensive SVM scheme.
The recall and precision values of each single SVM component as well as the comprehensive SVM analysis were shown in Additional File 3
, Table S2. In general, for each component SVM, the recall value is lower than the precision value, and the number of top ranked features required to give the best recall varies in different data types.
The comprehensive SVM analysis generally increased recall and decreased the precision value in comparison to the single component SVMs. The effects are more prominent for some data types than others. For example, in the case of RNAi data, the comprehensive SVM achieved a recall of 0.99, whereas the recall of a single SVM component is 0.91 and the worst recall of single SVM is 0.85. On the other hand, the increase of recall in comprehensive SVM is not so apparent for the antibody data type. The recall of the comprehensive SVM for antibody is 0.94, which is a slight increase from 0.91, the best recall of the single SVM components, and 0.88, the worst recall of the single SVM components.
The decrease in precision in comprehensive SVM also varies with different data types. For example, for the RNAi data type, the precision of comprehensive SVM is 0.78, which is much lower than the best precision of 0.92 of a single component SVM and is also lower than the worst precision of 0.82 of a single component. On the other hand, for the Mutant allele sequence data type, the precision of the comprehensive SVM is 0.98, not much of a decrease in comparison to both the best and the worst precision of a single component SVM, 1 and 0.98, respectively.
It is not clear whether the same single component SVM will give the highest recall in the testing set and different batches of validation set; we do not have sufficient validation sets to do a systematic evaluation. It is thus generally more desirable to do comprehensive SVM analysis to improve recall.
Automated data type identification for WormBase and FlyBase curation
To test our method, we applied it to ten data types (Additional File 1
, Note S1A) of strong interest to WormBase. A sufficient number of papers labeled with these ten data types have accumulated between 1985 - 2009 by curators reading each new C. elegans
paper and indexing different data types; these labels were used in constructing the training sets. Each paper underwent comprehensive SVM analysis for each of the ten data types (Table ; Additional File 4
, Table S3) and the performance for each data type was evaluated by using a testing set with papers from the same time period as that of the training set, which is from papers curated at WormBase between 1985 and 2009 (see Methods). Six of the data types were also evaluated every one-two weeks using new C
papers, i.e. the validation sets, over a six-month period (07/2009 - 12/2009) (see Methods). The recall and precision values of these ten data types from the testing set were in the range of 0.85 - 0.99 and 0.70 - 0.98, respectively. The recall and precision values from the validation sets agreed well with those from the testing sets for all the data types except the gene expression and gene regulation data types whose precision values decreased from 0.98 to 0.55 and 0.88 to 0.49, respectively.
Evaluation results of ten WormBase data types using the ten testing sets
The number of papers in each batch varies depending on how many papers on C. elegans were published in the relevant time period. For example, for the five batches validated for RNAi data, the number of papers ranged from 19 to 88. The SVM performance for RNAi data type among different batches varied little judging by the standard deviation of recall and precision: recall of these five batches is 0.98 +0.04 and precision is 0.81 ± 0.03. We also examined the precision value of SVM analyses of six batches for gene expression data type. These six batches ranged from 21 to 44 papers, and the average precision value is 0.44 ± 0.08. The performance of a batch was not correlated with its size. For example, the batch with the highest precision (0.59), and the batch with the lowest precision (0.37), have about the same number of papers, 21 and 22, respectively. The precision of the largest batch with 44 papers is 0.45, close to the average.
Several factors may contribute to the decrease in the precision value from the validation set for gene expression and gene regulation data type, in comparison to those from the testing set: Data type definitions may change over time, and different vocabularies may be used to describe data type-specific information as new experimental methods are invented or old experimental methods become obsolete. For example, when looking at gene expression, Northern blotting was commonly used in the past but is now less frequently used, having been replaced by techniques such as reporter gene expression and RT-PCR.
The training papers for gene expression and gene regulation, the data types whose validation set showed much lower precision than the testing set, are obtained from a collection of the past 14 years. We do not have sufficient training papers to make large enough training set for different period of time to examine the time effect; this can be done more effectively at a later time when significant number of newly labeled papers are available for systematic comparison.
The SVM method does not take into account synonym expansion; the change in the vocabulary of the used terms might lead to decreased performance. This type of change may be one of the reasons that the precision of the validation set for gene expression and gene regulation data types are much lower than those from the testing set. This problem can be addressed by utilizing generalized vector space models, or concept vector space models that map terms into concepts, and the document can then be categorized based on concepts which accommodate terms from different times instead of terms that may change over time [34
]. It has been shown that the SVM performance in precision was significantly increased especially in those cases with small training sets after incorporating WordNet concepts for mapping the terms [34
We also applied the comprehensive SVM method to fifteen data types from FlyBase (Additional File 1
, Note S1B). Table and Additional File 5
, Table S4 show the results of five of these data types with high occurrence. Their performances were similar to those of the WormBase data types with recall in the range of 0.88 - 0.98 and precision in the range of 0.56 - 0.92.
Evaluation results of Five FlyBase data types with high occurrence using the testing sets
SVM across organism-specific corpora
The same or similar types of data are often curated at different biological databases such as the model organism database, or MODs. For some data types, the training set from one MOD may not be large enough to achieve satisfactory performance. We thus explored the possibility of utilizing training papers from one MOD to help with the SVM analysis of similar data types in another MOD. Both WormBase and FlyBase label papers containing RNA interference (RNAi) data, albeit using different criteria (Additional File 1
, Note S1A-B). WormBase has identified > 1400 papers indexed with 'RNAi', while FlyBase has identified only 232 'RNAi'-labeled papers.
One strategy for utilizing the large training set of C. elegans papers to identify D. melanogaster papers that contain the RNAi data type would be to remove C. elegans specific features from the C. elegans RNAi feature list. However, while some features such as "Fire", the surname of an author of a highly cited C. elegans RNAi reference, seemed to be a likely candidate for removal, others were not so readily apparent. Thus, manually editing an existing features list could be a difficult and time-consuming process.
We categorized the features of a data type to be either organism-independent or organism-dependent. Those organism-independent features found in C. elegans
RNAi papers could contribute to the SVM analysis of D. melanogaster
RNAi papers whereas those features only found in C. elegans
RNAi papers probably would not contribute to the D. melanogaster
RNAi SVM. We postulated that by pooling the training papers from WormBase and FlyBase and then calculating the Chi-square score for their features, the ranking of organism-independent features would be more favorable than when the Chi-square score was calculated using only WormBase or FlyBase training papers alone. On the other hand, those organism-dependent features would be less favorable than those found using only WormBase or FlyBase training papers alone. As shown in Additional File 2
, Table S1, the top-ranked, organism-specific features such as "Fire" and "Timmons," both author names of a highly cited C. elegans
RNAi reference, disappeared from the top 400 features list of the combined training set, whereas organism-independent features such as RNAi, dsRNA, interference, etc
. remained as top-ranked features.
As shown in Table and Additional File 6
, Table S5, SVM analysis using a training set containing 170 WormBase RNAi and 170 FlyBase RNAi papers effectively increased the recall from 0.81, obtained using the FlyBase training papers alone, to 0.99, while the precision value remained as high as 0.99, indicating that this pooling strategy worked well. A large training set containing 773 WormBase RNAi papers gave a much lower recall of 0.85 but the same precision value of 0.99 for the same FlyBase testing papers.
Evaluation results of FlyBase RNAi data type using FlyBase or/and WormBase training papers
SVM results of data types of low occurrence
Table and Additional File 7
, Table S6 show the SVM results of nine data types from FlyBase. Table and Additional File 8
, Table S7 show the SVM results of three data types used for the text classification task at the Genomic Track of the Text Retrieval Conference 2005 (GT TREC 2005), which were originally curated by Mouse Genomics Informatics (MGI) [36
]. These data types have unbalanced class distributions whose percentage in the total document set were in the range of ~1-10%. It has been reported that a large negative training set can have adverse effects on performance [21
], and several approaches, including modifying either the data distribution or the classifier, or a combination of both, have been applied to deal with this problem [21
). We found that a large negative training set could have both positive and negative consequences: on the one hand, it could increase precision while on the other hand, it could decrease recall (data not shown). An optimum ratio of positive to negative training sets (PN ratio) could be found for each data type to give the highest recall possible while keeping the false positive rate reasonably low, i.e., a reasonably low filter term (FT) value. As shown in Tables and , the recall values for these data types were in the range of 0.86 ± 0.06 to 0.98 ± 0.01 and the filter term (FT) values were between 3.4 ± 1.6% to 22.5 ± 2.3%. The use of the optimum PN ratio effectively increased recall values of these data types from a range of 0.32 - 0.7 to a range of 0.87 - 0.97 while FT values were kept under ~20%.
Evaluation results of nine FlyBase data types With low occurrence using the testing sets
Evaluation results of three data types with low occurrence from MGI using the testing sets
TF-IDF (Term of Frequency Inverse Document Frequency) is one of the most commonly used term weighting schemes in information retrieval and text mining. We compared SVM analyses using the following three different feature selection methods and term weighting schemes: TF-IDF weighting on all features, TF-IDF weighting on Chi-square score ranked features, Boolean weighting on Chi-square score ranked features using the RNAi data type. The results were evaluated using two testing sets and two validation sets, respectively. The two testing sets differ in the ratio of the negative set of the positive set, one with a 1:1 and the other with a 2:1 ratio, as do the two validation sets. Because the TF-IDF weighting scheme without feature selection is CPU-intensive with large datasets, these comparisons were done using small training and testing sets (Additional File 9
, Table S8, Additional File 10
, Table S9; Additional File 11
, Table S10; and Additional File 12
, Table S11), which were constructed by randomly selecting papers from the positive and negative labeled pools. All the different schemes used the same training, testing and validation sets.
The reason we used different ratios to evaluate the results is that we are interested to know how different ratios might affect the evaluation of results. This issue arises because in the curation process, we need to do text categorization of newly published papers on a frequent basis. The ratio of the positive papers over the negative papers in such short period of time could vary batch by batch for any data type and it could differ from the ratio of the training set.
As shown in Additional File 9
, Table S8, Additional File 10
, Table S9, Additional File 11
, Table S10, and Additional File 12
, Table S11, Boolean and TF-IDF weighting schemes that combine Chi-Square score ranked feature selection have similar recall, ≥ 0.9. By contrast, TF-IDF weighting schemes using all features (without the feature selection step) have very poor recall, between 0.08 - 0.61. As shown in Additional File 9
, Table S8, the TF-IDF weighting scheme that combines Chi-Square feature selection has similar precision as that of the Boolean one when using the testing set with the ratio of negative over positive set of 1:1. In the testing set with a 2:1 ratio of negatives to positives and both the validation sets, a TF-IDF weighting scheme that combines Chi-Squared score ranked feature selection has much lower precision than the Boolean weighting scheme that combines a Chi-Squared score ranked feature selection. As shown in Additional File 10
, Table S9, in the validation set with a 1:1 ratio of negatives to positives, the precision of the TF-IDF one is 0.61 whereas the Boolean one is 0.72. As shown in Additional File 11
, Table S10, in the testing set with a 2:1 ratio of negatives to positives, the precision of the TF-IDF one is 0.54 whereas the Boolean one is 0.64. As shown in Additional File 12
, Table S11, in the validation set with a 2:1 ratio of negatives to positives, the precision of the TF-IDF one is 0.45, whereas the Boolean one is 0.59. The TF-IDF weighting scheme that combines all features gives similar precision values as those of the Boolean weighting scheme that combines Chi-Square score ranked feature selection in all four evaluation sets.
The precision values of the SVM analysis using the TF-IDF weighting scheme are 0.10-0.15 lower than that using the Boolean weighting scheme in three out of four cases reported here. This difference may be due to the fact that the ratio of negative over positive papers in a small pool of new papers can deviate from that of the training set. The TF-IDF may also cause inappropriate scaling for some features; consequently some features with strong predicting power may be given less favourable score than those with weak predicting power, thereby undermining the performance [40
]. The ratio of negative papers over the positive papers in each batch of new papers varies and is difficult to predict ahead of time. We think that the Boolean weighting scheme that combines Chi-Square score ranked feature selection maybe a more suitable method than the TF-IDF weighting scheme that combines Chi-Square score ranked feature selection for the categorization of experimental datatypes in a curation process where a small pool of new papers usually need to be analyzed in a timely manner.
Numerous machine-learning methods have been used by various groups that participated in the text categorization task in the GT TREC 2005 challenge [8
]. The methods included regularized linear classifier [41
], logistic regression [42
], pattern-based learning [43
], naÏve Bayes learning [44
], theme detection [45
], K-nearest neighbor [43
], Rocchio-based classifier [45
], SVM [42
], as well as others. Several groups have used SVM in their studies on these data types and have reported different performances. The differences in performance might arise from the use of different feature selection strategies and other procedures in their SVM analysis [36
]. One of the SVM method submitted to TREC 2005 has an overall high performance in a comparison with all the other methods submitted [48
]. We did a side-by-side comparison of our method and those methods submitted to the GT TREC 2005 for the categorization of the Mutant Phenotype Alleles, Embryologic Expression and Tumor Biology data types [8
] originally curated by MGI. As shown in Additional File 10
, Table S9, our method showed equivalent or better results for all the three data types than both the best performance among various methods and a SVM method submitted to the GT TREC 2005. In comparison to the best performance among various methods submitted to GT TREC 2005 [48
], our method achieved similar recall for all three data types and a 1.3- and 2.4-fold increase in precision for the Mutant phenotype allele and the Tumor biology data type, respectively. In comparison to the SVM method submitted to the GT TREC 2005 [48
], our method gave a higher recall value, 0.94 ± 0.04, compared to 0.82, and a similar precision value for the Embryologic expression data type. For the other two data types, our method gave similar recall but more than 2-fold increase in precision. Furthermore, our method is relatively simple when compared to most of the methods submitted to GT TREC 2005, which involved multiple steps or required expert domain knowledge in feature selection or document preprocessing etc. Our method does not require any data type specific manual input or sophisticated manipulation at any step, is completely automated, and can be readily applied to different data types.
We showed that our method can be applied to the three data types of MGI giving high recall (Additional File 13
, Table S12), and thus might save curation time (measured by the FT term). However, a direct comparison of our method and those methods in TREC 2005 is difficult because we used a different set and number of papers for training and testing (Additional File 10
, Table S9) than those used by TREC 2005 participants. As indicated earlier, the PN ratio affects precision value. In the TREC 2005 systems, the number of negative training papers is much larger than that of the positive papers: this disparity may adversely affect precision. We think that this factor may need to be taken into consideration when evaluation schemes are designed.
Previously we developed a combinatorial Boolean keyword search using Textpresso [44
] to identify papers that contained the RNAi data type (G. Schindelman, J. Chan, and P. Sternberg, unpublished results) with a recall of 0.96 and precision of 0.61. This was obtained after eight iterations of refining keywords in the search query and subsequent manual examination of false negative and false positive articles. This process requires expert domain knowledge for a specific data type and time consuming manual effort, unlike the SVM method which is completely automatic with a given training set and can be readily used for different data types. Furthermore, for those data types without a sufficient set of specific keywords, this approach may not be applicable.
Once documents have been classified for data type identification, a subsequent task in biocuration is extraction of the information of interest. While attempts to automate fact extraction can be undermined by high false positive rates, we have observed that the false positive rate in text extraction of Gene Ontology Cellular Component data by a category-based semi-automatic text extraction approach using Textpresso [14
] is significantly decreased when extraction is performed on only those papers identified as containing gene expression data by SVM (K. Van Auken, R. Fang, J. Chan, H.-M. Müller, and P. Sternberg, unpublished results). We expect that a filtering step provided by SVM analysis will have the same effect on other text extraction methods, as well.