Comparison of performance of various algorithms
Different supervised learning algorithms were used to develop various prediction models. The results showed that the model developed using SVM had the best prediction performance compared to those developed using other algorithms. This is in concordance with other studies which frequently showed that models developed using SVM outperforms those developed using other learning algorithms [20
Comparison of performance using different types of frequencies
The prediction performance of models developed using either word occurrence, binary frequency or TF-IDF of predictors were compared. The model developed using TF-IDF showed the best performance, followed by the model using word occurrence and lastly the model using binary frequency. In the computation of TF-IDF, terms that occurred less frequently in the training set were given higher values and those that occurred very frequently were given lower values. A possible reason for the better prediction performance of the model developed using TF-IDF is that those words that occur less frequently play a key role in the classification of 'useful' and 'non-useful' articles. In addition, binary frequency can only provide information on the presence or absence of a term and not its occurrence frequency. Thus, the poorer performance of the model developed using binary frequency suggested that the classification of the articles may be dependent on the frequency of term occurrence.
General automated system
A general automated system was developed using SVM and TF-IDF of general predictors. Only general predictors were used to develop the system because predictors that are specific for TNF-α blockers are usually not present in articles on other drug classes. Thus, the use of these specific predictors in an automated system will add noise to the system, which will reduce its generalizability.
The results showed that the general automated system was able to rank the articles such that approximately 70% of the 'useful' articles on TNF-α blockers were found in the first 20% of TNF-α blockers articles. The generalizability performance for classifying articles on four other drug classes were lower, which suggests that there could be other general predictors which were not present in the abstracts of articles on TNF-α blockers. Thus there is a need to periodically update the general predictors used in the general automated system with newly classified abstracts of articles from different drug classes in order to maintain or improve its generalizability performance.
The advantage of the general automated system is that the evaluators need not manually classify any articles before using the system. A major limitation of the general automated system is that the system is trained using abstracts that were manually classified into 'useful' and 'non-useful'. Although a systematic approach (Additional file 1
) was used in this study to classify the articles, the classification scheme may not be applicable for different drug classes or different risk assessment tasks. In fact, it is important to note that product risk management involves using a variety of information sources and primary literature is just one of the sources. Thus different risk assessment tasks may have slightly different definition of 'useful' and 'non-useful' articles. This subjectiveness in classification of 'useful' articles may limit the applicability of the general automated system.
Specific automated system
The general trend of the prediction performance of specific automated systems developed using training sets of various sizes is that an increase in training set size correlated with improved performance of the automated system. The AUC of the system trained using 36 articles appeared to be an anomaly from this general trend. This could be due to its ratio of 'useful' to 'non-useful' articles, which was much more unbalanced compared to those in the rest of the training sets. Studies have shown that models developed from highly unbalanced training sets tend to have poorer prediction performance than those developed from balanced training sets [21
The advantage of the specific automated system is that a new automated system will be developed for each new drug class or new risk assessment task. This resolves problems caused by the subjective definition of 'useful' and 'non-useful' articles and thus prevent the potential poor generalizability associated with the general automated system. This study shows that manual classification of just 20 articles is sufficient to develop a specific automated system. Based on our experience, an experienced evaluator can easily manually classify 20 articles within a short time of approximately 30 minutes. Thus the use of such specific automated system is feasible for routine risk assessment work.
One disadvantage is that in the selection of articles for manual classification, 'useful' articles may not be selected by the Kennard and Stone sampling method or the selection may result in a highly unbalanced 'useful' to 'non-useful' articles ratio in the training set which may potentially decrease the performance of the automated system. The lower performance of the automated system trained using 36 articles could constitute an example of such problem.
A potential solution is to use the general automated system to perform the initial selection of articles that are most likely to be 'useful' and articles that are most likely to be 'non-useful' for manual classification. Although the general automated system may not have high generalizability performance, using it will still increase the likelihood of achieving a more balanced 'useful' to 'non-useful' articles ratio in the training set, compared to the alternative option of not using it.
Potential application of automated systems in routine risk assessment work
During routine risk assessment work, evaluators will enter search terms into the system. The system will proceed to retrieve abstracts of articles from PubMed. The general automated system will then select a list of approximately 20 abstracts to present to the evaluators for manual classification. The specific automated system will then be trained using these manually classified articles. The specific automated system will categorize the retrieved articles using their abstracts and rank them according to the confidence values for the classification of the articles as 'useful'. The list of abstracts will be presented to the evaluator in decreasing order of confidence values. This work flow is summarized in Figure .
Use of automated system in routine risk assessment work.
Current limitations and potential for improving the automated systems
In this study, parameter selection of the SVM algorithm was guided by determining the AUC of the automated systems on the testing set. An implicit assumption for AUC values is that both false positives and false negatives are equally problematic. This is clearly not the case for risk assessment task as false positives will only result in additional workload for the evaluators and will be identified during manual review of the results. In comparison, false negatives may result in important articles not being examined. Thus, future studies could explore using a customized performance measure which weighs false positives and false negatives differently to guide parameter selection.
The automated systems were developed using 'useful' and 'non-useful' articles. This assumes that all 'useful' articles are equally valuable. This is an over-simplification as some articles are more 'useful' than others. However, it is difficult to determine the degree of 'usefulness' for an article and thus it is not possible to develop regression models for use in the current automated systems. Future work could attempt to address this limitation by developing methods to ease the determination of the degree of 'usefulness' for an article.
In this study, it was assumed that important information required for classification of the articles will be summarized in the abstracts. However, authors may describe safety issues in the article but did not include them in the abstracts. In addition, several articles may discuss about similar safety issues but not all will include warnings and precautions on the use of these drugs in the abstracts. Hence, not all potentially 'useful' articles could be identified by the two automated systems which use only the abstracts. A potential solution is to develop and apply the automated systems on full articles [23
]. However, the presence of several obstacles made it challenging to make use of full articles in such automated system [24
]. Firstly, bulk download of full articles is difficult to automate. Moreover, copyright and fair use issues make retrieval of full text of all entries indexed on PubMed unfeasible. In addition, retrieved documents are often in PDF or HTML format and will be required to be converted into plain text prior to being used for text mining. Unfortunately, this conversion is not always accurate. Furthermore, symbols (e.g. 'ε' and 'α') are frequently used in full articles of biomedical literature and they require replacement with their spelled names. Such replacements have usually been done in their abstracts, which makes preparation of abstracts for feeding into automated systems less tedious.
A potential method to improve the performance is to explore semantic features. Semantics refers to the study of 'meanings' linked to their words in linguistic studies [25
]. Semantic features could be applied by using MeSH and the Unified Medical Language System (UMLS) concepts and semantic types [12
]. UMLS is the most extensive known database of synonyms and concepts relations of biomedical and health-related terms, maintained by National Library of Medicine. It can be used to map related concepts to words in text mining applications [26
] and has been used to extract disease drug knowledge from biomedical and clinical documents [27
]. This addition of meanings and concepts of the terms linked to their respective concepts may potentially boost the performance of the model.
Another method to improve performance is to use MeSH terms in addition to the abstracts for creating the predictors. The reason for using MeSH terms is because it contained information on the topics of the articles, which may be useful for the classification of the articles by the automated systems. For example, articles related to pharmacoeconomics are often focused on the decrease in cost burden to patients and to society but not on the adverse drug reactions. On the other hand, articles related to pharmacokinetics of a drug may occasionally be useful as they may report on dose-dependent adverse drug reactions. The inclusion of MeSH terms had been found to improve the prediction performance of the epitope model [12
] and thus could possibly be useful for improving the performance of the automated systems.