|Home | About | Journals | Submit | Contact Us | Français|
This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks.
Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results.
Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences.
The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm.
For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.
Natural language processing (NLP) techniques coupled with machine learning algorithms are increasingly being used to automate information extraction.1–3 The efficiency of machine learning algorithms is largely dependent on the availability of high-quality training data.4 5 The creation of training data typically requires human input in the form of manual review and/or annotation. In the biomedical domain, where the human annotators need to be highly skilled professionals, these annotation tasks can become very expensive. Although vast amounts of unlabeled medical data are available, labeled data that can be used for training purposes continue to be relatively scarce.
Active learning algorithms provide a mechanism to create training datasets with reduced human annotation and have been successfully applied in domains such as text classification, image classification, etc. However, their applicability in the biomedical domain, especially in clinical NLP, has not been sufficiently investigated. Furthermore, different active learning algorithms perform differently on different datasets. There is also no guarantee that they will outperform the passive learning method. Prior research on active learning provides little guidance in the selection of active learning algorithms. In this study, we first tested three known active learning algorithms on clinical text classification tasks, and then investigated which characteristics of the dataset impacted the active learning algorithms' performance.
Text classification is the task of assigning one or more predetermined labels to a text/document using the information contained in the text.1 6 Classification requires a training phase in order to learn patterns from a training set with labeled instances for which actual class membership is known.4 7 8
To select the training set from the dataset, two approaches can be used: passive and active learning. In passive learning, a training set is set apart using random or stratified selection. Active learning, on the other hand, seeks to select the most informative cases, with the objective of reaching a performance comparable to that of passive supervised learning algorithms, using fewer training instances.6–9
In an active learning selection method, three sets of data are involved: an unlabeled pool containing the collection of all candidate instances (U), an initial training set (I) containing at least one instance of each class, and a test set (T). The instances in both the training and test set data are labeled; that is, the true class of instances is known. The learning process is shown in figure 1 and consists of an initialization phase and an iterative selection phase.
Active learning algorithms are differentiated by the decision functions used to select instances for labeling (g in figure 1). Relevance feedback,9 an early attempt at selective sampling, was found to perform poorly due to its propensity to select instances that are very similar to instances already included in the training set.
Lewis and Gale10 proposed the uncertainty sampling method, which selects instances whose probability of belonging to a class is close to 0.5. The query-by-committee method,6 11 12 on the other hand, trains multiple classifiers to select cases with high uncertainty. These methods have been shown to considerably reduce the need for labeling.6 8 11–14
Another widely used approach in active learning combines active learning with the use of support vector machines (SVM).7 15–19 Tong and Koller19 described a simple active learning method called simple margin method that uses SVM. It ranks all instances in U by their distance to the hyperplane and selects instances at the shortest distance for labeling. Experiments in Tong and Koller19 showed that the simple margin and two variants of the method performed better than random selection.
Other types of methods seek out informative samples while maintaining the diversity of the selected samples.15 16 20 Brinker16 discusses one such method: in addition to measuring the distance of an unlabeled instance to the classification hyperplane, the decision function also measures diversity by calculating the cosine similarity between the unlabeled instance and already selected instances. Experiments on data from the UCI repository21 and the STATLOG22 collection showed that this combined method outperformed both the simple margin and random selection methods.
Although active learning algorithms have been widely used in other domains, their applications in the biomedical domain have been limited. Liu23 described a method similar to relevance feedback.9 Warmuth et al24 used a similar approach to separate active (positive side) and inactive (negative side) compounds. Methods which use the principles of probability theory to measure classification uncertainty have also been applied in the biomedical domain for classifying medical images, profiling genes, training computer-aided diagnosis systems, etc.20 25–29 In all these cases active learning methods were found to perform better than passive learning.
We explored the use of active learning to minimize the size of the required training sets. This study aims to answer the following questions:
We used five sets of data (see table 1) in the study. The first three sets are obtained from a previous study on smoking status extraction performed on patient discharge summaries from the Partners HealthCare Research Patient Data Registry.5 30 In that study, 11 000 sentences containing information about smoking status were extracted and each sentence was labeled as belonging to one of three classes—Current Smoker, Non Smoker and Past Smoker. In our experiments with active learning performance, we want to limit the number of variables/factors affecting the learning performance. One such factor is multiple versus binary classification. The performance of multiple classifications tends to be poorer and multiple classification can always be represented as a series of binary classification, but not vice versa. To allow binary and hierarchical classification, we derived two datasets: Smoker versus Non Smoker (SNS) and Past Smoker versus Current Smoker (PSCS). Two datasets were further derived from the SNS dataset: the first (SNS1) containing all available instances (n=11 000), the second (SNS2) containing the same number of instances as the PSCS (n=7016) that were randomly selected from SNS1.
The fourth dataset was obtained from a study on depression status classification (DS) at the Partners HealthCare system.31 Three psychiatrists conducted a group review of 264 outpatient psychiatric visit notes and categorized the patient depression status as: Well, Mild or Depressed. To allow binary classification, Depressed and Mild cases were merged into a single class, Depressed.
The fifth dataset, Dexter (DEXT), is a general purpose binary text classification problem from the UCI machine learning repository.21 32 It contains 600 examples and 20 000 features; 9947 features represent the frequencies of word occurrences in the text and 10 053 are distractor features not having any predictive power. We included this non-clinical dataset, because both distance and diversity-based active learning algorithms have been shown to work well outside the biomedical domain and our analysis of the relationship between algorithm performance and dataset characteristics required a more positive example of active learning performance.
To identify features for the SNS1 and SNS2 and PSCS datasets, we used the simple bag of words (BoW) approach. In the active learning and classification processes, all the available features were used for SNS1, SNS2, PSCS, and DEXT (n=3938, 3057, 3218, and 20 000, respectively). The DS dataset has a relatively small fixed set of features (n=177). These features were pre-selected by a group of clinicians who manually reviewed the unlabeled notes.
We implemented an SVM-based approach, using a combination of distance and diversity16 as the active learning method to classify our datasets. Studies have shown that SVM-based methods are suitable for text classification7 33 34 and the distance and diversity methods have shown good performance16 19 35 when applied to classification tasks in other domains.
The distance is measured based on the simple margin method.19 The diversity is calculated as the angle between the vectors representing training samples.16 In the decision function for the distance–diversity combination shown in (equation 1), f (xi) represents the distance between a candidate instance xi in the unlabeled pool and the classification hyperplane; cos−sim(xi,xj) represents the maximum cosine distancei from the candidate instance vector xi in the unlabeled pool to the already selected instances vectors xj in training set T. S is the set of selected instances, and λ is the tuning parameter to vary the weights of the distance and diversity components. This function selects examples diverse enough to reduce information redundancy by maximizing the distance (max(cos−sim)), while at the same time keeping the uncertainty of selected sample (arg-min).
Using the decision function discussed above, we experimented with three variations of the algorithm:
We simulated the active learning process on the five datasets. The sizes of the test set and unlabeled pool are shown in table 1. Due to the small size of DS and DEXT, at each run of the algorithm the datasets were randomly divided in two: 90% of the examples were assigned to the unlabeled pool, while the remaining 10% were used to test the classifier performance.
To select the training samples iteratively, a batch size h=16 was used for the SNS1, SNS2, and PSCS datasets and h=4 for the DS and DEXT dataset. The batch sizes were chosen as per recommendations from prior studies and based on the sizes of unlabeled datasets.16 19 For the SNS1, SNS2, and PSCS datasets, we iterated until the training set had 1008 instances, and for DS and DEXT datasets until the training set had 228 instances. To compare the performance of these active learning algorithms with passive learning, we also performed random selection of instances with passive learning (RAND) on each of these sample sizes. In addition, we implemented a variation of the passive learning using stratified random sampling to compensate for class imbalance. In every training batch, the stratification strategy allocated 50% of the instances to each class. In other words, at each iteration, half of the training samples belonged to one of the two classes in the binary classification.
As our base classifier, we used the SVM implementation available from Weka.36 37 A linear kernel was used; attribute normalization/standardization was turned off and classification in all the datasets was treated as a binary problem (see the Datasets section).
Each experiment was repeated 100 times on each dataset. For each sample size, the classification accuracy (ACC, or percentage of correct classification) and the area under the ROC (receiver operating characteristics) or AUC curve were averaged over 100 trial runs.
To investigate why the algorithms vary in performance on the different datasets and select the best algorithm for a dataset, we analyzed the uncertainty and diversity of the datasets using relative entropy (equation (2)). Relative entropy is a measure of how different two probability mass functions p(x) and q(x) are over the same event space. Relative entropy is a non-negative number and D(p ‖ q)=0 only if p=q.38 39
To measure dataset diversity, we randomly selected 10% of the unlabeled pool and compared that, using relative entropy, to the remaining 90%. To measure dataset uncertainty, we calculated the relative entropy between the two labeled classes in the dataset.
Diversity and uncertainty in the five datasets (PSCS, SNS1, SNS2, DS, and DEXT) were measured using an event space of 180 features so that we can compare the learning algorithms' performance on different datasets. Feature selection was conducted on PSCS, SNS1, SNS2, and DEXT to reduce the number of features and the active learning algorithms were rerun to obtain the respective performances. (This feature selection would not be necessary if we were not comparing learning performance on datasets with varying number of features.) To select the most relevant features we applied the Relief algorithm40 with a ranker method available from Weka37 using all available data as labeled data.
In order to compare the performance among the active learning algorithms and the random sampling method, we used the paired t test on the weighted means of the paired differences41 42 for the results on the accuracy measure. The paired t test was used since for each dataset the active and random sampling algorithm performances were measured at the same sample sizes using the methods described in the Algorithm testing section. We calculated the weight by using the inverse of the square of the SE for each sample size. The SE for the accuracy estimate for each sample size was available because we drew 100 samples for each sample size. The SE for the paired difference was computed using the standard errors from each algorithm (eg, angle-diversity vs random sampling) and the covariance from both algorithms in a pair. The covariance was computed using the correlation and the SEs of both algorithms. The weight used in the paired t test was the inverse of the square of SE of the paired difference for each sample size. We used the two-tailed test, and the level of significance was set at 0.05.
We correlated the diversity and uncertainty of each dataset (as measured by relative entropy) with the active learning algorithms' performance. Since CMB is a combination of DIV and DIST, the correlation analysis was focused on DIV and DIST. Specifically, we correlated the diversity with (DIV−RAND)/RAND and the uncertainty with (DIST−RAND)/RAND, using Pearson's correlation coefficient.
We tested the DIV, DIST, and CMB algorithms on five different text classification datasets (SNS1, SNS2, PSCS, DEP, and DEXT). Depending on the target ACC and AUC as well as the specific text classification task, the performance of the active learning algorithms varied. Because the stratified random sampling results are similar to the pure random sampling results (see online appendix 1), we only include one set of passive learning results (ie, pure random sample) in the main text.
Table 2 shows the estimated weighted accuracy for each learning method. In both ACC and AUC, the DIST and CMB algorithms were found in most cases to perform statistically better than passive learning (p<0.0001, except AUC on DS-PSCS). The DIV algorithm did worse than passive learning on all datasets except DEXT.
Figures 2–4 plot the performance of the active and passive learning methods. Some of the curves intersect, especially at smaller sample sizes. With increasing sample sizes, active learning methods were clearly differentiated from passive learning methods and performing either significantly better or worse.
At most ACC and AUC levels, the DIV algorithm required the same number of or more training cases than the random sampling method. The DIST algorithm performed best at most ACC and AUC levels, closely followed by the CMB combination method; both outperformed random selection. At certain ACC and AUC levels, the DIST algorithm required 50% to 30% fewer training cases than random sampling. To supplement the figures, we generated tables 3 and and4,4, which show the number of examples needed to achieve a given accuracy. We selected a few accuracy levels to reveal the differences among the algorithms and estimated the required sample sizes by interpolating the results observed during the iterative process. Highlighted cells show the accuracy percentage where the active learning methods required smaller sample sizes than random sampling.
To explain the varying effectiveness of the DIV and DIST algorithms, we correlated the datasets' diversity with DIV's performance and the datasets' uncertainty with that of DIST. Strong correlations were observed in both cases (figure 5). DIV performed better on data with higher diversity, while DIST performed better on data with relatively lower uncertainty. As described in the Materials and method section, diversity was measured as the relative entropy of a random subset of data (10%) against another subset (the other 90%). A dataset with low diversity has low relative entropy between the two random subsets. Since the DIV algorithm is designed to select diverse samples to annotate for training, it is less effective on data lacking diversity (ie, more uniform). Similarly, the DIST algorithm is designed to select training samples with high uncertainty. We measured uncertainty in a dataset as the relative entropy between labeled classes, with higher values of relative entropy suggesting the samples from the classes are more distinct (ie, less uncertainty). Our result suggests that the DIST algorithm is less effective when there is high uncertainty in the dataset.
A close examination of figure 5A shows that the DS and PSCS datasets have the highest uncertainty. The dataset with the least uncertainty is DEXT. The correlation between the DIST performance and the uncertainty is strong (r=−0.76). In figure 5B, we observe that all five datasets have relative entropy values very close to zero, ranging from 0.05 to below 0.01. The most diverse dataset is again DEXT. DIV also performed the best on DEXT. The least diverse dataset is SNS1, upon which DIV performed worst of all, even worse than RAND. We found a very strong correlation between the diversity and the performance differences for the DIV variation (r=0.98).
We implemented and tested five active learning algorithms on several clinical datasets. We found that these active learning algorithms do not always perform better than random sampling. The distance-based and combination algorithms (DIST and CMB) required fewer or as many training instances as random sampling to achieve given ACC or AUC levels. DIV, on the other hand, did not perform well overall and required less or the same number of training instances as random sampling.
At certain important performance benchmarks, noticeable differences were observed between the sample sizes required by the active learning and passive learning methods. For instance, in the SNS2 dataset at 90% accuracy (table 3), DIST and CMB required one third fewer instances than random sampling, while DIV required significantly larger sample sizes. Similar results were observed on the PSCS dataset, although all algorithms required larger sample sizes than on the SNS2 dataset. These results suggest that some active learning approaches can reduce the burden of annotation and speed up the development of NLP methods.
Another result of interest based on our experiments is the inferior performance of DIV on our clinical datasets. Prior studies in computer science have shown that DIV methods, along with DIST and CBM, consistently outperformed random sampling. To verify our implementation of these algorithms, we have run our algorithms on a waveform dataset (21 continuous features, three classes, 5000 instances) from the UCI machine learning repository21 that have been used in previous studies of active learning algorithms.16 We observed performance on this dataset consistent with the reported performance, and all active learning methods were found to be better than random sampling. The CBM performed best, while the DIV performed worst. This has led us to believe that certain characteristics of the clinical datasets we used have rendered the DIVs ineffective.
One salient characteristic is the vocabulary. Clinical notes do have more restricted sublanguage. In a study about sublanguages in the biomedical domain, Friedman et al write ‘In the grammar of a specialized sublanguage, the vocabulary is limited, only restricted combinations of words occur.’43 The sublanguage concerning a specific clinical classification task (eg, current vs past smoker) may be further restricted than others (eg, differentiating health news from non-health news). The unit of analysis is another factor. Certain clinical text instances (eg, sentences or sections) are much shorter than, for example, a news article.
To illustrate the differences, we analyzed the vocabulary and the selected features of two datasets: a non-clinical dataset with the highest diversity (DEXT) and one clinical dataset with the lowest diversity (SNS1). SNS1 has only 3938 non-zero words while DEXT has a larger vocabulary with 11 035 non-zero words. After feature selection, each dataset has around of 180 unique words. The mean frequency in DEXT is 510.87 with a median of 179, which is higher than the mean frequency of 37.71 with a median of 5 in SNS1. SNS1 instances are also much shorter with roughly 25 words on average and 9395 words in DEXT before feature selection and 6 and 756, respectively, after feature selection. The results are reported in table 5 (see online appendix 2).
In order to understand the varying performance of the active learning algorithms, we conducted experiments to analyze the informativeness of the datasets in terms of diversity and uncertainty. We observed that all our clinical datasets have very low diversity, as measured by the relative entropy between two random subsets of data, as well as poor DIV performance. The non-clinical dataset DEXT we included in the experiment has much higher diversity and much better DIV performance. While DIV aims to diversify the training data, it cannot succeed in a dataset that is inherently uniform. In such cases, the diversity of the data selected through random sampling and the DIV algorithm would be similar and DIV would not be effective. For example, in the PSCS dataset many of the instances are quite similar, while, for example, in DEXT the differences among the instances are more pronounced since news articles tend to be more varied. As a result, there is less diversity in the PSCS dataset than in DEXT and DIV performed better on DEXT.
The uncertainty of the clinical datasets was relatively low, as measured by the relative entropy between classes, although still higher than the non-clinical DEXT set. Please note that the higher relative entropy in this case indicates more distinction between the classes, and thus lower uncertainty. DIST seeks to locate the most ambiguous, that is, uncertain examples for training. When a dataset has a higher number of ambiguous cases (ie, higher uncertainty), random sampling can easily capture ambiguous cases, rendering the effect of DIST less noticeable. The clinical datasets we experimented with had a fair number of ambiguous cases, for example, ‘She quit smoking several years ago.’ and ‘She tried to quit smoking several years ago.’ We believe this is why DIST performed relatively well on the clinical datasets, but much better on DEXT.
Our study examined a limited number of datasets: the four clinical datasets used are derived from two original text datasets. Please note that no active learning algorithm had been tested on multiple datasets in the clinical NLP domain. Our results suggest that the DIST and CMB algorithms may be effective for clinical text classification, although these results may not be generalizable. On the other hand, our method of measuring diversity and uncertainty, and the correlations we found between diversity and DIV performance and uncertainty and DIST performance are generalizable to clinical and non-clinical data.
It is also worth mentioning that for ACC and AUC, all learning curves (DIST, CMB, DIV, and RAND) start close to each other, then diverge, and then converge again at larger sample sizes (see online appendix 3). This suggests that the target performance is an important factor when choosing active or passive sampling methods. We have conducted a separate study on this topic, which has recently been published.44
This study is our initial attempt to develop and apply active learning for clinical text classification. We intend to extend this study by exploring variations of the sample selection method, for example, measuring diversity between the unlabeled candidate and the correctly classified instances of the training set. We also plan to continue studying the factors that affect diversity on the datasets in the clinical domain.
This paper presents the results of our experiments in applying active learning to medical text classification. We tested three existing active learning algorithms on four datasets. Our results suggest that active learning algorithms can sometimes, but not always, effectively reduce the sample size of training sets and outperform passive selection. We also found that the effectiveness of the active learning algorithms is associated with the diversity and uncertainty of the datasets as measured by relative entropy.
The authors want to thank CONICYT (Chilean National Council for Science and Technology Research), Universidad de Concepcion, the US Department of Veterans Affairs (grants HIR 09-005 and HIR 08-204), and the NIH Roadmap for Medical Research (grant U54LM008748) for their support for this research.
Contributors: QZT and RLF conceived the study. SG and RLF designed and implemented experiments. LHN designed statistical analysis. RLF performed statistical tests and data analysis. QZT and LHN participated in study design and supervised experiments and data analysis. RLF and QZT drafted the manuscript. Both SG and QZT had full access to all of the data. QZT and EPW made critical revisions and gave final approval to the manuscript.
Funding: This study was supported by the US Department of Veterans Affairs, NIH Roadmap for Medical Research.
Competing interests: None.
Provenance and peer review: Not commissioned; externally peer reviewed.
iThe vector space model is an algebraic model that represents text documents or segments (eg, sentence) terms or words. Each word is represented as a dimension and its occurrence frequence as the value. Cosine similarity is calculated as the dot product of two vectors divided by the norm of the two vectors.