|Home | About | Journals | Submit | Contact Us | Français|
To investigate whether (1) machine learning classifiers can help identify nonrandomized studies eligible for full-text screening by systematic reviewers; (2) classifier performance varies with optimization; and (3) the number of citations to screen can be reduced.
We used an open-source, data-mining suite to process and classify biomedical citations that point to mostly nonrandomized studies from 2 systematic reviews. We built training and test sets for citation portions and compared classifier performance by considering the value of indexing, various feature sets, and optimization. We conducted our experiments in 2 phases. The design of phase I with no optimization was: 4 classifiers × 3 feature sets × 3 citation portions. Classifiers included k-nearest neighbor, naïve Bayes, complement naïve Bayes, and evolutionary support vector machine. Feature sets included bag of words, and 2- and 3-term n-grams. Citation portions included titles, titles and abstracts, and full citations with metadata. Phase II with optimization involved a subset of the classifiers, as well as features extracted from full citations, and full citations with overweighted titles. We optimized features and classifier parameters by manually setting information gain thresholds outside of a process for iterative grid optimization with 10-fold cross-validations. We independently tested models on data reserved for that purpose and statistically compared classifier performance on 2 types of feature sets. We estimated the number of citations needed to screen by reviewers during a second pass through a reduced set of citations.
In phase I, the evolutionary support vector machine returned the best recall for bag of words extracted from full citations; the best classifier with respect to overall performance was k-nearest neighbor. No classifier attained good enough recall for this task without optimization. In phase II, we boosted performance with optimization for evolutionary support vector machine and complement naïve Bayes classifiers. Generalization performance was better for the latter in the independent tests. For evolutionary support vector machine and complement naïve Bayes classifiers, the initial retrieval set was reduced by 46% and 35%, respectively.
Machine learning classifiers can help identify nonrandomized studies eligible for full-text screening by systematic reviewers. Optimization can markedly improve performance of classifiers. However, generalizability varies with the classifier. The number of citations to screen during a second independent pass through the citations can be substantially reduced.
Translation of biomedical research into practice depends in part on the production of systematic reviews that synthesize available evidence for clinicians, researchers, and policymakers. Unfortunately, remarkable growth in the number of reviews has not kept pace with growth in the number of medical trials, which are sources of evidence . The problem is even more serious because most reviews are traditional rather than systematic. What is needed is streamlined production of the latter [1, 2] to better control known threats to validity  while promoting transparent and reproducible science.
To support the creation and maintenance of quality systematic reviews (also known as evidence reports or comparative effectiveness reviews), a global network of Cochrane entities  and a North American network of AHRQ-funded Evidence-based Practice Centers [5, 6] exist. Even so, production is slow. For example, Tricco et al  report that 19% of protocols published in the respected Cochrane Library fail to reach fruition as full reviews. Of those that are published as reviews, the average time to completion is 2.4 years with a reported maximum of 9 years, which is the ceiling imposed by the study design. Worse, these estimates ignore time spent exploring the literature to assess significance of possible review questions, and then time spent developing a protocol.
A major bottleneck occurs when teammates screen studies. In a two-step process involving independent and replicated effort, teammates first identify provisionally eligible studies by reading typically thousands of citations. Then they repeat the process by reading full texts of studies identified in the first step to select the final set of studies for inclusion in a review. In other words, to be included in a review, a study must first appear to meet eligibility criteria based on reading its citation; if so, it is eligible for full-text review and provisionally eligible for inclusion in the systematic review. However, not until the full text of its report has been carefully considered in light of the protocol is a final decision made whether to include a study.
In a best-case scenario, teammates compare their decisions and resolve their differences after each step, usually by discussion. It is worth noting that screening procedures vary. For example, some review teams will consider a study for full-text review if at least one teammate thinks the citation (title plus abstract) appears to meet eligibility criteria. In contrast, other teams work to reach consensus when screening citations before they will consider a study worth reading as full text. Presumably, the latter procedure for screening citations is more labor intensive. The point is that workflow patterns vary by review team and topic (A. McKibbon, PhD, written communication, December 2010). Furthermore, it is likely that estimates of workload for professional review teams associated with established centers are underestimates for inexperienced volunteer teams that may be conducting one-off reviews, e.g., when launching new research programs.
The research that serves as the foundation for this study was conducted by Aphinyanaphongs et al , and later extended by Kilicoglu et al . Their work entailed supervised machine learning methods and natural language processing to identify rigorous clinical trials in broad domains, such as therapy, rather than topical domains defined by review questions. Based on the work of Haynes and colleagues in a series of papers (e.g., see ), rigor was presumed if trials comparing treatments were randomized and controlled. However, identifying nonrandomized (NR) studies for inclusion in systematic reviews is an important problem because randomized controlled trials (RCTs) may be unlikely or even unethical for some research questions [11, 12]. For example, NR studies, such as case-control, cross-sectional, and cohort studies, are commonly employed to investigate exposure to environmental hazards, diagnostic test accuracy, disease etiology, human development, invasive surgery, adverse events, and rare disorders. Notably, in what is perhaps the first study to use machine learning methods to identify topically relevant trials for inclusion in systematic reviews, classification involved randomized and controlled drug trials , which is in keeping with the foundational research of .
For many review questions, the classification task involves a mix of designs because reviewers search for NR studies (if eligible) in addition to RCTs. The latter are preferred because they tend to be less biased relative to NR studies. However, when NR studies are eligible for inclusion in a systematic review, the Cochrane Non-Randomised Studies Methods Group enjoins investigators to not include design terms in their search filters . Although filters exist to reliably retrieve RCTs , filters “to identify other study types are limited” (Appendix 2 in ; see also ). This is true even though development of filters is ongoing (e.g., see [16–20]). Thus, the initial screening phase can be more labor intensive when NR studies are eligible. In response to this dilemma, some of the Cochrane Review Groups allow NR design terms when the retrieval set is so large that the review becomes impractical (e.g., see ). If we take seriously the preference for not including design terms in searches for NR studies, an informatics solution to assist review teams seems especially warranted.
Researchers interested in [semi-]automating the screening phase for systematic reviews are currently using the classifiers complement naïve Bayes (cNB)  or a Support Vector Machine (SVM) with a linear kernel [23, 24], or are developing a factorized version of cNB . The fact that these researchers are using different classifiers for their specific tasks indicates that understanding relative classifier performance is a necessary step for our task. Thus, we are interested in empirically comparing the performance of several supervised machine learning classifiers for a binary classification task using biomedical citations from extant systematic reviews. The task is binary because we want to classify primary studies as being eligible or not for further consideration by the review team. We also consider no optimization vs. optimization of features and parameters. Interestingly, a comparative study of classifiers by Colas and Brazdil  is sometimes cited as support for using a particular classifier. They found that an optimized k-nearest neighbor (k-NN) or naïve Bayes (NB) classifier could be as good as a linear SVM based on 20,000 newsgroup e-mails. However, they cautioned that their results should be validated for other document classification tasks to ensure generalizability. In sum, classifiers useful for newsgroup e-mail may not be as useful for biomedical citations. Thus, comparative studies of classifiers are warranted.
In general, our motivation for conducting this research is similar to that of other groups [13, 22–25], i.e., we want to facilitate production of systematic reviews. However, we are interested in assisting reviewers (regardless of experience or affiliation) by identifying classifiers that can reduce the number of citations that must be screened during a second independent pass through a set of citations. We interpret the usefulness of a classifier with respect to reducing the number of citations to screen rather than time spent screening because of differences in procedures, reviewer expertise, and number of teammates available for dividing the labor. In other words, valid baseline estimates of time spent screening and subsequent reductions in time depend on several variables that are not the focus of this study.
Additionally, until this relatively new area of translational informatics research matures, we assume that reviewers will insist on at least one complete cycle where human(s) screen the full set of citations. We further assume that a team consists of at least two people to ensure independent and replicated screening. In reality, more than one teammate can screen citations for the first pass as long as other people independently screen the same citations during the second pass. This procedure is meant to control random errors and bias introduced by humans. However, there are times when even two people cannot independently screen the entire set of citations. When this is the case, Cochrane suggests “a second person look at a sample [emphasis added] of the records” . This is precisely our intention, i.e., we envision a machine learning system that returns a reduced set or sample of citations to screen for the second pass. The reduced set would include most if not all of the citations labeled as eligible for full-text review, as well as a subset of those labeled as ineligible during the initial screening. Human reviewers would still have to reach consensus regarding discrepant eligibility decisions from the first pass through the entire set when compared to a second pass through the reduced set (see Figure 1). Assisting reviewers in this way would enable a more focused, independent screening of citations during the second pass. Reviewer bias and error would be controlled, in part, because of the opportunity for a second screening by different teammate(s) who could potentially identify studies overlooked by the first reviewers. Furthermore, the workload would be reduced because the disproportionately large set of citations identified as ineligible by both humans and machine would be eliminated from further consideration.
In sum, we conducted this study to investigate whether (1) machine learning classifiers can help identify NR studies eligible for full-text screening by systematic reviewers; (2) classifier performance varies with optimization of parameters and features extracted from biomedical citations; and (3) the number of citations to screen can be reduced. We did this by empirically comparing classifier performance using citations that point to mostly NR studies, varying optimization conditions, and then estimating the reduction in the number of citations to screen for the best classifier.
The citations for this study were from 2 Cochrane systematic reviews. One has to do with surgical interventions for treating ameloblastomas of the jaws  and the other with vaccines for preventing influenza in the elderly . By using citations from extant systematic reviews, we capitalized on domain-specific knowledge. This is because citations were initially retrieved by Cochrane trials search coordinators who developed filters given reviewers’ knowledge of their topics.
For the ameloblastoma dataset, we had access to the entire set of citations (N=1815) retrieved from MEDLINE , EMBASE , the Cochrane Central Register of Controlled Trials, and the Cochrane Oral Health Group Trials Register. For the influenza dataset, we retrieved 5485 citations (94%) by re-running published MEDLINE and EMBASE searches. We also manually searched for 147 studies not in our retrieval set, but listed in the review as eligible for further consideration.
We managed citations in EndNote and recorded decisions as either exclude or include. Decisions were based on the consensus of at least 2 reviewers in the published author lists regarding eligibility [21, 28, 31]. From EndNote, we exported each corpus as a text file in MEDLINE format. We then created 3 text files for each citation: (1) the full citation, including title, abstract, and metadata (FULL); (2) the title and abstract (TIABS); and (3) the title (TITLE). We built training and test sets for each type of text file by randomly assigning files using a 2:1 split, respectively. To ensure comparability across training and test sets, we used the same random assignment for citation portions.
For the ameloblastoma review, the training set for each citation portion consisted of 1209 files: exclude=1133; include=76 (6.3%). The test set for each portion consisted of 606 files: exclude=567; include=39 (6.4%). For the influenza review, the training set consisted of 3679 files: exclude=3469; include=210 (5.7%); the test set consisted of 1806 files: exclude=1699; include=107 (5.9%). The citations labeled as include point to studies eligible for full text-review, as well as being provisionally eligible for eventual inclusion in the systematic review.
To extract features (processed words) and classify studies, we used the open-source, data-mining suite RapidMiner v.4.6 [32, 33] with a text plugin . We processed text to create weighted feature vectors that represent each citation portion. This involved tokenizing (splitting up) strings of text, converting to lower case, filtering out stopwords and tokens with length less than 3, Porter stemming, and pruning out tokens that occurred in at most 3 citations. Features were weighted with TFIDF weights (, p.109), which are the product of term frequencies (TF) and inverse document frequencies (IDF). Note that for citations retrieved from MEDLINE or EMBASE, the metadata include tags and indexing terms from MeSH  or EMTREE , respectively. For this study, we treated metadata as any other text without preserving the tags, such as the MeSH tags TI for title or SO for source.
In general, we first trained a set of classifiers known to work well with text [26, 38] using processed features extracted from citations. Then we independently tested classifier models on a third of the data reserved for this purpose. We compared performance with respect to recall, precision, and a summary measure that overweights recall relative to precision. We chose to overweight recall because this is in keeping with the human goal of near-perfect recall when screening citations [12, 25, 39]. Human reviewers are overly inclusive during this phase in order to reduce the risk of overlooking relevant studies. This means that precision is sacrificed for recall. During their full-text review of studies identified by screening citations, reviewers effectively improve precision by eliminating studies that do not meet their inclusion criteria. Thus, for our purposes, we wanted to find classifiers with nearly perfect recall and precision good enough to reduce the number of citations to screen.
We conducted this study in two phases. Phase I involved neither optimization nor validation; phase II involved optimization of features and classifier parameters with cross-validation. In both phases, we conducted independent tests on the reserved data.
We defined best models as returning highest recall with precision good enough to reduce workload. Specifically, recall had to be at least 95% and precision had to be greater than 7% and 6% for the ameloblastoma and influenza datasets, respectively. The rationale for the cutoffs is as follows: when a model returns nearly perfect recall but poor precision, almost all of the eligible studies are identified along with many falsely identified ones. In the extreme, if precision equals the percentage eligible, the returned set of studies is as large as the entire set and no reduction in workload is possible. Thus, precision must surpass the percentage of studies identified by humans as being eligible for full-text review. Note that when recall is 95%, the machine falsely excludes 5% of the eligible studies. However, in comparing discrepant decisions, human(s) would reconsider the 5% they had identified but the machine had missed.
In our experiments, we compared the following classifiers: k-NN , NB , cNB , and evolutionary support vector machine (EvoSVM) . NB and cNB are probabilistic learners; EvoSVM is functional; and k-NN is a lazy learner that classifies based on similarity or distance measures. Further, NB assumes conditional and positional independence of features; thus, the immediate context of features or processed words extracted from citations is ignored. cNB is suitable for imbalanced data and presumably more appropriate for this task because the percentage of eligible studies in systematic reviews is usually relatively small. Additionally, cNB relaxes the particularly unrealistic assumptions of NB regarding independence of features extracted from text written by humans. EvoSVM uses a kernel function to find a nonlinear hyperplane that maximally separates classes of documents. EvoSVM generalizes support vector machine classifiers and can optimize non-positive semi-definite kernel functions .
We used RapidMiner default settings for classifier parameters with the following exceptions: For EvoSVM, we set C=1 instead of C=0 in phase I based on . In phase II, we set C=1, 10, or 20. The parameter C is a regularization constant that sets an upper bound for multipliers used in maximizing the margin between classes (cf. chapter 15 in ). For k-NN, we used cosine similarity measures instead of mixed Euclidean distances.
For both phases, performance measures included recall, precision, and an overall performance measure (F3), which is a weighted harmonic mean (, p.144). The formula for F is:
where beta is a non-negative number. Note that the notation F1 or F3 is short for Fbeta=1 or Fbeta=3, respectively. Thus, the formula for F3 is:
We estimated F3 rather than the traditional measure F1 that equally weights recall and precision. Although the relative weighting is more obvious when beta is expressed in terms of alpha (cf. , p. 144), the formulas presented are more common. In our opinion, F1 is inappropriate for this task because it is not in keeping with reviewer behavior during the screening phase.
The design of phase I was: 4 classifiers × 3 citation portions × 3 feature sets. We used the ameloblastoma citations and did not optimize features or classifier parameters.
Classifiers included k-NN, NB, cNB, and EvoSVM. In early analyses, LibSVM with a radial or polynomial kernel either failed or returned very poor performance. We therefore dropped LibSVM from subsequent analyses.
Citation portions included TITLES, TIABS, and FULL citations.
Feature sets included unigrams or bag of single words (BOW), and 2-term (2G) and 3-term n-grams (3G). N-gram sets are hierarchical and therefore consist of features from previous set(s). For example, a 3G set consists of contiguous triples and pairs, as well as single processed features, i.e., trigrams, bigrams, and unigrams. We had competing reasons for comparing these feature sets. On the one hand, 2G or 3G could add linguistic phrases that improve classification; on the other, BOW could reduce computational burden.
Varying feature sets and citation portions allowed comparison of their relative contribution to classification. We expected that an n-gram feature set extracted from FULL citations would improve classifier performance. We reasoned that 2G or 3G sets would preserve some of the information in the indexing terms or phrases found in the metadata of FULL citations and, therefore, this feature–citation portion combination would be associated with better performance.
In this phase, we used ameloblastoma and influenza citations. We considered 3 classifiers, 2 feature sets, and 1 citation portion. Classifiers included k-NN, cNB, and EvoSVM. Given the results from phase I, we dropped NB and used BOW extracted from FULL citations. We also developed a second feature set by adding 2G title features to the BOW. This enrichment overweighted titles and added contextual information residing in pairs of title words.
For each information gain (IG) threshold, we selected features if the absolute value of the IG weight was >= to the threshold. We manually set the IG threshold outside of a loop for grid optimization of classifier parameters with an inner loop for 10-fold cross-validations.
We used the RapidMiner operator Grid Parameter Optimization to find the best parameter set per information threshold. This operator searches over a grid of parameter combinations to return an optimal set. Given the nature of human screening behavior, we searched for optimal sets yielding highest recall with precision greater than the cutoff. The size of the grid is determined by the levels of the parameters under consideration. For example, if one combines 2 parameters with 3 possible values each, the search is over a 3 × 3 grid with 9 cells. For each cell, an n-fold cross-validation is run. In our experiments, the total number of runs (N) for each classifier equals the number of IG thresholds × the number of cells in the grid × the number of folds in the cross-validations. For example, N=480 (k-NN), 540 (EvoSVM), and 600 (cNB), ameloblastoma data.
We randomly selected partitions for the cross-validations and stratified to ensure that the percentage of eligible studies was the same across partitions. Further, we used the same random seed to ensure that partitions were equivalent when comparing classifiers. For each fold in a 10-fold cross-validation, we trained a classifier on 90% of the training data given a particular combination of parameters in the optimization grid, and assessed performance on the remaining 10%. Because cross-validations are iterative, performance measures were means of 10 values.
To develop a reasonable series of IG thresholds, we inspected a plot of normalized IG weights for BOW extracted from FULL ameloblastoma citations. The absolute values ranged from 0.0 to 1.0, with ameloblastoma having the largest weight; most values were less than 0.20. Thus, our series of threshold values included the following: none (no feature selection), 0.0001, 0.04, 0.08, 0.12, and 0.16. Based on the ameloblastoma results, the thresholds for the influenza data were none and 0.0001.
To ensure feasibility of the EvoSVM optimization runs, we conducted scoping analyses to select appropriate parameter values. By scoping, we mean that we conducted grid optimization with simple validation using a 1:1 split of the training set with no feature selection. However, for mutation types (Gaussian, switching, and sparsity) we followed the methods of phase I in addition to optimization with simple validation. Given the guidance of [43, 44], we considered various values for C and gamma; the default for gamma=1.0 was best. We also confirmed that the default value for epsilon=0.1 was reasonable for our data. We chose a nonlinear kernel based on our pilot study . Given these preliminary analyses, we used the following settings: radial kernel; Gaussian mutation; gamma=1.0; epsilon=0.1; population size=1, 10, 20; and C=1, 10, 20. Thus, population size and C were the input parameter values for the grid optimization in phase II.
For cNB, the input parameter values included smoothing values = 0.001, 0.4, 0.6, 0.8, 1.0 and normalized class weights=false, true, based on .
For k-NN, the input parameter values included number of neighbors k = 1, 3, 5, 7 and weighted vote = false, true. Note that when k=1, vote is not relevant.
In addition to cross-validation, we independently validated the best model for each classifier on a reserved test set. Note that the independent tests are stricter than the tests on held-out partitions during cross-validation because data for the former are not used when training classifiers. Thus, the independent tests are probably better estimates of generalizability.
For phase II, we expected that optimization would improve performance for all classifiers. We further expected that after optimization at least one classifier would return recall greater than or equal to 95% with precision greater than 7% for the ameloblastoma data and greater than 6% for the influenza data. Based on the results from phase I, we expected that enriching the feature set extracted from FULL citations with 2G title features would improve performance for cNB.
Table 1 displays the independent test results for phase I. In general, there appears to be a complex interaction between classifier, citation portion, and feature set.
Over 9 possible conditions (3 citation portions × 3 feature sets), EvoSVM returned the best recall (82.05%) for BOW extracted from FULL citations; 1-NN returned the best F3 (67.84%), also for BOW extracted from FULL citations. NB and cNB returned the worst recall (7.69%) and F3 (8.47%) for 2G and 3G extracted from FULL citations.
Over all conditions, recall was best for EvoSVM 5 of 9 times. Precision was maximal when recall was very low, e.g., precision=100% and recall=7.69% for NB, 2G, FULL. NB was the weakest classifier regarding F3 (range: 8.47% to 56.90%).
Fig. 2 (top) displays the results for recall as a function of classifier and feature set when features were extracted from the FULL citation. Using BOW appears to improve recall for EvoSVM, 1-NN, and cNB, but not for NB.
Fig. 2 (bottom) displays the results for recall as a function of classifier and citation portion when the feature set was BOW. Metadata in the FULL citation appear to improve recall for EvoSVM and 1-NN, but not for NB and cNB. (Consider that the difference between FULL and TIABS is the metadata in FULL.) However, extracting BOW from TITLES was associated with best recall for cNB and NB, and was second to FULL citations for EvoSVM.
No classifier reached the recall criterion of at least 95% for acceptable performance.
Based on phase I results, we dropped NB from further consideration. We optimized features and parameters with respect to recall and cross-validated models for k-NN, cNB, and EvoSVM using BOW extracted from FULL citations. We also cross-validated optimized models on enriched feature sets (BOW plus 2G title features). All independent tests applied the best training models from the grid optimizations with cross-validations to the reserved data. The best feature-parameter combinations per classifier were the same across ameloblastoma and influenza datasets.
Tables 2, ,3,3, and and44 display the results for optimization with cross-validation; recall and precision are in bold for models that surpassed both cutoffs. Table 5 displays the independent test results.
Fig. 3 displays mean recall and precision as a function of IG threshold for the ameloblastoma data. The curves for both cNB and EvoSVM were inversely related, which is typical of the tradeoff between recall and precision. Two points surpassed both recall and precision criteria: when the IG threshold=none for cNB and 0.0001 for EvoSVM. For k-NN, the curves were similar, but diverged for the largest IG threshold. Although k-NN always surpassed the precision cutoff, it never met the recall criterion.
The best model for EvoSVM over all IG thresholds involved a subset of features for which the IG weight was >= 0.0001; n=1430 (40%) and n=2205 (32%) features, ameloblastoma and influenza data, respectively. The best parameter set was C=1 and population size=10 (see Tables 2 and and44).
For the independent tests with ameloblastoma data, recall was stable when compared to the best optimization results, i.e., recall=100% for both BOW and enriched BOW (see Tables 2, ,3,3, and and5).5). However, with influenza data, recall degraded from 100% to 79.44% and 90.65% for BOW and the enriched BOW, respectively (see Tables 4 and and5).5). Enrichment boosted precision 2.4% (13.40% vs. 13.09%, ameloblastoma) and 8.5% (8.90% vs. 8.20%, influenza) (see Table 5). We computed the percentage improvement as [(.0890–.0820)/.0820] × 100=8.5%. EvoSVM surpassed both recall and precision thresholds for the ameloblastoma data, and the precision threshold for influenza data. However, it failed with respect to recall for the influenza data.
Compared to the results from phase I, recall for the optimized model on the ameloblastoma test set was 21.9% better (100% vs. 82.05%) and F3 was 3.8% worse (60.74% vs. 63.11%). (See Table 1, BOW/FULL and Table 5.)
The best optimized model for cNB over all IG thresholds involved the full set of features: n=3574 and n=6828, ameloblastoma and influenza data, respectively. The best parameter set was smoothing value=0.001 and normalized weights for each class (see Tables 2 and and44).
For the independent tests, recall was relatively stable when compared to the best optimization results (ameloblastoma and influenza data) (see Tables 2–5). Enrichment boosted precision 18.8% (10.95% vs. 9.22%, ameloblastoma) and 3.8% (7.58% vs. 7.30%, influenza) (see Table 5). cNB surpassed both recall and precision thresholds with influenza data (both feature sets) and ameloblastoma data (BOW). However, it just missed the recall threshold of 95% with ameloblastoma data and the enriched BOW (recall=94.87%).
Compared to the results from phase I, recall for the optimized model on the ameloblastoma test set was 2.8 times better (97.44% vs. 25.64%); F3 was 81.3% better (49.80% vs. 27.47%). (See Table 1, BOW, FULL and Table 5).
The results for the independent tests were quite mixed. For example, recall improved when compared to the best optimization results for the ameloblastoma data, but degraded for the influenza data (see Tables 2, ,4,4, and and5).5). Enrichment boosted precision for the ameloblastoma data, but degraded precision for the influenza data (see Table 5). k-NN failed to meet the recall threshold for both datasets regardless of feature set, whereas it always surpassed the precision threshold.
For the ameloblastoma data, the results of the independent test for BOW extracted from FULL citations were the same as in phase I (see Table 1, BOW/FULL and Table 5). This is because the models were the same.
Following the advice of Demsar , we computed an omnibus Friedman test statistic to assess differences among mean ranks for 3 classifiers per performance measure (see Table 5). The Friedman test is a robust, nonparametric alternative to repeated measures ANOVA. When the Friedman test statistic was statistically significant (P < .05), we computed Bonferroni-Dunn tests for post hoc comparisons; we adjusted alpha for the number of comparisons to control the Type I error rate. Note that higher ranks are associated with better performance.
Mean ranks for recall were significantly different: 2.5 (EvoSVM), 2.5 (cNB), and 1.0 (k-NN); Friedman chi2 (2 df) = 6, P =.0498. Because the post hoc comparison for EvoSVM vs. k-NN was the same as for cNB vs. k-NN—2.5 vs. 1.0—alpha was not adjusted. For EvoSVM or cNB vs. k-NN, mean recall was significantly different: z = 2.12, P = .034. Thus, recall was not significantly different for EvoSVM vs. cNB, but was when each was compared to k-NN. Recall was always better for EvoSVM or cNB vs. k-NN. Overall, recall usually improved or was stable when features were enriched by overweighting titles for EvoSVM and cNB, but not for k-NN.
The mean ranks for precision were significantly different: 2.0 (EvoSVM), 1.0 (cNB), and 3.0 (k-NN); Friedman chi2 (2 df) = 8, P = .0183. Because 3 post hoc comparisons were computed, the adjusted alpha = .05/3 = .0167. For EvoSVM vs. cNB: z = 1.41, P = .1585. For EvoSVM vs. k-NN, z = −1.41, P = .1585. For cNB vs. k-NN: z = −2.83, P=.0047. Thus, precision was not significantly different for EvoSVM vs. cNB or EvoSVM vs. k-NN, but was for cNB vs k-NN. Precision was always better for k-NN when compared to cNB. In general, precision improved for EvoSVM and cNB when features were enriched by overweighting titles, whereas results for k-NN were mixed.
The mean ranks for F3 were not significantly different: 2.5 (EvoSVM), 1.8 (cNB), and 1.8 (k-NN); Friedman chi2 (2 df) = 1.5, P = .4724. Because the omnibus test was not statistically significant, post hoc comparisons were not warranted.
To understand the implications of this research, consider the following scenario. Assume that (1) a reliable machine learning system exists to assist systematic reviewers when screening citations; (2) 3000 citations have been retrieved; (3) human reviewer(s) complete the first pass through the entire set of citations and label 180 (6%) as eligible for full-text review; and (4) two machine learning classifiers are available (EvoSVM and cNB). Given sampling variability, our best estimates for recall and precision are the averages for the independent test results on the enriched feature sets. Thus, further assume that recall and precision are 95.32% and 11.15% for EvoSVM, and 96.50% and 9.27% for cNB (based on Table 5).
The questions of concern to potential users are: How many citations will each machine learning classifier identify as eligible and how does this compare to screening the entire set once again? If the system is useful, reviewers need not consider further the disproportionately large number of citations labeled as ineligible both by human(s) and machine.
If the reviewers choose EvoSVM, the classifier correctly labels 172 citations and incorrectly labels another 1443 as eligible. Thus, a noisy set of 1615 true and false positives (172+1443=1615) is returned for the second pass through the citations by at least one more human reviewer—we refer to the size of this set as the number needed to screen (NNS). However, the NNS should be adjusted somewhat by the 8 citations overlooked by the machine, but identified by human(s). This is because recall is not perfect. Thus, the NNS for EvoSVM is 1615+8=1623, which is a 46% reduction in the size of the initial retrieval set: (3000-1623)/3000=.459.
If the reviewers choose cNB, the classifier correctly labels 174 citations and incorrectly labels another 1768 as eligible. A set of 1942 true and false positives is returned. Adding in the 6 citations overlooked by the machine, the NNS is 1948, which is a 35% reduction in the size of the initial retrieval set.
Clearly, if a reliable system were in place and both classifiers were reasonably efficient, systematic reviewers would choose EvoSVM in favor of cNB because the NNS=1623 for EvoSVM and 1948 for cNB. Nevertheless, until we have more citations from SRs on topics where NR studies are likely, our estimates for recall and precision may be unrealistic.
A major challenge for future research is boosting precision to reduce further the screening burden while maintaining very high recall. More than likely, we need feature sets that capitalize on both the structure of citations and the language that scientists and indexers use to describe studies. Regarding the latter, review teams outside of the United States are likely to search EMBASE, which is the European counterpart of MEDLINE. However, indexers use different terms for the same concepts, and MeSH and EMTREE terms can appear in different places in the citation. Thus, modeling structure is a challenge if we want to extract indexing terms and tag for source. In this paper, we demonstrated that adding contextual information from pairs of title words tends to boost precision modestly—suggesting that we can do a better job of modeling the format and scientific language of biomedical citations.
The results were somewhat surprising. For phase I, we had expected that without optimization, recall and overall performance would be best using 2- or 3-term n-grams extracted from complete citations. Instead, using single processed words (BOW) from FULL citations was associated with best performance. This suggests that indexing in the complete citation improves performance, even when the indexing terms are processed as single words. To improve this feature set in future work, we could preserve the MeSH and EMTREE terms (phrases), which would yield a feature set similar to the one used by Cohen and colleagues [23, 47].
Because none of the classifiers from phase I attained high enough recall to be of use, optimization in phase II was warranted.
For phase II, we had expected that all classifiers would benefit from optimization. This was generally true for EvoSVM and cNB, but not for k-NN. As it turned out, the optimized model for k-NN was the same as the one we used during phase I. Additionally, just EvoSVM benefited from selecting features based on IG. The results did support our expectation that, with optimization, one or more classifiers would return recall at least as high as 95% and precision greater than 6% or 7%, depending on the dataset. Both EvoSVM and cNB met these criteria, but generalization performance for EvoSVM was not as good as for cNB. This suggests either sampling variability or overfitting of EvoSVM during training. If the latter, the parameter C may not have been tuned well because C purportedly controls overfitting (, p. 301). Additionally, a radial kernel may have been inappropriate (see below).
Additionally, we had expected that enriching the BOW from full citations by overweighting titles would improve performance for cNB. It was somewhat surprising that enrichment improved performance for both cNB and EvoSVM.
Although researchers currently favor variants of both of these classifiers [22–25], the evidence suggests that optimization is necessary to boost performance. In fact, the results for cNB were startling with an almost three-fold improvement for recall and an 81% improvement for overall performance when comparing phase I and phase II results for ameloblastoma data.
The major limitation of this study is that the citations came from just two systematic reviews. Future comparative studies of classifiers should use citations from several reviews, paying attention to phrases for NR study designs that meet inclusion criteria as specified in the protocols. Presumably, more precise classification is possible for randomized controlled trials because the indexing is better than for NR studies (see the Introduction here and in ).
Another limitation is that we wrapped feature selection around grid optimization of classifier parameters, ignoring the class imbalance problem . While using a wrapper strategy is a well-known approach , a better one could involve selecting features within the positive (include) and negative (exclude) classes before grid optimization (e.g., see [49, 50]). Recently, Le and colleagues  compared other optimization methods, including stochastic gradient descent (SGD), limited memory BFGS (L-BFGS), and conjugate gradient (CG) methods. They reported that in contrast to the favored SGD method, L-BFGS and CG methods outperform SGD with respect to speed and accuracy. However, their overall conclusion was that performance of the optimization method varies with the research problem.
Certainly, a more thorough comparison of parameter settings for EvoSVM is required as this classifier has quite a few parameters. In particular, a study comparing performance as a function of kernel is essential in the context of classifying biomedical citations. Because generalization performance is “dominated by the chosen kernel function” (, p. 1313), researchers are developing automatic methods for learning kernel functions. A promising nonparametric approach was described in , wherein a family of simple nonparametric kernel learning (NPKL) algorithms was presented. Simple NPKL algorithms are reportedly as accurate as other NPKL methods, but more efficient and scalable. This line of research is timely inasmuch as parametric SVMs do not scale well for many applications, selection of the appropriate kernel is not obvious, and parametric kernels may be inappropriate for this task.
Although the results from our study and  suggest that EvoSVM with a nonlinear kernel is promising, the runtimes are much longer than for cNB. In the near term, cNB may be the better choice to semi-automate citation screening, especially when the number of citations is large. Finally, in our opinion, conditional random fields  and latent Dirichlet allocation  might profitably be compared to variants of cNB and SVM.
We have demonstrated that machine learning classifiers can help identify NR studies eligible for full-text screening by systematic reviewers. We have further shown that optimization can markedly improve classifier performance. In our opinion, careful comparative research is needed before a classifier is chosen to semi-automate screening citations. Further, stability of performance for optimized classifiers needs to be demonstrated over various medical review topics.
We thank the anonymous reviewers for their thoughtful remarks. Additionally, we thank Drs. Thankam Thyvalikakath and Richard Oliver for help with labeling ameloblastoma citations; Ms. Anne Littlewood, Cochrane Oral Health Group Trials Search Coordinator, and Ms. Jill Foust, Reference Librarian, University of Pittsburgh Health Sciences Library System, for help with database searches; and Mr. Richard Wilson and Mr. Eugene Tseytlin for technical assistance. This research was supported, in part, by the Pittsburgh Biomedical Informatics Training Program (5 T15 LM/DE07059) and an NIH award from the US National Library of Medicine (K99LM010943).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.