|Home | About | Journals | Submit | Contact Us | Français|
Evidence from health services research (HSR) is currently thinly spread through many journals, making it difficult for health services researchers, managers and policy-makers to find research on clinical practice guidelines and the appropriateness, process, outcomes, cost and economics of health care services. We undertook to develop and test search terms to retrieve from the MEDLINE database HSR articles meeting minimum quality standards.
The retrieval performance of 7445 methodologic search terms and phrases in MEDLINE (the test) were compared with a hand search of the literature (the gold standard) for each issue of 68 journal titles for the year 2000 (a total of 25 936 articles). We determined sensitivity, specificity and precision (the positive predictive value) of the MEDLINE search strategies.
A majority of the articles that were classified as outcome assessment, but fewer than half of those in the other categories, were considered methodologically acceptable (no methodologic criteria were applied for cost studies). Combining individual search terms to maximize sensitivity, while keeping specificity at 50% or more, led to sensitivities in the range of 88.1% to 100% for several categories (specificities ranged from 52.9% to 97.4%). When terms were combined to maximize specificity while keeping sensitivity at 50% or more, specificities of 88.8% to 99.8% were achieved. When terms were combined to maximize sensitivity and specificity while minimizing the differences between the 2 measurements, most strategies for HSR categories achieved sensitivity and specificity of at least 80%.
Sensitive and specific search strategies were validated for retrieval of HSR literature from MEDLINE. These strategies have been made available for public use by the US National Library of Medicine at www.nlm.nih.gov/nichsr/hedges/search.html.
With the increasing emphasis on “using evidence” and “value for money” in health services, it is essential that researchers, clinicians, health system managers and public policy-makers be able to retrieve relevant, high-quality reports of health services research (HSR). Efficiently retrieved research evidence can aid in decision-making about which services to provide and in the resource allocation decisions to support those services, reducing the need for arbitrary decisions and aiding collaboration with clinicians and consumers.1 MEDLINE is a huge and expanding bibliographic resource that is freely available to all with Internet access. Yet the volume of the literature often overwhelms both clinicians and health system decision-makers.2,3 End-users of MEDLINE and other large bibliographic databases have difficulty executing precise searches2,3 and are often unaware of what kind of information to seek, where to find it3,4 and how to judge its quality.3
HSR has been defined as the scientific study of the effect of health care delivery; the organization and management of health care access, quality, cost and financing; and the evaluation of the impact of health services and technology (Allmang NA, Koonce TY. Health services research topic searches. Bethesda [MD]: National Library of Medicine; 2000. Unpublished report). More recently, HSR has been defined as the multidisciplinary field of scientific investigation that studies how social factors, financing systems, organizational structures and processes, health technologies and personal behaviours affect access to health care, the quality and cost of health care and, ultimately, health and well-being.5 HSR articles constitute only a tiny fraction of the MEDLINE database and are spread through a large number of journals; hence, MEDLINE searching is challenging. Conversely, journal browsing is impractical as a means of retrieving all relevant studies for a given question or staying abreast of the literature. Our aim was to develop methodologic search filters for MEDLINE to enable end-users to efficiently retrieve articles of relevance to clinical practice guidelines (CPGs) and the appropriateness, process, outcomes, cost and economics of health services.
We compared the retrieval performance of methodologic search terms and phrases in MEDLINE with a manual review of each article in each issue of 68 journal titles for the year 2000 for the study categories of appropriateness, process assessment, outcome assessment, CPGs, cost and economics of care.
Candidate content and methodologic terms (text words and Medical Subject Headings [MeSH] [exploded and nonexploded], publication types) were compiled by reviewing “gold standard” articles and their MEDLINE indexing, the definitions in Table 1 and the criteria in Table 2; by consulting experts in bibliographic database searching for HSR topics (mainly health sciences librarians); and by consulting experts in studying HSR-related questions. All suggested search terms were tested. The terms are available on request from the corresponding author.
A database of journals containing relevant studies of HSR was created. We looked for journals that published adequate numbers of relevant articles, such that manual searching of these journals would provide an adequate benchmark against which the MEDLINE searches could be compared. Three independently derived lists were examined to identify appropriate journals.
The first list comprised journals that are reviewed by 4 publications: ACP Journal Club, Evidence-Based Medicine, Evidence-Based Nursing and Evidence-Based Mental Health. These publications provide synopses of the articles in 170 journals with the intent of giving health care workers an overview of new developments in medicine and nursing; the journals have been selected on the basis of their yield of studies that meet explicit criteria for methodologic merit and relevance to clinical practice.11 This list of 170 journals was reduced to 161 by including only those that were indexed in MEDLINE and by conducting hand searches of issues for the year 2000 to determine which journals had published at least 1 study concerning the appropriateness, process or outcomes of care or CPGs.
The second set of journals was derived from a survey by Elixhauser and associates12 of HSR literature for studies of the economics of health care and the Science Citation Index listing of top-rated journals in the field for the category health care sciences and services. We deleted 2 pharmacy journals from this list because we judged them too narrow in focus for our purposes. Input by a convenience sample of policy-makers led to 2 journal nominations and resulted in 10 unique HSR journals (i.e., 10 titles that were not included in the first list). The third list consisted of journals identified in a report on MEDLINE searches for HSR written by 2 National Library of Medicine associate fellows (Allmang NA, Koonce TY. Health services research topic searches. Bethesda [MD]: National Library of Medicine; 2000. Unpublished report). Three HSR experts had selected the journals in that list. The 3 journal lists were merged and duplicates deleted to yield the final list of 68 journals (for the complete merged list, see the online appendix at www.cmaj.ca/cgi/content/full/171/10/1179/DC1).
Four research assistants reviewed each issue of the 68 journal titles for the calendar year 2000. Each journal article was read independently by 2 research assistants and coded for the following HSR categories, according to definitions derived using the MeSH scope notes (Table 1): appropriateness, process assessment, outcome assessment, CPGs, cost and economics. All original research and review articles that met the category definitions were evaluated for scientific merit on the basis of the criteria in Table 2, which were based on the “Users' guides to the medical literature” articles published in the Journal of the American Medical Association.6,7,8,9 Although empirical evidence of design-related bias is not directly available for the HSR categories, research concerning diagnosis13 and treatment14 shows that studies with methodologic shortcomings may overestimate the accuracy or the effect being studied. To pass the criteria, an explicit statement relevant to each criterion had to appear in the article, and all criteria for the appropriate category had to be met. When disagreements arose between the assessments of the 2 research assistants, a third research assistant, blinded to the other assessments, reviewed the article in question. If the coding of the third appraiser agreed with the coding of 1 of the original reviewers, that coding was taken to be correct; otherwise, the article was referred to a more senior member of the research team, who reviewed all coding and determined the final classification.
The candidate search terms were treated as “diagnostic tests” for sound studies, and the manual review of the literature was treated as the “gold standard.” The concepts of diagnostic test evaluation and library science were used to determine the sensitivity, specificity and precision of MEDLINE searches as shown in Table 3. The sensitivity for a given topic was defined as the proportion of high-quality articles for that topic that were retrieved, specificity was the proportion of low-quality or nonrelevant articles that were not retrieved, and precision was the proportion of retrieved articles that were of high quality. Search performance was determined by an iterative computer program for each single term. Single terms that yielded sensitivity greater than 25% and specificity greater then 75% were used to form 2-term Boolean “or” strategy combinations. Two-term strategies that yielded sensitivity greater than 75% and specificity greater than 50% were used in 3-term Boolean “or” strategy development to optimize sensitivity. Two-term strategies that yielded sensitivity greater than 50% and specificity greater than 75% were used in 3-term Boolean “or” strategy development to optimize specificity. We did not test “and” combinations because of their predictably adverse effect on sensitivity. We also did not test “and not” combinations because, when we have tested this approach for clinical topics, the performance of the search strategies was not materially affected.
MEDLINE searches were conducted through Ovid (Ovid Technologies, New York; http://gateway2.ovid.com). For the defined subset of journal issues included in the database, we downloaded the full MEDLINE record, including full citation, abstract and MeSH terms. The MEDLINE records were then matched to the corresponding records in the hand-search file, by means of unique identifiers.
The HSR database included 25 936 articles. Of these, 994 (3.8%) met our criteria for one or more of the HSR categories (Table 4). Of the 795 articles classified as relevant to appropriateness, process assessment, outcome assessment, CPGs and economics, 318 (40.0%) met our methodologic criteria (no methodologic criteria were applied for cost studies). The numbers of articles were adequate for precise estimates of search performance for all but the appropriateness category, which had only 20 articles in total and only 5 articles that met the criteria.
In total, 7445 unique search terms were tested, of which 5330 returned results. Predictably, single search terms had lower yields than 2-term strategies, but the difference between 2- and 3-term combinations was small. As expected, combining terms increased the sensitivity over single search terms. A somewhat unexpected finding was that some combinations of terms for each category of studies also led to increases in specificity and precision. Thus, for brevity, only 3-term strategies are presented here (see Table 5 for the denominators of the data presented in the tables for these strategies, as detailed below), unless 2-term or single-term strategies performed as well as the best 3-term strategy. The best sensitivities ranged from 95% to 100% for methodologically sound articles for all categories, including appropriateness studies (Table 6), but the estimate for the latter category was imprecise, as only 5 studies in the database met this criterion. Precision was 9.5% or less for all searches, a consequence of the low prevalence of HSR even in these selected journals and the suboptimal specificities for the most sensitive searches.
Terms that yielded the best specificity while maintaining sensitivity of 50% or more for each HSR category are presented in Table 7. In achieving the highest specificity for combined terms, sensitivity decreased in all HSR categories while precision rose somewhat.
The combinations of terms that optimized both sensitivity and specificity while minimizing the differences between the 2 measurements for each HSR category are presented in Table 8. These strategies provide the best separation of relevant from nonrelevant retrievals, but do so without regard for whether sensitivity or specificity is affected.
We have documented search strategies that can help to discriminate relevant from nonrelevant articles for a number of categories of importance to those interested in HSR. Those who are interested in all articles on a given topic and who have the time to sort out irrelevant articles will be best served by the most sensitive strategies (Table 6). Those with little time who are looking for “a few good articles” on a given topic will likely be best served by the most specific strategies (Table 7). Use of these strategies is straightforward, as the National Library of Medicine has translated our most sensitive and most specific strategies for public use at www.nlm.nih.gov/nichsr/hedges/search.html. The best strategies for optimizing the trade-off between sensitivity and specificity are shown in Table 8. When sensitivity was maximized for combinations of terms, specificities rose considerably relative to those for individual search terms. For instance, for sensitive searches for high-quality appropriateness articles, the combination of terms resulted in marked increases, to near-perfect specificity. When specificity was maximized for combinations of terms, while keeping sensitivity of at least 50%, we observed very high specificities for almost all HSR categories. Search performance, including the trade-off between sensitivity and specificity, was generally comparable to that found for topics that are of more direct interest to clinicians, such as treatment, diagnosis, prognosis and etiology.15However, the methodologic standards for clinical topics are generally much higher, so that the literature retrieved provides more robust answers.
Few search filters have been developed to retrieve journal articles on a small range of topics of direct relevance to HSR. A pilot project created preliminary search strategies for economics and qualitative research in the HSR literature in 2000 (Allmang NA, Koonce TY. Health services research topic searches. Bethesda [MD]: National Library of Medicine; 2000. Unpublished report) but lacked a gold standard against which to assess the quality of the searches. Search filters developed for the National Health Service Economic Evaluation Database,16 the Health Economic Evaluation Database17 and the London School of Economics (LSE) Strategy,18 which are designed to retrieve economic evaluation articles, were compared with one another, to generate a relative standard, giving estimates of sensitivity of 72% and specificity of 75% for the LSE strategy in MEDLINE.18 Our findings for economics articles appear to be somewhat better but are not directly comparable, as our gold standard was a hand search. Additional filters have been designed to retrieve articles on outcome measurement19 (just 3 strategies based on hand searches in just 2 journals) and quality of care20 (in which only precision was measured).
Our study had some limitations. First, we could not find secure methodologic features for the HSR categories of appropriateness and cost that lend themselves to retrieving the best studies. Second, the number of appropriateness articles in our database was small, giving rise to imprecise estimates of search performance for that category. Third, our database was not large enough to permit test–retest searches to validate the strategies. Fourth, we have not studied the effect of combining research filters with content terms (such as a disease, technology or type of health service) and thus cannot report on the characteristics of such searches; such a study would require considerably more resources than were available to us. Fifth, we tested only Ovid's search engine for MEDLINE; other search engines, including the PubMed search engine of the National Library of Medicine, may handle terms somewhat differently, with slightly differing results.
The best search strategies found in our research leave some room for improvement. Better search performance may require maturation of research methods for HSR, similar to those for some forms of clinical research, and better indexing. Improvements may also be possible through more sophisticated search strategies, for example, with more search terms, use of other Boolean operators (“and,” “and not”), natural language processing and multivariate statistical techniques such as logistic regression and discriminant function analysis. In our limited experience with the use of other Boolean operators and logistic regression for clinical topics such as diagnostic tests,21 we have observed trade-offs between sensitivity and specificity and no substantive improvements with more complex search strategies, but we have not attempted these approaches for HSR topics. We look forward to other researchers taking up the challenge of developing better search strategies for HSR.
This study was conducted under a contract from the National Information Center on Health Services Research and Health Care Technology (NICHSR). We thank Ione Auston for encouragement and constructive comments on study reports. Ovid Technologies Inc. relaxed its limits on search volumes. Vivek Goel, Professor in the Department of Health Policy, Management and Evaluation, University of Toronto, provided assistance concerning health services managers and their information needs, as well as a list of key journals for publication of health services research.
This article has been peer reviewed.
Contributors: Nancy Wilczynski and Brian Haynes contributed to the conception and design of the study and to the analysis and interpretation of data. Nancy Wilczynski, Ravi Ramkissoonsingh and Alexandra Arnold-Oatley contributed to the acquisition of the data. John Lavis and Ravi Ramkissoonsingh contributed to the analysis and interpretation of the data. Nancy Wilczynski, Brian Haynes, Ravi Ramkissoonsingh and Alexandra Arnold-Oatley were involved in drafting the article. All authors were involved in critically revising the article for important intellectual content and gave final approval of the version submitted to be published.
Competing interests: None declared.
Correspondence to: Dr. R. Brian Haynes, Department of Clinical Epidemiology and Biostatistics, Room 2C10B, Health Sciences Centre, McMaster University Faculty of Health Sciences, 1200 Main St. W, Hamilton ON L8N 3J5; fax 905 577-0017; ac.retsamcm@senyahb