A peer-reviewed journal would not survive without the generous time and insightful comments of the reviewers, whose efforts often go unrecognized. Although final decisions are always editorial, they are greatly facilitated by the deeper technical knowledge, scientific insights, understanding of social consequences, and passion that reviewers bring to our deliberations. For these reasons, the Editors-in-Chief and staff of the journal warmly thank the 219 reviewers whose comments helped to shape Systematic Reviews, for their invaluable assistance with review of manuscripts for the journal in Volume 3 (2014).
Systematic reviews should build on a protocol that describes the rationale, hypothesis, and planned methods of the review; few reviews report whether a protocol exists. Detailed, well-described protocols can facilitate the understanding and appraisal of the review methods, as well as the detection of modifications to methods and selective reporting in completed reviews. We describe the development of a reporting guideline, the Preferred Reporting Items for Systematic reviews and Meta-Analyses for Protocols 2015 (PRISMA-P 2015). PRISMA-P consists of a 17-item checklist intended to facilitate the preparation and reporting of a robust protocol for the systematic review. Funders and those commissioning reviews might consider mandating the use of the checklist to facilitate the submission of relevant protocol information in funding applications. Similarly, peer reviewers and editors can use the guidance to gauge the completeness and transparency of a systematic review protocol submitted for publication in a journal or other medium.
Apply and compare two methods that identify signals for the need to update systematic reviews, using three Evidence-based Practice Center reports on omega-3 fatty acids as test cases.
Study Design and Setting
We applied the RAND method, which uses domain (subject matter) expert guidance, and a modified Ottawa method, which uses quantitative and qualitative signals. For both methods, we conducted focused electronic literature searches of recent studies using the key terms from the original reports. We assessed the agreement between the methods and qualitatively assessed the merits of each system.
Agreement between the two methods was “substantial” or better (kappa > 0.62) in three of the four systematic reviews. Overall agreement between the methods was “substantial” (kappa = 0.64, 95% confidence interval [CI] 0.45–0.83).
The RAND and modified Ottawa methods appear to provide similar signals for the possible need to update systematic reviews in this pilot study. Future evaluation with a broader range of clinical topics and eventual comparisons between signals to update reports and the results of full evidence review updates will be needed. We propose a hybrid approach combining the best features of both methods, which should allow efficient review and assessment of the need to update.
Evidence-based methodology; Systematic reviews; Comparative effectiveness reviews; Omega-3 fatty acids; Cardiovascular disease risk factors; Cancer; Cognitive function
The Health Information Technology for Economic and Clinical Health (HITECH) Act subsidizes implementation by hospitals of electronic health records with computerized provider order entry (CPOE), which may reduce patient injuries caused by medication errors (preventable adverse drug events, pADEs). Effects on pADEs have not been rigorously quantified, and effects on medication errors have been variable. The objectives of this analysis were to assess the effectiveness of CPOE at reducing pADEs in hospital-related settings, and examine reasons for heterogeneous effects on medication errors.
Articles were identified using MEDLINE, Cochrane Library, Econlit, web-based databases, and bibliographies of previous systematic reviews (September 2013). Eligible studies compared CPOE with paper-order entry in acute care hospitals, and examined diverse pADEs or medication errors. Studies on children or with limited event-detection methods were excluded. Two investigators extracted data on events and factors potentially associated with effectiveness. We used random effects models to pool data.
Sixteen studies addressing medication errors met pooling criteria; six also addressed pADEs. Thirteen studies used pre-post designs. Compared with paper-order entry, CPOE was associated with half as many pADEs (pooled risk ratio (RR) = 0.47, 95% CI 0.31 to 0.71) and medication errors (RR = 0.46, 95% CI 0.35 to 0.60). Regarding reasons for heterogeneous effects on medication errors, five intervention factors and two contextual factors were sufficiently reported to support subgroup analyses or meta-regression. Differences between commercial versus homegrown systems, presence and sophistication of clinical decision support, hospital-wide versus limited implementation, and US versus non-US studies were not significant, nor was timing of publication. Higher baseline rates of medication errors predicted greater reductions (P < 0.001). Other context and implementation variables were seldom reported.
In hospital-related settings, implementing CPOE is associated with a greater than 50% decline in pADEs, although the studies used weak designs. Decreases in medication errors are similar and robust to variations in important aspects of intervention design and context. This suggests that CPOE implementation, as subsidized under the HITECH Act, may benefit public health. More detailed reporting of the context and process of implementation could shed light on factors associated with greater effectiveness.
Medical order entry systems; Drug toxicity/prevention and control; Hospitals; Adverse drug event; Medication error
There is an increasing push for ‘evidence-based’ decision making in global health policy circles. However, at present there are no agreed upon standards or guidelines for how to evaluate evidence in global health. Recent evaluations of existing evidence frameworks that could serve such a purpose have identified details of program context and project implementation as missing components needed to inform policy. We performed a pilot study to assess the current state of reporting of context and implementation in studies of global health interventions.
We identified three existing criteria sets for implementation reporting and selected from them 10 criteria potentially relevant to the needs of policy makers in global health contexts. We applied these 10 criteria to 15 articles included in the evidence base for three global health interventions chosen to represent a diverse set of advocated global health programs or interventions: household water chlorination, prevention of mother-to-child transmission of HIV, and lay community health workers to reduce child mortality. We used a good-fair-poor/none scale for the ratings.
The proportion of criteria for which reporting was poor/none ranged from 11% to 54% with an average of 30%. Eight articles had ‘good’ or ‘fair’ documentation for greater than 75% of criteria, while five articles had ‘poor or none’ documentation for 50% of criteria or more. Examples of good reporting were identified.
Reporting of context and implementation information in studies of global health interventions is mostly fair or poor, and highly variable. The idiosyncratic variability in reporting indicates that global health investigators need more guidance about what aspects of context and implementation to measure and how to report them. This lack of context and implementation information is a major gap in the evidence needed by global health policy makers to reach decisions.
To develop and validate the Geriatric CompleXity of Care Index (GXI), a comorbidity index of medical, geriatric, and psychosocial conditions that addresses disease severity and intensity of ambulatory care for older adults with chronic conditions.
Development phase: variable selection and rating by clinician panel. Validation phase: medical record review and secondary data analysis.
Assessing the Care of Vulnerable Elders-2 study.
Six hundred forty-four older (≥75) individuals receiving ambulatory care.
Development: 32 conditions categorized according to severity, resulting in 117 GXI variables. A panel of clinicians rated each GXI variable with respect to the added difficulty of providing primary care for an individual with that condition. Validation: Modified versions of previously validated comorbidity measures (simple count, Charlson, Medicare Hierarchical Condition Category), longitudinal clinical outcomes (functional decline, survival), intensity of ambulatory care (primary, specialty care visits, polypharmacy, number of eligible quality indicators (NQI)) over 1 year of care.
The most-morbid individuals (according to quintiles of GXI) had more visits (7.0 vs 3.7 primary care, 6.2 vs 2.4 specialist), polypharmacy (14.3% vs 0% had ≥14 medications), and greater NQI (33 vs 25) than the least-morbid individuals. Of the four comorbidity measures, the GXI was the strongest predictor of primary care visits, polypharmacy, and NQI (p<.001, controlling for age, sex, function-based vulnerability).
Older adults with complex care needs, as measured by the GXI, have healthcare needs above what previously employed comorbidity measures captured. Healthcare systems could use the GXI to identify the most complex elderly adults and appropriately reimburse primary providers caring for older adults with the most complex care needs for providing additional visits and coordination of care.
ambulatory care; utilization; comorbidity
Systematic reviews are a cornerstone of evidence-based medicine but are useful only if up-to-date. Methods for detecting signals of when a systematic review needs updating have face validity, but no proposed method has had an assessment of predictive validity performed.
The AHRQ Comparative Effectiveness Review program had produced 13 comparative effectiveness reviews (CERs), a subcategory of systematic reviews, by 2009, 11 of which were assessed in 2009 using a surveillance system to determine the degree to which individual conclusions were out of date and to assign a priority for updating each report. Four CERs were judged to be a high priority for updating, four CERs were judged to be medium priority for updating, and three CERs were judged to be low priority for updating. AHRQ then commissioned full update reviews for 9 of these 11 CERs. Where possible, we matched the original conclusions with their corresponding conclusions in the update reports, and compared the congruence between these pairs with our original predictions about which conclusions in each CER remained valid. We then classified the concordance of each pair as good, fair, or poor. We also made a summary determination of the priority for updating each CER based on the actual changes in conclusions in the updated report, and compared these determinations with the earlier assessments of priority.
The 9 CERs included 149 individual conclusions, 84% with matches in the update reports. Across reports, 83% of matched conclusions had good concordance, and 99% had good or fair concordance. The one instance of poor concordance was partially attributable to the publication of new evidence after the surveillance signal searches had been done. Both CERs originally judged as being low priority for updating had no substantive changes to their conclusions in the actual updated report. The agreement on overall priority for updating between prediction and actual changes to conclusions was Kappa = 0.74.
These results provide some support for the validity of a surveillance system for detecting signals indicating when a systematic review needs updating.
Methods; Systematic reviews; Updating
A paucity of data exists addressing the quality of care provided to women with pelvic organ prolapse (POP). We sought to develop a means to measure this quality through the development of quality-of-care indicators (QIs).
QIs were modeled after those previously described in the Assessing the Care of Vulnerable Elders (ACOVE) project. The indicators were then presented to a panel of nine experts. Using the RAND Appropriateness Method, we analyzed each indicator’s preliminary rankings. A forum was then held in which each indicator was thoroughly discussed by the panelists as a group, after which panelists individually re-rated the indicators. QIs with median scores of at least seven were considered valid.
QIs were developed that addressed screening, diagnosis, work-up, and both nonsurgical and surgical management. Areas of controversy included whether screening should be performed to identify prolapse, whether pessary users should undergo a vaginal exam by a health professional every six months versus annually, and whether a colpocleisis should be offered to older women planning to undergo surgery for POP. Fourteen of 21 potential indicators were rated as valid for pelvic organ prolapse (median score ≥ 7).
We developed and rated fourteen potential quality indicators for the care of women with POP. Once these QIs are tested for feasibility they can be used on a larger scale to measure and compare the care provided to women with prolapse in different clinical settings.
Delphi Method; RAND Appropriateness method; pelvic floor disorders
To develop a means to measure the quality of care provided to women treated for urinary incontinence (UI) through the development of quality-of-care indicators (QIs).
We performed an extensive literature review to develop a set of potential quality indicators for the management of urinary incontinence. QIs were modeled after those previously described in the Assessing the Care of Vulnerable Elders (ACOVE) project. Nine experts ranked the indicators on a nine-point scale for both validity and feasibility. We analyzed preliminary rankings of each indicator using the RAND Appropriateness Method. A forum was then held in which each indicator was thoroughly discussed by the panelists as a group, after which the indicators were rated a second time individually using the same nine-point scale.
QIs were developed that addressed screening, diagnosis, work-up, and both non-surgical and surgical management. Areas of controversy included whether routine screening for incontinence should be performed, whether urodynamics should be performed before non-surgical management is initiated, and whether cystoscopy should be part of the pre-operative work-up of uncomplicated stress incontinence. Following the expert panel discussion, 27 of 40 potential indicators were determined to be valid for UI with a median score of at least seven on a nine-point scale.
We identified 27 quality indicators for the care of women with UI. Once these QIs are pilot-tested for feasibility, they will be applied on a larger scale to measure the quality of care provided to women with UI in the United States.
Quality Indicators; Outcomes; RAND Appropriateness Method; Stress Urinary Incontinence; Urge Urinary Incontinence
Continuous quality improvement (CQI) methods are foundational approaches to improving healthcare delivery. Publications using the term CQI, however, are methodologically heterogeneous, and labels other than CQI are used to signify relevant approaches. Standards for identifying the use of CQI based on its key methodological features could enable more effective learning across quality improvement (QI) efforts. The objective was to identify essential methodological features for recognizing CQI.
Previous work with a 12-member international expert panel identified reliably abstracted CQI methodological features. We tested which features met rigorous a priori standards as essential features of CQI using a three-phase online modified-Delphi process.
Primarily United States and Canada.
119 QI experts randomly assigned into four on-line panels.
Participants rated CQI features and discussed their answers using online, anonymous and asynchronous discussion boards. We analyzed ratings quantitatively and discussion threads qualitatively.
Main outcome measure(s)
Panel consensus on definitional CQI features.
Seventy-nine (66%) panelists completed the process. Thirty-three completers self-identified as QI researchers, 18 as QI practitioners and 28 as both equally. The features ‘systematic data guided activities,’ ‘designing with local conditions in mind’ and ‘iterative development and testing’ met a priori standards as essential CQI features. Qualitative analyses showed cross-cutting themes focused on differences between QI and CQI.
We found consensus among a broad group of CQI researchers and practitioners on three features as essential for identifying QI work more specifically as ‘CQI.’ All three features are needed as a minimum standard for recognizing CQI methods.
continuous quality improvement; quality improvement; consultants; health care organization
There are both theoretical and empirical reasons to believe that design and execution factors are associated with bias in controlled trials. Statistically significant moderator effects, such as the effect of trial quality on treatment effect sizes, are rarely detected in individual meta-analyses, and evidence from meta-epidemiological datasets is inconsistent. The reasons for the disconnect between theory and empirical observation are unclear. The study objective was to explore the power to detect study level moderator effects in meta-analyses.
We generated meta-analyses using Monte-Carlo simulations and investigated the effect of number of trials, trial sample size, moderator effect size, heterogeneity, and moderator distribution on power to detect moderator effects. The simulations provide a reference guide for investigators to estimate power when planning meta-regressions.
The power to detect moderator effects in meta-analyses, for example, effects of study quality on effect sizes, is largely determined by the degree of residual heterogeneity present in the dataset (noise not explained by the moderator). Larger trial sample sizes increase power only when residual heterogeneity is low. A large number of trials or low residual heterogeneity are necessary to detect effects. When the proportion of the moderator is not equal (for example, 25% ‘high quality’, 75% ‘low quality’ trials), power of 80% was rarely achieved in investigated scenarios. Application to an empirical meta-epidemiological dataset with substantial heterogeneity (I2 = 92%, τ2 = 0.285) estimated >200 trials are needed for a power of 80% to show a statistically significant result, even for a substantial moderator effect (0.2), and the number of trials with the less common feature (for example, few ‘high quality’ studies) affects power extensively.
Although study characteristics, such as trial quality, may explain some proportion of heterogeneity across study results in meta-analyses, residual heterogeneity is a crucial factor in determining when associations between moderator variables and effect sizes can be statistically detected. Detecting moderator effects requires more powerful analyses than are employed in most published investigations; hence negative findings should not be considered evidence of a lack of effect, and investigations are not hypothesis-proving unless power calculations show sufficient ability to detect effects.
Meta-analysis; Power; Heterogeneity; Meta-epidemiological dataset; Randomized controlled trial (RCT)
Systematic reviews (SRs) can become outdated as new evidence emerges over time. Organizations that produce SRs need a surveillance method to determine when reviews are likely to require updating. This report describes the development and initial results of a surveillance system to assess SRs produced by the Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Center (EPC) Program.
Twenty-four SRs were assessed using existing methods that incorporate limited literature searches, expert opinion, and quantitative methods for the presence of signals triggering the need for updating. The system was designed to begin surveillance six months after the release of the original review, and thenceforth every six months for any review not classified as being a high priority for updating. The outcome of each round of surveillance was a classification of the SR as being low, medium or high priority for updating.
Twenty-four SRs underwent surveillance at least once, and ten underwent surveillance a second time during the 18 months of the program. Two SRs were classified as high, five as medium, and 17 as low priority for updating. The time lapse between the searches conducted for the original reports and the updated searches (search time lapse - STL) ranged from 11 months to 62 months: The STL for the high priority reports were 29 months and 54 months; those for medium priority reports ranged from 19 to 62 months; and those for low priority reports ranged from 11 to 33 months. Neither the STL nor the number of new relevant articles was perfectly associated with a signal for updating. Challenges of implementing the surveillance system included determining what constituted the actual conclusions of an SR that required assessing; and sometimes poor response rates of experts.
In this system of regular surveillance of 24 systematic reviews on a variety of clinical interventions produced by a leading organization, about 70% of reviews were determined to have a low priority for updating. Evidence suggests that the time period for surveillance is yearly rather than the six months used in this project.
Systematic review; Updating; Surveillance
Jill Luoto and colleagues apply different frameworks to the same body of evidence for three advocated global health interventions and compare the ratings and policy recommendations resulting from each.
Please see later in the article for the Editors' Summary
Validation of process-of-care measures includes testing for a link with outcomes. We aimed to determine whether delivery of better quality of care for urinary incontinence (UI) and falls is associated with improved patient-reported outcomes.
Retrospective cohort study of older (age ≥ 75) ambulatory care participants in Assessing Care of Vulnerable Elders-2 (ACOVE-2) study, who screened positive for UI (n=133) and/or falls/fear of falling (n=328).
We measured composite quality scores (% quality indicators [QIs] passed per patient) and change in Incontinence Quality of Life (IQOL, range 0–100) scores or Falls Efficacy Scale (FES, range 10–40) before and after care was delivered (mean 10 months). Because eligibility for falls treatment QIs was dependent on the physician performing a physical exam, we calculated an alternative “Common Pathway” quality indicator (CPQI) score that assigned a failing score for falls treatment to unexamined patients.
Each 10% increment in receipt of recommended care for UI was associated with a 1.4 point improvement in IQOL score (p=.01). Falls quality was not related to FES score using the original composite score; however the CPQI score was related to FES (+.4 point FES per 10% increment in falls quality, p=.01).
Better quality of care for falls and UI was associated with measurable improvement in patient-reported outcomes in less than one year. The connection between process and outcome required consideration of the interdependence between diagnosis and treatment in the falls QIs. The process-outcome link demonstrated for UI and falls underscores the importance of improving care in these areas.
Care for falls and urinary incontinence (UI) among older patients is inadequate. One possible explanation is that physicians provide less recommended care to patients who are not as concerned about their falls and UI.
To test whether patient-reported severity for two geriatric conditions, falls and UI, is associated with quality of care.
Prospective cohort study of elders with falls and/or fear of falling (n=384) and UI (n=163).
Participants in the Assessing Care of Vulnerable Elders-2 Study (2002–3), which evaluated an intervention to improve the care for falls and UI among older (age ≥75) ambulatory care patients with falls/fear of falling or UI.
Falls Efficacy Scale (FES) and the Incontinence Quality of Life (IQOL) surveys measured at baseline, quality of care measured by a 13-month medical record abstraction.
There was a small difference in falls quality scores across the range of FES, with greater patient-perceived falls severity associated with better odds of passing falls quality indicators (OR 1.11 (95% CI 1.02–1.21) per 10-point increment in FES). Greater patient-perceived UI severity (IQOL score) was not associated with better quality of UI care.
Although older persons with greater patient-perceived falls severity receive modestly better quality of care, those with more distressing incontinence do not. For both conditions, however, even the most symptomatic patients received less than half of recommended care. Low patient-perceived severity of condition is not the basis of poor care for falls and UI.
Quality of care; Urinary Incontinence; Falls
Clinical practice guidelines are one of the foundations of efforts to improve healthcare. In 1999, we authored a paper about methods to develop guidelines. Since it was published, the methods of guideline development have progressed both in terms of methods and necessary procedures and the context for guideline development has changed with the emergence of guideline clearinghouses and large scale guideline production organisations (such as the UK National Institute for Health and Clinical Excellence). It therefore seems timely to, in a series of three articles, update and extend our earlier paper. In this second paper, we discuss issues of identifying and synthesizing evidence: deciding what type of evidence and outcomes to include in guidelines; integrating values into a guideline; incorporating economic considerations; synthesis, grading, and presentation of evidence; and moving from evidence to recommendations.
Clinical practice guidelines are one of the foundations of efforts to improve health care. In 1999, we authored a paper about methods to develop guidelines. Since it was published, the methods of guideline development have progressed both in terms of methods and necessary procedures and the context for guideline development has changed with the emergence of guideline clearing houses and large scale guideline production organisations (such as the UK National Institute for Health and Clinical Excellence). It therefore seems timely to, in a series of three articles, update and extend our earlier paper. In this third paper we discuss the issues of: reviewing, reporting, and publishing guidelines; updating guidelines; and the two emerging issues of enhancing guideline implementability and how guideline developers should approach dealing with the issue of patients who will be the subject of guidelines having co-morbid conditions.
Clinical practice guidelines are one of the foundations of efforts to improve health care. In 1999, we authored a paper about methods to develop guidelines. Since it was published, the methods of guideline development have progressed both in terms of methods and necessary procedures and the context for guideline development has changed with the emergence of guideline clearing houses and large scale guideline production organisations (such as the UK National Institute for Health and Clinical Excellence). It therefore seems timely to, in a series of three articles, update and extend our earlier paper. In this first paper we discuss: the target audience(s) for guidelines and their use of guidelines; identifying topics for guidelines; guideline group composition (including consumer involvement) and the processes by which guideline groups function and the important procedural issue of managing conflicts of interest in guideline development.
Welcome to a new age in publishing systematic reviews. We hope the launch of Systematic Reviews will resonate with a broad spectrum of readers interested in using them in a variety of ways, such as providing comprehensive and up to date evidence for patient management, informing health policy, and developing rigorous practice guidelines. Systematic reviews are increasingly popular. Our journal is committed to publishing a wide variety of well conducted and transparently reported systematic reviews and associated research. We are open access and electronic and not confined by space and so offer scope for publishing reviews in detail and providing a modern and innovative approach to publishing. We look forward to participating in the voyage with all of our readers.
new journal; systematic reviews; open access
Prospective registration of systematic reviews promotes transparency, helps reduce potential for bias and serves to avoid unintended duplication of reviews. Registration offers advantages to many stakeholders in return for modest additional effort from the researchers registering their reviews.
This paper has two goals. First, we explore the feasibility of conducting online expert panels to facilitate consensus finding among a large number of geographically distributed stakeholders. Second, we test the replicability of panel findings across four panels of different size.
We engaged 119 panelists in an iterative process to identify definitional features of Continuous Quality Improvement (CQI). We conducted four parallel online panels of different size through three one-week phases by using the RAND's ExpertLens process. In Phase I, participants rated potentially definitional CQI features. In Phase II, they discussed rating results online, using asynchronous, anonymous discussion boards. In Phase III, panelists re-rated Phase I features and reported on their experiences as participants.
66% of invited experts participated in all three phases. 62% of Phase I participants contributed to Phase II discussions and 87% of them completed Phase III. Panel disagreement, measured by the mean absolute deviation from the median (MAD-M), decreased after group feedback and discussion in 36 out of 43 judgments about CQI features. Agreement between the four panels after Phase III was fair (four-way kappa = 0.36); they agreed on the status of five out of eleven CQI features. Results of the post-completion survey suggest that participants were generally satisfied with the online process. Compared to participants in smaller panels, those in larger panels were more likely to agree that they had debated each others' view points.
It is feasible to conduct online expert panels intended to facilitate consensus finding among geographically distributed participants. The online approach may be practical for engaging large and diverse groups of stakeholders around a range of health services research topics and can help conduct multiple parallel panels to test for the reproducibility of panel conclusions.
To systematically review the evidence for the efficacy and safety of botulinum toxin in the management of OAB.
Materials and Methods
We performed a systematic review of the literature to identify articles published between 1985 and March 2009 on intravesical botulinum toxin A (BTX) injections for the treatment of refractory idiopathic overactive bladder in both men and women. Database searched included MEDLINE, CENTRAL, and EMBASE. Data were tabulated from case series and from randomized controlled trials (RCTs). Data were pooled where appropriate.
Our literature search identified 432 titles. Twenty-three full articles were included in the final review. Three randomized placebo-controlled trials addressing the use of botulinum toxin-A were identified (99 patients total). The pooled random effects estimate of effect across all three studies was 3.88 (95% C.I. -6.15, -1.62), meaning that patients treated with BTX had 3.88 fewer incontinence episodes per day. UDI data revealed significant improvements in quality of life compared with placebo, with a standardized mean difference of -0.62 (CI -1.04, -0.21). Data from case series demonstrated significant improvements in OAB symptoms and quality of life, despite heterogeneity in methodology and case mix. However, based on the randomized controlled trials, there was a nine-fold increased risk of elevated post-void residual after BTX compared with placebo (8.55, 95% CI 3.22-22.71).
Intravesical injection of botulinum toxin resulted in improvement in medication-refractory OAB symptoms. However, the risk of elevated post-void residual and symptomatic urinary retention was significant. Several questions remain concerning the optimal administration of BTX for the OAB patient.
The evidence base for quality improvement (QI) interventions is expanding rapidly. The diversity of the initiatives and the inconsistency in labeling these as QI interventions makes it challenging for researchers, policymakers, and QI practitioners to access the literature systematically and to identify relevant publications.
We evaluated search strategies developed for MEDLINE (Ovid) and PubMed based on free text words, Medical subject headings (MeSH), QI intervention components, continuous quality improvement (CQI) methods, and combinations of the strategies. Three sets of pertinent QI intervention publications were used for validation. Two independent expert reviewers screened publications for relevance. We compared the yield, recall rate, and precision of the search strategies for the identification of QI publications and for a subset of empirical studies on effects of QI interventions.
The search yields ranged from 2,221 to 216,167 publications. Mean recall rates for reference publications ranged from 5% to 53% for strategies with yields of 50,000 publications or fewer. The 'best case' strategy, a simple text word search with high face validity ('quality' AND 'improv*' AND 'intervention*') identified 44%, 24%, and 62% of influential intervention articles selected by Agency for Healthcare Research and Quality (AHRQ) experts, a set of exemplar articles provided by members of the Standards for Quality Improvement Reporting Excellence (SQUIRE) group, and a sample from the Cochrane Effective Practice and Organization of Care Group (EPOC) register of studies, respectively. We applied the search strategy to a PubMed search for articles published in 10 pertinent journals in a three-year period which retrieved 183 publications. Among these, 67% were deemed relevant to QI by at least one of two independent raters. Forty percent were classified as empirical studies reporting on a QI intervention.
The presented search terms and operating characteristics can be used to guide the identification of QI intervention publications. Even with extensive iterative development, we achieved only moderate recall rates of reference publications. Consensus development on QI reporting and initiatives to develop QI-relevant MeSH terms are urgently needed.