|Home | About | Journals | Submit | Contact Us | Français|
We established a program of research to improve the development, reporting and evaluation of practice guidelines. We assessed the construct validity of the items and user’s manual in the β version of the AGREE II.
We designed guideline excerpts reflecting high-and low-quality guideline content for 21 of the 23 items in the tool. We designed two study packages so that one low-quality and one high-quality version of each item were randomly assigned to each package. We randomly assigned 30 participants to one of the two packages. Participants reviewed and rated the guideline content according to the instructions of the user’s manual and completed a survey assessing the manual.
In all cases, content designed to be of high quality was rated higher than low-quality content; in 18 of 21 cases, the differences were significant (p < 0.05). The manual was rated by participants as appropriate, easy to use, and helpful in differentiating guidelines of varying quality, with all scores above the mid-point of the seven-point scale. Considerable feedback was offered on how the items and manual of the β-AGREE II could be improved.
The validity of the items was established and the user’s manual was rated as highly useful by users. We used these results and those of our study presented in part 1 to modify the items and user’s manual. We recommend AGREE II (available at www.agreetrust.org) as the revised standard for guideline development, reporting and evaluation.
For clinical practice guidelines to achieve their full potential as tools to assist in clinical, policy-related and system-level decisions,1–3 they need to be of high quality and developed using rigorous methods.4 Thus, strategies are required to facilitate the development and reporting of guidelines and tools able to distinguish guidelines of varying quality. The AGREE Collaboration (Appraisal of Guidelines, Research and Evaluation) was the first to create a generic tool to assess the process of guideline development and reporting,5,6 and it quickly became the standard for guideline evaluation.7
As with any new assessment tool, ongoing development of the instrument was required to improve its measurement properties and advance the guideline enterprise. The AGREE Next Steps Consortium undertook a program of research to achieve these goals and create the next version of the tool, the AGREE II.8 The consortium completed two studies (parts 1 and 2). In part 1, also reported in this issue,9 we conducted an analysis of the performance of the new seven-point response scale, explored the usefulness of the AGREE items, and systematically identified ways in which the items and supporting document could be improved.
In part 2, reported here, we aimed to test the construct validity of the items and evaluate the new supporting documentation, which was intended to facilitate efficient and accurate application of the tool.
The validity of the original AGREE instrument was explored in three ways.5 Appraisers’ attitudes about the instrument’s usefulness and the helpfulness of the supporting documents (i.e., a user guide and training manual) were used as measures of face validity. The construct validity of the instrument was tested using three core hypotheses for each of the six domains; 3 of the possible 18 tests were supported. In retrospect, whether the hypotheses were generalizable across contexts was somewhat questionable. Finally, to establish criterion validity, correlations between users’ overall global endorsement and quality ratings of individual items were calculated. Whether global endorsements were a reasonable proxy gold standard was somewhat questionable. Further, for both the construct validity and the criterion validity, guidelines chosen in these studies were nominated by members of the research team as representing a range of quality, creating significant risks of bias.
Together, these findings and methodological limitations illustrated the need for additional work to test and establish the instrument’s validity. The most fundamental concept of construct validity, in particular, had not been yet addressed —are guidelines known to be of higher quality rated more favourably using the AGREE instrument than guidelines known to be of lower quality? In addition, no study to date has tested specifically whether the instructions for applying the tool are perceived to be appropriate, implementable, and helpful in differentiating among guidelines of varying quality. These perceptions are important components that contribute to the face validity of the tool.
We tested two specific research questions in this study. First, do the items in β-AGREE II differentiate between guideline content of known, varying quality? Second, is the new user’s manual perceived by users as appropriate, easy to apply and helpful in differentiating good quality guidelines from poor quality guidelines?
We used a two-level factorial design. Guideline quality (i.e., high and low) was the between-subjects factor. We sought to recruit 15 participants per group, to enable a two-sided test to have 80% power to detect an advantage of as little as one point on the seven-point scale between the high-quality and low-quality groups.
A convenience sample of guideline developers, researchers and clinicians was recruited to participate in this study. They were recruited from the Program in Evidence-based Care of Cancer Care Ontario, the Canadian Partnership Against Cancer and international coinvestigators of the research team. We oversampled by 33% to ensure we would receive data for our targeted sample size of 30.
An existing cancer-related guideline developed by an established guidelines program10 was used as the source guideline from which we purposefully designed excerpts of content of varying quality to reflect 21 of the 23 AGREE items. This guideline was chosen because it was of mid-range quality (as determined by two independent appraisers using the original AGREE instrument [MK, JM]), and thus enabled us to easily craft higher-quality content and lower-quality content. The AGREE instrument had not been explicitly used to facilitate its development. We did not test item 16 (i.e., the different options for management of the condition are clearly presented) because the source document only focused on one effective treatment option, and we did not want to introduce a recommendation that was fictitious or not based on evidence. Therefore, this item was excluded. Item 17 (i.e., key recommendatons are easily identifiable) was not manipulated, because we were presenting guideline excerpts related to each of the items one at a time rather than embedding all of the manipulated content to create a whole version of a guideline. Therefore, all participants received the same content as in the original source guideline for item 17.
In crafting guideline content, our objective was to reflect more nuanced differences that might typically be seen between guidelines rather than extreme examples of high and low content (Table 1). For each item, a high-quality version and a low-quality version of the content was pilot-tested, reviewed and refined by three members of the team (MB, MK, ER). From there, two versions of a study package were created. Excerpts of high- and low-quality content were randomly assigned to each version of the study package using a random number generator, such that in each package, only one version (high or low) was included for each item (except item 17 as per above). Version 1 included 14 high-quality items and 7 low-quality items. Version 2 included 7 high-quality items and 14 low-quality items (i.e., the inverse of version 1 in quality).
After obtaining ethics approval, we distributed personalized letters of invitation and then reminders via email to participant-candidates. Participants were assigned a unique identifier code and were blinded to group and purpose of the study. They were randomly assigned to one of the two versions of the study package and sent a confidential username and password to access the web-based study platform. Once logged on, participants were asked to assess the guideline content, using the β-AGREE II items and user’s manual to guide their assessment. Content relevant to each item was presented sequentially. Participants were then asked to fill out a survey to assess the usefulness of the user’s manual.
The β-AGREE II comprised an item set and a user’s manual. The set included the 23 items clustered into the six quality domains from the original AGREE instrument. However, the items were answered using the new seven-point response scale which was tested in part 19,11 and replaces the original four-point scale.5 The most significant change to the β-AGREE II is the new user’s manual that replaces the original supporting documentation. The user’s manual is an extensively restructured revision of the original user guide and training manual. For each of the 23 items, the user’s manual provides a definition of the concept, specific examples, suggestions for where to find the information in the guideline and clear direction (including criteria and considerations) on how to score the item.
A three-item scale was used to gather feedback on the user’s manual based on previously published measures of clinical sensibility.12 For each item represented in the manual, participants were asked to rate their agreement using a seven-point scale (i.e., 1 = strongly disagree, 7 = strongly agree) regarding item appropriateness, ease of application, and capacity to facilitate discrimination between good- and poor-quality guidelines. Participants were also asked to provide written feedback (i.e., qualitative, open-ended) on how the user’s manual could be improved.
To assess whether differences in item ratings existed between guideline content designed to be of high and low quality and to correct for multiple comparisons, we conducted a multivariable analysis of variance (MANOVA). This analysis included the 21 manipulated items as dependent measures. We report the results of both the MANOVA and the univariable analysis. A separate analysis of variance (ANOVA) was conducted to compare scores for item 17 between the two groups, where no difference was expected.
Descriptive statistics were calculated for each of the three assessment measures of the user’s manual. For exploratory purposes, total scores were added across the AGREE items for each of the three assessment measures of the user’s manual, and a one-way ANOVA was undertaken to determine if differences in overall assessments existed between version 1 and version 2 of the study packages. Our hypothesis was that no differences would exist between the two groups.
Of 41 invited participants, we received data from 30 people (for a response rate of 73%), which met our requirement for sample size. One data point was missing for two of these participants. The demographic characteristics of participants are provided in Table 2. Almost three quarters of participants identified themselves as researchers, 28% engaged in clinical practice, and 83% were participants in some aspect of the guideline enterprise.
Multivariable analysis of variance yielded a significant main effect for guideline quality (p = 0.005). Univariable analyses yielded significant differences in scores for 18 of the 21 manipulated items (Table 3). In all cases, content designed to be of high quality was rated higher than content designed to be of low quality. Whereas the mean scores were in the correct direction, the three items that did not yield significant univariable differences between the high- and low-quality item versions were item 10 (i.e., methods for formulating recommendations are clearly described), item 11 (i.e., health benefits, side effects and risks have been considered in formulating recommendations); and item 12 (i.e., there is an explicit link between the recommendations and the supporting evidence). As expected, item 17 (i.e., key recommendations are easily identifiable), for which participants in both groups received the same version, did not yield a significant difference between the high- and low-quality versions (p > 0.05) in the separate ANOVA.
The results of the three usability assessments of the β version of the user’s manual across each of the AGREE II items are presented in Table 4. Mean scores were high, with a range of 5.43–6.43 for the measure of appropriateness, 5.33–6.33 for that of ease of application, and 5.21–6.27 for that of ability to discriminate. No differences in total assessment scores were found between study package 1 and study package 2 (p > 0.05).
We received considerable written feedback from participants, including specific suggestions for improvements to the instrument (not presented). All feedback, in combination with feedback received in part 1, was formally discussed by the AGREE Next Steps Consortium, and final modifications were made to create the AGREE II.8
This study represents the first systematic analysis of the construct validity of the AGREE items. Our results show the capacity of the items to detect differences in guideline quality which the instrument purports to measure. Prior to this study, development work on the AGREE instrument had been done on real guidelines considered by researchers to reflect a range of quality. However, those results were confounded because the AGREE instrument served both as the measurement tool to evaluate the guidelines and as the object of the study intended to assess the instrument’s capacity to evaluate guidelines. Differences in what looked like quality might have been confounded with other differences (e.g., guideline topic, intervention, organization, differences among researchers on criteria used to nominate good- and poor-quality exemplars). In this study, by removing these potential confounders, we were able to explicitly test the capacity and predictability of the AGREE items to distinguish among guideline information of known varying quality. By manipulating the quality of guideline excerpts, we have been able to determine how the scores relate to the operational definitions of the items.
Our results are encouraging, with all mean ratings falling in the intended direction, and 18 of the 21 means yielding statistically significant differences. In addition, this study established that the instructions of the β-AGREE II User Manual are appropriate, are easy to apply, and create confidence among users that good-quality guidelines will be differentiated from poor-quality guidelines.
Our study has limitations. First, in testing of the β-AGREE II, participants were presented with excerpts of a guideline that reflected each item’s concept and not an entire guideline. Whether the items would be sensitive in discriminating between differences in quality when users are presented with an entire guideline is a question for future research. Second, we chose a convenience sample of participants that was comprised primarily of guideline developers and researchers rather than a full range of potential users of AGREE II. As such, the generalizability of the findings may be limited. However, given that most of our participants (83%) were experienced in guideline development or research, they were uniquely situated as consumers who could be critical of the value of the user’s manual. Third, although we met our sample-size goal of 30 participants for analytical purposes, this study is modestly sized. This fact may raise questions regarding the generalizability of our findings to a larger group of stakeholders. Fourth, by using a specialist (oncologic) guideline focused on one procedure as our source document, extrapolation of our findings to other clinical areas is contestable. Finally, for many of the items, the word count for high-quality items was larger than the word count for low-quality items. Thus, word count may be confounded with quality when interpreting the differences that emerged. This fact too may limit the generalizability of our findings.
Our study represents the first systematic assessment of the construct validity of the AGREE. Future research is warranted to reproduce these findings using a larger sample of stakeholders and including manipulated guideline content within the context of a whole report.
In combination with part 1,9 our results led to the final refinements and release of the AGREE II, the revised standard for guideline development, reporting and evaluation.8 The AGREE II is available at the website of the AGREE Research Trust (www.agreetrust.org).
The AGREE Next Steps Consortium thanks the US National Guidelines Clearinghouse for its assistance in the identification of eligible practice guidelines used in the research program of the consortium. The consortium also thanks Ms. Ellen Rawski for her support on the project as research assistant from September 2007 to May 2008.
Members of the AGREE Next Steps Consortium: Dr. Melissa C. Brouwers, McMaster University and Cancer Care Ontario, Hamilton, Ont.; Dr. George P. Browman, British Columbia Cancer Agency, Vancouver Island, BC; Dr. Jako S. Burgers, Dutch Institute for Healthcare Improvement CBO, and Radboud University Nijmegen Medical Centre, IQ Healthcare, Netherlands; Dr. Francoise Cluzeau, Chair of AGREE Research Trust, St. George’s University of London, London, UK; Dr. Dave Davis, Association of American Medical Colleges, Washington, USA; Prof. Gene Feder, University of Bristol, Bristol, UK; Dr. Béatrice Fervers, Unité Cancer et Environement, Université de Lyon – Centre Léon Bérard, Université Lyon 1, EA 4129, Lyon, France; Dr. Ian D. Graham, Canadian Institutes of Health Research, Ottawa, Ont.; Dr. Jeremy Grimshaw, Ottawa Hospital Research Institute, Ottawa, Ont.; Dr. Steven E. Hanna, McMaster University, Hamilton, Ont.; Ms. Michelle E. Kho, McMaster University, Hamilton, Ont.; Prof. Peter Littlejohns, National Institute for Health and Clinical Excellence, London, UK; Ms. Julie Makarski, McMaster University, Hamilton, Ont.; Dr. Louise Zitzelsberger, Canadian Partnership Against Cancer, Ottawa, Ont.
Competing interests: Melissa Brouwers, Francoise Cluzeau and Jako Burgers are trustees of the AGREE Research Trust. No competing interests declared by the other authors.
Contributors: Melissa Brouwers conceived and designed the study, led the collection, analysis and interpretation of the data, and drafted the manuscript. All of the authors made substantial contributions to the study concept and the interpretation of the data, critically revised the article for important intellectual content and approved the final version of the manuscript to be published.
Previously published at www.cmaj.ca
Funding: This research was supported by the Canadian Institutes of Health Research (CIHR), which had no role in the design, analysis or interpretation of the data. Michelle Kho is supported by a CIHR Fellowship Award (Clinical Research Initiative).
This article has been peer reviewed.