For clinical practice guidelines to achieve their full potential as tools to assist in clinical, policy-related and system-level decisions,1–3
they need to be of high quality and developed using rigorous methods.4
Thus, strategies are required to facilitate the development and reporting of guidelines and tools able to distinguish guidelines of varying quality. The AGREE Collaboration (Appraisal of Guidelines, Research and Evaluation) was the first to create a generic tool to assess the process of guideline development and reporting,5,6
and it quickly became the standard for guideline evaluation.7
As with any new assessment tool, ongoing development of the instrument was required to improve its measurement properties and advance the guideline enterprise. The AGREE Next Steps Consortium undertook a program of research to achieve these goals and create the next version of the tool, the AGREE II.8
The consortium completed two studies (parts 1 and 2). In part 1, also reported in this issue,9
we conducted an analysis of the performance of the new seven-point response scale, explored the usefulness of the AGREE items, and systematically identified ways in which the items and supporting document could be improved.
In part 2, reported here, we aimed to test the construct validity of the items and evaluate the new supporting documentation, which was intended to facilitate efficient and accurate application of the tool.
The validity of the original AGREE instrument was explored in three ways.5
Appraisers’ attitudes about the instrument’s usefulness and the helpfulness of the supporting documents (i.e., a user guide and training manual) were used as measures of face validity. The construct validity of the instrument was tested using three core hypotheses for each of the six domains; 3 of the possible 18 tests were supported. In retrospect, whether the hypotheses were generalizable across contexts was somewhat questionable. Finally, to establish criterion validity, correlations between users’ overall global endorsement and quality ratings of individual items were calculated. Whether global endorsements were a reasonable proxy gold standard was somewhat questionable. Further, for both the construct validity and the criterion validity, guidelines chosen in these studies were nominated by members of the research team as representing a range of quality, creating significant risks of bias.
Together, these findings and methodological limitations illustrated the need for additional work to test and establish the instrument’s validity. The most fundamental concept of construct validity, in particular, had not been yet addressed —are guidelines known to be of higher quality rated more favourably using the AGREE instrument than guidelines known to be of lower quality? In addition, no study to date has tested specifically whether the instructions for applying the tool are perceived to be appropriate, implementable, and helpful in differentiating among guidelines of varying quality. These perceptions are important components that contribute to the face validity of the tool.
We tested two specific research questions in this study. First, do the items in β-AGREE II differentiate between guideline content of known, varying quality? Second, is the new user’s manual perceived by users as appropriate, easy to apply and helpful in differentiating good quality guidelines from poor quality guidelines?