We found a rather small proportion of CPGs for heart disease, cancer, stroke, COPD and diabetes that made risk-stratified treatment recommendations using risk assessment tools. Most of these CPGs recommend risk assessment tools that had been shown to accurately predict outcome risk in the target population of the CPGs and most of the treatment evidence is based on RCTs and meta-analyses. For the majority of the CPGs, however, it was not explicitly explained how treatment effects on benefit and harm outcomes were estimated for patients at different risks. Perhaps most importantly, it was unclear for all but one CPG how treatment thresholds were determined to generate risk-stratified treatment recommendations.
We formed a framework for the development of risk-stratified treatment recommendations (Figure ) to systematically identify the strengths and weaknesses of current CPGs. Our findings suggest that risk assessment tools were carefully appraised and selected during the development of CPGs. For example, some CPG developers critically appraised validation studies of risk tools to judge their calibration (agreement between predicted and observed risk) and discrimination (probability that those with an event receive higher risk predictions that those without an event) [10
]. Minimizing misclassification of outcome risks is important to avoid over- or under-treatment [37
]. While some CPGs recommended specific risk assessment tools, one CPG suggested using the risk assessment tool that is most likely to be accurate in the specific population of interest [30
]. However, the set of CPGs selected in this study may give an overoptimistic picture of risk assessment tools proposed by guidelines. For many diseases and geographical locations other than the US, Canada and the UK, calibrated and discriminative risk assessment tools may not exist. A strength of existing CPGs is that the majority of them relied on RCTs and meta-analyses of RCTs for intervention effectiveness. The CPG developers recognized limitations within this body of evidence, including insufficient evidence on treatment heterogeneity (that is, subgroup effects) and scarcity of data on harm outcomes.
We discovered a number of major limitations in how CPGs develop risk-stratified treatment recommendations. It should be noted that some limitations propagated from single, prominent CPG (for example, National Cholesterol Education Program) to other CPGs that adopted the approach or even the recommendations. For example, it was often unclear how the benefit and harm outcomes were estimated for different risk profiles. Some CPGs applied estimates on relative risk reduction to absolute risks. This approach relies on the assumption of constant (relative) effects across the risk spectrum. This assumption of constant relative treatment effects may be justifiable in many instances but it is usually difficult to verify. No alternative approaches for linking the absolute risk with treatment evidence were used. Additional sensitivity analyses may sometimes be appropriate to explore the assumption of relative treatment effects. For example, one could obtain risk-specific treatment estimates from large trials using individual patient data [12
]. Or, one could employ simulation studies to estimate the probability of outcomes in the population of interest by combining observational data and treatment effects from randomized trials. It is currently unclear what the most appropriate approach is to link risk predictions with evidence from randomized trials. Nevertheless, we believe CPGs should be explicit about the method they use and acknowledge the associated advantages and limitations (for example, assumption of constant relative risk reduction).
In our view, the greatest limitation of current CPGs is that it is unclear how treatment thresholds were developed for most of them. Some CPGs stated that the thresholds were determined by experts. The USPSTF guideline on aspirin [25
] was the only guideline that conducted a formal quantitative assessment by comparing the expected number of benefit and harm events for patients at different risk for myocardial infarction and major gastrointestinal bleeding. We believe that transparency will be enhanced by conducting quantitative benefit-harm assessments alongside more qualitative approaches, such as using expert consensus about treatment thresholds.
Treatment thresholds are important because medical decision-making is discrete (to treat the patient or not). It is challenging to determine thresholds because clear cuts on the (commonly) continuous benefit-harm scale may not exist. In addition, there may often be substantial uncertainty about harms and heterogeneity of treatment effects as a consequence of poor reporting or a lack of evidence from primary studies. However, this should, in our view, not prevent CPG developers from making risk-stratified recommendations because health care providers need evidence-based guidance nevertheless and because variability in delivering health care may be unacceptably high in the absence of guidance. Quanstrum and Hayward [40
] recently suggested an approach that acknowledges uncertainty about treatment decision thresholds and proposed two thresholds instead of one: one above which physicians should recommend treatments (benefits outweighing harms irrespective of patient preferences and uncertainties about evidence base) and one below which physicians should recommend against treatments (harms outweighing benefits). The interval between the two thresholds represents an area where treatment could provide small benefits or harms depending on patient preferences but also where uncertainty about the evidence precludes CPG developers from making recommendations. Alternatively, CPG developers could frame strong recommendations for or against treatment for patients at outcomes risks above or below the two thresholds, respectively, and weak recommendations for patients at outcome risks between the two thresholds [41
One may criticize the approach used by the USPSTF, assigning equal weight to benefit and harm outcomes to calculate events expected per 1,000 people treated over 10 years, because empirical evidence suggests that patients, on average, assign different importance to myocardial infarction, major gastrointestinal bleeding and major stroke, the major drivers of the benefit-harm balance of aspirin [42
]. Nevertheless, such transparency about the relative importance of outcomes comes with several important advantages. Users of CPGs can understand and replicate how the treatment thresholds were derived and, if they do not agree with certain assumptions (for example, equal importance of myocardial infarction and major gastrointestinal bleeding), they can adjust the result to derive thresholds that would suit their settings (for example, myocardial infarction considered twice as important as major gastrointestinal bleeding). This would also allow the guideline to be interpreted for an individual patient, who may weigh the various outcomes differently than those preferences assumed in the CPG.
The framework for developing risk-stratified treatment recommendation we proposed may be useful for those developing CPGs and to stimulate further research. While much research has been done on how to select and appraise evidence on treatment benefits and harms [43
] and how to judge the validity of prediction models [37
], it is less clear how to link risk prediction and treatment evidence, how to select a method for benefit-harm assessment to develop treatment thresholds, and how to include patient preferences. It would be useful to have empirical evidence on how the results of different approaches for linking risk prediction and treatment evidence and for defining treatment thresholds differ and how sensitive they are to assumptions [45
]. As for patient preferences, little research has been done to find ways to include stakeholders in the process of selecting important outcomes, or a benefit-harm assessment method that provides the information patients need in order to make decisions [46
]. The newly founded Patient-Centered Outcomes Research Institute is likely to contribute substantially to the questions raised.
Our study has some weaknesses. We selected guidelines from five major disease categories and from one database and focused on CPGs from the US, Canada and NICE (UK). Thus our results may not be generalizable, but provide an optimistic assessment of CPGs because we included some of the most prominent guidelines in medicine. For the fields of cardiovascular medicine and diabetes, guideline developers have a long tradition of making risk-stratified treatment recommendations. We relied on published reports, which may not reflect the true underlying development process for CPGs. We considered all background documents that were openly accessible but we may have missed some information on the development of risk-stratified treatment recommendations.