We first discuss how we adapted the USPSTF method to our task of grading the quality of evidence that we might use in our decision model. Next, we describe how we specified a minimally acceptable, or “qualifying,” grade of evidence. If the quality of evidence was beneath this grade, we considered it to be too weak for us to use in the model. Finally, we varied our threshold for defining adequate quality of evidence to measure the effect of evidence quality on the results produced by the model.
The basic idea underlying this approach is that when a data source has poor-quality evidence, we should not use it to estimate a model parameter. Instead, we should assume that we know little about the parameter’s true value and specify it by using a probability distribution with few embedded assumptions (probability distributions that are “uninformative” or “wide”). A probability distribution is a way of expressing what we know about a number whose true value is uncertain. It is narrow if we are certain and wide if we are uncertain. We chose to use uniform distributions (that is, it is equally likely that a parameter has any value within a specified range) to construct our uninformative distributions; however, other types of wide distributions (such as shallow triangular distributions) are consistent with this approach and may also be suitable. It is important to note that we used uninformative distributions only if no data source met our qualifying grade of evidence; otherwise, we based the parameter distribution on the statistical uncertainty of the qualifying data source.
Adapting the Analytic Framework of Evidence-Based Medicine
As recommended by the USPSTF (4
), we evaluated quality of evidence according to 3 separate domains of study design: hierarchy of research design, internal validity, and external validity (). Hierarchy of research design (which we subsequently refer to as study design
) measures the extent to which a study’s design differs from a controlled experiment. The hierarchy ranges from randomized, controlled clinical trial (level 1, most favorable) to expert opinion (level 3, least favorable). Internal validity
measures how well a study’s conclusions apply to the members of the study sample itself (11
) () and includes determinants of study quality or bias minimization, such as concealment of randomization and loss to follow-up. It is graded as good, fair, or poor. External validity
measures how well a study’s inferences should apply to members of the target population (11
) () and includes such determinants of generalizability as whether clinical, social, or environmental circumstances in the studies could modify the results from those expected in another clinical setting. The USPSTF did not specify a grading system for external validity, so we simply graded this attribute as high or low. The USPSTF classification scheme treats research design, internal validity, and external validity as largely independent of each other. Therefore, although some would argue that any study with poor internal validity will necessarily have poor external validity, the USPSTF approach preserves the flexibility to rate these 2 attributes independently.
Strength of Evidence Hierarchy*
The USPSTF developed its methods to use with experimental variables (that is, interventions). Therefore, it gives lower-quality scores to studies that rely strictly on observation. Decision analytic models commonly include variables that are necessarily observed and are impossible to measure by experiment (for example, cost or features of the natural history of disease, such as the mortality rate due to age-, sex-, and race-related causes). For this reason, we did not want to automatically award low study design grades to studies measuring nonexperimental variables. Therefore, we classified high-quality studies that measured nonexperimental variables as level 1 rather than level 2.
Two of the authors independently classified the design, internal validity, and external validity of 17 data sources (12–25; one unpublished study; expert opinion) for the DOT model by using this adapted classification scheme (). They disagreed only once and quickly resolved the difference. Some investigators may prefer more robust review procedures, such as those used for meta-analyses (for example, review in duplicate with decision rules for adjudicating discrepancies, blinded review) (26
Parameters in Computer Simulation*
Specifying Qualifying Grades of Evidence
Every time we used the model, we specified qualifying grade or grades of evidence for 1 or more of our evidence domains (study design, internal validity, and external validity). A “qualifying” grade of evidence means we could use a data source in our model only if that source met or exceeded the qualifying grade of evidence. For example, if we specified that the model must contain only data from studies using level 1 study design, we used only data sources with level 1 design and excluded studies with weaker design (level 2-1, level 2-2, level 3 [see ]).
Varying the Qualifying Grade of Evidence in Sensitivity Analyses
After performing a base-case analysis in which we used all evidence (regardless of its quality), we performed 4 separate sensitivity analyses on the basis of quality of evidence. Three of them consisted of imposing the strictest evidence criteria for 1 of the 3 evidence domains while using all evidence for the other 2 domains. In the fourth analysis, we simultaneously set all 3 evidence criteria to their strictest levels.
Estimation of Parameter Input Distributions
When more than 1 study met the qualifying grade of evidence, we used the data source with the most statistically precise estimate. Specifically, we used the mean and 95% CI from the study to define the corresponding parameter’s distribution. We chose this approach for its simplicity, as well as to reflect common modeling practice. However, an alternative approach would be to perform a formal meta-analysis using all data sources that met the qualifying grade of evidence (assuming that the studies were sufficiently homogeneous to justify combining them).
When data sources for a particular model parameter did not meet the qualifying grade of evidence, we did not base the parameter point estimate on those sources. Instead, we assumed a uniform distribution over a range that was sufficiently wide and inclusive to encompass the likely range of model users’ beliefs (that is, prior probability distributions) about the true value of that parameter. To minimize our dependence on assumptions, we specified uniform distributions that were neutral, thereby not favoring any particular direction of effect. For example, when we imposed a strict qualifying grade of evidence for external validity, none of the studies of the effectiveness of DOT met our criterion, so we specified this parameter by a uniform distribution centered on the null effect (DOT had no effect on adherence).
Decision Analytic Model
We constructed a simple decision analytic model () by using standard methods to specify parameters as probability distributions (that is, a range of probabilities in which some probabilities may be more likely than others). In the model, individuals may or may not receive DOT, and DOT may or may not affect whether individuals take HIV treatment. HIV treatment, in turn, influences whether people live or die. We used this simple model solely to illustrate this approach. The model differs completely from the HIV decision model published elsewhere by Braithwaite and colleagues (27
). We deliberately made it as simple as possible to ensure that the complexity of the model would not be a barrier to understanding the concept of incorporating quality of evidence into a decision model, which applies to any expected value decision model, no matter how complex.
Schematic diagram of decision analytic model
We used a type of model that generates a result from each of many runs in which the model draws a value at random from the probability distributions for each parameter input. Because the parameter values vary from run to run, the results vary from run to run. We used a cost-effectiveness model, so each run generated a value for incremental cost and incremental effectiveness. In accord with accepted practice for presenting model results, we display the results in 2 different ways. First, we show a confidence ellipse
, which indicates the portion of the cost-effectiveness plane (that is, a 2-dimensional graph of incremental cost and incremental effectiveness) in which 95% of results of a run are likely to occur (30
). The confidence ellipse is analogous to confidence intervals. Smaller confidence ellipses correspond to more precise results, and wider confidence ellipses correspond to less precise results. Second, we show an acceptability curve
, which shows the proportion of observations that fall beneath a range of hypothetical thresholds that society is willing to pay for health benefits (31
). A range of $50 000 to $100 000 per quality-adjusted life-year (QALY) is commonly used as a de facto standard for cost-effectiveness. If a high proportion of observations falls below this range, it is evidence in favor of cost-effectiveness. We ran each simulation 1000 times.
Role of the Funding Source
The funding source had no role in the study’s design, conduct, or analysis and in the decision to submit the manuscript for publication.