In this analysis we found that the probability of reflux resolution systematically differed among 3 Internet based calculators. For certain patients these calculators may produce predictions ranging from a low to a high probability of resolution despite identical patient characteristics. For example 1 patient was calculated to have a 2-year resolution probability ranging from 24% to 89%, while another had a probability of 7% to 48%. Perhaps more importantly the 3 calculators produced widely divergent discriminatory abilities when a variety of threshold cutoff values were used. For example if a 2-year resolution probability of 25% was determined to be a clinically meaningful threshold, 1 calculator would predict that 16 of 100 cases would be below this threshold or unlikely to resolve, while another predicted that 43 of these same cases would be unlikely to resolve. Interestingly these differences were present for each VUR grade as well.
While an experienced clinician who is well versed in the current VUR literature might be expected to evaluate these differences critically, someone less familiar with pediatric VUR may be unable to do so. In particular it seems doubtful that the average parent of a child with VUR could be expected to interpret reliably the differing predictions of the calculators we used. This issue is of particular relevance in terms of surgical decision making, given that parent preferences are widely cited as a motivating factor behind early surgical intervention.9,10
Parents who are presented with a 24% probability of resolution might choose a different (and possibly more aggressive) treatment algorithm than parents of a similar child who are given an 89% probability of resolution. Similarly a 6% resolution probability might lead a clinician to recommend a different management strategy than she/he would choose for a child with a 48% chance of spontaneous resolution. However, in both of these instances the child in question is exactly the same. The only difference is the methodology behind the calculator estimates. As such, it is clear that these calculators, which are freely accessible to parents, pediatricians and specialists, could reasonably be expected to exert an influence on management choices purely based on which calculator happens to be used.
Further complicating this issue is the high level of variability in patient understanding of risks and probabilities. Patients and families often have a poor understanding of quantitative risk information, although their understanding can be improved by incorporating uncertainty into the risk estimate or by the use of graphic presentation methods.11,12
Specifically it is well documented that patients commonly underestimate or overestimate medical risks, and multiple studies have shown that patient understanding of risk estimates can be greatly influenced by the specific wording used to frame a risk estimate.11–15
Of particular relevance to probability calculators such as the ones we used is the finding that patients tend to be unrealistically optimistic in their interpretation of risk data.13,15
Although one might assume that the use of personalized risk estimates would be easier for patients to understand than generalized risk estimates, little data exist to support this idea. Similarly there is little evidence that patients receiving personalized risk data make better, or better informed, decisions. However, patients receiving risk information of any kind tend to have higher satisfaction with their medical decisions.14
Also, multiple studies have revealed that risk estimates are a useful component of decision aids, which in turn can improve patient decision making and decision satisfaction. In this sense these calculators may serve an important and useful role for patients and families, despite their variability.
Predictive models have the potential to be useful adjuncts to clinical management. However, for clinicians to use them effectively a basic understanding of how these models function is critical. In this case we found significant variation between 3 predictive models using the identical set of randomly generated hypothetical patients. All modeling techniques have their own unique methodological advantages and disadvantages. The precision and accuracy of these models depend on the premise that the model technique being used is the most appropriate for that particular data set.
Interestingly each of the 3 calculators we tested uses a different predictive method. The CHB calculator uses a logistic regression model, the Iowa calculator uses a neural network model, and the Q-Med calculator uses a nomogram derived from a previous systematic review and meta-analysis.1
It is important to realize that although each method has specific advantages compared to the others, none of these methods is inherently superior to any other for all situations. Generally use of model selection techniques is recommended to identify a best fitting model to make the best statistical inference possible for a given population.16–18
Without access to the original data from which each prediction model was derived it is impossible to make a post hoc determination regarding whether a given model is the best choice for a given population. Nevertheless, model choice and type can significantly impact how well a model functions for a particular population, as our results demonstrate.
Perhaps more importantly the generalizability of the model depends on the premise that the patients on whom the model is based are representative of the population to whom the model will be applied. For this reason it is difficult to overestimate the value of external validation for such models. It is reassuring that at least 2 of the 3 calculators we studied have undergone or are currently undergoing external validation (H. T. Nguyen, personal communication).19
Lastly it is noteworthy that this study is not intended to be, and should not be interpreted as, a methodological critique of 1 or more of the calculators we investigated, since all appear to use reasonable methods drawn from appropriately performed studies. Similarly this study was not intended to validate or otherwise judge the relative accuracy of 1 calculator vs another. Because we used a randomly generated cohort of hypothetical subjects instead of actual patients, it is impossible to assess the accuracy of any model. Rather, the goal of this study was simply to investigate the variation in results that would be obtained by a typical family or a typical clinician seeking to use a tool encountered during a cursory Internet search. While these data clearly show that different Internet based calculators will produce different probabilities of spontaneous resolution for a given child, these differences should be assumed to reflect methodological variation rather than quality variation. Unfortunately whatever the cause of these variations, their net effect is the same—for some patients using a particular calculator rather than another will result in clinically significant differences in the predicted chance of spontaneous resolution, which may in turn lead to a different treatment regimen than they might otherwise have chosen.