Study Design

We conducted a secondary analysis of data originally collected to establish the feasibility and efficacy of using a novel risk communication aid during breast oncology consultations. The study provides an appropriate and interesting dataset in which to conduct a preliminary exploration of categorical and continuous accuracy measures. The dataset is small enough to be included in the first three columns of , along with key results, so that readers can easily reproduce and verify calculations. The intervention was designed to improve accuracy, as measured by a categorical standard with clinical relevance, but the improvement was not statistically significant (p=0.125). This allowed us to explore whether continuous measures would make a material difference in hypothesis testing (i.e. achieve statistical significance), and if so whether the continuous measures were as clinically relevant as the categorical one. We now briefly describe aspects of the primary study design that were relevant to the secondary analysis. We provide a detailed description of the primary study in another report [

6].

| **Table 2**Patient and gold standard estimates of 10-year local therapy mortality risk, with categorical and continuous measures of accuracy. Highlighted numbers are undefined boundary values for Kullback-Leibler. For Kullback-Leibler calculation, the boundary value (more ...) |

Population, Setting, and Study Site

The primary study took place at the breast care center in an academic medical center in San Francisco. The center treats over 500 newly diagnosed breast cancer patients per year. The population of new breast cancer patients at this center is mostly White, college educated, affluent, and insured.

Subjects, recruitment, consent

Researchers recruited a convenience sample of 20 patients consecutively referred for oncology consultations with either of two senior oncologists. Patients were eligible to participate in the study if they could speak and read English, if they had completed surgery for stage I, II, or IIIa breast cancer, if they had not initiated any form of adjuvant therapy, and if their medical charts included tumor size, tumor grade, hormone receptor status, node status, and age. Patients were not eligible to participate in the study if they had metastatic disease, if they needed further surgery to complete staging, or if they were unable to provide informed consent. The institutional Committee on Human Research and the funding agency’s Institutional Review Board approved the study protocol. Patients were enrolled between October, 2001 and February, 2002.

Outcomes and Instruments

Subjects filled out a brief survey asking them to estimate their 10-year mortality risk with and without adjuvant therapy, before and after an educational presentation of gold-standard estimates by their oncologist. The pre and post-visit surveys took the form: “The chance that I will die from my breast cancer within the next 10 years after having [therapy] is (circle one): [response].” The response format was a list of potential responses ranging from 0% to 100% in increments of 5%. Patients were prompted to respond for local therapy (surgery and local radiation) and adjuvant therapy (systemic chemotherapy or hormone therapy) scenarios. This secondary analysis examined the accuracy of patient estimates as to the risk of dying within 10 years after local therapy only, i.e. with no adjuvant therapy. We focused our analysis on this topic because local therapy prognosis is a baseline prognosis that patients should understand before they consider adding therapy.

Intervention

The intervention consisted of an oncologist reviewing a printout of the graphs from the Adjuvant! software program, which presents estimates of the patient prognosis based on patient-specific inputs consisting of age, tumor size and grade, estrogen receptor status, node status, and number of comorbidities [

7]. Adjuvant! is a validated, widely used prognostic model [

8,

9]. Its estimates functioned as a gold standard for patient 10-year local therapy mortality risk in this study. Oncologists presented several estimates; our analysis focuses on local therapy only.

Data Collection and Management Procedures

A research assistant transcribed the survey responses to an Excel workbook, and entered the corresponding gold standard estimates from Adjuvant! for comparison with patient estimates.

Measures of Patient Accuracy

Categorical An indicator of whether each patient estimate was within plus or minus 5% of the gold standard. This was a binary variable with 1 indicating an estimate within 5% and 0 indicating an estimate outside the 5% threshold. The oncologists recruiting patients to the original study felt that five percent was a clinically meaningful threshold for accuracy in this patient population, and it corresponds to a conservative estimate for the Adjuvant! model’s margin of error [

8].

Continuous
- Absolute bias, defined as the magnitude of the difference between the patient and Adjuvant! estimate.
- Brier score, defined as the square of the difference between the patient and gold standard estimates.
- Kullback-Leibler divergence score, defined by

where P and Q represent the gold standard and patient estimates, respectively. In this case the summation is over the two possible outcomes, survival or death and P(2) = 1− P(1), Q(2) = 1 − Q(1) with P(1) and Q(1) representing the 10 year mortality estimates. The Kullback-Leibler divergence score is undefined for estimates of 0 or 1. We substituted 5% for 0 and 95% for 1 in the computation of the Kullback-Leibler score, since these were the closest allowable responses to 0 and 1 on the patient survey.

All four measures are used in information theory. A thorough discussion of these and other measures can be found in Grunwald and Dawid [

10]. We selected these measures because they are easy to compute and are widely used in information theory.

All of the continuous measures are lowest (zero) when the patient estimate equals the gold standard and increase as the patient estimate moves away from the gold standard. Since we were interested in the measuring the effect of the oncologists’ risk communication, we used the difference in these statistics, *before* minus *after* scores, as a measure of the information gained by the patient as a result of exposure to the oncologist presentation. For the continuous measures, positive values in this difference were associated with improvements in patient accuracy, since that means the *before* score was bigger (worse) than the *after* score.

Analysis plan

The study questions were: “How sensitive and efficient were each of the categorical and continuous measures; and were there significant differences in sensitivity across these measures?”

We began by calculating the sensitivity and efficiency for each measure. As a first measure of sensitivity, we tested the null hypothesis of no change in patient accuracy after compared to before the oncology visit to see if the measures detected a statistically significant change. For the categorical measure, because our data were paired we used McNemar’s test to compare the number of patients that improved to within ±5% versus the number that started within but ended up outside that margin of error. We ran additional scenarios to explore the effect on statistical sensitivity, as measured by p-values, of relaxing our categorical standard of accuracy from plus or minus 5% to plus or minus 10% and 15%.

For the continuous measures, we applied the Shapiro-Wilks test for normality, transformed the distributions as needed, and then used a paired t-test to determine whether the observed differences were statistically significant. Since it was possible that the risk presentation could have led to decreased patient accuracy, through patient overload, both tests were two-sided at 5% level of significance.

As an index of sensitivity, we calculated the p-value for the paired comparisons defined by the categorical and three continuous measures.

For each measure we then calculated the sample size needed to have 90% power (at 5% significance level) to replicate the observed effect in a separate, independent study. See

Appendix 1 for sample size calculation details. These calculations were performed using Stata 10 [

11] and provided a measure of the efficiency of each of the categorical and continuous measures considered separately for this sample of patients.

In order to statistically test whether the continuous measures differed significantly in their sensitivity, we tested for differences in standardized effect sizes, denoted d. We had to take into account that our data consists of a single set of patient scores. The differences in measures are due to different calculation methods rather than different samples of scores. The measures are therefore correlated, which complicates statistical comparisons. In particular, we could not use standard meta-analytic techniques for testing the homogeneity of effect sizes across different samples [

12]. Instead we used a bootstrap method to calculate 95% confidence intervals for the differences between d’s across the continuous measures [

13].

Specifically, we wrote a bootstrap function in the statistical programming language R to resample, with replacement, scores from our 20 patients. For each bootstrap sample we calculated the difference between the log-transformed Kullback-Leibler and Brier estimated d’s. (The d for the absolute bias is identical to that for the Brier). This was repeated 1000 times and the 2.5 and 97.5 percentile values were used to estimate the 95% confidence interval for the difference in d’s. A confidence interval spanning zero would indicate no statistically significant difference in the effect size across the continuous measures.