|Home | About | Journals | Submit | Contact Us | Français|
Physicians’ recommendations affect patients’ treatment choices. However, most research relies on physicians’ or patients’ retrospective reports of recommendations, which offer a limited perspective and have limitations such as recall bias.
To develop a reliable and valid method to measure the strength of physician recommendations using direct observation of clinical encounters.
Clinical encounters (n = 257) were recorded as part of a larger study of prostate cancer decision making. We used an iterative process to create the 5-point Physician Recommendation Coding System (PhyReCS). To determine reliability, research assistants double-coded 50 transcripts. To establish content validity, we used one-way ANOVAs to determine whether relative treatment recommendation scores differed as a function of which treatment patients received. To establish concurrent validity, we examined whether patients’ perceived treatment recommendations matched our coded recommendations.
The PhyReCS was highly reliable (Krippendorf’s alpha =. 89, 95% CI [.86, .91]). The average relative treatment recommendation score for each treatment was higher for individuals who received that particular treatment. For example, the average relative surgery recommendation score was higher for individuals who received surgery versus radiation (mean difference = .98, SE = .18, p < .001) or active surveillance (mean difference = 1.10, SE = .14, p < .001). Patients’ perceived recommendations matched coded recommendations 81% of the time.
The PhyReCS is a reliable and valid way to capture the strength of physician recommendations. We believe that the PhyReCS would be helpful for other researchers who wish to study physician recommendations, an important part of patient decision making.
There has been an increasing interest in empowering patients as informed consumers of healthcare goods and services.1 As informed consumers, patients often must choose between multiple treatment options. For example, in early stage prostate cancer, patients must choose whether to receive surgery, radiation, or active surveillance. Each of these treatment options is associated with a unique profile of risks and benefits, and therefore there is not a single right treatment option for all patients.2 Shared decision making is considered by many to be the “pinnacle of patient-centered care,” a process by which patients and physicians work together to choose the best treatment based on both medical factors and patients’ individual preferences.3 As part of this process, physicians may provide patients with recommendations. It is vital to be able to accurately capture these recommendations. Even within the paradigm of shared decision making, physicians’ recommendations strongly impact patients’ treatment choices, potentially even more so than patients’ cancer severity, age, or anxiety.4–6
However, current research on physician recommendations has several limitations. Physician recommendations are frequently treated as binary, in which a physician either does or does not recommend a single treatment option.4 In reality, however, treatment recommendations are often more nuanced, and physicians can provide recommendations of varying strength for multiple treatments. Additionally, most studies have relied on patient reports of physician recommendations, which are subject to recall bias.7,8 Furthermore, motivated cognition may lead patients to misremember physician recommendations, such that their reported recommendation matches their treatment choice rather than accurately reflecting their conversation with the physician.9
In order to address these limitations, we developed the Physician Recommendation Coding System (PhyReCS), which captures the strength of physician recommendations during appointments within the context of early stage prostate cancer. The PhyReCS addresses the aforementioned limitations in the following ways: it is a continuous (rather than binary) measure, has the flexibility to capture multiple nuanced recommendations, and avoids problems associated with relying on patients’ retrospective reports of recommendations. In this article, we provide an in-depth explanation of the PhyReCS, measure its reliability, and assess its validity.
Appointments (n=257) were recorded and transcribed as part of a larger trial in which men undergoing prostate biopsies were randomized to receive either a standard or low-literacy prostate cancer treatment decision aid prior to choosing a treatment for their early stage prostate cancer.10 The type of decision aid did not influence our measures of interest;* therefore, it is not discussed further in this article. Appointments were recorded from 2008–2012 at four geographically-dispersed, academically-affiliated Veterans Affairs medical centers. During each appointment, the patient and physician discussed treatment options for the patient’s newly-diagnosed early stage (low or intermediate risk) prostate cancer. Patient and physician demographics are listed in Table 1. There were 47 unique physicians in our study, most of whom were residents or fellows. On average, each physician was recorded in 5.31 clinical encounters (SD = 3.77).
We identified a subset of transcripts using maximum variation purposeful sampling techniques, in which we selected transcripts that differed on variables that we expected to influence physician recommendations (e.g. age, Gleason Score).11 We then used an iterative process to develop a 5-point Physician Recommendation Scale to capture how physicians portrayed each treatment option during the clinical appointment as a whole. We defined the boundaries of each recommendation score through repeated application and discussion.
For each treatment option (surgery, radiation, and active surveillance),† recommendations were coded as follows: +2 (−2) indicated that the physician made a strong recommendation for (against) the treatment, +1 (−1) indicated that the physician made a mild recommendation for (against) the treatment, and 0 indicated that the physician recommended neither for nor against the treatment. If the physician did not mention a treatment, this was coded as “not discussed;” for the purpose of these analyses, this was treated as equivalent to a strong recommendation against the treatment option (−2), because such an omission essentially indicated that the physician did not think it was even worth mentioning the option. Such omissions occurred relatively infrequently (surgery: n = 3; radiation: n = 3; active surveillance: n = 11). Thus, for each appointment, coders assigned a recommendation score for each of the three primary treatment options (surgery, radiation, and active surveillance). Importantly, recommendation scores were independent such that a recommendation against a particular treatment did not automatically translate into a recommendation for another treatment.
Although recommendation scores were global judgments that considered the appointment in its entirety, there were often key statements that captured the sentiment of physicians’ feelings towards a particular treatment option. Table 2 provides examples of these types of statements. As noted above, however, keep in mind that final recommendation scores were global scores based on the entire appointment rather than any single statement in isolation; therefore, we provide an example of how recommendation scores evolved over the course of an appointment in Table 3.
The lead researcher trained five research assistants (RAs) using the finalized codebook (available in online appendix): each RA received approximately 30 hours of guided practice over a period of 2–3 weeks until they demonstrated a thorough understanding of the PhyReCS. RAs then double-coded a random subset of 50 previously-unseen transcripts, which we used to calculate reliability. Discrepancies were resolved via team discussion. Given the high reliability (see below), it was appropriate for RAs to single-code the remainder of encounters (n = 207). All coding was finished within 3 weeks, minimizing the possibility of coder drift. RAs coded the transcripts in the development set at the end of the coding period to minimize the chance of carryover from the training period. Example transcripts available upon request.
To determine reliability, we calculated Krippendorf’s alpha for the recommendation scores for the subset of transcripts that were double coded (n = 50). Krippendorf’s alpha offers advantages over other ratings of inter-rater reliability such as inter-class correlation coefficient or weighted kappa, and it can be used “regardless of the number of observers, levels of measurement, sample sizes, and presence or absence of missing data.”12,13 We treated the scale as an interval variable‡ and used Hayes’ 2013 SPSS macro to calculate Krippendorf’s alpha using 5000 bias-corrected bootstrap samples to determine 95% confidence intervals.12 A Krippendorf’s alpha value of 0 indicates that the reliability was no better than would be expected by chance whereas a value of 1 indicates perfect reliability; values of 0.67–0.8 are considered indicative of acceptable reliability and values above 0.8 indicate excellent reliability.13
Patient treatment choice was determined via chart review six months after the recorded appointment (data available for 216 individuals).§ Five individuals received a treatment other than active surveillance, radiation, or surgery (e.g. hormone therapy); these individuals were not included in analyses that examined which treatment patients received. Therefore, for analyses involving patient treatment choice, n = 211.
To examine the construct validity of the PhyReCS, we tested whether recommendation scores differed as a function of treatment received. Given that physicians’ recommendations are a strong determinant of patients’ treatment choices,4 we should find that, on average, physicians’ coded recommendation scores should be higher for the treatment the patient received versus the non-chosen treatment options. We calculated a relative recommendation score for each treatment option, which was equal to that treatment recommendation score minus the average of the recommendation scores for the other two treatments. For example, Active Surveillance relative = [Active Surveillance raw] – [(Surgery raw + Radiation raw) / 2]. We then used a series of three one-way ANOVAs to examine whether the average relative recommendation score for each treatment differed as a function of treatment received.** For example, we tested if the average relative surgery recommendation score differed for individuals who received surgery versus radiation versus active surveillance. If Levene’s test indicated that we violated the assumption of homogeneity of variance, we used a Brown-Forsythe correction for the omnibus F-test and a Tamhane test for pairwise comparisons. Otherwise, we used a Bonferroni correction for pairwise comparisons.
Patients’ perceptions of their physicians’ recommendations were determined via phone interview conducted by a professional survey company approximately 7–10 days after the recorded appointment (data available for 205 patients). Patients were asked, “Did your physician provide a recommendation?” If they indicated yes, they were then asked, “What was the recommendation?” with the answer choices of surgery, external beam radiation, brachytherapy,†† watchful waiting/active surveillance, and other. Patients who answered “other” (n = 2) were excluded from analysis; thus, for analyses involving patients’ perceived recommendations, n = 203.
To examine the concurrent validity of the PhyReCS, we examined the concordance between physicians’ recommendations as perceived by patients (“perceived recommendations”) and physicians’ recommendations as determined by coders using the PhyReCS (“coded recommendations”). Given that both patients and coders are experiencing the same conversation, if the PhyReCS is valid, there should be relatively high concordance between patients’ perceived recommendations and our coded recommendations. However, given that patients’ perceptions may be influenced by factors other than the objective occurrences during the appointment, we would not be surprised to see some differences between patients’ perceptions and our coded recommendations.
We classified the perceived versus coded recommendation as a “match” if the treatment that the patient perceived as recommended also received the highest PhyReCS recommendation score. On the other hand, we classified the perceived versus coded recommendation as a “mismatch” if the treatment that the patient perceived as recommended did not receive the highest PhyReCS recommendation score. For patients who perceived that the physician did not provide a recommendation, we classified the perceived versus coded recommendation as a “match” if more than one treatment received the highest recommendation score and as a “mismatch” in all other cases.
All funding agreements ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report.
This study was approved by the Institutional Review Boards at each of the participating sites; written informed consent was obtained from all patients and physicians.
Each of the 50 double-coded transcripts included three recommendation scores (one for surgery, radiation, and active surveillance); thus, we had 150 recommendation scores to assess the reliability of our scoring system. The Krippendorf’s alpha for all treatments was .89 (95% CI [.86, .91]), indicating excellent reliability. The Krippendorf’s alpha for the individual treatments was as follows: active surveillance = .94 (95% CI [.92, .95]); surgery =. 87 (95% CI [.82, .91]); and radiation = .64 (95% CI [.49, .80]), all of which indicate acceptable to excellent reliability. Of note, the lower reliability for the radiation scores was due to less variation in the scores; the Krippendorf’s alpha formula takes this into account and therefore reliability goes down more with each individual discrepancy because there is a higher likelihood that matches occurred by chance. Table 4 displays the observed versus expected coincidence matrices for the scores given by the two RAs coding each encounter. There were 29 total discrepancies out of 150 scores assigned. One coder was responsible for 15 of these discrepancies; she received remedial training before being allowed to continue with coding. Four of the discrepancies were significant, in which coders assigned scores that were 2 points apart on the scale. Three of these discrepancies were due to a misunderstanding of the coding rules by the coder who received remedial training. The remaining 25 discrepancies involved minor disagreements, in which coders assigned scores that differed by only one point on the scale. Importantly, coders never disagreed about whether a physician recommended for versus against a treatment option; in other words, there were no discrepancies that involved a negative versus positive score.
A series of three separate ANOVAs revealed that, for all treatments, the relative treatment recommendation score was higher for individuals who received that treatment versus the other two treatments (Fig. 1). The relative surgery recommendation score differed as a function of treatment received [F(2,208) = 53.01, p < .001]. Specifically, the relative surgery recommendation score was higher for individuals who received surgery versus radiation [M surgery = 1.69, SE = .14 vs. M radiation = .52, SE = .20; mean difference = 1.17, SE = .24, p < .001] and surgery versus active surveillance [M surgery = 1.69, SE = .14 vs. M active surveillance = =−0.27, SE = .13; mean difference = 1.96, SE = .19, p < .001]. The relative radiation recommendation score also differed as a function of treatment received [Brown-Forsythe F(2,85.78) = 23.22, p < .001]. Specifically, the relative radiation recommendation score was higher for individuals who received radiation versus surgery [M radiation = 1.08, SE = .22 vs. M surgery = .30, SE = .08; mean difference = .78, SE = .23, p = .004] and radiation versus active surveillance [M radiation = 1.08, SE = .22 vs. M active surveillance = −.25, SE = .09; mean difference = 1.33, SE = .23, p < .001]. Finally, the relative active surveillance recommendation score differed as a function of treatment received [Brown-Forsythe F (2,187.83) = 105.81, p < .001]. Specifically, the relative active surveillance recommendation score was higher for individuals who received active surveillance versus surgery [M active surveillance = .52, SE = .14 vs. M surgery = −1.99, SE = .15; mean difference = 2.51, SE = .19, p < .001] and active surveillance versus radiation [M active surveillance = .52, SE = .14 vs. M radiation = −1.61, SE = .19; mean difference = 2.12, SE = .22, p < .001].
There was a high level of concordance between patients’ perceived recommendations and the coded recommendations determined using the PhyReCS. Overall, the perceived and coded recommendation matched in 81% (164/203) of cases. Patients perceived that their physicians recommended active surveillance in 52 appointments; the perceived and coded recommendation matched in 80% (42/52) of these appointments. Patients perceived that their physicians recommended surgery in 70 appointments; the perceived and coded recommendation matched in 91% (64/70) of these appointments. Patients perceived that their physicians recommended radiation in 26 appointments; the perceived and coded recommendation matched in 85% (22/26) of these appointments. Patients perceived that their physicians provided no recommendation in 55 appointments; the perceived and coded recommendation matched in 65% (36/55) of these appointments. The perceived and coded recommendation mismatched in 19% of cases (39/203). In 44% of these cases (17/39), patients perceived no recommendation but the PhyReCS determined that the physician did recommend a particular treatment. Notably, the patient received the PhyReCS-recommended treatment in 65% of these cases (11/17).
In this paper, we demonstrated that the Physician Recommendation Coding System (PhyReCS) is a reliable and valid way to quantify the strength of physician recommendations during clinical appointments in the context of early stage prostate cancer. We showed that the scale could be applied with high reliability. We established construct validity by showing that the average relative recommendation score for each treatment was higher for individuals who received that treatment versus the other two treatment options. In addition, we demonstrated concurrent validity by showing that there was a high level of concordance between patients’ perceived recommendations and coded recommendations as determined by the PhyReCS.
We believe that the PhyReCS would be helpful for other researchers who wish to study physician recommendations, and it is flexible enough to be adapted to many clinical settings. For example, patients with early stage breast cancer must choose whether to receive breast-conserving therapy (lumpectomy plus radiation) or mastectomy. Like early stage prostate cancer, the “right” treatment choice depends upon patient preference in addition to medical factors.14 Patient-physician conversations about these treatment options are complex and the physician may recommend multiple treatment options with varying strength. There are, of course, important differences between these clinical situations, including the gender of patients; however, we believe that with proper validation the PhyReCS could help to better understand the connection between physician recommendations, patient preference, and treatment choice in clinical settings besides prostate cancer.
Given the centrality of physicians’ recommendations in the medical decision-making process, the PhyReCS will also allow researchers to answer other interesting and important questions. For example, future research could examine when and why there is discordance between coded treatment recommendations and patients’ perceived recommendations, potentially providing insights into cognitive processes such as motivated cognition and recall bias. Are there circumstances in which patients are more (vs. less) motivated to perceive treatment recommendations as consistent with their chosen treatment option? In addition, the fact that patients often received the PhyReCS-recommended treatment when they perceived no recommendation suggests that the PhyReCS may be able to capture subtle recommendations that patients do not perceive, although future research is clearly needed to more fully examine this possibility.
The PhyReCS could also help researchers examine whether patients are more versus less satisfied with their decisions and/or clinical appointments as a function of the strength of physicians’ recommendations. Patient satisfaction evaluations primarily reflect patients’ perceptions of communication with their healthcare providers;15–17 therefore, it is reasonable to assume that patient satisfaction measures will be affected by differences in physician recommendations, which could be captured with the PhyReCS. It is possible that advice may decrease decision satisfaction as people like to feel that they are experiencing free choice,18 and receiving advice can feel like an infringement on this sense of free choice.19 Alternatively, advice may increase decision satisfaction as advice can be an important aspect of coping and patients may feel that their physicians are emotionally supportive when they give recomendations.20 Given that recommendations change patients’ sense of responsibility for a decision and responsibility is known to impact decision satisfaction,21 the PhyReCS could also be used to examine the connection between responsibility and patient decision satisfaction.
There are limitations to the PhyReCS and our study in general. First, although the PhyReCS captures the strength of physician recommendations, it does not capture other aspects of the recommendation, such as the motivation for physicians’ recommendations, which is an important factor when trying to fully understand physician recommendations. It also does not capture whether recommendations were solicited, which is known to influence how people perceive advice.22 Second, we did not collect other measures that would help to establish concurrent validity, such as which treatment(s) physicians believed that they recommended during the appointment. Third, our study has limitations in terms of the generalizability of our results. For example, the study was conducted in the Veterans Affairs system, where patients are older, sicker, and poorer on average,23 which may impact how physicians give recommendations. The PhyReCS may need to be adjusted with other patient populations; for example, the boundaries between recommendation scores may need to be adjusted when physicians are interacting with patients of a higher socioeconomic status. In addition, all patients (and most physicians) were male; given differences in communication styles between male and female physicians,24 future research is needed to examine the reliability and validity of the PhyReCS in clinical settings with female patients and/or physicians. Finally, although we have evidence that our scale is valid, it is possible that another scale would have done an even better job of capturing physician recommendations. Future research is needed to optimize the scale.
In conclusion, although clinical interactions within early stage prostate cancer are nuanced and complex, the PhyReCS makes it possible to capture how physicians recommend multiple treatment options with high reliability and validity. We feel the PhyReCS could allow researchers to more fully examine physician recommendations, an area with significant substantive and theoretical importance.
We would like to thank Haley Miller, Margaret Oliver, Elizabeth Reiser, and Biqi Zhang for their help coding. We would also like to thank Valerie Kahn and Daniel Cannochie for their help in data collection and project management.
Financial support for this study was provided in part by the following institutions: a IIR Merit Award from the U.S. Department of Veterans Affairs (IIR 05-283) to Angela Fagerlin, a Health Policy Investigator Award from the Robert Wood Johnson foundation to Peter A. Ubel, and Federal Grant T32 GM007171 to Karen Scherr as part of the Medical Scientist Training Program. All funding agreements ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report.
The Physician Recommendation Coding System (PhyReCS) provides researchers with a reliable and valid way to capture the strength with which physicians recommend for or against treatment recommendations during clinical appointments, developed in the context of early stage prostate cancer. By quantifying physicians’ recommendations, the PhyReCS will enable researchers to answer important questions involving physicians’ recommendations, which are a key component of many clinical encounters and medical decision making processes.
For men with early stage prostate cancer, there are three main treatment options: active surveillance, surgery, and radiation (external beam and brachytherapy). Each treatment option receives an independent global treatment recommendation score, which ranges from −2 (strong recommendation against the treatment) to +2 (strong recommendation for the treatment). Recommendation scores are global scores in that coders consider the entirety of the conversation, including the order and context of all statements, when determining recommendation scores.
The coder will assign a global recommendation score for each treatment between −2 and +2, with the coder assuming a beginning score of 0 (neutral). Global scores are intended to capture coders’ overall impression of physicians’ recommendations for the main treatment options: active surveillance, surgery, and radiation (both external beam and brachytherapy). The recommendations scores are “at the end of the day” reflections.
|Global Treatment Recommendation (Rec) Score|
|Not discussed||Strong rec against||Mild rec against||Neutral||Mild rec for||Strong rec for|
|Physician does not mention the treatment option or provides so little information that recommendation score cannot be assigned.||Physician gives a clear, strong rec against treatment option.||Physician gives mild, relatively subtle rec against treatment option.||Physician portrays treatment as an option; gives no indication of whether this treatment is particularly good or bad for patient.||Physician gives mild, relatively subtle rec for treatment option.||Physician gives a clear, strong rec for treatment option.|
Treatment Option Recommendation Coding Subject ID: _________ Coder:_________________
ND = not discussed
*Type of decision aid did not affect the following variables: Patient treatment choice (χ2(2) = 2.93, p = .23), physician recommendation scores (e.g., for active surveillance F(1,252) = .16, p = .69), and patients’ perceived recommendations (χ2(3) = 2.29, p = .51).
†We coded brachytherapy recommendations separately from external beam. However, it was only discussed in 35% (90/257) of appointments and received the highest recommendation score in only 5 appointments. “Radiation” recommendation scores thus reflect external beam radiation recommendation scores.
‡We also calculated Krippendorf’s alpha treating the scores as ordinal variables and results were substantively similar.
§Because only 5 patients received brachytherapy, we collapsed external beam radiation and brachytherapy into a single “radiation” category. Results remain substantively similar if we conduct analyses of brachytherapy and external beam therapy as separate categories.
**We conducted a number of sensitivity analyses to examine the relationship between recommendation scores and treatment received, including using raw recommendation scores and transforming recommendation scores into a single categorical variable. Results were substantively similar.
††Only 5 patients perceived that their physicians recommended brachytherapy; therefore, we collapsed external beam radiation and brachytherapy into a single “radiation” category. Results remain substantively similar if we treat brachytherapy and external beam therapy as separate categories.
Conflict of Interest Disclosures
Peter A. Ubel is a consultant for Humana. The principal investigator and all other co-authors have no conflicts of interest.
Karen A. Scherr, Fuqua School of Business and School of Medicine, Duke University.
Angela Fagerlin, Departments of Internal Medicine and Psychology, Center for Bioethics and Sciences in Medicine, University of Michigan Ann Arbor, The Ann Arbor VA Venter for Clinical Management Research, Ann Arbor, Michigan.
Lillie D. Williamson, Fuqua School of Business, Duke University.
J. Kelly Davis, Fuqua School of Business, Duke University.
Ilona Fridman, Columbia Business School, Columbia University.
Natalie Atyeo, Duke University.
Peter A. Ubel, Fuqua School of Business, School of Medicine and Sanford School of Public Policy, Duke University.