|Home | About | Journals | Submit | Contact Us | Français|
This study examined the correspondence of treatment integrity ratings (adherence and competence) among community program therapists, supervisors, and observers for therapists who used motivational enhancement therapy (MET) within a National Institute on Drug Abuse Clinical Trial Network protocol. The results suggested there was reasonable agreement between the three groups of raters about the presence or absence of several fundamental MET strategies. Moreover, relative to observers, therapists and supervisors were more positive in their evaluations of the therapists' MET adherence and competence. These findings underscore the need for objective monitoring of therapists' performance when using empirically supported treatments and for adequately training therapists and supervisors to evaluate their treatment implementation in community programs, and are consistent with observations that different perspectives on the therapeutic process are not interchangeable.
The virtual requirement that community programs implement empirically supported psychotherapy treatment has led to a related need to assess the integrity of treatment implementation by therapists who work in these settings (APA Presidential Task Force, 2005; Carroll & Rounsaville, 2007; Miller, Zweben, & Johnsen, 2005). Treatment integrity refers to how much therapists have implemented psychotherapies consistent with the principles and techniques of the targeted approach (i.e., adherence) and the skill with which therapists have delivered them (i.e., competence) (Waltz, Addis, Koerner, & Jacobson, 1993). If community program therapists are to achieve the improved treatment outcomes that have been the promise of using empirically supported psychotherapies, the integrity with which they implement those treatments and the manner in which integrity is assessed become critical issues (Fixsen, Naoom, Blase, Friedman, & Wallace, 2005). This study examines the correspondence of treatment integrity ratings provided by therapists, their supervisors, and observers in a multisite clinical trial of motivational enhancement therapy (Ball et al., 2007).
Three methods within psychotherapy clinical trial studies are commonly used to assess treatment integrity. These methods tap the perspectives of therapists, supervisors, and observers, respectively. Another method, having clients provide their perception of their therapists' interventions in sessions (Elliot & Williams, 2003; Henggeler, Melton, Brondino, Scherer, & Hanley, 1997; McCarthy & Barber, in press; Silove, Parker, & Manicavasgar, 1990), is seldom used in clinical trials for treatment integrity assessment and not the focus of this paper.
The first method involves the use of therapist checklists. In this approach, therapists complete brief checklists at the end of their sessions to indicate which interventions they delivered to clients consistent with the manual guided treatments (see Carroll, Nich, & Rounsaville, 1998). Therapist checklists take relatively little time to complete and aim to (a) help therapists learn new skills by monitoring their own performances, (b) enhance the quality of supervision by having therapists and supervisors focus on a common set of therapeutic strategies, and (c) establish the extent to which the treatment is actually used in clinical practice from the therapists' perspective. Some studies have shown that therapists' adherence ratings positively predict their clients' primary treatment outcomes (Henggeler et al., 1997; Henggeler, Pickrel, & Brondino, 1999), though this relationship has seldom been evaluated in the literature.
A second method for assessing treatment integrity is to have supervisors, who are proficient in the targeted treatment and trained in how to rate tape recorded sessions, evaluate those sessions for adherence and competence. During the trial, supervisors provide therapists with performance feedback and coach them to maintain and further develop their treatment skills (Baer et al., 2007; Bellg et al., 2004; Carroll, 1997). This approach has the advantage of relying on recorded data of what actually occurred in sessions rather than relying solely on therapist memory (Hill & Lambert, 2004). To our knowledge, the relationship between supervisors' ratings of therapists' treatment integrity and client treatment outcomes has not been examined in prior studies.
Third, many efficacy trials include ratings by observers who assess the therapists' treatment integrity from recorded sessions using psychotherapy process rating measures to establish that the targeted treatment approaches have been implemented with sufficient adherence and competence and to determine if different treatments in the trial are discriminable from each other (Perepletchikova & Kazdin, 2005). Observers are viewed as relatively independent in that they do not participate in the trials, are not informed about the treatment conditions or sessions to be evaluated, and are not aware of the hypotheses being tested. Given the extensive training observers typically receive in such trials, their ratings usually are highly reliable, especially when observers tally every occurrence of each rating item (Weiss, Marmar, & Horowitz, 1988). Moreover, in many studies observers' ratings of adherence and competence (Barber, Crits-Christoph, & Luborsky, 1996; DeRubeis & Feeley, 1990; Hoffart, Sexton, Nordahl, & Stiles, 2005; Miller, Benefield, & Tonigan, 1993; O'Malley et al., 1988; Shaw, Elkin, Yamaguchi, Olmstead, & Vallis, 1999) have been found to relate positively with client treatment outcomes. However, these relationships often are very complex (e.g., other variables such as alliance may moderate the effect of adherence on outcome) and findings have been inconsistent across studies (Barber et al., 2006; Barber, Sharpless, Klostermann, & McCarthy, 2007; Beutler et al., 2004).
Little data exist that examine the extent to which treatment integrity ratings derived by these three methods correspond with each other. Carroll et al. (1998) reported on the concordance of therapists' and independent observers' assessments of the therapists' use of specific cognitive behavioral therapy (CBT) interventions for cocaine abuse. They found poor overall therapist-observer agreement on dichotomous assessments (present, absent) of therapists' use of CBT interventions (e.g., 80% of the item kappas were in the poor to fair range). Therapists in that trial reported significantly greater use of specific CBT techniques relative to observers. More recently, two studies by Miller and colleagues (2001; 2004) compared post-workshop global proficiency ratings made by therapists trained in motivational interviewing with several proficiency standards derived from independent observers' evaluation of therapists' actual motivational interviewing sessions. Both studies found that the therapists' self-reported higher overall proficiency in motivational interviewing than determined by the observers. Similarly, there is very little data examining how supervisors evaluate therapists' treatment integrity. Borders and Fong (1991) found a weak and insignificant relationship (r = .12) between supervisors' global ratings of trainees' counseling skills and ratings of counseling competencies as determined by two trained observers evaluating audiotapes of the same sessions. No other studies have directly compared how supervisors evaluate treatment integrity relative to observers, and none have compared supervisor and therapist evaluations of the same sessions.
Treatment integrity ratings by therapists, supervisors, and observers are all subject to bias, as each has unique perceptions about the sessions being evaluated, which might affect their ratings and contribute to the low correspondence among rater groups. For example, therapists directly experience what happens in sessions, giving them privileged information (e.g., cues and nonverbal communication, contextual factors influencing treatment) about their clients that might not be evident in recordings, yet may influence their self-assessments of treatment integrity (Hill & Lambert, 2004). Therapists also may overestimate the quantity and quality of their interventions when they know they are being assessed (Perepletchikova & Kazdin, 2005) or evaluate sessions more positively when self-assessing their performance immediately after sessions rather than when evaluating their sessions in tape-assisted reviews (Hill et al., 1994). Supervisors typically monitor the work of several therapists over time, giving them information about the relative performance of therapists. Depending on their prior assessments of these therapists, they might be prone to be more lenient or severe in their judgments of therapists' performance in subsequent sessions (Hoyt, 2000). Supervisors also may experience demand characteristics to evaluate the therapists' treatment implementation favorably when they want to show how therapists have benefited from their supervision. Independent observers also may experience biases in that they may interpret specific rating items idiosyncratically (Hill & Lambert, 2004), rate therapists they perceive as similar to themselves positively (Hill, O'Grady, & Price, 1988; Mahalik, Hill, O'Grady, & Thompson, 1993), or rate more leniently or severely across items due to general impressions not grounded in rating item definitions (Hoyt, 2000). On the other hand, they typically are carefully trained to interpret the meaning of rating items similarly, a process intended to encourage greater objectivity and reduce subjective perception when rating therapists' treatment performance (Hill & Lambert 2004; Hoyt, 2000).
The extent to which therapists, supervisors, and observers agree about treatment integrity using these different methods could inform how community programs establish that therapists have implemented treatments consistent with recommended practice guidelines. If therapists consistently evaluate their treatment adherence and competence similarly to supervisors and observers, then the most simple, least costly approach for measuring treatment integrity would be to use therapist checklists. Supervisory assessment requires an additional burden of training supervisors how to rate treatment integrity and to use the ratings to further develop therapists' treatment skills. However, onsite supervisors have the advantage of providing therapists with ongoing treatment integrity monitoring and training over time (Carroll et al., 2002). Alternatively, observers often have more rating expertise than therapists and supervisors. If available to community programs, observers could provide periodic treatment integrity assessments to programs without shifting this demand to therapists or supervisors. This model has been used to foster treatment integrity when complex psychotherapies such as multisystemic therapy (Henggeler, Schoenwald, Liao, Letourneau, & Edwards, 2002), dialectical behavioral therapy (Dimeff & Koerner, 2007), and integrated dual diagnosis assertive community treatment (McHugo, Drake, Teague, & Xie, 1999) have been implemented in large systems.
In this report we describe the correspondence of treatment integrity ratings provided by therapists, their supervisors, and independent observers who participated in a multisite National Institute on Drug Abuse Clinical Trial Network effectiveness protocol (Ball et al., 2007) evaluating motivational enhancement therapy (MET), a manualized version of motivational interviewing (Miller & Rollnick, 2002) originally developed for Project MATCH (Miller, Zweben, DiClemente, & Rychtarik, 1992). MET is a brief intervention in which therapists use empathic counseling techniques (such as reflective listening) and strategies for eliciting client self-motivational statements (such as using feedback to produce discomfort with status quo behaviors) to enhance a clients' intrinsic motivation for behavioral change. MET has demonstrated moderate treatment effects (.4 to .5) for substance use disorders (Burke, Arkowitz, & Menchola, 2003; Hettema, Steele, & Miller, 2005). We hypothesized that, consistent with prior reports, therapists' ratings of their own adherence to MET would be higher than observer ratings. Second, we hypothesized that supervisors trained to rate the sessions using an adherence and competence scale would have higher rates of agreement with the observers for both MET adherence and competence ratings than with the therapists' ratings.
The MET protocol was implemented within five outpatient community treatment programs that served diverse samples of substance users. Programs were located in California, Connecticut, and Pennsylvania in suburban and urban settings. Clients seeking outpatient substance abuse treatment at each site were randomized to receive either three individual MET or counseling-as-usual (CAU) sessions. Program therapists conducted the sessions during the first four weeks of treatment. Therapists who delivered MET also received supervision from program supervisors who rated the therapists' MET adherence and competence and provided them with feedback and coaching to support their implementation of MET. Observers subsequently independently evaluated the adherence and competence of MET and CAU sessions to establish MET treatment integrity in the trial. The primary findings from the MET protocol suggested that both MET and CAU reduced substance use at the end of the 4-week treatment phase; however, during the 12-week follow-up period, participants assigned to MET sustained reduced primary substance use, whereas those in CAU significantly increased use over this same phase. In addition, MET therapists used MET strategies significantly more often and with greater competence than CAU therapists (see Ball et al., 2007 and Martino, Ball, Nich, Frankforter, & Carroll, 2008 for detailed description of the study and its findings).
Participants were the therapists and supervisors at the five study sites involved in the delivery of MET, the observers who evaluated the therapists' treatment integrity, and the clients who received MET treatment.
Volunteers were drawn from the staff of the participating treatment programs based on their willingness to be randomized to deliver either MET or CAU and to have their sessions audiotaped. The 14 therapists who were trained in and implemented MET within the protocol were participants in this study. Ratings from the therapists who were randomized to deliver CAU were not included in this study because they did not rate their session performance or participate in rating-based supervision. Therapists provided either written permission or informed consent for participation depending on local Institutional Review Board requirements. Most MET therapists had no prior motivational interviewing or MET training exposure, and none had been trainers or therapists in research studies involving MET (Ball et al., 2002a). Therapists were predominantly female (60%) and Caucasian (77%). On average, therapists were 38.9 years old (sd = 11.8), employed at their agencies for a mean of 3.2 years (sd = 3.9), had been working as substance abuse counselors for 8.1 years (sd = 6.4), and completed 14.5 years of education (sd = 5.1). Forty-three percent had a master's degree, 46% were state certified substance abuse counselors, and 45% indicated they were in recovery from prior substance abuse problems.
Four supervisors evaluated MET sessions across five sites.1 Protocol supervisors were only involved in the monitoring of the MET intervention and had no equivalent involvement in the CAU condition. Supervisors provided either written permission or informed consent for participation depending on local Institutional Review Board requirements. Three of the four supervisors were female, all Caucasian, on average 40.2 years old (sd = 7.7), and had a master's degree or higher. On average, they had been providing substance abuse treatment services for 12.0 years (sd = 6.8), supervision for 8.2 years (sd = 5.6), and had 13.0 hours (sd = 8.4) of formal motivational interviewing or MET training prior to protocol involvement.
Fifteen observers evaluated the session audiotapes. On average, observers were 37.7 years old (sd = 9.7) and female (53%). Most observers had master's degrees in a clinical profession (67%) and had an average of 6.9 years (sd = 9.7) of substance abuse treatment experience, 8.3 years (sd = 7.9) of general psychotherapy experience, and 5.6 years (sd = 5.3) of clinical research experience. Sixty percent of the observers had served as independent raters in prior clinical trial studies testing the efficacy of behavioral treatments. Fifty-three percent reported prior workshop training in motivational interviewing or MET for an average of 9.0 hours (sd = 5.9).
Sixty-four clients who received treatment from the 14 MET therapists were participants in this study, representing 36% of the all the clients who received MET in the original study. Clients were English-speaking, with a range of primary substance use problems (38% alcohol, 16% marijuana, 14% cocaine, 11% opiates, 6% methamphetamines, 5% alcohol and drug, and 10% other). On average they were 33.4 years old (sd = 10.6), male (73%), primarily Caucasian (44%) or African American (38%), and single (84%). Clients completed on average 12.3 years (sd = 1.8) of education. Less than half of the clients were employed (35.9%). These characteristics were similar to the full sample of clients in the MET condition and the protocol overall.
The ITRS (Martino et al., 2008) is a 39-item scale adapted from the Yale Adherence and Competence Scale (Carroll et al., 2000) to assess community program therapists' adherence and competence in implementing MET, strategies considered inconsistent with MET (e.g., direct confrontation) or common to drug counseling (e.g., assessing substance use), and several general characteristics of the therapists (e.g., ability to maintain the session's structure) and clients (e.g., initiation of discussions unrelated to the session). This study focused on the 10 items that were common across the scales used by the therapists, supervisors, and observers and that involved MET-specific interventions.
For each item, observers evaluated the therapists on two dimensions using a 7-point Likert scale. First, they rated the extent to which the therapist delivered the intervention (adherence; 1 = not at all, to 7 = extensively). Second, they rated the skill with which the therapist delivered the intervention (competence; 1 = very poor, to 7 = excellent). ICC estimates suggested excellent levels of interrater reliability for the 10 items for both the adherence (mean ICC = .89; range = .66 to .99) and competence dimensions (mean ICC = .85; range = .69 to .97). A confirmatory factor analysis supported a two-factor model for the 10 MET consistent items (Martino et al., 2008). Five items included MET fundamental strategies that underpin the empathic and collaborative stance of MET such as use of open-ended questions, reflective statements, and motivational interviewing style or spirit. The other five MET consistent items involved advanced or structured strategies for evoking client motivation for behavior change, such as heightening discrepancies and change planning. The fundamental and advanced MET strategies factors retained excellent inter-rater reliability, consistent with their individual components (adherence ICC: fundamental = .91; advanced = .95; competence ICC: fundamental = .89; advanced = .89). Table 1 describes the rating items that comprise each factor. A detailed description of the psychometric analysis of the ITRS is provided in another report (Martino et al., 2008).
The Therapist Session Checklist asks therapists to indicate the frequency of their use of the ten MET consistent strategies in the ITRS. Therapists rated the extent to which they believed they had used the strategies during the session along 7-point Likert scales (1 = not at all, to 7 = extensively). Because therapists only rated their own sessions, interrater reliability was not determined. A confirmatory factor analysis using structural models with AMOS (6.0) software (Arbuckle, 2005) with maximum likelihood estimation showed the two-factor model demonstrated a good fit to the data, namely, a Root Square Mean Square Error of Approximation (RMSEA) of .05 or less (Browne & Cudeck, 1993) and the Normed Fit Index (NFI), Incremental Fit Index (IFI), and Comparative Fit Index (CFI) of .90 or greater (Hu and Bentler, 1995; Kline, 1998; Marsh, Balla, & McDonald, 1988; Yadama & Pandey, 1995). The respective fundamental and advanced strategies fit indices were as follows: RMSEA = .06 and .00 ; NFI = .98 and 1.0; IFI = .99 and 1.0; and CFI = .99 and 1.0.
The supervisors rated taped sessions using a 30-item supervisory version of the ITRS that excluded rating items involving general characteristics of the therapists and clients. Supervisors rated both adherence and competence using the 7-point Likert scales as described for the ITRS. Interrater reliability was not determined because no more than one supervisor rated each session. Confirmatory factor analysis of the MET consistent adherence items as described above showed the two factor model fit the data adequately for the respective fundamental and advanced strategies (RMSEA = .03 and .00, NFI = .96 and .99; IFI = .99 and 1.0, and CFI = .99 and 1.0).2
All MET therapists and supervisors were trained by MET experts. The MET experts were master's or doctoral degreed clinicians who had extensive experience training and supervising clinicians in motivational interviewing and MET, had completed a 3-day MET trainer's workshop, and were trained in the use of the ITRS as part of the MET protocol. The MET experts conducted two-day intensive MET workshops for the therapists and supervisors at their sites, followed by individually supervised practice cases until minimal protocol certification standards had been achieved in three sessions (i.e., at least half of the MET integrity scale items rated average or above in terms of adherence and competence). In addition, the MET experts reviewed with therapists how to complete the Therapist Session Checklist at the end of each session and supervisors how to use Supervisor Tape Rater Scale. No formal training criteria were used to establish the accuracy with which therapists or supervisors used these scales. After therapists were certified in MET, they began to treat randomized clients in the protocol and receive biweekly supervision from their supervisors who provided the therapists with MET adherence and competence rating-based feedback and coaching after reviewing audiotaped client sessions. These therapist and supervisor training procedures are commonly used in clinical trials (Baer et al., 2007) and have been shown to result in significant improvements in therapists' practice of empirically supported treatments (Miller et al., 2004; Sholomskas et al., 2005).
Observers were trained to assess therapist adherence and competence using the ITRS (Martino et al., 2008). All observers attended an initial 8-hour didactic seminar in which they reviewed a detailed rating manual (Ball, Martino, Corvino, Morgenstern, & Carroll, 2002b) that specified item definitions and rating decision guidelines and practiced rating the items in both limited therapist-client transactions and in a full protocol session. Following initial training, each observer completed ratings for an identical set of 10 calibration tapes randomly selected from the larger pool of protocol tapes to include the two treatment conditions as well as different sites and sessions. Initial item reliabilities were calculated using Shrout and Fleiss (1979) ICC two-way mixed model (3.1), with item ratings as the fixed effect and observers as the random effect. Next, a second 6-hour rater training was held to address items with lower initial reliability; the observers then completed a set of five additional tapes for final inter-rater reliability calculation. To sensitize observers to specific items in which their ratings might drift, observers were informed if any of their ratings were more than two scale points above or below ‘expert’ (SM and SB) consensus ratings on the same set of calibration tapes. Combining the didactic and calibration tape components, each rater completed about 44 hours of training. To ensure ongoing rater reliability, a randomly selected common tape was rated on five separate occasions approximately four months apart. Observers were aware of this procedure, but not its timing. Once again, individual item ratings were compared to expert ones and feedback was provided to all observers.
On average, MET sessions were 45.4 minutes long (sd = 7.8) and all of them were audiotaped. Data used for this study were taken from all MET sessions in which therapists, supervisors, and observers had rated the same session. Of the 451 total MET sessions in the protocol, 117 (26%) were completed across raters. This sample included all 14 therapists who delivered MET in the trial and was balanced across sessions (48 first sessions, 37 second sessions, and 32 third sessions). The reduction from the total number of sessions occurred because the supervisors did not rate all MET sessions, and the observers only rated a subsample of MET sessions.
At the end of each session, therapists completed the Therapist Session Checklist to indicate their self-assessment of their adherence to MET-consistent strategies. They did not rate their own competence in delivering these strategies because the higher level of inference involved in making competency determinations would have required additional rating training to reduce rater bias (Hoyt, 2000; Hoyt & Kerns, 1999).
Supervisors rated an entire audiotaped session once per month, which was randomly selected by research staff for each MET therapist. These ratings formed the basis of MET performance feedback and coaching interventions to enhance the therapists' MET skills. To maintain consistency across supervisors, tapes were co-rated with MET experts on a monthly basis to promote accurate rating, supervisory feedback, and coaching of MET.
Independent observers, blind to treatment condition, session number, and site, rated entire sessions within a substantial subset (n = 425, 44%) of the protocol tapes. Given the high reliability of the rating scale and checks for rating drift during the rating period, only one observer rated each taped session.
Correspondence of adherence ratings was evaluated in terms of 1) whether an intervention occurred (e.g., categorically present, regardless of how much it occurred in a given session) and 2) the extent to which it occurred (1 – 7 rating) by rater category (therapist, supervisor, observer). Table 2 shows the percentage of sessions in which MET strategies occurred at least once in the session according to the therapists, supervisors, and observers, including the percentage of absolute agreement among these raters. In general, there was good to very good agreement for 4 of 5 items tapping fundamental MET strategies across protocols. That is, absolute agreement was above 90% for the items regarding use of open-ended questions, reflective statements, and use of an overall motivational interviewing style and 74% for the item assessing use of affirmations. However, for all items evaluating more strategic applications of MET (i.e., all advanced strategies for eliciting client self-motivational statements), both therapists and supervisors estimated these strategies were present in more of the sessions than the observers (all absolute agreements < 65%).
For the continuous measure, we approached the analyses in two ways. First, we estimated the degree of consistency among the three groups of raters for the fundamental and advanced MET mean adherence ratings using Shrout and Fleiss (1979) intraclass correlation coefficients (ICC) two-way mixed model (3.1), with ratings as the fixed effect and raters as the random effect. We similarly calculated ICC estimates for the correspondence of fundamental and advanced MET mean competence ratings between supervisors and observers (note the therapists were not asked to rate their own competence levels). ICC estimates for the adherence ratings (e.g., agreement on how much the intervention occurred in each session) for fundamental and advanced MET strategies indicated there was poor agreement among the raters about how often on average fundamental and advanced MET strategies occurred (ICC = .34 and .33, respectively).
Second, we used generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Hoyt & Melby, 1999; Shavelson, Webb, & Rowley, 1989) to simultaneously examine the main effects of rater type (therapist, supervisor, observer), item type (fundamental, advanced) nested within rater type, and taped sessions evaluated (session 1, 2, or 3), as well as the interaction among these factors (called facets in generalizability theory) on the adherence and competence ratings. In generalizability theory, a generalizability coefficient denoted by G and analogous to an ICC in a single facet design (Mushquash & O'Connor, 2006), is calculated and can be used to make absolute (pass/fail) or relative decisions (e.g., examining differences between groups or individuals in MET proficiency) about the optimal level of each facet to reduce error variance and maximize reliability in measurements (Shavelson et al., 1989). For this study, relative G is most appropriate for examining the degree of consistency among the rater groups. Variance estimation and G-coefficient calculation was conducted using an SPSS program by Mushquash and O'Connor (2006).
Table 3 provides variance generalizability theory component estimates based on a two facet nested design to assess variability in adherence rating associated with rater type, item type, and session number. The generalizability coefficient was .60, suggesting low levels of rating reliability. The greatest amount of observed variation in adherence rating was explained by differences among the raters (39%) rather than by differences in the type of items rated (14%) or specific sessions taped (9%). Thirty-eight percent of the total variance was attributable to residual error.
Finally, we conducted one-way ANOVAs using a Bonferroni-corrected α of .025 (.05/02) to test for the two hypothesized rater differences in mean fundamental and advanced MET adherence and competence ratings (see Table 4). One-way ANOVAs on therapist fundamental, F(2,349) = 30.3, p < .001, and advanced MET adherence scores, F(2,349) = 127.4, p < .001, indicated significant differences among the three sets of ratings. Post hoc Tukey tests showed that while therapists and supervisors estimated similar mean amounts of fundamental MET strategies occurring in sessions, their estimates were significantly higher than those provided by the observers. In addition, their estimates of advanced MET strategy frequency were significantly higher than observer estimates, though the estimates provided by the therapists were higher than those provided by the supervisors. Re-analysis of the data using only one randomly selected rated session for each client, to prevent the potential confounds of non-independence of observations, did not change these results.3
ICC estimates for fundamental (ICC = .26) and advanced mean competence ratings (ICC = .29) showed poor agreement between supervisors and observers. Likewise, the G-coefficient (.63) and variance estimate components indicate low levels of rating reliability. As with adherence ratings, the greatest amount of observed variation was explained by differences among the raters (47%) rather than by differences in the type of items rated (13%) or specific sessions taped (12%), with 28% of the variance attributable to residual error (see Table 3). One-way ANOVAs suggested that supervisors rated the therapists' competence in fundamental MET delivery higher than how observers evaluated the therapists' performance in this area (F(1,233) = 14.6, p < .001), whereas these raters did not differ significantly from each other in their evaluation of therapists' advanced MET strategies (p = .30). This pattern of results remained the same in a re-analysis of the data using only one randomly selected taped session per client.
This study was the first to report on the correspondence of therapists, supervisors, and observers on ratings of therapist adherence and competence based on a large number of sessions drawn from a national multisite trial of MET conducted within community programs. There were three major findings. First, when therapists, supervisors, and observers evaluated sessions categorically for the presence or absence of specific MET interventions, there was good to very good agreement among them on several fundamental MET strategies that underpin the approach's empathic and supportive stance (open-ended questions, reflective statements, affirmations, motivational interviewing style). There was less agreement about whether therapists used advanced MET strategies commonly employed to elicit client statements in favor of change. Second, there was poor agreement regarding the extent to which the specific interventions occurred within sessions. Therapists and supervisors provided higher ratings of fundamental and advanced MET strategy adherence relative to the observers. Therapists also rated their advanced MET strategy adherence as higher than supervisors. Higher levels of variability in adherence ratings were attributable to differences between the groups of raters rather than systematic differences in how raters approached rating individual items or specific sessions taped. Third, there was poor agreement between supervisors and independent observers about the therapists' competence levels, with the supervisors generally providing higher estimates than the observers for fundamental MET strategies but few differences for competency ratings of advanced MET strategies. As with the adherence ratings, more variation in competence ratings was attributable to differences between the raters than due to differences in how raters responded to individual item or sessions.
These results parallel findings from prior studies (Carroll et al., 1998; Miller et al., 2004; Miller & Mount, 2001) that indicate low rates of correspondence between therapists' self-reports of their adherence and competence in delivering specific therapies and those of observers not involved in the treatment. This study extends previous research in this area by using a detailed continuous rating system of specific interventions rather than global therapist impressions of their overall proficiency (Miller & Mount, 2001; Miller et al., 2004) or only the presence or absence of strategies used within sessions (Carroll et al., 1998). In addition, this study goes beyond traditional methods for determining inter-rater agreement (e.g., absolute agreement, ICCs) by using generalizability theory to establish that a substantial proportion of the variation in adherence and competence ratings was due to actual rater differences rather than other facets. Moreover, this study's findings are relevant to clinical practice and dissemination efforts, as the therapists estimated they had used MET strategies several more times per session than as rated by observers. This difference in perspective could lead to a situation in which some therapists believe they have sufficiently adhered to MET practices when their performance, as evaluated supervisors or observers, may not reach proficiency standards.
This study also is the first investigation to compare program supervisors' assessments of therapists' treatment adherence and competence to those of independent observers. Supervisors rated therapists as implementing fundamental and advanced MET strategies more often and with more competence for the fundamental strategies than did the observers. Differences in adherence ratings between supervisors and observers were smaller in magnitude than between the therapists and observers, however. In addition, there was more agreement between the supervisors and observers regarding their ratings of therapists' competence levels, particularly for the advanced MET strategies. While the rating training and support provided to supervisors may have developed their adherence and competence rating abilities, supervisors might benefit from additional training and practice in how to rate tapes more consistently (Hill & Lambert, 2004; Hoyt, 2000) to provide therapists with more reliable and valid assessments of their MET integrity.
This study's findings raised the question of which methods for evaluating adherence and competence should be used to assess therapist treatment integrity in clinical practice. Because therapists, supervisors, and observers typically agreed about whether several fundamental MET strategies occurred within sessions, the use of therapists' self-reports from therapeutic strategy checklists may be a reasonable and cost effective means for determining if therapists have used these basic counseling strategies (e.g., open-ended questions, reflections, affirmations). These determinations may require less inference and, therefore, be easier for therapists to make without more extensive rating training (Hoyt, 2000; Hoyt & Kerns, 1999). However, simply establishing the presence or absence of basic therapeutic techniques may be of limited value without more fine-tuned ratings of level of adherence to more complex strategies, which were found to be areas of poorer rater correspondence in this study. For example, having twice as many reflections than questions or substantially more MET adherent strategies relative to all interventions are considered markers of proficiency in motivational interviewing (Miller & Mount, 2001; Miller et al., 2004). In this study, the observers received the most training and monitoring to interpret the meaning/occurrence of the items similarly. Training included alerting observers to potential rating biases (e.g., leniency/severity, halo effects), carefully reviewing each item's definition and rating decision rules, practicing rating items across a range of MET sessions, therapists, clients, and sites, calibration of inter-rater reliability until high agreement was reached, periodically checking the observers' performance relative to MET experts, and providing them with feedback to reduce drift (Martino et al., 2008). Supervisors received some training and monthly monitoring in their use of the rating system, but their level of training did not approach the rigor, feedback, and time involved in preparing the observers. In contrast, the therapists simply reviewed how to complete the checklist based upon their understanding of MET strategies derived from the workshop training and biweekly supervision, without additional integrity rating training. Thus, while all raters may have been subject to bias, the observers were trained most thoroughly to reduce variability in their interpretation of the rating items. Their ratings likely represented the least biased depiction of the therapists' MET implementation.
If one assumes that the observers' ratings are most reliable or least biased, and if both therapists and supervisors are prone to overestimate the presence, frequency, and competence of the therapists' implementation of empirically validated therapies such as MET, it is possible that therapists and supervisors are more likely to rate a therapists' performance as sufficiently proficient when it may not be. This potential scenario could accentuate the problem of therapists believing and reporting they are implementing empirically supported treatments without actually having changed their treatment-as-usual practices (Miller, 2007; Miller, Sorensen, Selzer, & Brigham, 2006; Santa Ana et al., in press). Hence, this study serves as a reminder to be wary of drawing conclusions regarding the extent to which specific treatments are used in clinical practice based solely on what therapists say happened and adds an additional cautionary note about the complimentary positive bias supervisors may have about therapists' adherence and competence levels.
This study has several limitations. Some of the rating discrepancies may have been due to the different rating methods used by the therapists, supervisors, and observers rather than systematic differences in what the three rater groups believed occurred in the sessions. For example, therapists may have evaluated their performance from taped sessions more conscientiously than when self-reporting on their performance immediately after sessions (Hill et al., 1994). Second, it is possible that the therapists and supervisors, in being more involved in the cases, may have been more sensitive to certain aspects of the treatment than the observers who may not have recognized some interventions based on their rigorous reading of the item definitions to maximize reliability. Third, although observers' ICC estimates were comparatively good, variability in inter-rater reliability across MET items might partially explain rating discrepancies among therapists, supervisors, and observers. Fourth, the use of only one rater per session for each rater group may have diminished the validity of the treatment integrity measures (Hill & Lambert, 2004). Fifth, the generalizability of the findings were limited given the small number of therapists and supervisors included in this study, the inclusion of only programs included in the Clinical Trials Network, and the study's focus on MET integrity instead of additional treatment approaches. Finally, this study did not examine the relationship of the different adherence and competence ratings provide by therapists, supervisors, and observers to treatment outcomes. Thus, which perspective matters the most for predicting how clients might benefit from MET remains an open question.
Nonetheless, this study demonstrated how different perspectives about treatment integrity are not interchangeable. Future research should include the clients' perspective about what they believe their therapists are doing in sessions. This direction would require the development of treatment integrity measures that gather multiple perspectives on strategies used in treatment (McCarthy & Barber, in press). In addition, the predictive validity of client, therapist, supervisor, and observer integrity ratings on treatment processes and outcomes needs to be established. It is possible that the perspectives from different rater groups might have unique relations to treatment processes and outcomes or that a shared or composite evaluation may be the best indicator of the therapists' treatment adherence and competence.
1Two sites in the three-session protocol had the same person conduct MET supervision in the study. This supervisors' position and location permitted her to transverse both agencies.
2For brevity, tables of fit indices for models of therapist MET consistent adherence are not included here. They are available upon request from the corresponding author.
3A table presenting the results of the re-analysis of the mean fundamental and advanced MET strategy adherence and competence ratings among therapists, supervisors, and observers involving only one randomly selected taped session per client (n = 64) is available from the corresponding author.