The results of this study may help to sort out the discrepancy concerning two views of the importance of the alliance to psychotherapy outcome. Measures of the alliance that take into account the typical level of the alliance across multiple sessions were substantially better predictors of outcome at termination than was the alliance measured at a single session. The fact that a single session is inadequate to measure the individual patient differences in the alliance was evident in the generalizability coefficients that we calculated. At least two treatment sessions were needed to arrive at an alliance score with a minimally acceptable(.80) generalizability coefficient. However, ideally, research instruments have a higher generalizability coefficient than .80. Very good (.90 or above) patient-level generalizability coefficients were only consistently achieved across the current sample and two replication samples when the alliance was aggregated over four or more occasions. Apparently, the reduction in error variance by going from 23% error (generalizability coefficient of .77) to 10% or 5% error has an appreciable impact on the relation of the alliance to outcome. Relatively few studies in the alliance literature have aggregated the alliance over four or more assessments. This suggests that the 4.8% of outcome variance associated with the alliance reported in the meta-analysis by
Martin et al. (2000), and the 4.4% reported by
Horvath and Bedi (2002), may underestimate the size of the alliance-outcome relationship compared to an adequate patient-level measurement of the alliance (aggregated over multiple sessions).
Although aggregating more sessions increases the size of the alliance-outcome relation, the use of late-in-treatment sessions can introduce confounds to alliance-outcome relation. Our results indicated that, from session 10 to 16, session-by-session changes in depressive symptom were significantly predictive of subsequent session-to-session changes in the alliance. This reverse causation can lead to the alliance being a marker for outcome. Studies that average the alliance across treatment, including late sessions (e.g.,
Ogrodniczuk, Piper, Joyce, & McCallum, 2000) might be reporting alliance-outcome correlations that are biased upwards because of this confound. To adequately understand the scope of the impact of prior symptom change on the alliance, further studies using large sample sizes and employing structural equation modeling are needed to tease out the direction of causality and in particular examine the possibility of reciprocal influences between the alliance and outcome. In addition, future research can explore whether or not the clinician’s view of the importance of the alliance in relation to outcome is biased by selective recall, is in part an illusion created by the impact of prior symptom change on the alliance, or is an intuitive integration of these reciprocal influences over time. Until such studies are conducted, however, researchers who examine the alliance-outcome correlation should be aware that, while averaging over several sessions yields a more dependable alliance score, averaging over a large number of sessions (particularly later in treatment sessions) may well increase the influence of prior symptomatic improvement (or other third variables) on the alliance-outcome relation.
In the current study, we aggregated alliance over four consecutive early-in-treatment sessions and found that the relation of these average alliance scores to outcome was a larger effect than found when using single session assessments of the alliance. However, it should be noted that the generalizability coefficients simply indicate that more assessments are better, not that averaging, for example, sessions 3, 4, 5, and 6 is optimal. Aggregating any random sample of sessions would likely yield similarly improved generalizability coefficients and better prediction of outcome compared to a single session. From a clinical point of view, however, a primary interest in the alliance is in how it sets the stage for the ongoing work of therapy. Thus, assessment of the alliance early in treatment is most relevant to this clinical view of the role of the alliance. Moreover, as mentioned, late-in-treatment assessments of the alliance are more likely to be influenced by prior symptom improvement.
Several studies (e.g.,
Barber et al., 2000;
Crits-Christoph et al., 2009;
Klein et al., 2003), like the current one, have used early symptom improvement as a covariate when examining the alliance-outcome relation. Although it is useful to do this in order to rule out the impact of early symptom change on the alliance, the determinants of such early symptom change may be of particular interest in themselves. Moreover, removing this early symptom change from final symptom change may reduce variability in final outcomes, thereby artificially limiting correlations with outcome.
A particularly important finding of the current study was the relatively low therapist-level generalizability coefficients found in all three samples examined. These findings are especially important because of investigations (
Baldwin et al., 2007;
Crits-Christoph et al., 2009) demonstrating stronger alliance-outcome relationships at the therapist level compared to the patient-level of analysis, despite the less than ideal therapist-level assessment in those studies. In the
Baldwin et al. (2007) study, there were 4.1 patients per therapist; in the
Crits-Christoph et al. (2009) study, there were 12.9 patients per therapist. The data of these previous studies, taken together with our current finding of low therapist generalizability coefficients when the ratio of patients per therapist is relatively low (e.g., below 50), implies that stronger relationships between the alliance and outcome would be evident at the therapist level if high numbers of patients per therapist were used in a study. Thus, an ideal study that would uncover an accurate estimate of the effect size for the impact of the alliance on outcome at both the patient and therapist level would include assessments of the alliance at multiple sessions and a large number of patients per therapist.
In designing an ideal study of the alliance in relation to outcome one other factor is relevant: statistical power. While incorporating a large number of patients per therapist is likely to guarantee that there is adequate power for examining effects at the patient level, to achieve statistical significance at the therapist level a study also would need to include a relatively large number of therapists. The relatively large number (80) of therapists in the
Baldwin et al. (2007) study likely provided enough statistical power to detect a therapist-level effect despite the fact that the low number (4.1) of patients per therapists in that study provided a ceiling on the potential size of the alliance-outcome effect at the therapist level (i.e., low therapist-level generalizability coefficient). As unrealistic as it may sound, the results presented here along with statistical power considerations leads to the conclusion that a study designed to accurately estimate the size of the alliance – outcome relationship at both the patient and therapist level would include perhaps 50 therapists, each treating 60 patients, for a total sample size of 3000 patients, and assessment of the alliance at four or more occasions.
Beyond the practical considerations of such a study and the lack of previous knowledge about limited therapist-level generalizability coefficients, one reason why most investigators do not think in these sample size terms is that the goal of most research is simply to detect an effect (whether or not the detected effect is attenuated from the maximum possible effect), not to determine the accurate size of an effect. In fact, the majority of published studies of the alliance did indeed detect a statistically significant relation of the alliance to outcome. But a full understanding of the role of the alliance in psychotherapy, at both the scientific and clinical level, would answer the question of how large is the alliance-outcome relationship and whether the patient or therapist level (or both) are contributing. To answer this question, issues related to adequacy of assessment (at relevant levels) and study design need to be taken into account.
We can speculate on the implications of these results for clinical practice and the training of psychotherapists. For ongoing monitoring of the alliance in clinical practice, it would obviously be useful to measure the alliance repeatedly. To the extent that this presents a burden on patients, particularly if an outcome instrument is already administered at treatment sessions, very brief measures of the alliance might have greater clinical utility, assuming the reliability and validity of such measures are adequate. In a training context where beginning therapists are being evaluated on their ability to form positive alliances with patients, the current results indicate that adequate evaluation of a therapist requires the assessment of the alliance across a relatively large number of patients. Whether extreme outliers (i.e., trainee therapists who consistently form very poor alliances) can be detected using a relatively small number of patients is a question that could be addressed in future research studies.
Several limitations of this study are important to note. For one, the size of the alliance-outcome relationship might vary by type of treatment. Within treatments in which the techniques might have a more potent impact (e.g., exposure therapy for certain anxiety disorders), the alliance might have less of an impact on outcome. Second, the alliance-outcome relationship might vary by disorder. Many alliance studies in the literature use samples of patients with a depressive disorder, or a mixed sample in which depression is common. The provision of a caring and empathic therapeutic relationship that facilitates a strong alliance may be especially important within the context of a disorder like depression often characterized by disconnection from other people, loneliness, and low self-esteem. Thus, generalizability of the current findings to various outpatient populations is uncertain. However, the fact that generalizability coefficients in our Center-wide pooled study database that incorporated a variety of treatments and disorders were quite similar at both the patient and therapist level to those generalizability coefficients found in the depression sample, and the cocaine study suggests that the results found here are not highly disorder-specific. The findings here may also not generalize to other alliance instruments. Other self-report alliance scales, and therapist or observer versions of the CALPAS and other scales, may yield different generalizability coefficients. Thus, with such other scales, a greater or lesser number of sessions and patients per therapist might be needed for adequate assessment of the alliance at the patient and therapist levels, respectively. Finally, as noted, prior symptomatic improvement and other potential third-variables need to continue to be addressed in future correlational studies of the alliance in relation to outcome.
Another concern that might be expressed about the results presented here is that our focus on the less than optimal dependability of measurement is a problem that applies to much of research. The effect size for almost any investigation reported in the scientific literature could be made larger if the reliability, or dependability, of the measures used was improved. However, as mentioned earlier, this concern reduces to the issue of whether an investigation is attempting to uncover whether an effect exists or attempting to accurately determine the size of an effect. We therefore conclude that the current results and the associated design implications should be considered during the debate of how important the alliance is to psychotherapy outcome. Beyond the alliance, other areas of research might benefit from the examination of generalizability coefficients relevant to different design facets and the implications of these coefficients for a full understanding of a phenomenon.