The interrater reliabilities were excellent for both Study 1 and Study 2. Establishing rater reliability in studies of depression treatment is critically important, however, most studies do not report on rater training or reliability measures (Mulsant et al., 2002
). In typical industry-supported clinical trials, investigators meetings convene to provide instruction to raters and investigators on the proper use of the various instruments. However, rigorous assessments of rater reliability rarely occur at these meetings or at any following time. The practical importance of interrater reliability has been established as essential to reducing variability in multi-site trials (Small et al., 1996
). In that report, inadequate rater training and the absence of a measure of interrater reliability was shown to skew the results.
The relatively high ICCs for all groups of raters participating in this study are consistent with interrater reliability described in several studies with the HDRS that used traditional videotapes. In a large multi-center study, the ICC for conventionally-trained raters on scoring the HDRS was 0.97 (Sackeim et al., 2001
), and in a single center study with multiple raters, the ICC for conventionally trained raters was 0.95 (Feske et al., 2004
Although not supported as a training tool in all settings (Sanchez et al., 1995
), videotapes of patients have been used to train raters and establish reliability in some clinical trials (Andreasen et al., 1982
; Muller et al., 1998
; Muller and Wetzel, 1998
; Muller and Dragicevic, 2003
). Limitations to this technique include the logistical support needed to mail videotapes to all raters at multiple sites, the inability to interact with video-based training, and the additional burden of mailing paper scoring sheets to a data management center and entering the data.
The computer-based system provides interactive training that is continuously available through any computer that has high speed connectivity to the Internet. The integrated database provides ICC calculations and report generation without the additional work of mailing or faxing data and entering the data by hand. New raters and new sites can be added over time and the ICC’s can be calculated with the group. Finally, rater drift can be assessed.
It is important to note that the Web-based system described in the current manuscript addresses rater training only in regard to scoring conventions. The equally important component of training raters on clinical interview skills required for the administration of the HDRS was not addressed in this study. The importance and effectiveness of providing rater interview training for the HDRS with a Web-based instrument has been previously demonstrated ( Kobak et al., 2003
; Kobak et al., 2005a
; Kobak et al., 2005b
; Targum, 2006
; Kobak et al., 2006
; Jeglic et al., 2007
). Additional limitations to this study include the relatively small sample size and the fact that all of the “naïve” raters were experienced clinicians or raters that used different assessment instruments. Finally, the use of videotapes may artificially inflate estimates of reliability by reducing the information variance that would result if each rater interviewed the patient independently (Spitzer and Williams, 1980
In conclusion, the current study evaluated a web-based system of interactive scoring training and reliability testing in a group of raters using the HDRS as a single site and multiple site study. The ICC’s calculated support the effectiveness of this system without the additional logistical burden involved in with the use of videotapes.