|Home | About | Journals | Submit | Contact Us | Français|
Despite the importance of establishing shared scoring conventions and assessing interrater reliability in clinical trials in psychiatry, these elements are often overlooked. Obstacles to rater training and reliability testing include logistic difficulties in providing live training sessions, or mailing videotapes of patients to multiple sites and collecting the data for analysis. To address some of these obstacles, a Web-based interactive video system was developed. It uses actors of diverse ages, gender and race to train raters how to score the Hamilton Depression Rating Scale and to assess interrater reliability. This system was tested with a group of experienced and novice raters within a single site. It was subsequently used to train raters of a federally-funded multi-center clinical trial on scoring conventions and to test their interrater reliability. The advantages and limitations of using interactive video technology to improve the quality of clinical trials are discussed.
In clinical trials, the reliability of the data collected ultimately determines the validity of the studies’ conclusions (Kobak et al., 1996). In psychiatry, the primary outcome measures often depend on interviewers’ skills in eliciting information, as well as their interpretations of the subjects’ responses(Kobak et al., 2005a). When multiple raters are used in a clinical trial, differences between raters in terms of interviewing technique and scoring criteria introduce variability that can distort the outcome measures ( Muller and Szegedi, 2002; Bourin et al., 2004). Despite the importance of statistically establishing raters’ reliability, review of the literature suggests that this issue is often ignored in clinical trials, including those of depression treatment (Mulsant et al., 2002). This is especially problematic in multi-center trials in which groups of raters are geographically dispersed, may change over time, and that may recruit patients over several years.
We have previously reported that videotapes of professional actors using scripted interviews of the Hamilton Depression Rating Scale (HDRS) could not be distinguished from the videotapes of the actual patients when scored by experienced raters (Rosen et al., 2004). Building on the findings of that study, we developed a Web-based system using professional actors both to train raters on scoring the HDRS using shared scoring conventions and to assess interrater reliability. This report describes: 1) the development of the system, 2) a study of HDRS scoring-tutorial and reliability testing with both naïve and experienced raters, and 3) results of a field test of this system in a multi-site NIMH-funded study.
The web-based system consists of three components: 1) a scoring tutorial program, 2) a reliability testing program, and 3) an administrative program. In order to use this system, users must have a high speed internet connection with “Flash” plug-in for the internet browser. To develop the scoring tutorial and reliability testing programs, informed consent was obtained to video-record 21 HDRS interviews of seven patients participating in an NIMH-funded study of depression at initiation of treatment, in mid-treatment, and in partial or full remission. The semi-structured interview utilized for this project is based on the published interview by Williams et al.(Williams, 1988), and has been previously used in depression trials in the U.S. (Mulsant et al., 1999; Tew, Jr. et al., 1999;; Sackeim et al., 2000; Sackeim et al., 2001; Gildengers et al., 2005; Feske et al., 2004; Reynolds, III et al., 2006; Dombrovski et al., 2006). The scores of these 21 interviews ranged from below 10 (absence of depression), 11–20 (mild to moderate depression), 21–29 (severe depression), and greater than 30 (very severe depression, including psychosis) as each patient was followed through the course of his or her treatment. The videotaped interviews of the patients were transcribed yielding 21 scripts, which were modified to remove all information that might identify the actual patients. In order to create realistic portrayals of different stages of depression in diverse populations, three male and three female actors were recruited to portray young, mid-life, and elderly adults. One of the male and one of the female actors were African-American. Each actor recorded 9 or 10 scripts that were slightly modified to be age and gender appropriate for the actor (for instance a reference to a child may be changed to a reference to a grandchild).
Ten of the scripts were used to create the tutorial program designed to train raters on scoring conventions. The scoring tutorial program provides video vignettes for every possible score of 28 HDRS items. For item-scores not represented by actual interviews, the scripts were modified by changing either the intensity or frequency to move the score into a more or into a less severe rating. In the tutorial mode, trainees have the option of watching every vignette for each question in the order of increasing severity. Alternatively, they can watch them in random order. While the rater is observing the interview in the tutorial mode, the scoring guidelines are presented in text format in a box below the video for reference. In the tutorial mode, raters assign scores, and the system informs them when their scores differ from the scores assigned by two expert psychiatrist /raters. These raters (JR and BMH) have more than 20 years of cumulative experience administering and scoring the HDRS.
Following completion of the tutorial, the raters are directed by the system to the reliability testing program. The testing program was created with the 11 scripts that were not used in the tutorial program. To rest interrater reliability, raters are presented with 6 of the HDRS interviews representing a full range of severities of depression. As in the tutorial mode, while raters watch the interview, the scoring guideline corresponding to the item that is being probed by the interviewer is presented in text format below the video stream. After raters select a score for a particular item, the system progresses to the next questions. Raters have the opportunity to go back and review any question and their score until they have scored all the items and “lock-in” their scores at the end of the testing session. Once raters complete a particular interview and lock their scores, these scores are stored in a database and are available to calculate interrater reliability. All raters who are associated with a given study complete the reliability testing mode with the same 6 interviews. Repeat testing to assess rater drift over time can be accomplished with an alternate set of interviews.
The system is designed to provide scoring tutorials and reliability testing using the 17-, 24- or 28- item versions of the HDRS. The scoring conventions used for the first 17 items are based on the published conventions of the 17-item “Grid-Hamilton” that provides a single score for each item based on both the intensity and frequency of depressive symptoms (Kalali et al., 2002). The scoring convention used for items 18–28 were adapted by two of the authors (JR and BHM) to be congruent with the Grid-Hamilton scoring conventions.
The administrative program is designed to perform several functions. The overall administrator of a clinical trial can identify the sites participating in the study and designate for each site a site-coordinator. The overall administrator also specifies the version of the HDRS to be used for training and reliability testing (i.e., 17-, 24-, or 28-item version). In multi-site studies, the site coordinators enter the names and ID numbers of raters at each research site. Sites and raters can be added or removed during the course of a clinical trial. A database stores the test scores of each rater. Intra-class correlations (ICC) coefficients are calculated for raters participating in a particular study or by site according to the formula described by Shrout and Fleiss (Shrout and Fleiss, 1979) who described calculations based on one of 3 main cases depending on the assignment of judges. Our study follows Case 3, where “each target is rated by each of the same k judges, who are the only judges of interest” (p. 421).
Research raters were recruited from the research programs of the Department of Psychiatry at the University of Pittsburgh School of Medicine to conduct an initial evaluation of the web-based system prior to finalization of the system and actual field testing. All participants were research raters in one of the psychiatry research programs. All of them had received previous training on at least one rating instrument using classroom instructions and videotapes to establish reliability. However, some had not been trained to administer and score the HDRS. Regardless of their prior experience with the HDRS, they were required to complete the tutorial program prior to reliability testing. ICC’s were calculated for the entire group and for three subgroups based on prior experience with the HDRS: 1) naive raters with no prior experience with the HDRS, 2) experienced raters who had administered the HDRS less than 150 times, and 3) highly experienced raters who had administered the HDRS 150 times or more.
Raters were also asked to keep track of, and to report the amount of time and the number of sessions needed to complete the tutorial program and the reliability testing program.
To further evaluate the system, a field trial was completed. In this study, six sites involved in a NIMH-funded multi-site study of late-life mood disorders used this system to train raters on shared scoring conventions and to assess interrater reliability.
For both Study 1 and Study 2, the scoring-tutorial and the reliability testing were to be completed within a two week window. Within that time frame, raters were permitted flexibility in terms of how much time they spent with the system and the number of sessions needed to complete the tutorial and the testing. For Study 1, the 28-item version of the HDRS was used; for Study 2, the 17-item version of the HDRS was used.
Of the 17 raters who participated in this study, 7 were naive, 3 experienced, and 7 experts. The mean age was 42.3 years (range: 22–60). One rater was male; one rater was an African American woman; the remaining raters were Caucasian women.
Based on self-reports, the tutorial was completed in a mean of 1.8 hours (range: 1 – 2.5 hrs.) in 2.5 sessions (range 1–4). The mean number of hours to complete the reliability testing was 3.3 (range 2.5 – 5) in 2.6 separate sessions (range: 1 – 4).
The ICC for the naive, experienced and expert subgroups were 0.94, 0.93, and 0.96, respectively. The ICC calculated for the entire group was 0.95.
Of the 13 participating raters, 10 were female. One woman was Asian and one Hispanic and the remaining were Caucasian. The mean age was 34.3 years (range: 23–58). All participants completed the tutorial before going on to the test mode. The ICC was 0.98 for this group, and no outliers were identified. There were no problems accessing the web-site, completing the interactive tutorial and testing, and recovery of ICC data.
The interrater reliabilities were excellent for both Study 1 and Study 2. Establishing rater reliability in studies of depression treatment is critically important, however, most studies do not report on rater training or reliability measures (Mulsant et al., 2002). In typical industry-supported clinical trials, investigators meetings convene to provide instruction to raters and investigators on the proper use of the various instruments. However, rigorous assessments of rater reliability rarely occur at these meetings or at any following time. The practical importance of interrater reliability has been established as essential to reducing variability in multi-site trials (Small et al., 1996). In that report, inadequate rater training and the absence of a measure of interrater reliability was shown to skew the results.
The relatively high ICCs for all groups of raters participating in this study are consistent with interrater reliability described in several studies with the HDRS that used traditional videotapes. In a large multi-center study, the ICC for conventionally-trained raters on scoring the HDRS was 0.97 (Sackeim et al., 2001), and in a single center study with multiple raters, the ICC for conventionally trained raters was 0.95 (Feske et al., 2004).
Although not supported as a training tool in all settings (Sanchez et al., 1995), videotapes of patients have been used to train raters and establish reliability in some clinical trials (Andreasen et al., 1982; Muller et al., 1998; Muller and Wetzel, 1998; Muller and Dragicevic, 2003). Limitations to this technique include the logistical support needed to mail videotapes to all raters at multiple sites, the inability to interact with video-based training, and the additional burden of mailing paper scoring sheets to a data management center and entering the data.
The computer-based system provides interactive training that is continuously available through any computer that has high speed connectivity to the Internet. The integrated database provides ICC calculations and report generation without the additional work of mailing or faxing data and entering the data by hand. New raters and new sites can be added over time and the ICC’s can be calculated with the group. Finally, rater drift can be assessed.
It is important to note that the Web-based system described in the current manuscript addresses rater training only in regard to scoring conventions. The equally important component of training raters on clinical interview skills required for the administration of the HDRS was not addressed in this study. The importance and effectiveness of providing rater interview training for the HDRS with a Web-based instrument has been previously demonstrated ( Kobak et al., 2003; Kobak et al., 2005a; Kobak et al., 2005b; Targum, 2006; Kobak et al., 2006; Jeglic et al., 2007). Additional limitations to this study include the relatively small sample size and the fact that all of the “naïve” raters were experienced clinicians or raters that used different assessment instruments. Finally, the use of videotapes may artificially inflate estimates of reliability by reducing the information variance that would result if each rater interviewed the patient independently (Spitzer and Williams, 1980).
In conclusion, the current study evaluated a web-based system of interactive scoring training and reliability testing in a group of raters using the HDRS as a single site and multiple site study. The ICC’s calculated support the effectiveness of this system without the additional logistical burden involved in with the use of videotapes.
This work was sponsored in part by the National Institute of Health (MH061639, MH069430, MH062565, MH067028, MH068847, HS011976).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.