The AUC for the adaptive BDI was a respectable 88%, indicating that the adaptive test could correctly classify large proportions of both positive and negative cases. The 'best case' adaptive BDI was able to classify a subject using an average of only 5.6 questions. The latent depression score generated in that simulation was highly correlated with the BDI total score. In addition, for the subset of the data for which Hamilton scores were available, the latent score was more highly correlated with the Hamilton than the BDI total score was. The latter results indicate that the adaptive BDI would be as useful as the full BDI as a measure of the severity of depression. Thus, in our simulation the adaptive BDI was as accurate as the full scale BDI while dramatically improving efficiency. We note that the CAT algorithm can be 'tuned' to the assessment purpose at hand, for which one might choose another point that emphasized either sensitivity or specificity.
The results also suggested, however, that five or six questions would be the required number of items for only a few participants. Indeed, the adaptive BDI asked fewer than five questions for the majority of patients. Even when the adaptive BDI asked few questions, it usually made the same screening decision as the full BDI: the rates of disagreements between the adaptive BDI and the SCID were slightly higher when more questions were asked. In addition, the algorithm asked more questions about positive cases. This is an attractive outcome, because the patients' answers provide useful symptom data for the clinician. Thus, an adaptive test budgets the patient and clinician time spent on measuring depression, allocating it primarily to persons for whom there are reasons for concern.
A reduction of 15 questions may not seem important, given that the full BDI takes only a few minutes to complete. In our experience, however, clinicians are very concerned about both office visit time and maintaining a smooth flow of patients through the waiting room. In addition, health care providers are confronted with recommendations that they screen for many illnesses and health-related problems. Saving questions on a depression screen might free time to screen for other problems such as domestic violence or substance abuse.
Based on this simulation, it appears that adaptive testing has significant promise for settings where both high efficiency and high accuracy are essential. These settings include primary care, where clinician time is a rate-limiting factor, and ongoing monitoring after successful treatment in specialty mental health care, where respondent burden is a constraint. The next step should be to field test adaptive instruments and validate them prospectively, to make certain that they are acceptable to patients and clinicians, and to measure the costs of implementing them.
The principal limitation of our study is that it is a simulation. In particular, we assumed that participants would respond to questions presented adaptively similarly to the way they responded on the paper BDI, and that the order in which questions are asked does not have an important effect on accuracy. We have some confidence in these assumptions, based on prior research showing that adaptive versions of tests in other fields have accuracy that is similar to paper and pencil versions [16
A second limitation is that we compiled our study group from several research studies. Although the study group included a wide range of both healthy and acutely ill individuals, it would be preferable to have a random sample from a defined health care setting. We attempted to statistically control the prevalence of MDE in our bootstrap analyses and found that the performance of the regular and adaptive BDI improved slightly when the prevalence of MDE was decreased. While this is reassuring, we speculate that because our study group may have lacked some of the mildly depressed and difficult to classify cases that are found in many real world settings. The sample that we have collected may also explain why we found evidence for only one factor in the BDI data, where other researchers have found evidence for two or more. Although we view our study as providing strong support for the concept of adaptive measurement of depression, because of these limitations it needs to be replicated in an actual sample.
A final limitation of our study is the BDI itself. Our results argue that a computerized adaptive test has the same sensitivity and specificity as its full-length version, and the same validity as a measure of symptom severity. Conversion to CAT cannot, however, produce a test with higher validity than the base instrument. Although the BDI is widely used, one may still judge that, in light of a false positive rate of 21%, neither the regular nor the adaptive BDI is sufficiently accurate to serve as a screen for MDE. Our results nevertheless suggest that whatever depression instrument one chooses, it would be more efficient to administer it adaptively.