The present study examined the utility of the anhedonic depression scale from the Mood and Anxiety Symptoms Questionnaire (MASQ-AD) as a way to screen for depressive disorders. Using receiver-operator characteristic analysis, the sensitivity and specificity of the full 22-item MASQ-AD scale, as well as the 8 and 14-item subscales, were examined in relation to both current and lifetime DSM-IV depressive disorder diagnoses in two nonpatient samples. As a means of comparison, the sensitivity and specificity of a measure of a relevant personality dimension, neuroticism, was also examined. Results from both samples support the clinical utility of the MASQ-AD scale as a means of screening for depressive disorders. Findings were strongest for the MASQ-AD 8-item subscale and when predicting current depression status. Furthermore, the MASQ-AD 8-item subscale outperformed the neuroticism measure under certain conditions. The overall usefulness of the MASQ-AD scale as a screening device is discussed, as well as possible cutoff scores for use in research.
depressive disorders; anhedonic depression; Mood and Anxiety Symptoms Questionnaire; receiver-operator characteristic analysis; screening
The Brief Symptom Inventory (BSI), Mood & Anxiety Symptom Questionnaire −30 (MASQ-D30), Short Form Health Survey 36 (SF-36), and Dimensional Assessment of Personality Pathology-Short Form (DAPP-SF) are generic instruments that can be used in Routine Outcome Monitoring (ROM) of patients with common mental disorders. We aimed to generate reference values usually encountered in 'healthy' and ‘psychiatrically ill’ populations to facilitate correct interpretation of ROM results.
We included the following specific reference populations: 1294 subjects from the general population (ROM reference group) recruited through general practitioners, and 5269 psychiatric outpatients diagnosed with mood, anxiety, or somatoform (MAS) disorders (ROM patient group). The outermost 5% of observations were used to define limits for one-sided reference intervals (95th percentiles for BSI, MASQ-D30 and DAPP-SF, and 5th percentiles for SF-36 subscales). Internal consistency and Receiver Operating Characteristics (ROC) analyses were performed.
Mean age for the ROM reference group was 40.3 years (SD=12.6) and 37.7 years (SD=12.0) for the ROM patient group. The proportion of females was 62.8% and 64.6%, respectively. The mean for cut-off values of healthy individuals was 0.82 for the BSI subscales, 23 for the three MASQ-D30 subscales, 45 for the SF-36 subscales, and 3.1 for the DAPP-SF subscales. Discriminative power of the BSI, MASQ-D30 and SF-36 was good, but it was poor for the DAPP-SF. For all instruments, the internal consistency of the subscales ranged from adequate to excellent.
Discussion and conclusion
Reference values for the clinical interpretation were provided for the BSI, MASQ-D30, SF-36, and DAPP-SF. Clinical information aided by ROM data may represent the best means to appraise the clinical state of psychiatric outpatients.
Reference values; Routine outcome monitoring; Questionnaires; Mood disorders; Anxiety disorders; Somatoform disorders
This study investigated the combination of item response theory and computerized adaptive testing (CAT) for psychiatric measurement as a means of reducing the burden of research and clinical assessments.
Data were from 800 participants in outpatient treatment for a mood or anxiety disorder; they completed 616 items of the 626-item Mood and Anxiety Spectrum Scales (MASS) at two times. The first administration was used to design and evaluate a CAT version of the MASS by using post hoc simulation. The second confirmed the functioning of CAT in live testing.
Tests of competing models based on item response theory supported the scale’s bifactor structure, consisting of a primary dimension and four group factors (mood, panic-agoraphobia, obsessive-compulsive, and social phobia). Both simulated and live CAT showed a 95% average reduction (585 items) in items administered (24 and 30 items, respectively) compared with administration of the full MASS. The correlation between scores on the full MASS and the CAT version was .93. For the mood disorder subscale, differences in scores between two groups of depressed patients—one with bipolar disorder and one without—on the full scale and on the CAT showed effect sizes of .63 (p<.003) and 1.19 (p<.001) standard deviation units, respectively, indicating better discriminant validity for CAT.
Instead of using small fixed-length tests, clinicians can create item banks with a large item pool, and a small set of the items most relevant for a given individual can be administered with no loss of information, yielding a dramatic reduction in administration time and patient and clinician burden.
The overlap between Depression and Anxiety has led some researchers to conclude that they are manifestations of a broad, non-specific neurotic disorder. However, others believe that they can be distinguished despite sharing symptoms of general distress. The Tripartite Model of Affect proposes an anxiety-specific, a depression-specific and a shared symptoms factor. Watson and Clark developed the Mood and Anxiety Symptom Questionnaire (MASQ) to specifically measure these Tripartite constructs. Early research showed that the MASQ distinguished between dimensions of Depression and Anxiety in non-clinical samples. However, two recent studies have cautioned that the MASQ may show limited validity in clinical populations. The present study investigated the clinical utility of the MASQ in a clinical sample of adolescents and young adults.
A total of 204 Young people consecutively referred to a specialist public mental health service in Melbourne, Australia were approached and 150 consented to participate. From this, 136 participants completed both a diagnostic interview and the MASQ.
The majority of the sample rated for an Axis-I disorder, with Mood and Anxiety disorders most prevalent. The disorder-specific scales of the MASQ significantly discriminated Anxiety (61.0%) and Mood Disorders (72.8%), however, the predictive accuracy for presence of Anxiety Disorders was very low (29.8%). From ROC analyses, a proposed cut-off of 76 was proposed for the depression scale to indicate 'caseness' for Mood Disorders. The resulting sensitivity/specificity was superior to that of the CES-D.
It was concluded that the depression-specific scale of the MASQ showed good clinical utility, but that the anxiety-specific scale showed poor discriminant validity.
Short-form patient-reported outcome measures are popular because they minimize patient burden. We assessed the efficiency of static short forms and computer adaptive testing (CAT) using data from the Patient-Reported Outcomes Measurement Information System (PROMIS) project.
We evaluated the 28-item PROMIS depressive symptoms bank. We used post hoc simulations based on the PROMIS calibration sample to compare several short-form selection strategies and the PROMIS CAT to the total item bank score.
Compared with full-bank scores, all short forms and CAT produced highly correlated scores, but CAT outperformed each static short form in almost all criteria. However, short-form selection strategies performed only marginally worse than CAT. The performance gap observed in static forms was reduced by using a two-stage branching test format.
Using several polytomous items in a calibrated unidimensional bank to measure depressive symptoms yielded a CAT that provided marginally superior efficiency compared to static short forms. The efficiency of a two-stage semi-adaptive testing strategy was so close to CAT that it warrants further consideration and study.
Computer adaptive testing; PROMIS; Item response theory; Short form; Two-stage testing
Unlike other areas of medicine, psychiatry is almost entirely dependent on patient report to assess the presence and severity of disease; therefore, it is particularly crucial that we find both more accurate and efficient means of obtaining that report.
To develop a computerized adaptive test (CAT) for depression, called the Computerized Adaptive Test–Depression Inventory (CAT-DI), that decreases patient and clinician burden and increases measurement precision.
A psychiatric clinic and community mental health center.
A total of 1614 individuals with and without minor and major depression were recruited for study.
Main Outcome Measures
The focus of this study was the development of the CAT-DI. The 24-item Hamilton Rating Scale for Depression, Patient Health Questionnaire 9, and the Center for Epidemiologic Studies Depression Scale were used to study the convergent validity of the new measure, and the Structured Clinical Interview for DSM-IV was used to obtain diagnostic classifications of minor and major depressive disorder.
A mean of 12 items per study participant was required to achieve a 0.3 SE in the depression severity estimate and maintain a correlation of r=0.95 with the total 389-item test score. Using empirically derived thresholds based on a mixture of normal distributions, we found a sensitivity of 0.92 and a specificity of 0.88 for the classification of major depressive disorder in a sample consisting of depressed patients and healthy controls. Correlations on the order of r=0.8 were found with the other clinician and self-rating scale scores. The CAT-DI provided excellent discrimination throughout the entire depressive severity continuum (minor and major depression), whereas the traditional scales did so primarily at the extremes (eg, major depression).
Traditional measurement fixes the number of items administered and allows measurement uncertainty to vary. In contrast, a CAT fixes measurement uncertainty and allows the number of items to vary. The result is a significant reduction in the number of items needed to measure depression and increased precision of measurement.
The Mood and Anxiety Symptom Questionnaire (MASQ) was designed to specifically measure the Tripartite model of affect and is proposed to offer a delineation between the core components of anxiety and depression. Factor analytic data from adult clinical samples has shown mixed results; however no studies employing confirmatory factor analysis (CFA) have supported the predicted structure of distinct Depression, Anxiety and General Distress factors. The Tripartite model has not been validated in a clinical sample of older adolescents and young adults. The aim of the present study was to examine the validity of the Tripartite model using scale-level data from the MASQ and correlational and confirmatory factor analysis techniques.
137 young people (M = 17.78, SD = 2.63) referred to a specialist mental health service for adolescents and young adults completed the MASQ and diagnostic interview.
All MASQ scales were highly inter-correlated, with the lowest correlation between the depression- and anxiety-specific scales (r = .59). This pattern of correlations was observed for all participants rating for an Axis-I disorder but not for participants without a current disorder (r = .18). Confirmatory factor analyses were conducted to evaluate the model fit of a number of solutions. The predicted Tripartite structure was not supported. A 2-factor model demonstrated superior model fit and parsimony compared to 1- or 3-factor models. These broad factors represented Depression and Anxiety and were highly correlated (r = .88).
The present data lend support to the notion that the Tripartite model does not adequately explain the relationship between anxiety and depression in all clinical populations. Indeed, in the present study this model was found to be inappropriate for a help-seeking community sample of older adolescents and young adults.
Questionnaires used by health services to identify children with psychosocial problems are often rather short. The psychometric properties of such short questionnaires are mostly less than needed for an accurate distinction between children with and without problems. We aimed to assess whether a short Computerized Adaptive Test (CAT) can overcome the weaknesses of short written questionnaires when identifying children with psychosocial problems.
We used a Dutch national data set obtained from parents of children invited for a routine health examination by Preventive Child Healthcare with 205 items on behavioral and emotional problems (n = 2,041, response 84%). In a random subsample we determined which items met the requirements of an Item Response Theory (IRT) model to a sufficient degree. Using those items, item parameters necessary for a CAT were calculated and a cut-off point was defined. In the remaining subsample we determined the validity and efficiency of a Computerized Adaptive Test using simulation techniques, with current treatment status and a clinical score on the Total Problem Scale (TPS) of the Child Behavior Checklist as criteria.
Out of 205 items available 190 sufficiently met the criteria of the underlying IRT model. For 90% of the children a score above or below cut-off point could be determined with 95% accuracy. The mean number of items needed to achieve this was 12. Sensitivity and specificity with the TPS as a criterion were 0.89 and 0.91, respectively.
An IRT-based CAT is a very promising option for the identification of psychosocial problems in children, as it can lead to an efficient, yet high-quality identification. The results of our simulation study need to be replicated in a real-life administration of this CAT.
We provide detailed instructions for analyzing patient-reported outcome (PRO) data collected with an existing (legacy) instrument so that scores can be calibrated to the PRO Measurement Information System (PROMIS) metric. This calibration facilitates migration to computerized adaptive test (CAT) PROMIS data collection, while facilitating research using historical legacy data alongside new PROMIS data.
A cross-sectional convenience sample (n = 2,178) from the Universities of Washington and Alabama at Birmingham HIV clinics completed the PROMIS short form and Patient Health Questionnaire (PHQ-9) depression symptom measures between August 2008 and December 2009. We calibrated the tests using item response theory. We compared measurement precision of the PHQ-9, the PROMIS short form, and simulated PROMIS CAT.
Dimensionality analyses confirmed the PHQ-9 could be calibrated to the PROMIS metric. We provide code used to score the PHQ-9 on the PROMIS metric. The mean standard errors of measurement were 0.49 for the PHQ-9, 0.35 for the PROMIS short form, and 0.37, 0.28, and 0.27 for 3-, 8-, and 9-item-simulated CATs.
The strategy described here facilitated migration from a fixed-format legacy scale to PROMIS CAT administration and may be useful in other settings.
Calibration; Computerized adaptive testing; Depression; Item banks; Item response theory; PROMIS
The purpose of this study was to assess the utility of measuring current physical functioning status of children with complex spinal impairments by applying computerized adaptive testing (CAT) methods. CAT uses a computer-interface to administer the most optimal items based on previous responses, reducing the number of items needed to obtain a scoring estimate.
This was a prospective study of 77 subjects (0.6 – 19.8 yrs) with spinal impairments who were seen during a routine clinic visit. Using a multidimensional version of the Pediatric Evaluation of Disability Inventory CAT program (PEDI-MCAT), we evaluated content range, accuracy and efficiency, known –groups validity, concurrent validity with the Pediatric Outcomes Data Collection Instrument (PODCI), and test-retest reliability in a sub-sample (n=16) within a two-week interval.
We found the PEDI-MCAT to have sufficient item coverage in both self-care and mobility content for this sample, although a majority of the patients tended to score at the higher ends of both scales. Both the accuracy of PEDI-MCAT scores as compared to a fixed-format of the PEDI (r = 0.98 for both mobility and self-care) and test-retest reliability were very high (self-care: ICC (3,1)=0.98, mobility: ICC(3,1)=0.99). The PEDI-MCAT took an average of 2.9 minutes for the parents to complete. The PEDI-MCAT detected expected differences between patient groups, and scores on the PEDI-MCAT correlated in expected directions with scores from the PODCI domains.
Use of the PEDI-MCAT to assess the physical functioning status, as perceived by parents of children with complex spinal impairments, appears to be feasible and achieves accurate and efficient estimates of self-care and mobility function. Additional item development will be needed at the higher functioning end of the scale to avoid ceiling effects for older children.
Level of Evidence
This is a level II prospective study designed to establish the utility of computer adaptive testing as an evaluation method in a busy pediatric spine practice.
computerized adaptive testing; assessment; outcomes; spine impairments
Many hospitals have adopted mobile nursing carts that can be easily rolled up to a patient’s bedside to access charts and help nurses perform their rounds. However, few papers have reported data regarding the use of wireless computers on wheels (COW) at patients’ bedsides to collect questionnaire-based information of their perception of hospitalization on discharge from the hospital.
The purpose of this study was to evaluate the relative efficiency of computerized adaptive testing (CAT) and the precision of CAT-based measures of perceptions of hospitalized patients, as compared with those of nonadaptive testing (NAT). An Excel module of our CAT multicategory assessment is provided as an example.
A total of 200 patients who were discharged from the hospital responded to the CAT-based 18-item inpatient perception questionnaire on COW. The numbers of question administrated were recorded and the responses were calibrated using the Rasch model. They were compared with those from NAT to show the advantage of CAT over NAT.
Patient measures derived from CAT and NAT were highly correlated (r = 0.98) and their measurement precisions were not statistically different (P = .14). CAT required fewer questions than NAT (an efficiency gain of 42%), suggesting a reduced burden for patients. There were no significant differences between groups in terms of gender and other demographic characteristics.
CAT-based administration of surveys of patient perception substantially reduced patient burden without compromising the precision of measuring patients’ perceptions of hospitalization. The Excel module of animation-CAT on the wireless COW that we developed is recommended for use in hospitals.
Computerized adaptive testing; computer on wheels; classic test theory; IRT; item response theory; nonadaptive testing
The use of item response theory (IRT) to measure self-reported outcomes has burgeoned in recent years. Perhaps the most important application of IRT is computer-adaptive testing (CAT), a measurement approach in which the selection of items is tailored for each respondent.
To provide an introduction to the use of CAT in the measurement of health outcomes, describe several IRT models that can be used as the basis of CAT, and discuss practical issues associated with the use of adaptive scaling in research settings.
The development of a CAT requires several steps that are not required in the development of a traditional measure including identification of “starting” and “stopping” rules. CAT's most attractive advantage is its efficiency. Greater measurement precision can be achieved with fewer items. Disadvantages of CAT include the high cost and level of technical expertise required to develop a CAT.
Researchers, clinicians, and patients benefit from the availability of psychometrically rigorous measures that are not burdensome. CAT outcome measures hold substantial promise in this regard, but their development is not without challenges.
Measurement; quality of life; psychometrics; reliability
The purpose of this research was to calibrate an item bank for a computerized adaptive test (CAT) of asthma impact on health-related quality of life (HRQOL), test CAT versions of varying lengths, conduct preliminary validity testing, and evaluate item bank readability.
Asthma Impact Survey (AIS) bank items that passed focus group, cognitive testing, and clinical and psychometric reviews were administered to adults with varied levels of asthma control. Adults self-reporting asthma (N=1106) completed an Internet survey including 88 AIS items, the Asthma Control Test (ACT), and other HRQOL outcome measures. Data were analyzed using classical and modern psychometric methods, real-data CAT simulations, and known groups validity testing.
A bi-factor model with a general factor (asthma impact) and several group factors (cognitive function, fatigue, mental health, physical function, role function, sexual function, self-consciousness/stigma, sleep, and social function) was tested. Loadings on the general factor were above 0.5 and were substantially larger than group factor loadings, and fit statistics were acceptable. Item functioning for most items and fit to the model was acceptable. CAT simulations demonstrated several options for administration and stopping rules. AIS distinguished between respondents with differing levels of asthma control.
The new 50-item AIS item bank demonstrated favorable psychometric characteristics, preliminary evidence of validity, and accessibility at moderate reading levels. Developing item banks for CAT can improve the precise, efficient, and comprehensive monitoring of asthma outcomes, and may facilitate patient-centered care.
asthma control; Asthma Impact Survey; item response theory; patient-reported outcome; health-related quality of life
Dyspnea is a common symptom among patients with heart failure. Currently, there is no standardized, rapid, precise method to assess dyspnea.
Methods and Results
From a review of the literature, we pooled questions from various questionnaires assessing dyspnea. 201 patients with heart failure completed all questions in the preliminary item bank. Each item asks how much shortness of breath the patient had when doing an activity. Medical charts were reviewed for hospitalization within 1 or 3 months of completing the questions. We created a dyspnea item bank of 44 items. Computer Adaptive Tests (CAT) generated from this item bank can assess dyspnea by administering on average 10 questions. Simulation CAT scores were generated to compare with the item bank scores. The CAT scores had a correlation of 0.98 with item bank scores. Logistic regression models predicting the probability of being hospitalized from the dyspnea score were statistically significant (p<0.05). A 5-point score increase was associated with a 32% increased odds of hospitalization in1 month and a 20% increased odds of hospitalization in 3 months.
This computer based tool for dyspnea assessment obtains similar precision to that of answering the entire dyspnea item bank with less patient burden.
The aim of this article is to report the development and preliminary testing of a prototype computerized adaptive test of chronic pain (CHRONIC PAIN-CAT) conducted in two stages: 1) evaluation of various item selection and stopping rules through real data simulated administrations of CHRONIC PAIN-CAT; 2) a feasibility study of the actual prototype CHRONIC PAIN-CAT assessment system conducted in a pilot sample. Item calibrations developed from a US general population sample (N=782) were used to program a pain severity and impact item bank (k=45) and real data simulations were conducted to determine a CAT stopping rule. The CHRONIC-PAIN CAT was programmed on a tablet PC using QualityMetric's Dynamic Health Assessment (DYHNA®) software and administered to a clinical sample of pain sufferers (n=100). The CAT was completed in significantly less time than the static (full item bank) assessment (p<.001). On average, 5.6 items were dynamically administered by CAT to achieve a precise score. Scores estimated from the two assessments were highly correlated (r=.89) and both assessments discriminated across pain severity levels (p<.001, RV=.95). Patients’ evaluations of the CHRONIC PAIN-CAT were favourable.
This report demonstrates that the CHRONIC PAIN-CAT is feasible for administration in a clinic. The application has the potential to improve pain assessment and help clinicians manage chronic pain.
Chronic Pain; Item Response Theory; Computer Adaptive Testing; Pain Assessment
The COPD Assessment Test (CAT™) is a new short health status measure for routine use. New questionnaires require reference points so that users can understand the scores; descriptive scenarios are one way of doing this. A novel method of creating scenarios is described.
A Bland and Altman plot showed a consistent relationship between CAT scores and scores obtained with the St George's Respiratory Questionnaire for COPD (SGRQ-C) permitting a direct mapping process between CAT and SGRQ items. The severity associated with each CAT item was calculated using a probabilistic model and expressed in logits (log odds of a patient of given severity affirming that item 50% of the time). Severity estimates for SGRQ-C items in logits were also available, allowing direct comparisons with CAT items. CAT scores were categorised into Low, Medium, High and Very High Impact. SGRQ items of corresponding severity were used to create scenarios associated with each category.
Each CAT category was associated with a scenario comprising 12 to 16 SGRQ-C items. A severity 'ladder' associating CAT scores with exemplar health status effects was also created. Items associated with 'Low' and 'Medium' Impact appeared to be subjectively quite severe in terms of their effect on daily life.
These scenarios provide users of the CAT with a good sense of the health impact associated with different scores. More generally they provide a surprising insight into the severity of the effects of COPD, even in patients with apparently mild-moderate health status impact.
A real-data simulation of computerized adaptive testing (CAT) is an important step in real life CAT applications. Such a simulation allows CAT developers to evaluate important features of the CAT system such as item selection and stopping rules before live testing. SIMPOLYCAT, an SAS macro program, was created by the authors to conduct real-data CAT simulation based on polytomous item response theory (IRT) models. In SIMPOLYCAT, item responses can be input from an external file or generated internally based on item parameters provided by users. The program allows users to choose among methods of setting initial θ, approaches to item selection, trait estimators, CAT stopping criteria, polytomous IRT models, and other CAT parameters. In addition, CAT simulation results can be saved easily and used for further study. The purpose of this article is to introduce SIMPOLYCAT, briefly describe the program algorithm and parameters, and provide examples of CAT simulations using generated and real data. Visual comparisons of the results obtained from the CAT simulations are presented.
To develop and evaluate a prototype measure (OA-DISABILITY-CAT) for osteoarthritis research using Item Response Theory (IRT) and Computer Adaptive Test (CAT) methodologies.
Study Design and Setting
We constructed an item bank consisting of 33 activities commonly affected by lower extremity (LE) osteoarthritis. A sample of 323 adults with LE osteoarthritis reported their degree of limitation in performing everyday activities and completed the Health Assessment Questionnaire-II (HAQ-II). We used confirmatory factor analyses to assess scale unidimensionality and IRT methods to calibrate the items and examine the fit of the data. Using CAT simulation analyses, we examined the performance of OA-DISABILITY-CATs of different lengths compared to the full item bank and the HAQ-II.
One distinct disability domain was identified. The 10-item OA-DISABILITY-CAT demonstrated a high degree of accuracy compared with the full item bank (r=0.99). The item bank and the HAQ-II scales covered a similar estimated scoring range. In terms of reliability, 95% of OA-DISABILITY reliability estimates were over 0.83 versus 0.60 for the HAQ-II. Except at the highest scores the 10-item OA-DISABILITY-CAT demonstrated superior precision to the HAQ-II.
The prototype OA-DISABILITY-CAT demonstrated promising measurement properties compared to the HAQ-II, and is recommended for use in LE osteoarthritis research.
outcome assessment (Health Care); osteoarthritis; clinical trials; disability; item response theory; computer adaptive testing
Patient-reported outcomes (PROs) are an important endpoint in orthopedics providing comprehensive information about patients' perspectives on treatment outcome. Computer-adaptive test (CAT) measures are an advanced method for assessing PROs using item sets that are tailored to the individual patient. This provides increased measurement precision and reduces the number of items. We developed a CAT version of the Forgotten Joint Score (FJS), a measure of joint awareness in everyday life. CAT development was based on FJS data from 580 patients after THA or TKA (808 assessments). The CAT version reduced the number of items by half at comparable measurement precision. In a feasibility study we administered the newly developed CAT measure on tablet PCs and found that patients actually preferred electronic questionnaires over paper–pencil questionnaires.
patient-reported outcomes; forgotten joint score; electronic data capture; computer-adaptive testing; total knee arthroplasty; total hip arthroplasty
Computerized adaptive testing (CAT) is being applied to health outcome measures developed as paper-and-pencil (P&P) instruments. Differences in how respondents answer items administered by CAT vs. P&P can increase error in CAT-estimated measures if not identified and corrected.
Two methods for detecting item-level mode effects are proposed using Bayesian estimation of posterior distributions of item parameters: (1) a modified robust Z (RZ) test, and (2) 95% credible intervals (CrI) for the CAT-P&P difference in item difficulty. A simulation study was conducted under the following conditions: (1) data-generating model (one- vs. two-parameter IRT model); (2) moderate vs. large DIF sizes; (3) percentage of DIF items (10% vs. 30%), and (4) mean difference in θ estimates across modes of 0 vs. 1 logits. This resulted in a total of 16 conditions with 10 generated datasets per condition.
Both methods evidenced good to excellent false positive control, with RZ providing better control of false positives and with slightly higher power for CrI, irrespective of measurement model. False positives increased when items were very easy to endorse and when there with mode differences in mean trait level. True positives were predicted by CAT item usage, absolute item difficulty and item discrimination. RZ outperformed CrI, due to better control of false positive DIF.
Whereas false positives were well controlled, particularly for RZ, power to detect DIF was suboptimal. Research is needed to examine the robustness of these methods under varying prior assumptions concerning the distribution of item and person parameters and when data fail to conform to prior assumptions. False identification of DIF when items were very easy to endorse is a problem warranting additional investigation.
To develop and test a prototype dyspnea computer adaptive test.
Two outpatient medical facilities.
A convenience sample of 292 adults with COPD.
Main Outcome Measure
We developed a modified and expanded item bank and computer adaptive test (CAT) for the Dyspnea Management Questionnaire (DMQ), an outcome measure consisting of four dyspnea dimensions: dyspnea intensity, dyspnea anxiety, activity avoidance, and activity self-efficacy.
Factor analyses supported a four-dimensional model underlying the 71 DMQ items. The DMQ item bank achieved acceptable Rasch model fit statistics, good measurement breadth with minimal floor and ceiling effects, and evidence of high internal consistency reliability (α = 0.92 to 0.98). Using CAT simulation analyses, the DMQ-CAT showed high measurement accuracy compared to the total item pool (r = .83 to .97, p < .0001) and evidence of good to excellent concurrent (r = −.61 to −0.80, p < .0001) validity. All DMQ-CAT domains showed evidence for known-groups validity (p ≤ 0.001).
The DMQ-CAT reliably and validly captured four distinct dyspnea domains. Multidimensional dyspnea assessment in COPD is needed to better measure the effectiveness of pharmacologic, pulmonary rehabilitation, and psychosocial interventions in not only alleviating the somatic sensation of dyspnea but also reducing dysfunctional emotions, cognitions, and behaviors associated with dyspnea, especially for anxious patients.
Dyspnea; COPD; Outcomes assessment; Reliability; Validity
To evaluate the accuracy of computer adaptive tests (CATs) of daily routines for child- and parent-reported outcomes following pediatric spinal cord injury (SCI) and to evaluate the validity of the scales.
One hundred ninety-six daily routine items were administered to 381 youths and 322 parents. Pearson correlations, intraclass correlation coefficients (ICC), and 95% confidence intervals (CI) were calculated to evaluate the accuracy of simulated 5-item, 10-item, and 15-item CATs against the full-item banks and to evaluate concurrent validity. Independent samples t tests and analysis of variance were used to evaluate the ability of the daily routine scales to discriminate between children with tetraplegia and paraplegia and among 5 motor groups.
ICC and 95% CI demonstrated that simulated 5-, 10-, and 15-item CATs accurately represented the full-item banks for both child- and parent-report scales. The daily routine scales demonstrated discriminative validity, except between 2 motor groups of children with paraplegia. Concurrent validity of the daily routine scales was demonstrated through significant relationships with the FIM scores.
Child- and parent-reported outcomes of daily routines can be obtained using CATs with the same relative precision of a full-item bank. Five-item, 10-item, and 15-item CATs have discriminative and concurrent validity.
computer adaptive test; concurrent validity; daily routines; discriminative validity; pediatric spinal cord injury
This study applied Item Response Theory (IRT) and Computer Adaptive Test (CAT) methodologies to develop a prototype function and disability assessment instrument for use in aging research. Herein, we report on the development of the CAT version of the Late-Life Function & Disability instrument (Late-Life FDI) and evaluate its psychometric properties.
We employed confirmatory factor analysis, IRT methods, validation, and computer simulation analyses of data collected from 671 older adults residing in residential care facilities. We compared accuracy, precision, and sensitivity to change of scores from CAT versions of two Late-Life FDI scales with scores from the fixed-form instrument. Score estimates from the prototype CAT versus the original instrument were compared in a sample of 40 older adults.
Distinct function and disability domains were identified within the Late-Life FDI item bank and used to construct two prototype CAT scales. Using retrospective data, scores from computer simulations of the prototype CAT scales were highly correlated with scores from the original instrument. The results of computer simulation, accuracy, precision, and sensitivity to change of the CATs closely approximated those of the fixed-form scales, especially for the 10- or 15-item CAT versions. In the prospective study each CAT was administered in less than 3 minutes and CAT scores were highly correlated with scores generated from the original instrument.
CAT scores of the Late-Life FDI were highly comparable to those obtained from the full-length instrument with a small loss in accuracy, precision, and sensitivity to change.
outcome assessment (Health Care); geriatrics; rehabilitation
Health Related Quality of Life (HRQoL) is a relevant variable in the evaluation of health outcomes. Questionnaires based on Classical Test Theory typically require a large number of items to evaluate HRQoL. Computer Adaptive Testing (CAT) can be used to reduce tests length while maintaining and, in some cases, improving accuracy. This study aimed at validating a CAT based on Item Response Theory (IRT) for evaluation of generic HRQoL: the CAT-Health instrument.
Cross-sectional study of subjects aged over 18 attending Primary Care Centres for any reason. CAT-Health was administered along with the SF-12 Health Survey. Age, gender and a checklist of chronic conditions were also collected. CAT-Health was evaluated considering: 1) feasibility: completion time and test length; 2) content range coverage, Item Exposure Rate (IER) and test precision; and 3) construct validity: differences in the CAT-Health scores according to clinical variables and correlations between both questionnaires.
396 subjects answered CAT-Health and SF-12, 67.2% females, mean age (SD) 48.6 (17.7) years. 36.9% did not report any chronic condition. Median completion time for CAT-Health was 81 seconds (IQ range = 59-118) and it increased with age (p < 0.001). The median number of items administered was 8 (IQ range = 6-10). Neither ceiling nor floor effects were found for the score. None of the items in the pool had an IER of 100% and it was over 5% for 27.1% of the items. Test Information Function (TIF) peaked between levels -1 and 0 of HRQoL. Statistically significant differences were observed in the CAT-Health scores according to the number and type of conditions.
Although domain-specific CATs exist for various areas of HRQoL, CAT-Health is one of the first IRT-based CATs designed to evaluate generic HRQoL and it has proven feasible, valid and efficient, when administered to a broad sample of individuals attending primary care settings.
The Patient-Reported Outcomes Measurement Information System (PROMIS) was developed as one of the first projects funded by the NIH Roadmap for Medical Research Initiative to re-engineer the clinical research enterprise. The primary goal of PROMIS is to build item banks and short forms that measure key health outcome domains that are manifested in a variety of chronic diseases which could be used as a “common currency” across research projects. To date, item banks, short forms and computerized adaptive tests (CAT) have been developed for 13 domains with relevance to pediatric and adult subjects. To enable easy delivery of these new instruments, PROMIS built a web-based resource (Assessment Center) for administering CATs and other self-report data, tracking item and instrument development, monitoring accrual, managing data, and storing statistical analysis results. Assessment Center can also be used to deliver custom researcher developed content, and has numerous features that support both simple and complicated accrual designs (branching, multiple arms, multiple time points, etc.). This paper provides an overview of the development of the PROMIS item banks and details Assessment Center functionality.