The Brief Symptom Inventory (BSI), Mood & Anxiety Symptom Questionnaire −30 (MASQ-D30), Short Form Health Survey 36 (SF-36), and Dimensional Assessment of Personality Pathology-Short Form (DAPP-SF) are generic instruments that can be used in Routine Outcome Monitoring (ROM) of patients with common mental disorders. We aimed to generate reference values usually encountered in 'healthy' and ‘psychiatrically ill’ populations to facilitate correct interpretation of ROM results.
We included the following specific reference populations: 1294 subjects from the general population (ROM reference group) recruited through general practitioners, and 5269 psychiatric outpatients diagnosed with mood, anxiety, or somatoform (MAS) disorders (ROM patient group). The outermost 5% of observations were used to define limits for one-sided reference intervals (95th percentiles for BSI, MASQ-D30 and DAPP-SF, and 5th percentiles for SF-36 subscales). Internal consistency and Receiver Operating Characteristics (ROC) analyses were performed.
Mean age for the ROM reference group was 40.3 years (SD=12.6) and 37.7 years (SD=12.0) for the ROM patient group. The proportion of females was 62.8% and 64.6%, respectively. The mean for cut-off values of healthy individuals was 0.82 for the BSI subscales, 23 for the three MASQ-D30 subscales, 45 for the SF-36 subscales, and 3.1 for the DAPP-SF subscales. Discriminative power of the BSI, MASQ-D30 and SF-36 was good, but it was poor for the DAPP-SF. For all instruments, the internal consistency of the subscales ranged from adequate to excellent.
Discussion and conclusion
Reference values for the clinical interpretation were provided for the BSI, MASQ-D30, SF-36, and DAPP-SF. Clinical information aided by ROM data may represent the best means to appraise the clinical state of psychiatric outpatients.
Reference values; Routine outcome monitoring; Questionnaires; Mood disorders; Anxiety disorders; Somatoform disorders
The tripartite model categorizes symptoms of depression and anxiety into three groups: 1) non-specific general distress that is shared between depression and anxiety, 2) depression-specific symptoms that include low positive affect and loss of interest, and 3) anxiety-specific symptoms that include somatic arousal. The Mood and Anxiety Symptoms Questionnaire (MASQ) was developed to measure these three factors of depression and anxiety. The purpose of the present study was to test the psychometric properties of the Korean version of the MASQ (K-MASQ) in adolescents.
Community-dwelling adolescents (n=933) were randomly assigned to two groups. Exploratory factor analysis and confirmatory factor analysis were conducted in each group to identify the factor structure of the K-MASQ. The reliability and validity of the K-MASQ were also evaluated.
Our results support the three-factor structure of the K-MASQ in adolescents. However, we found that the specific items of each factor differed from those of the original MASQ. That is, the depression-specific factor was only related to low positive affect and not loss of interest, and the anxiety-specific factor included more items related to general somatic symptoms of anxiety. The reliability and validity of the K-MASQ were found to be satisfactory.
The K-MASQ supports the tripartite model of depression and anxiety and has satisfactory reliability and validity among Korean adolescents. The K-MASQ can be used to distinguish unique symptoms of depression and anxiety in Korean adolescents.
Anxiety; Depression; Assessment; Adolescent; Mood and Anxiety Symptom Questionnaire; Korea
The present study examined the utility of the anhedonic depression scale from the Mood and Anxiety Symptoms Questionnaire (MASQ-AD) as a way to screen for depressive disorders. Using receiver-operator characteristic analysis, the sensitivity and specificity of the full 22-item MASQ-AD scale, as well as the 8 and 14-item subscales, were examined in relation to both current and lifetime DSM-IV depressive disorder diagnoses in two nonpatient samples. As a means of comparison, the sensitivity and specificity of a measure of a relevant personality dimension, neuroticism, was also examined. Results from both samples support the clinical utility of the MASQ-AD scale as a means of screening for depressive disorders. Findings were strongest for the MASQ-AD 8-item subscale and when predicting current depression status. Furthermore, the MASQ-AD 8-item subscale outperformed the neuroticism measure under certain conditions. The overall usefulness of the MASQ-AD scale as a screening device is discussed, as well as possible cutoff scores for use in research.
depressive disorders; anhedonic depression; Mood and Anxiety Symptoms Questionnaire; receiver-operator characteristic analysis; screening
The overlap between Depression and Anxiety has led some researchers to conclude that they are manifestations of a broad, non-specific neurotic disorder. However, others believe that they can be distinguished despite sharing symptoms of general distress. The Tripartite Model of Affect proposes an anxiety-specific, a depression-specific and a shared symptoms factor. Watson and Clark developed the Mood and Anxiety Symptom Questionnaire (MASQ) to specifically measure these Tripartite constructs. Early research showed that the MASQ distinguished between dimensions of Depression and Anxiety in non-clinical samples. However, two recent studies have cautioned that the MASQ may show limited validity in clinical populations. The present study investigated the clinical utility of the MASQ in a clinical sample of adolescents and young adults.
A total of 204 Young people consecutively referred to a specialist public mental health service in Melbourne, Australia were approached and 150 consented to participate. From this, 136 participants completed both a diagnostic interview and the MASQ.
The majority of the sample rated for an Axis-I disorder, with Mood and Anxiety disorders most prevalent. The disorder-specific scales of the MASQ significantly discriminated Anxiety (61.0%) and Mood Disorders (72.8%), however, the predictive accuracy for presence of Anxiety Disorders was very low (29.8%). From ROC analyses, a proposed cut-off of 76 was proposed for the depression scale to indicate 'caseness' for Mood Disorders. The resulting sensitivity/specificity was superior to that of the CES-D.
It was concluded that the depression-specific scale of the MASQ showed good clinical utility, but that the anxiety-specific scale showed poor discriminant validity.
Questionnaires used by health services to identify children with psychosocial problems are often rather short. The psychometric properties of such short questionnaires are mostly less than needed for an accurate distinction between children with and without problems. We aimed to assess whether a short Computerized Adaptive Test (CAT) can overcome the weaknesses of short written questionnaires when identifying children with psychosocial problems.
We used a Dutch national data set obtained from parents of children invited for a routine health examination by Preventive Child Healthcare with 205 items on behavioral and emotional problems (n = 2,041, response 84%). In a random subsample we determined which items met the requirements of an Item Response Theory (IRT) model to a sufficient degree. Using those items, item parameters necessary for a CAT were calculated and a cut-off point was defined. In the remaining subsample we determined the validity and efficiency of a Computerized Adaptive Test using simulation techniques, with current treatment status and a clinical score on the Total Problem Scale (TPS) of the Child Behavior Checklist as criteria.
Out of 205 items available 190 sufficiently met the criteria of the underlying IRT model. For 90% of the children a score above or below cut-off point could be determined with 95% accuracy. The mean number of items needed to achieve this was 12. Sensitivity and specificity with the TPS as a criterion were 0.89 and 0.91, respectively.
An IRT-based CAT is a very promising option for the identification of psychosocial problems in children, as it can lead to an efficient, yet high-quality identification. The results of our simulation study need to be replicated in a real-life administration of this CAT.
Computerized adaptive tests (CAT) provide an alternative to fixed-length assessments for diagnostic screening and severity measurement of psychiatric disorders. We sought to cross-sectionally validate a suite of computerized adaptive tests for mental health (CAT-MH) in a community psychiatric sample.
145 adult psychiatric outpatients and controls were prospectively evaluated with CAT for depression, mania and anxiety symptoms, compared to gold-standard psychiatric assessments including: Structured Clinical Interview for DSM IV-TR (SCID), Hamilton Rating Scale for Depression (HAM-D25), Patient Health Questionnaire (PHQ-9), Center for Epidemiologic Studies Depression Scale (CES-D), and Global Assessment of Functioning (GAF).
Sensitivity and specificity for the computerized adaptive diagnostic test for depression (CAD-MDD) were .96 and .64, respectively (.96 and 1.00 for major depression versus controls). CAT for depression severity (CAT-DI) correlated well to standard depression scales HAM-D25 (r=.79), PHQ-9 (r=.90), CES-D (r=.90) and had OR=27.88 for current SCID major depressive disorder diagnosis across its range. CAT for anxiety severity (CAT-ANX) correlated to HAM-D25 (r=.73), PHQ-9 (r=.78), CES-D (r=.81), and had OR=11.52 for current SCID generalized anxiety disorder diagnosis across its range. CAT for mania severity (CAT-MANIA) did not correlate well to HAM-D25 (r=.31), PHQ-9 (r=.37), CES-D (r=.39), but had an OR=11.56 for a current SCID bipolar diagnosis across its range. Participants found the CAT-MH suite of tests acceptable and easy to use, averaging 51.7 items and 9.4 minutes to complete the full battery.
Compared to current gold-standard diagnostic and assessment measures, CAT-MH provides an effective, rapidly-administered assessment of psychiatric symptoms.
We report on the selection of self-report measures for inclusion in the NIH Toolbox that are suitable for assessing the full range of negative affect including sadness, fear, and anger. The Toolbox is intended to serve as a “core battery” of assessment tools for cognition, sensation, motor function, and emotional health that will help to overcome the lack of consistency in measures used across epidemiological, observational, and intervention studies. A secondary goal of the NIH Toolbox is the identification of measures that are flexible, efficient, and precise, an agenda best fulfilled by the use of item banks calibrated with models from item response theory (IRT) and suitable for adaptive testing. Results from a sample of 1,763 respondents supported use of the adult and pediatric item banks for emotional distress from the Patient-Reported Outcomes Measurement Information System (PROMIS®) as a starting point for capturing the full range of negative affect in healthy individuals. Content coverage for the adult Toolbox was also enhanced by the development of a scale for somatic arousal using items from the Mood and Anxiety Symptom Questionnaire (MASQ) and scales for hostility and physical aggression using items from the Buss-Perry Aggression Questionnaire (BPAQ).
sadness; fear; anger; item response theory; measurement
The Mood and Anxiety Symptom Questionnaire (MASQ) was designed to specifically measure the Tripartite model of affect and is proposed to offer a delineation between the core components of anxiety and depression. Factor analytic data from adult clinical samples has shown mixed results; however no studies employing confirmatory factor analysis (CFA) have supported the predicted structure of distinct Depression, Anxiety and General Distress factors. The Tripartite model has not been validated in a clinical sample of older adolescents and young adults. The aim of the present study was to examine the validity of the Tripartite model using scale-level data from the MASQ and correlational and confirmatory factor analysis techniques.
137 young people (M = 17.78, SD = 2.63) referred to a specialist mental health service for adolescents and young adults completed the MASQ and diagnostic interview.
All MASQ scales were highly inter-correlated, with the lowest correlation between the depression- and anxiety-specific scales (r = .59). This pattern of correlations was observed for all participants rating for an Axis-I disorder but not for participants without a current disorder (r = .18). Confirmatory factor analyses were conducted to evaluate the model fit of a number of solutions. The predicted Tripartite structure was not supported. A 2-factor model demonstrated superior model fit and parsimony compared to 1- or 3-factor models. These broad factors represented Depression and Anxiety and were highly correlated (r = .88).
The present data lend support to the notion that the Tripartite model does not adequately explain the relationship between anxiety and depression in all clinical populations. Indeed, in the present study this model was found to be inappropriate for a help-seeking community sample of older adolescents and young adults.
This study investigated the combination of item response theory and computerized adaptive testing (CAT) for psychiatric measurement as a means of reducing the burden of research and clinical assessments.
Data were from 800 participants in outpatient treatment for a mood or anxiety disorder; they completed 616 items of the 626-item Mood and Anxiety Spectrum Scales (MASS) at two times. The first administration was used to design and evaluate a CAT version of the MASS by using post hoc simulation. The second confirmed the functioning of CAT in live testing.
Tests of competing models based on item response theory supported the scale’s bifactor structure, consisting of a primary dimension and four group factors (mood, panic-agoraphobia, obsessive-compulsive, and social phobia). Both simulated and live CAT showed a 95% average reduction (585 items) in items administered (24 and 30 items, respectively) compared with administration of the full MASS. The correlation between scores on the full MASS and the CAT version was .93. For the mood disorder subscale, differences in scores between two groups of depressed patients—one with bipolar disorder and one without—on the full scale and on the CAT showed effect sizes of .63 (p<.003) and 1.19 (p<.001) standard deviation units, respectively, indicating better discriminant validity for CAT.
Instead of using small fixed-length tests, clinicians can create item banks with a large item pool, and a small set of the items most relevant for a given individual can be administered with no loss of information, yielding a dramatic reduction in administration time and patient and clinician burden.
Unlike other areas of medicine, psychiatry is almost entirely dependent on patient report to assess the presence and severity of disease; therefore, it is particularly crucial that we find both more accurate and efficient means of obtaining that report.
To develop a computerized adaptive test (CAT) for depression, called the Computerized Adaptive Test–Depression Inventory (CAT-DI), that decreases patient and clinician burden and increases measurement precision.
A psychiatric clinic and community mental health center.
A total of 1614 individuals with and without minor and major depression were recruited for study.
Main Outcome Measures
The focus of this study was the development of the CAT-DI. The 24-item Hamilton Rating Scale for Depression, Patient Health Questionnaire 9, and the Center for Epidemiologic Studies Depression Scale were used to study the convergent validity of the new measure, and the Structured Clinical Interview for DSM-IV was used to obtain diagnostic classifications of minor and major depressive disorder.
A mean of 12 items per study participant was required to achieve a 0.3 SE in the depression severity estimate and maintain a correlation of r=0.95 with the total 389-item test score. Using empirically derived thresholds based on a mixture of normal distributions, we found a sensitivity of 0.92 and a specificity of 0.88 for the classification of major depressive disorder in a sample consisting of depressed patients and healthy controls. Correlations on the order of r=0.8 were found with the other clinician and self-rating scale scores. The CAT-DI provided excellent discrimination throughout the entire depressive severity continuum (minor and major depression), whereas the traditional scales did so primarily at the extremes (eg, major depression).
Traditional measurement fixes the number of items administered and allows measurement uncertainty to vary. In contrast, a CAT fixes measurement uncertainty and allows the number of items to vary. The result is a significant reduction in the number of items needed to measure depression and increased precision of measurement.
The objectives of the present study are to investigate the precision of static (fixed-length) short forms versus computerized adaptive testing (CAT) administration, response pattern scoring versus summed score conversion, and test-retest reliability (stability) of the Patient Reported Outcomes Measurement Information System (PROMIS®) pediatric self-report scales measuring the latent constructs of depressive symptoms, anxiety, anger, pain interference, peer relationships, fatigue, mobility, upper extremity functioning and asthma impact with polytomous items.
Participants (N = 331) between the ages of 8 and 17 were recruited from outpatient general pediatrics and subspecialty clinics. Of the 331 participants, 137 were diagnosed with asthma. Three scores based on item response theory (IRT) were computed for each respondent: CAT response pattern expected a posteriori estimates, short form response pattern expected a posteriori estimates, and short form summed score expected a posteriori estimates. Scores were also compared between participants with and without asthma. To examine test-retest reliability, 54 children were selected for retesting approximately two weeks after the first assessment.
A short CAT (maximum 12 items with a standard error of 0.4) was found, on average, to be less precise than the static short forms. The CAT appears to have limited usefulness over and above what can be accomplished with existing static short forms (8–10 items). Stability of the scale scores over a two week period was generally supported.
The study provides further information on the psychometric properties of the PROMIS pediatric scales and extends the previous IRT analyses to include precision estimates of dynamic versus static administration, test-retest reliability, and validity of administration across groups. Both the positive and negative aspects of using CAT vs. short forms are highlighted.
PROMIS; pediatrics; self-report; patient reported outcomes; item response theory; computerized adaptive testing
Short-form patient-reported outcome measures are popular because they minimize patient burden. We assessed the efficiency of static short forms and computer adaptive testing (CAT) using data from the Patient-Reported Outcomes Measurement Information System (PROMIS) project.
We evaluated the 28-item PROMIS depressive symptoms bank. We used post hoc simulations based on the PROMIS calibration sample to compare several short-form selection strategies and the PROMIS CAT to the total item bank score.
Compared with full-bank scores, all short forms and CAT produced highly correlated scores, but CAT outperformed each static short form in almost all criteria. However, short-form selection strategies performed only marginally worse than CAT. The performance gap observed in static forms was reduced by using a two-stage branching test format.
Using several polytomous items in a calibrated unidimensional bank to measure depressive symptoms yielded a CAT that provided marginally superior efficiency compared to static short forms. The efficiency of a two-stage semi-adaptive testing strategy was so close to CAT that it warrants further consideration and study.
Computer adaptive testing; PROMIS; Item response theory; Short form; Two-stage testing
The purpose of this research was to calibrate an item bank for a computerized adaptive test (CAT) of asthma impact on health-related quality of life (HRQOL), test CAT versions of varying lengths, conduct preliminary validity testing, and evaluate item bank readability.
Asthma Impact Survey (AIS) bank items that passed focus group, cognitive testing, and clinical and psychometric reviews were administered to adults with varied levels of asthma control. Adults self-reporting asthma (N=1106) completed an Internet survey including 88 AIS items, the Asthma Control Test (ACT), and other HRQOL outcome measures. Data were analyzed using classical and modern psychometric methods, real-data CAT simulations, and known groups validity testing.
A bi-factor model with a general factor (asthma impact) and several group factors (cognitive function, fatigue, mental health, physical function, role function, sexual function, self-consciousness/stigma, sleep, and social function) was tested. Loadings on the general factor were above 0.5 and were substantially larger than group factor loadings, and fit statistics were acceptable. Item functioning for most items and fit to the model was acceptable. CAT simulations demonstrated several options for administration and stopping rules. AIS distinguished between respondents with differing levels of asthma control.
The new 50-item AIS item bank demonstrated favorable psychometric characteristics, preliminary evidence of validity, and accessibility at moderate reading levels. Developing item banks for CAT can improve the precise, efficient, and comprehensive monitoring of asthma outcomes, and may facilitate patient-centered care.
asthma control; Asthma Impact Survey; item response theory; patient-reported outcome; health-related quality of life
Multidimensional computerized adaptive testing enables precise measurements of patient-reported outcomes at an individual level across different dimensions. This study examined the construct validity of a multidimensional computerized adaptive test (CAT) for fatigue in rheumatoid arthritis (RA).
The ‘CAT Fatigue RA’ was constructed based on a previously calibrated item bank. It contains 196 items and three dimensions: ‘severity’, ‘impact’ and ‘variability’ of fatigue. The CAT was administered to 166 patients with RA. They also completed a traditional, multidimensional fatigue questionnaire (BRAF-MDQ) and the SF-36 in order to examine the CAT’s construct validity. A priori criterion for construct validity was that 75% of the correlations between the CAT dimensions and the subscales of the other questionnaires were as expected. Furthermore, comprehensive use of the item bank, measurement precision and score distribution were investigated.
The a priori criterion for construct validity was supported for two of the three CAT dimensions (severity and impact but not for variability). For severity and impact, 87% of the correlations with the subscales of the well-established questionnaires were as expected but for variability, 53% of the hypothesised relations were found. Eighty-nine percent of the items were selected between one and 137 times for CAT administrations. Measurement precision was excellent for the severity and impact dimensions, with more than 90% of the CAT administrations reaching a standard error below 0.32. The variability dimension showed good measurement precision with 90% of the CAT administrations reaching a standard error below 0.44. No floor- or ceiling-effects were found for the three dimensions.
The CAT Fatigue RA showed good construct validity and excellent measurement precision on the dimensions severity and impact. The dimension variability had less ideal measurement characteristics, pointing to the need to recalibrate the CAT item bank with a two-dimensional model, solely consisting of severity and impact.
To document the development and psychometric evaluation of the Patient-Reported Outcomes Measurement Information System (PROMIS) Physical Function (PF) item bank and static instruments.
Study Design and Setting
Items were evaluated using qualitative and quantitative methods. 16,065 adults answered item subsets (n>2,200/item) on the Internet, with over-sampling of the chronically ill. Classical test and item response theory (IRT) methods were used to evaluate 149 PROMIS PF items plus 10 SF-36 and 20 HAQ-DI items. A graded response model was used to estimate item parameters, which were normed to a mean of 50 (SD=10) in a US general population sample.
The final bank consists of 124 PROMIS items covering upper, central, and lower extremity functions and IADL. In simulations, a 10-item Computerized Adaptive Test (CAT) eliminated floor and decreased ceiling effects, achieving higher measurement precision than any comparable-length static tool across four standard deviations of the measurement range. Improved psychometric properties transferred to the CAT’s superior ability to identify differences between age and disease groups.
The item bank provides a common metric and can improve the measurement of PF by facilitating the standardization of PRO measures and implementation of CATs for more efficient PF assessments over a larger range.
Item Response Theory; Computer Adaptive Test; physical function; health status; questionnaire
The Patient-Reported Outcomes Measurement Information System (PROMIS®) is an NIH Roadmap initiative devoted to developing better measurement tools for assessing constructs relevant to the clinical investigation and treatment of all diseases—constructs such as pain, fatigue, emotional distress, sleep, physical functioning, and social participation. Following creation of item banks for these constructs, our priority has been to validate them, most often in short-term observational studies. We report here on a three-month prospective observational study with depressed outpatients in the early stages of a new treatment episode (with assessments at intake, one-month follow-up, and three-month follow-up). The protocol was designed to compare the psychometric properties of the PROMIS depression item bank (administered as a computerized adaptive test, CAT) with two legacy self-report instruments: the Center for Epidemiological Studies Depression scale (CESD; Radloff, 1977) and the Patient Health Questionnaire (PHQ-9; Spitzer et al., 1999). PROMIS depression demonstrated strong convergent validity with the CESD and the PHQ-9 (with correlations in a range from .72 to .84 across all time points), as well as responsiveness to change when characterizing symptom severity in a clinical outpatient sample. Identification of patients as “recovered” varied across the measures, with the PHQ-9 being the most conservative. The use of calibrations based on models from item response theory (IRT) provides advantages for PROMIS depression both psychometrically (creating the possibility of adaptive testing, providing a broader effective range of measurement, and generating greater precision) and practically (these psychometric advantages can be achieved with fewer items—a median of 4 items administered by CAT—resulting in less patient burden).
depression; item response theory; measurement; self-report; patient-reported outcomes
Health Related Quality of Life (HRQoL) is a relevant variable in the evaluation of health outcomes. Questionnaires based on Classical Test Theory typically require a large number of items to evaluate HRQoL. Computer Adaptive Testing (CAT) can be used to reduce tests length while maintaining and, in some cases, improving accuracy. This study aimed at validating a CAT based on Item Response Theory (IRT) for evaluation of generic HRQoL: the CAT-Health instrument.
Cross-sectional study of subjects aged over 18 attending Primary Care Centres for any reason. CAT-Health was administered along with the SF-12 Health Survey. Age, gender and a checklist of chronic conditions were also collected. CAT-Health was evaluated considering: 1) feasibility: completion time and test length; 2) content range coverage, Item Exposure Rate (IER) and test precision; and 3) construct validity: differences in the CAT-Health scores according to clinical variables and correlations between both questionnaires.
396 subjects answered CAT-Health and SF-12, 67.2% females, mean age (SD) 48.6 (17.7) years. 36.9% did not report any chronic condition. Median completion time for CAT-Health was 81 seconds (IQ range = 59-118) and it increased with age (p < 0.001). The median number of items administered was 8 (IQ range = 6-10). Neither ceiling nor floor effects were found for the score. None of the items in the pool had an IER of 100% and it was over 5% for 27.1% of the items. Test Information Function (TIF) peaked between levels -1 and 0 of HRQoL. Statistically significant differences were observed in the CAT-Health scores according to the number and type of conditions.
Although domain-specific CATs exist for various areas of HRQoL, CAT-Health is one of the first IRT-based CATs designed to evaluate generic HRQoL and it has proven feasible, valid and efficient, when administered to a broad sample of individuals attending primary care settings.
To build an item response theory based computer-adaptive balance test (CAT) from three traditional, fixed-form balance measures: Berg Balance Scale (BBS), Performance-Oriented Mobility Assessment (POMA), and Dynamic Gait Index (DGI); and examine whether CAT psychometric performance exceeded that of individual measures.
Secondary analysis combining two existing datasets.
187 community-dwelling older adults, 65 years or older, mean age 75.2±6.8 years, 69% female.
Main Outcome Measure(s)
BBS, POMA, and DGI items were compiled into an initial 38-item bank. Rasch Partial Credit Model was used for final item bank calibration. CAT simulations were conducted to identify the ideal CAT. CAT score accuracy, reliability, floor and ceiling effects, and validity were examined. Floor and ceiling effects and validity of CAT and individual measures were compared.
A 23-item bank met model expectations. A 10-item CAT was selected, showing very strong association with full item bank scores (r=0.97), and good overall reliability (0.78). Reliability was better in low- to mid-balance ranges due to better item targeting to balance ability, compared with highest balance ranges. No floor effect was noted. CAT ceiling effect (11.2%) was significantly lower than POMA (40.1%) and DGI (40.3%) ceiling effects (p<0.0001 per comparison). The CAT outperformed individual measures, being the only test to discriminate between fallers and non-fallers (p=0.0068), and strongest predictor of self-reported function.
The balance CAT showed excellent accuracy, good overall reliability, and excellent validity compared with individual measures, being the only measure to discriminate between fallers and non-fallers. Prospective examination, particularly in low- functioning elderly and clinical populations with balance deficits, is recommended. Development of an improved CAT based on an expanded item bank containing higher difficulty items is also recommended.
computer-adaptive testing; postural balance; aged
Many hospitals have adopted mobile nursing carts that can be easily rolled up to a patient’s bedside to access charts and help nurses perform their rounds. However, few papers have reported data regarding the use of wireless computers on wheels (COW) at patients’ bedsides to collect questionnaire-based information of their perception of hospitalization on discharge from the hospital.
The purpose of this study was to evaluate the relative efficiency of computerized adaptive testing (CAT) and the precision of CAT-based measures of perceptions of hospitalized patients, as compared with those of nonadaptive testing (NAT). An Excel module of our CAT multicategory assessment is provided as an example.
A total of 200 patients who were discharged from the hospital responded to the CAT-based 18-item inpatient perception questionnaire on COW. The numbers of question administrated were recorded and the responses were calibrated using the Rasch model. They were compared with those from NAT to show the advantage of CAT over NAT.
Patient measures derived from CAT and NAT were highly correlated (r = 0.98) and their measurement precisions were not statistically different (P = .14). CAT required fewer questions than NAT (an efficiency gain of 42%), suggesting a reduced burden for patients. There were no significant differences between groups in terms of gender and other demographic characteristics.
CAT-based administration of surveys of patient perception substantially reduced patient burden without compromising the precision of measuring patients’ perceptions of hospitalization. The Excel module of animation-CAT on the wireless COW that we developed is recommended for use in hospitals.
Computerized adaptive testing; computer on wheels; classic test theory; IRT; item response theory; nonadaptive testing
Goldberg’s General Health Questionnaire (GHQ) items are frequently used to assess psychological distress but no study to date has investigated the GHQ-30’s potential for adaptive administration. In computerized adaptive testing (CAT) items are matched optimally to the targeted distress level of respondents instead of relying on fixed-length versions of instruments. We therefore calibrate GHQ-30 items and report a simulation study exploring the potential of this instrument for adaptive administration in a longitudinal setting.
GHQ-30 responses of 3445 participants with 2 completed assessments (baseline, 7-year follow-up) in the UK Health and Lifestyle Survey were calibrated using item response theory. Our simulation study evaluated the efficiency of CAT administration of the items, cross-sectionally and longitudinally, with different estimators, item selection methods, and measurement precision criteria.
To yield accurate distress measurements (marginal reliability at least 0.90) nearly all GHQ-30 items need to be administered to most survey respondents in general population samples. When lower accuracy is permissible (marginal reliability of 0.80), adaptive administration saves approximately 2/3 of the items. For longitudinal applications, change scores based on the complete set of GHQ-30 items correlate highly with change scores from adaptive administrations.
The rationale for CAT-GHQ-30 is only supported when the required marginal reliability is lower than 0.9, which is most likely to be the case in cross-sectional and longitudinal studies assessing mean changes in populations. Precise measurement of psychological distress at the individual level can be achieved, but requires the deployment of all 30 items.
Computerized adaptive testing; Item response theory; Bifactor model; Measurement invariance; General Health Questionnaire
The authors developed a computerized adaptive test for anxiety that decreases patient and clinician burden and increases measurement precision.
A total of 1,614 individuals with and without generalized anxiety disorder from a psychiatric clinic and community mental health center were recruited. The focus of the present study was the development of the Computerized Adaptive Testing–Anxiety Inventory (CAT-ANX). The Structured Clinical Interview for DSM-IV was used to obtain diagnostic classifications of generalized anxiety disorder and major depressive disorder.
An average of 12 items per subject was required to achieve a 0.3 standard error in the anxiety severity estimate and maintain a correlation of 0.94 with the total 431-item test score. CAT-ANX scores were strongly related to the probability of a generalized anxiety disorder diagnosis. Using both the Computerized Adaptive Testing–-Depression Inventory and the CAT-ANX, comorbid major depressive disorder and generalized anxiety disorder can be accurately predicted.
Traditional measurement fixes the number of items but allows measurement uncertainty to vary. Computerized adaptive testing fixes measurement uncertainty and allows the number and content of items to vary, leading to a dramatic decrease in the number of items required for a fixed level of measurement uncertainty. Potential applications for inexpensive, efficient, and accurate screening of anxiety in primary care settings, clinical trials, psychiatric epidemiology, molecular genetics, children, and other cultures are discussed.
This study evaluated psychometric properties of the Patient Health Questionnaire-9 (PHQ-9), the Center for Epidemiological Studies Depression Scale-10 (CESD-10), and the eight-item PROMIS Depression Short Form (PROMIS-D-8; 8b short form) in a sample of individuals living with multiple sclerosis (MS).
Data were collected by a self-reported mailed survey of a community sample of people living with MS (n=455). Factor structure, inter-item reliability, convergent/discriminant validity and assignment to categories of depression severity were examined.
A one factor, confirmatory factor analytic model had adequate fit for all instruments. Scores on the depression scales were more highly correlated with one another than with scores on measures of pain, sleep disturbance, and fatigue. The CESD-10 categorized about 37% of participants as having significant depressive symptoms. At least moderate depression was indicated for 24% of participants by PHQ-9. PROMIS-D-8 identified 19% of participants as having at least moderate depressive symptoms and about 7% having at least moderately-severe depression. None of the examined scales had ceiling effects, but the PROMIS-D-8 had a floor effect.
Overall, scores on all three scales demonstrated essential unidimensionality and had acceptable inter-item reliability and convergent/discriminant validity. Researchers and clinicians can choose any of these scales to measure depressive symptoms in individuals living with MS. The PHQ-9 offers validated cut off scores for diagnosing clinical depression. The PROMIS-D-8 measure minimizes the impact of somatic features on the assessment of depression and allows for flexible administration, including Computerize Adaptive Testing (CAT). The CESD-10 measures two aspects of depression, depressed mood and lack of positive affect, while still providing an interpretable total score.
depression; multiple sclerosis; CESD-10; PHQ-9; PROMIS
To assess the feasibility and psychometric properties of eight scales covering two domains of the newly developed Work Disability Functional Assessment Battery (WD-FAB): physical function (PF) and behavioral health (BH) function.
Adults unable to work due to a physical (n=497) or mental (n=476) disability.
Main Outcome Measures
Each disability group responded to a survey consisting of the relevant WD-FAB scales and existing measures of established validity. The WD-FAB scales were evaluated with regard to data quality (score distribution; percent “I don’t know” responses), efficiency of administration (number of items required to achieve reliability criterion; time required to complete the scale) by computerized adaptive testing (CAT), and measurement accuracy as tested by person fit. Construct validity was assessed by examining both convergent and discriminant correlations between the WD-FAB scales and scores on same-domain and cross-domain established measures.
Data quality was good and CAT efficiency was high across both WD-FAB domains. Measurement accuracy was very good for the PF scales; BH scales demonstrated more variability. Construct validity correlations, both convergent and divergent, between all WD-FAB scales and established measures were in the expected direction and range of magnitude.
The data quality, CAT efficacy, person fit and construct validity of the WD-FAB scales were well supported and suggest that the WD-FAB could be used to assess physical and behavioral health function related to work disability. Variation in scale performance suggests the need for future work on item replenishment and refinement, particularly regarding the Self-Efficacy scale.
Validation Studies; Disability Evaluation; US Social Security Administration; Outcomes Assessment; Psychometrics
Quality of life (QoL) questionnaires are desirable for clinical practice but can be time-consuming to administer and interpret, making their widespread adoption difficult.
Our aim was to assess the performance of the World Health Organization Quality of Life (WHOQOL)-100 questionnaire as four item banks to facilitate adaptive testing using simulated computer adaptive tests (CATs) for physical, psychological, social, and environmental QoL.
We used data from the UK WHOQOL-100 questionnaire (N=320) to calibrate item banks using item response theory, which included psychometric assessments of differential item functioning, local dependency, unidimensionality, and reliability. We simulated CATs to assess the number of items administered before prespecified levels of reliability was met.
The item banks (40 items) all displayed good model fit (P>.01) and were unidimensional (fewer than 5% of t tests significant), reliable (Person Separation Index>.70), and free from differential item functioning (no significant analysis of variance interaction) or local dependency (residual correlations < +.20). When matched for reliability, the item banks were between 45% and 75% shorter than paper-based WHOQOL measures. Across the four domains, a high standard of reliability (alpha>.90) could be gained with a median of 9 items.
Using CAT, simulated assessments were as reliable as paper-based forms of the WHOQOL with a fraction of the number of items. These properties suggest that these item banks are suitable for computerized adaptive assessment. These item banks have the potential for international development using existing alternative language versions of the WHOQOL items.
Patient-Reported Outcome Measures (PROMs) are important for evaluating mental health services. Yet, no specific PROM exists for the large and diverse mental health supported accommodation sector. We aimed to produce and validate a PROM specifically for supported accommodation services, by adapting the Client’s Assessment of Treatment Scale (CAT) and assessing its psychometric properties in a large sample.
Focus groups with service users in the three main types of mental health supported accommodation services in the United Kingdom (residential care, supported housing and floating outreach) were conducted to adapt the contents of the original CAT items and assess the acceptability of the modified scale (CAT-SA). The CAT-SA was then administered in a survey to service users across England. Internal consistency was assessed using Cronbach’s alpha. Convergent validity was tested through correlations with subjective quality of life and satisfaction with accommodation, as measured by the Manchester Short Assessment of Quality of Life (MANSA).
All seven original items of the CAT were regarded as relevant to appraisals of mental health supported accommodation services, with only slight modifications to the wording required. In the survey, data were obtained from 618 clients. The internal consistency of the CAT-SA items was 0.89. Mean CAT-SA scores were correlated with the specific accommodation item on the MANSA (rs = 0.37, p˂.001).
The content of the CAT-SA has relevance to service users living in mental health supported accommodation. The findings from our large survey show that the CAT-SA is acceptable across different types of supported accommodation and suggest good psychometric properties. The CAT-SA appears a valid and easy to use PROM for service users in mental health supported accommodation services.
Electronic supplementary material
The online version of this article (doi:10.1186/s12888-016-0755-3) contains supplementary material, which is available to authorized users.
Patient Reported Outcome; Supported Accommodation; Treatment Satisfaction; Mental Health