The Brief Symptom Inventory (BSI), Mood & Anxiety Symptom Questionnaire −30 (MASQ-D30), Short Form Health Survey 36 (SF-36), and Dimensional Assessment of Personality Pathology-Short Form (DAPP-SF) are generic instruments that can be used in Routine Outcome Monitoring (ROM) of patients with common mental disorders. We aimed to generate reference values usually encountered in 'healthy' and ‘psychiatrically ill’ populations to facilitate correct interpretation of ROM results.
We included the following specific reference populations: 1294 subjects from the general population (ROM reference group) recruited through general practitioners, and 5269 psychiatric outpatients diagnosed with mood, anxiety, or somatoform (MAS) disorders (ROM patient group). The outermost 5% of observations were used to define limits for one-sided reference intervals (95th percentiles for BSI, MASQ-D30 and DAPP-SF, and 5th percentiles for SF-36 subscales). Internal consistency and Receiver Operating Characteristics (ROC) analyses were performed.
Mean age for the ROM reference group was 40.3 years (SD=12.6) and 37.7 years (SD=12.0) for the ROM patient group. The proportion of females was 62.8% and 64.6%, respectively. The mean for cut-off values of healthy individuals was 0.82 for the BSI subscales, 23 for the three MASQ-D30 subscales, 45 for the SF-36 subscales, and 3.1 for the DAPP-SF subscales. Discriminative power of the BSI, MASQ-D30 and SF-36 was good, but it was poor for the DAPP-SF. For all instruments, the internal consistency of the subscales ranged from adequate to excellent.
Discussion and conclusion
Reference values for the clinical interpretation were provided for the BSI, MASQ-D30, SF-36, and DAPP-SF. Clinical information aided by ROM data may represent the best means to appraise the clinical state of psychiatric outpatients.
Reference values; Routine outcome monitoring; Questionnaires; Mood disorders; Anxiety disorders; Somatoform disorders
The present study examined the utility of the anhedonic depression scale from the Mood and Anxiety Symptoms Questionnaire (MASQ-AD) as a way to screen for depressive disorders. Using receiver-operator characteristic analysis, the sensitivity and specificity of the full 22-item MASQ-AD scale, as well as the 8 and 14-item subscales, were examined in relation to both current and lifetime DSM-IV depressive disorder diagnoses in two nonpatient samples. As a means of comparison, the sensitivity and specificity of a measure of a relevant personality dimension, neuroticism, was also examined. Results from both samples support the clinical utility of the MASQ-AD scale as a means of screening for depressive disorders. Findings were strongest for the MASQ-AD 8-item subscale and when predicting current depression status. Furthermore, the MASQ-AD 8-item subscale outperformed the neuroticism measure under certain conditions. The overall usefulness of the MASQ-AD scale as a screening device is discussed, as well as possible cutoff scores for use in research.
depressive disorders; anhedonic depression; Mood and Anxiety Symptoms Questionnaire; receiver-operator characteristic analysis; screening
The overlap between Depression and Anxiety has led some researchers to conclude that they are manifestations of a broad, non-specific neurotic disorder. However, others believe that they can be distinguished despite sharing symptoms of general distress. The Tripartite Model of Affect proposes an anxiety-specific, a depression-specific and a shared symptoms factor. Watson and Clark developed the Mood and Anxiety Symptom Questionnaire (MASQ) to specifically measure these Tripartite constructs. Early research showed that the MASQ distinguished between dimensions of Depression and Anxiety in non-clinical samples. However, two recent studies have cautioned that the MASQ may show limited validity in clinical populations. The present study investigated the clinical utility of the MASQ in a clinical sample of adolescents and young adults.
A total of 204 Young people consecutively referred to a specialist public mental health service in Melbourne, Australia were approached and 150 consented to participate. From this, 136 participants completed both a diagnostic interview and the MASQ.
The majority of the sample rated for an Axis-I disorder, with Mood and Anxiety disorders most prevalent. The disorder-specific scales of the MASQ significantly discriminated Anxiety (61.0%) and Mood Disorders (72.8%), however, the predictive accuracy for presence of Anxiety Disorders was very low (29.8%). From ROC analyses, a proposed cut-off of 76 was proposed for the depression scale to indicate 'caseness' for Mood Disorders. The resulting sensitivity/specificity was superior to that of the CES-D.
It was concluded that the depression-specific scale of the MASQ showed good clinical utility, but that the anxiety-specific scale showed poor discriminant validity.
Questionnaires used by health services to identify children with psychosocial problems are often rather short. The psychometric properties of such short questionnaires are mostly less than needed for an accurate distinction between children with and without problems. We aimed to assess whether a short Computerized Adaptive Test (CAT) can overcome the weaknesses of short written questionnaires when identifying children with psychosocial problems.
We used a Dutch national data set obtained from parents of children invited for a routine health examination by Preventive Child Healthcare with 205 items on behavioral and emotional problems (n = 2,041, response 84%). In a random subsample we determined which items met the requirements of an Item Response Theory (IRT) model to a sufficient degree. Using those items, item parameters necessary for a CAT were calculated and a cut-off point was defined. In the remaining subsample we determined the validity and efficiency of a Computerized Adaptive Test using simulation techniques, with current treatment status and a clinical score on the Total Problem Scale (TPS) of the Child Behavior Checklist as criteria.
Out of 205 items available 190 sufficiently met the criteria of the underlying IRT model. For 90% of the children a score above or below cut-off point could be determined with 95% accuracy. The mean number of items needed to achieve this was 12. Sensitivity and specificity with the TPS as a criterion were 0.89 and 0.91, respectively.
An IRT-based CAT is a very promising option for the identification of psychosocial problems in children, as it can lead to an efficient, yet high-quality identification. The results of our simulation study need to be replicated in a real-life administration of this CAT.
We report on the selection of self-report measures for inclusion in the NIH Toolbox that are suitable for assessing the full range of negative affect including sadness, fear, and anger. The Toolbox is intended to serve as a “core battery” of assessment tools for cognition, sensation, motor function, and emotional health that will help to overcome the lack of consistency in measures used across epidemiological, observational, and intervention studies. A secondary goal of the NIH Toolbox is the identification of measures that are flexible, efficient, and precise, an agenda best fulfilled by the use of item banks calibrated with models from item response theory (IRT) and suitable for adaptive testing. Results from a sample of 1,763 respondents supported use of the adult and pediatric item banks for emotional distress from the Patient-Reported Outcomes Measurement Information System (PROMIS®) as a starting point for capturing the full range of negative affect in healthy individuals. Content coverage for the adult Toolbox was also enhanced by the development of a scale for somatic arousal using items from the Mood and Anxiety Symptom Questionnaire (MASQ) and scales for hostility and physical aggression using items from the Buss-Perry Aggression Questionnaire (BPAQ).
sadness; fear; anger; item response theory; measurement
This study investigated the combination of item response theory and computerized adaptive testing (CAT) for psychiatric measurement as a means of reducing the burden of research and clinical assessments.
Data were from 800 participants in outpatient treatment for a mood or anxiety disorder; they completed 616 items of the 626-item Mood and Anxiety Spectrum Scales (MASS) at two times. The first administration was used to design and evaluate a CAT version of the MASS by using post hoc simulation. The second confirmed the functioning of CAT in live testing.
Tests of competing models based on item response theory supported the scale’s bifactor structure, consisting of a primary dimension and four group factors (mood, panic-agoraphobia, obsessive-compulsive, and social phobia). Both simulated and live CAT showed a 95% average reduction (585 items) in items administered (24 and 30 items, respectively) compared with administration of the full MASS. The correlation between scores on the full MASS and the CAT version was .93. For the mood disorder subscale, differences in scores between two groups of depressed patients—one with bipolar disorder and one without—on the full scale and on the CAT showed effect sizes of .63 (p<.003) and 1.19 (p<.001) standard deviation units, respectively, indicating better discriminant validity for CAT.
Instead of using small fixed-length tests, clinicians can create item banks with a large item pool, and a small set of the items most relevant for a given individual can be administered with no loss of information, yielding a dramatic reduction in administration time and patient and clinician burden.
Unlike other areas of medicine, psychiatry is almost entirely dependent on patient report to assess the presence and severity of disease; therefore, it is particularly crucial that we find both more accurate and efficient means of obtaining that report.
To develop a computerized adaptive test (CAT) for depression, called the Computerized Adaptive Test–Depression Inventory (CAT-DI), that decreases patient and clinician burden and increases measurement precision.
A psychiatric clinic and community mental health center.
A total of 1614 individuals with and without minor and major depression were recruited for study.
Main Outcome Measures
The focus of this study was the development of the CAT-DI. The 24-item Hamilton Rating Scale for Depression, Patient Health Questionnaire 9, and the Center for Epidemiologic Studies Depression Scale were used to study the convergent validity of the new measure, and the Structured Clinical Interview for DSM-IV was used to obtain diagnostic classifications of minor and major depressive disorder.
A mean of 12 items per study participant was required to achieve a 0.3 SE in the depression severity estimate and maintain a correlation of r=0.95 with the total 389-item test score. Using empirically derived thresholds based on a mixture of normal distributions, we found a sensitivity of 0.92 and a specificity of 0.88 for the classification of major depressive disorder in a sample consisting of depressed patients and healthy controls. Correlations on the order of r=0.8 were found with the other clinician and self-rating scale scores. The CAT-DI provided excellent discrimination throughout the entire depressive severity continuum (minor and major depression), whereas the traditional scales did so primarily at the extremes (eg, major depression).
Traditional measurement fixes the number of items administered and allows measurement uncertainty to vary. In contrast, a CAT fixes measurement uncertainty and allows the number of items to vary. The result is a significant reduction in the number of items needed to measure depression and increased precision of measurement.
The Mood and Anxiety Symptom Questionnaire (MASQ) was designed to specifically measure the Tripartite model of affect and is proposed to offer a delineation between the core components of anxiety and depression. Factor analytic data from adult clinical samples has shown mixed results; however no studies employing confirmatory factor analysis (CFA) have supported the predicted structure of distinct Depression, Anxiety and General Distress factors. The Tripartite model has not been validated in a clinical sample of older adolescents and young adults. The aim of the present study was to examine the validity of the Tripartite model using scale-level data from the MASQ and correlational and confirmatory factor analysis techniques.
137 young people (M = 17.78, SD = 2.63) referred to a specialist mental health service for adolescents and young adults completed the MASQ and diagnostic interview.
All MASQ scales were highly inter-correlated, with the lowest correlation between the depression- and anxiety-specific scales (r = .59). This pattern of correlations was observed for all participants rating for an Axis-I disorder but not for participants without a current disorder (r = .18). Confirmatory factor analyses were conducted to evaluate the model fit of a number of solutions. The predicted Tripartite structure was not supported. A 2-factor model demonstrated superior model fit and parsimony compared to 1- or 3-factor models. These broad factors represented Depression and Anxiety and were highly correlated (r = .88).
The present data lend support to the notion that the Tripartite model does not adequately explain the relationship between anxiety and depression in all clinical populations. Indeed, in the present study this model was found to be inappropriate for a help-seeking community sample of older adolescents and young adults.
Short-form patient-reported outcome measures are popular because they minimize patient burden. We assessed the efficiency of static short forms and computer adaptive testing (CAT) using data from the Patient-Reported Outcomes Measurement Information System (PROMIS) project.
We evaluated the 28-item PROMIS depressive symptoms bank. We used post hoc simulations based on the PROMIS calibration sample to compare several short-form selection strategies and the PROMIS CAT to the total item bank score.
Compared with full-bank scores, all short forms and CAT produced highly correlated scores, but CAT outperformed each static short form in almost all criteria. However, short-form selection strategies performed only marginally worse than CAT. The performance gap observed in static forms was reduced by using a two-stage branching test format.
Using several polytomous items in a calibrated unidimensional bank to measure depressive symptoms yielded a CAT that provided marginally superior efficiency compared to static short forms. The efficiency of a two-stage semi-adaptive testing strategy was so close to CAT that it warrants further consideration and study.
Computer adaptive testing; PROMIS; Item response theory; Short form; Two-stage testing
The purpose of this research was to calibrate an item bank for a computerized adaptive test (CAT) of asthma impact on health-related quality of life (HRQOL), test CAT versions of varying lengths, conduct preliminary validity testing, and evaluate item bank readability.
Asthma Impact Survey (AIS) bank items that passed focus group, cognitive testing, and clinical and psychometric reviews were administered to adults with varied levels of asthma control. Adults self-reporting asthma (N=1106) completed an Internet survey including 88 AIS items, the Asthma Control Test (ACT), and other HRQOL outcome measures. Data were analyzed using classical and modern psychometric methods, real-data CAT simulations, and known groups validity testing.
A bi-factor model with a general factor (asthma impact) and several group factors (cognitive function, fatigue, mental health, physical function, role function, sexual function, self-consciousness/stigma, sleep, and social function) was tested. Loadings on the general factor were above 0.5 and were substantially larger than group factor loadings, and fit statistics were acceptable. Item functioning for most items and fit to the model was acceptable. CAT simulations demonstrated several options for administration and stopping rules. AIS distinguished between respondents with differing levels of asthma control.
The new 50-item AIS item bank demonstrated favorable psychometric characteristics, preliminary evidence of validity, and accessibility at moderate reading levels. Developing item banks for CAT can improve the precise, efficient, and comprehensive monitoring of asthma outcomes, and may facilitate patient-centered care.
asthma control; Asthma Impact Survey; item response theory; patient-reported outcome; health-related quality of life
Health Related Quality of Life (HRQoL) is a relevant variable in the evaluation of health outcomes. Questionnaires based on Classical Test Theory typically require a large number of items to evaluate HRQoL. Computer Adaptive Testing (CAT) can be used to reduce tests length while maintaining and, in some cases, improving accuracy. This study aimed at validating a CAT based on Item Response Theory (IRT) for evaluation of generic HRQoL: the CAT-Health instrument.
Cross-sectional study of subjects aged over 18 attending Primary Care Centres for any reason. CAT-Health was administered along with the SF-12 Health Survey. Age, gender and a checklist of chronic conditions were also collected. CAT-Health was evaluated considering: 1) feasibility: completion time and test length; 2) content range coverage, Item Exposure Rate (IER) and test precision; and 3) construct validity: differences in the CAT-Health scores according to clinical variables and correlations between both questionnaires.
396 subjects answered CAT-Health and SF-12, 67.2% females, mean age (SD) 48.6 (17.7) years. 36.9% did not report any chronic condition. Median completion time for CAT-Health was 81 seconds (IQ range = 59-118) and it increased with age (p < 0.001). The median number of items administered was 8 (IQ range = 6-10). Neither ceiling nor floor effects were found for the score. None of the items in the pool had an IER of 100% and it was over 5% for 27.1% of the items. Test Information Function (TIF) peaked between levels -1 and 0 of HRQoL. Statistically significant differences were observed in the CAT-Health scores according to the number and type of conditions.
Although domain-specific CATs exist for various areas of HRQoL, CAT-Health is one of the first IRT-based CATs designed to evaluate generic HRQoL and it has proven feasible, valid and efficient, when administered to a broad sample of individuals attending primary care settings.
Many hospitals have adopted mobile nursing carts that can be easily rolled up to a patient’s bedside to access charts and help nurses perform their rounds. However, few papers have reported data regarding the use of wireless computers on wheels (COW) at patients’ bedsides to collect questionnaire-based information of their perception of hospitalization on discharge from the hospital.
The purpose of this study was to evaluate the relative efficiency of computerized adaptive testing (CAT) and the precision of CAT-based measures of perceptions of hospitalized patients, as compared with those of nonadaptive testing (NAT). An Excel module of our CAT multicategory assessment is provided as an example.
A total of 200 patients who were discharged from the hospital responded to the CAT-based 18-item inpatient perception questionnaire on COW. The numbers of question administrated were recorded and the responses were calibrated using the Rasch model. They were compared with those from NAT to show the advantage of CAT over NAT.
Patient measures derived from CAT and NAT were highly correlated (r = 0.98) and their measurement precisions were not statistically different (P = .14). CAT required fewer questions than NAT (an efficiency gain of 42%), suggesting a reduced burden for patients. There were no significant differences between groups in terms of gender and other demographic characteristics.
CAT-based administration of surveys of patient perception substantially reduced patient burden without compromising the precision of measuring patients’ perceptions of hospitalization. The Excel module of animation-CAT on the wireless COW that we developed is recommended for use in hospitals.
Computerized adaptive testing; computer on wheels; classic test theory; IRT; item response theory; nonadaptive testing
The authors developed a computerized adaptive test for anxiety that decreases patient and clinician burden and increases measurement precision.
A total of 1,614 individuals with and without generalized anxiety disorder from a psychiatric clinic and community mental health center were recruited. The focus of the present study was the development of the Computerized Adaptive Testing–Anxiety Inventory (CAT-ANX). The Structured Clinical Interview for DSM-IV was used to obtain diagnostic classifications of generalized anxiety disorder and major depressive disorder.
An average of 12 items per subject was required to achieve a 0.3 standard error in the anxiety severity estimate and maintain a correlation of 0.94 with the total 431-item test score. CAT-ANX scores were strongly related to the probability of a generalized anxiety disorder diagnosis. Using both the Computerized Adaptive Testing–-Depression Inventory and the CAT-ANX, comorbid major depressive disorder and generalized anxiety disorder can be accurately predicted.
Traditional measurement fixes the number of items but allows measurement uncertainty to vary. Computerized adaptive testing fixes measurement uncertainty and allows the number and content of items to vary, leading to a dramatic decrease in the number of items required for a fixed level of measurement uncertainty. Potential applications for inexpensive, efficient, and accurate screening of anxiety in primary care settings, clinical trials, psychiatric epidemiology, molecular genetics, children, and other cultures are discussed.
Recent approaches to outcome measurement involving Computerized Adaptive Testing (CAT) offer an approach for measuring disability in low back pain (LBP) in a way that can reduce the burden upon patient and professional. The aim of this study was to explore the potential of CAT in LBP for measuring disability as defined in the International Classification of Functioning, Disability and Health (ICF) which includes impairments, activity limitation, and participation restriction.
266 patients with low back pain answered questions from a range of widely used questionnaires. An exploratory factor analysis (EFA) was used to identify disability dimensions which were then subjected to Rasch analysis. Reliability was tested by internal consistency and person separation index (PSI). Discriminant validity of disability levels were evaluated by Spearman correlation coefficient (r), intraclass correlation coefficient [ICC(2,1)] and the Bland-Altman approach. A CAT was developed for each dimension, and the results checked against simulated and real applications from a further 133 patients.
Factor analytic techniques identified two dimensions named "body functions" and "activity-participation". After deletion of some items for failure to fit the Rasch model, the remaining items were mostly free of Differential Item Functioning (DIF) for age and gender. Reliability exceeded 0.90 for both dimensions. The disability levels generated using all items and those obtained from the real CAT application were highly correlated (i.e. > 0.97 for both dimensions). On average, 19 and 14 items were needed to estimate the precise disability levels using the initial CAT for the first and second dimension. However, a marginal increase in the standard error of the estimate across successive iterations substantially reduced the number of items required to make an estimate.
Using a combination approach of EFA and Rasch analysis this study has shown that it is possible to calibrate items onto a single metric in a way that can be used to provide the basis of a CAT application. Thus there is an opportunity to obtain a wide variety of information to evaluate the biopsychosocial model in its more complex forms, without necessarily increasing the burden of information collection for patients.
Workplace bullying is a prevalent problem in contemporary work places that has adverse effects on both the victims of bullying and organizations. With the rapid development of computer technology in recent years, there is an urgent need to prove whether item response theory–based computerized adaptive testing (CAT) can be applied to measure exposure to workplace bullying.
The purpose of this study was to evaluate the relative efficiency and measurement precision of a CAT-based test for hospital nurses compared to traditional nonadaptive testing (NAT). Under the preliminary conditions of a single domain derived from the scale, a CAT module bullying scale model with polytomously scored items is provided as an example for evaluation purposes.
A total of 300 nurses were recruited and responded to the 22-item Negative Acts Questionnaire-Revised (NAQ-R). All NAT (or CAT-selected) items were calibrated with the Rasch rating scale model and all respondents were randomly selected for a comparison of the advantages of CAT and NAT in efficiency and precision by paired t tests and the area under the receiver operating characteristic curve (AUROC).
The NAQ-R is a unidimensional construct that can be applied to measure exposure to workplace bullying through CAT-based administration. Nursing measures derived from both tests (CAT and NAT) were highly correlated (r=.97) and their measurement precisions were not statistically different (P=.49) as expected. CAT required fewer items than NAT (an efficiency gain of 32%), suggesting a reduced burden for respondents. There were significant differences in work tenure between the 2 groups (bullied and nonbullied) at a cutoff point of 6 years at 1 worksite. An AUROC of 0.75 (95% CI 0.68-0.79) with logits greater than –4.2 (or >30 in summation) was defined as being highly likely bullied in a workplace.
With CAT-based administration of the NAQ-R for nurses, their burden was substantially reduced without compromising measurement precision.
computerized adaptive testing; computer on wheels; classic test theory; item response theory; nonadaptive testing; the Negative Acts Questionnaire-Revised
To develop outpatient adaptive short forms (ASFs) for the Activity Measure for Post-Acute Care (AM-PAC) item bank for use in outpatient therapy settings.
A convenience sample of 11,809 adults with spine, lower extremity, upper extremity and miscellaneous orthopedic impairments who received outpatient rehabilitation in one of 127 outpatient rehabilitation clinics in the US. We identified optimal items for use in developing outpatient ASFs based on the Basic Mobility and Daily Activities domains of the AM-PAC item bank. Patient scores were derived from the AM-PAC computerized adaptive testing (CAT) program. Items were selected for inclusion on the ASFs based on functional content, range of item coverage, measurement precision, item exposure rate, and data collection burden.
Two outpatient ASFs were developed: 1) an 18-item Basic Mobility ASF and 2) a 15-item Daily Activities ASF, derived from the same item bank used to develop the AM-PAC-CAT. Both ASFs achieved acceptable psychometric properties.
In outpatient PAC settings where CAT outcome applications are currently not feasible, IRT-derived ASFs provide the efficient capability to monitor patients’ functional outcomes. The development of ASF functional outcome instruments linked by a common, calibrated item bank has the potential to create a bridge to outcome monitoring across PAC settings and can facilitate the eventual transformation from ASFs to CAT applications easier and more acceptable to the rehabilitation community.
Outcomes Assessment; Rehabilitation; Item Response Theory; Physical Functioning
The chronic obstructive pulmonary disease (COPD) Assessment Test (CAT) is a concise health status measure for COPD. COPD patients have a variety of comorbidities, but little is known about their impact on quality of life. This study was designed to investigate comorbid factors that may contribute to high CAT scores.
An observational study at Keio University and affiliated hospitals enrolled 336 COPD patients and 67 non-COPD subjects. Health status was assessed by the CAT, the St. Georges Respiratory Questionnaire (SGRQ), and all components of the Medical Outcomes Study Short-Form 36-Item (SF-36) version 2, which is a generic measure of health. Comorbidities were identified based on patients’ reports, physicians’ records, and questionnaires, including the Frequency Scale for the Symptoms of Gastro-esophageal reflux disease (GERD) and the Hospital Anxiety and Depression Scale. Dual X-ray absorptiometry measurements of bone mineral density were performed.
The CAT showed moderate-good correlations with the SGRQ and all components of the SF-36. The presence of GERD, depression, arrhythmia, and anxiety was significantly associated with a high CAT score in the COPD patients.
Symptomatic COPD patients have a high prevalence of comorbidities. A high CAT score should alert the clinician to a higher likelihood of certain comorbidities such as GERD and depression, because these diseases may co-exist unrecognized.
Clinical trial registered with UMIN (UMIN000003470).
Chronic obstructive pulmonary disease; Health status; Depression; Gastro-esophageal reflux; Comorbidity; Osteoporosis
We provide detailed instructions for analyzing patient-reported outcome (PRO) data collected with an existing (legacy) instrument so that scores can be calibrated to the PRO Measurement Information System (PROMIS) metric. This calibration facilitates migration to computerized adaptive test (CAT) PROMIS data collection, while facilitating research using historical legacy data alongside new PROMIS data.
A cross-sectional convenience sample (n = 2,178) from the Universities of Washington and Alabama at Birmingham HIV clinics completed the PROMIS short form and Patient Health Questionnaire (PHQ-9) depression symptom measures between August 2008 and December 2009. We calibrated the tests using item response theory. We compared measurement precision of the PHQ-9, the PROMIS short form, and simulated PROMIS CAT.
Dimensionality analyses confirmed the PHQ-9 could be calibrated to the PROMIS metric. We provide code used to score the PHQ-9 on the PROMIS metric. The mean standard errors of measurement were 0.49 for the PHQ-9, 0.35 for the PROMIS short form, and 0.37, 0.28, and 0.27 for 3-, 8-, and 9-item-simulated CATs.
The strategy described here facilitated migration from a fixed-format legacy scale to PROMIS CAT administration and may be useful in other settings.
Calibration; Computerized adaptive testing; Depression; Item banks; Item response theory; PROMIS
In 2012, the American Orthopaedic Foot & Ankle Society® established a national network for collecting and sharing data on treatment outcomes and improving patient care. One of the network’s initiatives is to explore the use of computerized adaptive tests (CATs) for patient-level outcome reporting.
We determined whether the CAT from the NIH Patient Reported Outcome Measurement Information System® (PROMIS®) Physical Function (PF) item bank provides efficient, reliable, valid, precise, and adequately covered point estimates of patients’ physical function.
After informed consent, 288 patients with a mean age of 51 years (range, 18–81 years) undergoing surgery for common foot and ankle problems completed a web-based questionnaire. Efficiency was determined by time for test administration. Reliability was assessed with person and item reliability estimates. Validity evaluation included content validity from expert review and construct validity measured against the PROMIS® Pain CAT and patient responses based on tradeoff perceptions. Precision was assessed by standard error of measurement (SEM) across patients’ physical function levels. Instrument coverage was based on a person-item map.
Average time of test administration was 47 seconds. Reliability was 0.96 for person and 0.99 for item. Construct validity against the Pain CAT had an r value of −0.657 (p < 0.001). Precision had an SEM of less than 3.3 (equivalent to a Cronbach’s alpha of ≥ 0.90) across a broad range of function. Concerning coverage, the ceiling effect was 0.32% and there was no floor effect.
The PROMIS® PF CAT appears to be an excellent method for measuring outcomes for patients with foot and ankle surgery. Further validation of the PROMIS® item banks may ultimately provide a valid and reliable tool for measuring patient-reported outcomes after injuries and treatment.
Level of Evidence
Level III, diagnostic study. See Instructions for Authors for a complete description of levels of evidence.
Electronic supplementary material
The online version of this article (doi:10.1007/s11999-013-3097-1) contains supplementary material, which is available to authorized users.
The Computer Adaptive Test version of the Community Reintegration of Injured Service Members measure (CRIS-CAT) consists of three scales measuring Extent of, Perceived Limitations in, and Satisfaction with community integration. The CRIS-CAT was developed using item response theory methods. The purposes of this study were to assess the reliability, concurrent, known group and predictive validity and respondent burden of the CRIS-CAT.
The CRIS-CAT was developed using item response theory methods. The purposes of this study were to assess the reliability, concurrent, known group and predictive validity and respondent burden of the CRIS-CAT.
This was a three-part study that included a 1) a cross-sectional field study of 517 homeless, employed, and Operation Enduring Freedom / Operation Iraqi Freedom (OEF/OIF) Veterans; who completed all items in the CRIS item set, 2) a cohort study with one year follow-up study of 135 OEF/OIF Veterans, and 3) a 50-person study of CRIS-CAT administration. Conditional reliability of simulated CAT scores was calculated from the field study data, and concurrent validity and known group validity were examined using Pearson product correlations and ANOVAs. Data from the cohort were used to examine the ability of the CRIS-CAT to predict key one year outcomes. Data from the CRIS-CAT administration study were used to calculate ICC (2,1) minimum detectable change (MDC), and average number of items used during CAT administration.
Reliability scores for all scales were above 0.75, but decreased at both ends of the score continuum. CRIS-CAT scores were correlated with concurrent validity indicators and differed significantly between the three Veteran groups (P < .001). The odds of having any Emergency Room visits were reduced for Veterans with better CRIS-CAT scores (Extent, Perceived Satisfaction respectively: OR = 0.94, 0.93, 0.95; P < .05). CRIS-CAT scores were predictive of SF-12 physical and mental health related quality of life scores at the 1 year follow-up. Scales had ICCs >0.9. MDCs were 5.9, 6.2, and 3.6, respectively for Extent, Perceived and Satisfaction subscales. Number of items (mn, SD) administered at Visit 1 were 14.6 (3.8) 10.9 (2.7) and 10.4 (1.7) respectively for Extent, Perceived and Satisfaction subscales.
The CRIS-CAT demonstrated sound measurement properties including reliability, construct, known group and predictive validity, and it was administered with minimal respondent burden. These findings support the use of this measure in assessing community reintegration.
To use item response theory (IRT) data simulations to construct and perform initial psychometric testing of a newly developed instrument, the Social Security Administration Behavioral Health Function (SSA-BH) instrument, that aims to assess behavioral health functioning relevant to the context of work.
Cross-sectional survey followed by item response theory (IRT) calibration data simulations
A sample of individuals applying for SSA disability benefits, claimants (N=1015), and a normative comparative sample of US adults (N=1000)
Main Outcome Measure
Social Security Administration Behavioral Health Function (SSA-BH) measurement instrument
Item response theory analyses supported the unidimensionality of four SSA-BH scales: Mood and Emotions (35 items), Self-Efficacy (23 items), Social Interactions (6 items), and Behavioral Control (15 items). All SSA-BH scales demonstrated strong psychometric properties including reliability, accuracy, and breadth of coverage. High correlations of the simulated 5- or 10- item CATs with the full item bank indicated robust ability of the CAT approach to comprehensively characterize behavioral health function along four distinct dimensions.
Initial testing and evaluation of the SSA-BH instrument demonstrated good accuracy, reliability, and content coverage along all four scales. Behavioral function profiles of SSA claimants were generated and compared to age and sex matched norms along four scales: Mood and Emotions, Behavioral Control, Social Interactions, and Self-Efficacy. Utilizing the CAT based approach offers the ability to collect standardized, comprehensive functional information about claimants in an efficient way, which may prove useful in the context of the SSA’s work disability programs.
Behavioral health; Outcome assessment (healthcare); Work disability; SSA disability determination; Disability evaluation
To develop and test a prototype dyspnea computer adaptive test.
Two outpatient medical facilities.
A convenience sample of 292 adults with COPD.
Main Outcome Measure
We developed a modified and expanded item bank and computer adaptive test (CAT) for the Dyspnea Management Questionnaire (DMQ), an outcome measure consisting of four dyspnea dimensions: dyspnea intensity, dyspnea anxiety, activity avoidance, and activity self-efficacy.
Factor analyses supported a four-dimensional model underlying the 71 DMQ items. The DMQ item bank achieved acceptable Rasch model fit statistics, good measurement breadth with minimal floor and ceiling effects, and evidence of high internal consistency reliability (α = 0.92 to 0.98). Using CAT simulation analyses, the DMQ-CAT showed high measurement accuracy compared to the total item pool (r = .83 to .97, p < .0001) and evidence of good to excellent concurrent (r = −.61 to −0.80, p < .0001) validity. All DMQ-CAT domains showed evidence for known-groups validity (p ≤ 0.001).
The DMQ-CAT reliably and validly captured four distinct dyspnea domains. Multidimensional dyspnea assessment in COPD is needed to better measure the effectiveness of pharmacologic, pulmonary rehabilitation, and psychosocial interventions in not only alleviating the somatic sensation of dyspnea but also reducing dysfunctional emotions, cognitions, and behaviors associated with dyspnea, especially for anxious patients.
Dyspnea; COPD; Outcomes assessment; Reliability; Validity
The objectives of this study were to develop a functional outcome instrument for hip and knee osteoarthritis research (OA-FUNCTION-CAT) using item response theory (IRT) and computer adaptive test (CAT) methods and to assess its psychometric performance compared to the current standard in the field.
We conducted an extensive literature review, focus groups, and cognitive testing to guide the construction of an item bank consisting of 125 functional activities commonly affected by hip and knee osteoarthritis. We recruited a convenience sample of 328 adults with confirmed hip and/or knee osteoarthritis. Subjects reported their degree of functional difficulty and functional pain in performing each activity in the item bank and completed the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC). Confirmatory factor analyses were conducted to assess scale uni-dimensionality, and IRT methods were used to calibrate the items and examine the fit of the data. We assessed the performance of OA-FUNCTION-CATs of different lengths relative to the full item bank and WOMAC using CAT simulation analyses.
Confirmatory factor analyses revealed distinct functional difficulty and functional pain domains. Descriptive statistics for scores from 5-, 10-, and 15-item CATs were similar to those for the full item bank. The 10-item OA-FUNCTION-CAT scales demonstrated a high degree of accuracy compared with the item bank (r = 0.96 and 0.89, respectively). Compared to the WOMAC, both scales covered a broader score range and demonstrated a higher degree of precision at the ceiling and reliability across the range of scores.
The OA-FUNCTION-CAT provided superior reliability throughout the score range and improved breadth and precision at the ceiling compared with the WOMAC. Further research is needed to assess whether these improvements carry over into superior ability to measure change.
The use of item response theory (IRT) to measure self-reported outcomes has burgeoned in recent years. Perhaps the most important application of IRT is computer-adaptive testing (CAT), a measurement approach in which the selection of items is tailored for each respondent.
To provide an introduction to the use of CAT in the measurement of health outcomes, describe several IRT models that can be used as the basis of CAT, and discuss practical issues associated with the use of adaptive scaling in research settings.
The development of a CAT requires several steps that are not required in the development of a traditional measure including identification of “starting” and “stopping” rules. CAT's most attractive advantage is its efficiency. Greater measurement precision can be achieved with fewer items. Disadvantages of CAT include the high cost and level of technical expertise required to develop a CAT.
Researchers, clinicians, and patients benefit from the availability of psychometrically rigorous measures that are not burdensome. CAT outcome measures hold substantial promise in this regard, but their development is not without challenges.
Measurement; quality of life; psychometrics; reliability
Routine Outcome Monitoring (ROM) is used as a means to enrich the process of treatment with feedback on patient outcomes, facilitating patient involvement and shared decision making. While traditional ROM measures focus on retrospective accounts of symptoms, novel mHealth technology makes it possible to collect real life, in-the-moment ambulatory data that allow for an ecologically valid assessment of personalized and contextualized emotional and behavioural adjustment in the flow daily life (mROM).
In a sample of 34 patients with major depressive disorder, treated with antidepressants, the combined effect of treatment and natural course was examined over a period of 18 weeks with Ecological Momentary Assessment (EMA). EMA consisted of repeated, within-subject, mini-measurements of experience (eg positive affect, negative affect, medication side effects) and context (eg stressors, situations, activities) at 10 unselected semi-random moments per day, for a period of six days, repeated three times over the 18-week period (baseline, week 6 and week 18).
EMA measures of emotional and behavioural adjustment were sensitive to the effects of treatment and natural course over the 18-week period, particularly EMA measures focussing on positive mood states and the ability to use natural rewards (impact of positive events on positive mood states), with standardized effect sizes of 0.4–0.5. EMA measures of activities, social interaction, stress-sensitivity and negative mood states were also sensitive to change over time.
This study supports the use of mROM as a means to involve the patient in the process of needs assessment and treatment. EMA data are meaningful to the patient, as they reflect daily life circumstances. Assessment of treatment response with mROM data allows for an interpretation of the effect of treatment at the level of daily life emotional and social adjustment – as an index of health, obviating the need for an exclusive focus on traditional measures of ‘sickness’.