PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (895559)

Clipboard (0)
None

Related Articles

1.  Psychometric Properties of Self-Report Concussion Scales and Checklists 
Journal of Athletic Training  2012;47(2):221-223.
Reference/Citation:
Alla S, Sullivan SJ, Hale L, McCrory P. Self-report scales/checklists for the measurement of concussion symptoms: a systematic review. Br J Sports Med. 2009;43 (suppl 1):i3–i12.
Clinical Question:
Which self-report symptom scales or checklists are psychometrically sound for clinical use to assess sport-related concussion?
Data Sources:
Articles available in full text, published from the establishment of each database through December 2008, were identified from PubMed, Medline, CINAHL, Scopus, Web of Science, SPORTDiscus, PsycINFO, and AMED. Search terms included brain concussion, signs or symptoms, and athletic injuries, in combination with the AND Boolean operator, and were limited to studies published in English. The authors also hand searched the reference lists of retrieved articles. Additional searches of books, conference proceedings, theses, and Web sites of commercial scales were done to provide additional information about the psychometric properties and development for those scales when needed in articles meeting the inclusion criteria.
Study Selection:
Articles were included if they identified all the items on the scale and the article was either an original research report describing the use of scales in the evaluation of concussion symptoms or a review article that discussed the use or development of concussion symptom scales. Only articles published in English and available in full text were included.
Data Extraction:
From each study, the following information was extracted by the primary author using a standardized protocol: study design, publication year, participant characteristics, reliability of the scale, and details of the scale or checklist, including name, number of items, time of measurement, format, mode of report, data analysis, scoring, and psychometric properties. A quality assessment of included studies was done using 16 items from the Downs and Black checklist1 and assessed reporting, internal validity, and external validity.
Main Results:
The initial database search identified 421 articles. After 131 duplicate articles were removed, 290 articles remained and were added to 17 articles found during the hand search, for a total of 307 articles; of those, 295 were available in full text. Sixty articles met the inclusion criteria and were used in the systematic review. The quality of the included studies ranged from 9 to 15 points out of a maximum quality score of 17. The included articles were published between 1995 and 2008 and included a collective total of 5864 concussed athletes and 5032 nonconcussed controls, most of whom participated in American football. The majority of the studies were descriptive studies monitoring the resolution of concussive self-report symptoms compared with either a preseason baseline or healthy control group, with a smaller number of studies (n = 8) investigating the development of a scale.
The authors initially identified 20 scales that were used among the 60 included articles. Further review revealed that 14 scales were variations of the Pittsburgh Steelers postconcussion scale (the Post-Concussion Scale, Post-Concussion Scale: Revised, Post-Concussion Scale: ImPACT, Post-Concussion Symptom Scale: Vienna, Graded Symptom Checklist [GSC], Head Injury Scale, McGill ACE Post-Concussion Symptoms Scale, and CogState Sport Symptom Checklist), narrowing down to 6 core scales, which the authors discussed further. The 6 core scales were the Pittsburgh Steelers Post-Concussion Scale (17 items), Post-Concussion Symptom Assessment Questionnaire (10 items), Concussion Resolution Index postconcussion questionnaire (15 items), Signs and Symptoms Checklist (34 items), Sport Concussion Assessment Tool (SCAT) postconcussion symptom scale (25 items), and Concussion Symptom Inventory (12 items). Each of the 6 core scales includes symptoms associated with sport-related concussion; however, the number of items on each scale varied. A 7-point Likert scale was used on most scales, with a smaller number using a dichotomous (yes/no) classification.
Only 7 of the 20 scales had published psychometric properties, and only 1 scale, the Concussion Symptom Inventory, was empirically driven (Rasch analysis), with development of the scale occurring before its clinical use. Internal consistency (Cronbach α) was reported for the Post-Concussion Scale (.87), Post-Concussion Scale: ImPACT 22-item (.88–.94), Head Injury Scale 9-item (.78), and Head Injury Scale 16-item (.84). Test-retest reliability has been reported only for the Post-Concussion Scale (Spearman r = .55) and the Post-Concussion Scale: ImPACT 21-item (Pearson r = .65). With respect to validity, the SCAT postconcussion scale has demonstrated face and content validity, the Post-Concussion Scale: ImPACT 22-item and Head Injury Scale 9-item have reported construct validity, and the Head Injury Scale 9-item and 16-item have published factorial validity.
Sensitivity and specificity have been reported only with the GSC (0.89 and 1.0, respectively) and the Post-Concussion Scale: ImPACT 21-item when combined with the neurocognitive component of ImPACT (0.819 and 0.849, respectively). Meaningful change scores were reported for the Post-Concussion Scale (14.8 points), Post-Concussion Scale: ImPACT 22-item (6.8 points), and Post-Concussion Scale: ImPACT 21-item (standard error of the difference = 7.17; 80% confidence interval = 9.18).
Conclusions:
Numerous scales exist for measuring the number and severity of concussion-related symptoms, with most evolving from the neuropsychology literature pertaining to head-injured populations. However, very few of these were created in a systematic manner that follows scale development processes and have published psychometric properties. Clinicians need to understand these limitations when choosing and using a symptom scale for inclusion in a concussion assessment battery. Future authors should assess the underlying constructs and measurement properties of currently available scales and use the ever-increasing prospective data pools of concussed athlete information to develop scales following appropriate, systematic processes.
PMCID: PMC3418135  PMID: 22488289
mild traumatic brain injuries; evaluation; reliability; validity; sensitivity; specificity
2.  Limitations of True Score Variance to Measure Discriminating Power: Psychometric Simulation Study 
Journal of abnormal psychology  2010;119(2):300-306.
Demonstrating a specific cognitive deficit usually involves comparing patients’ performance on two or more tests. The psychometric confound occurs if the psychometric properties of these tests lead patients to show greater cognitive deficits in one domain. One way to avoid the psychometric confound is to use tests with a similar level of discriminating power, which is a test’s ability to index true individual differences in classic psychometric theory. One suggested way to measure discriminating power is to calculate true score variance (Chapman & Chapman, 1978). Despite the centrality of these formulations, there is no systematic examination of the relationship between the observable property of true score variance and the latent property of discriminating power. We simulated administrations of free response tests and forced choice tests by creating different replicable ability scores for two groups, across a wide ranges of various psychometric properties (i.e., difficulty, reliability, observed variance, and number of items), and computing an ideal index of discriminating power. Simulation results indicated that true score variance had only limited ability to predict discriminating power (explained about 10 % of variance in replicable ability scores). Furthermore, the ability varied across tests with wide ranges of psychometric variables, such as difficulty, observed variance, reliability, and number of items. Discriminating power depends upon a complicated interaction of psychometric properties that is not well estimated solely by a test’s true score variance.
doi:10.1037/a0018400
PMCID: PMC2869469  PMID: 20455603
3.  Item response theory analysis of cognitive tests in people with dementia: a systematic review 
BMC Psychiatry  2014;14:47.
Background
Performance on psychometric tests is key to diagnosis and monitoring treatment of dementia. Results are often reported as a total score, but there is additional information in individual items of tests which vary in their difficulty and discriminatory value. Item difficulty refers to an ability level at which the probability of responding correctly is 50%. Discrimination is an index of how well an item can differentiate between patients of varying levels of severity. Item response theory (IRT) analysis can use this information to examine and refine measures of cognitive functioning. This systematic review aimed to identify all published literature which had applied IRT to instruments assessing global cognitive function in people with dementia.
Methods
A systematic review was carried out across Medline, Embase, PsychInfo and CINHAL articles. Search terms relating to IRT and dementia were combined to find all IRT analyses of global functioning scales of dementia.
Results
Of 384 articles identified four studies met inclusion criteria including a total of 2,920 people with dementia from six centers in two countries. These studies used three cognitive tests (MMSE, ADAS-Cog, BIMCT) and three IRT methods (Item Characteristic Curve analysis, Samejima’s graded response model, the 2-Parameter Model). Memory items were most difficult. Naming the date in the MMSE and memory items, specifically word recall, of the ADAS-cog were most discriminatory.
Conclusions
Four published studies were identified which used IRT on global cognitive tests in people with dementia. This technique increased the interpretative power of the cognitive scales, and could be used to provide clinicians with key items from a larger test battery which would have high predictive value. There is need for further studies using IRT in a wider range of tests involving people with dementia of different etiology and severity.
doi:10.1186/1471-244X-14-47
PMCID: PMC3931670  PMID: 24552237
Item response theory; Dementia; Psychometrics; Cognition; Alzheimer disease; MMSE; Systematic review
4.  Developing a Short Form of Benton’s Judgment of Line Orientation Test: An Item Response Theory Approach 
The Clinical neuropsychologist  2011;25(4):670-684.
The Judgment of Line Orientation (JLO) test was developed to be, in Arthur Benton’s words, “as pure a measure of one aspect of spatial thinking, as could be conceived.” The JLO test has been widely used in neuropsychological practice for decades. The test has a high test-retest reliability (Franzen, 2000), as well as good neuropsychological construct validity as shown through neuroanatomical localization studies (Tranel, Vianna, Manzel, Damasio, & Grabowski, 2009). Despite its popularity and strong psychometric properties, the full-length version of the test (30 items) has been criticized as being unnecessarily long (Straus, Sherman, & Spreen, 2006). There have been many attempts at developing short forms; however, these forms have been limited in their ability to estimate scores accurately. Taking advantage of a large sample of JLO performances from 524 neurological patients with focal brain lesions, we used techniques from Item Response Theory (IRT) to estimate each item’s difficulty and power to discriminate among various levels of ability. A random item IRT model was used to estimate the influence of item stimulus properties as predictors of item difficulty. These results were used to optimize the selection of items for a shorter method of administration which maintained comparability with the full form using significantly fewer items. This effectiveness of this method was replicated in a second sample of 82 healthy elderly participants. The findings should help broaden the clinical utility of the JLO and enhance its diagnostic applications.
doi:10.1080/13854046.2011.564209
PMCID: PMC3094715  PMID: 21469016
5.  The AMC Linear Disability Score project in a population requiring residential care: psychometric properties 
Background
Currently there is a lot of interest in the flexible framework offered by item banks for measuring patient relevant outcomes, including functional status. However, there are few item banks, which have been developed to quantify functional status, as expressed by the ability to perform activities of daily life.
Method
This paper examines the psychometric properties of the AMC Linear Disability Score (ALDS) project item bank using an item response theory model and full information factor analysis. Data were collected from 555 respondents on a total of 160 items.
Results
Following the analysis, 79 items remained in the item bank. The remaining 81 items were excluded because of: difficulties in presentation (1 item); low levels of variation in response pattern (28 items); significant differences in measurement characteristics for males and females or for respondents under or over 85 years old (26 items); or lack of model fit to the data at item level (26 items).
Conclusions
It is conceivable that the item bank will have different measurement characteristics for other patient or demographic populations. However, these results indicate that the ALDS item bank has sound psychometric properties for respondents in residential care settings and could form a stable base for measuring functional status in a range of situations, including the implementation of computerised adaptive testing of functional status.
doi:10.1186/1477-7525-2-42
PMCID: PMC514531  PMID: 15291958
6.  Measurement precision of the disability for back pain scale-by applying Rasch analysis 
Background
The Oswestry Disability Index (ODI) is widely used for patients with back pain. However, few studies have examined its psychometric properties using modern measurement theory. The purpose of this study was to investigate the psychometric properties of the ODI in patients with back pain using Rasch analysis.
Methods
A total of 408 patients with back pain participated in this cross-sectional study. Patients were recruited from the orthopedic, neurosurgery, rehabilitation departments and pain clinic of two hospitals. Rasch analysis was used to examine the Chinese version of ODI 2.1 for unidimensionality, item difficulty, category function, differential item functioning, and test information.
Results
The fit statistics showed 10 items of the ODI fitted the model’s expectation as a unidimensional scale. The ODI measured the different levels of functional limitation without skewing toward the lower or higher levels of disability. No significant ceiling and floor effects and gaps among the items were found. The reliability was high and the test information curve demonstrated precise dysfunction estimation.
Conclusions
Our results showed that the ODI is a unidimensional questionnaire with high reliability. The ODI can precisely estimate the level of dysfunction, and the item difficulty of the ODI matches the person ability. For clinical application, using logits scores could precisely represent the disability level, and using the item difficulty could help clinicians design progressive programs for patients with back pain.
doi:10.1186/1477-7525-11-119
PMCID: PMC3717282  PMID: 23866814
Back pain; Rasch analysis; Oswestry disability index; Functional measure; Disability
7.  The grounded psychometric development and initial validation of the Health Literacy Questionnaire (HLQ) 
BMC Public Health  2013;13:658.
Background
Health literacy has become an increasingly important concept in public health. We sought to develop a comprehensive measure of health literacy capable of diagnosing health literacy needs across individuals and organisations by utilizing perspectives from the general population, patients, practitioners and policymakers.
Methods
Using a validity-driven approach we undertook grounded consultations (workshops and interviews) to identify broad conceptually distinct domains. Questionnaire items were developed directly from the consultation data following a strict process aiming to capture the full range of experiences of people currently engaged in healthcare through to people in the general population. Psychometric analyses included confirmatory factor analysis (CFA) and item response theory. Cognitive interviews were used to ensure questions were understood as intended. Items were initially tested in a calibration sample from community health, home care and hospital settings (N=634) and then in a replication sample (N=405) comprising recent emergency department attendees.
Results
Initially 91 items were generated across 6 scales with agree/disagree response options and 5 scales with difficulty in undertaking tasks response options. Cognitive testing revealed that most items were well understood and only some minor re-wording was required. Psychometric testing of the calibration sample identified 34 poorly performing or conceptually redundant items and they were removed resulting in 10 scales. These were then tested in a replication sample and refined to yield 9 final scales comprising 44 items. A 9-factor CFA model was fitted to these items with no cross-loadings or correlated residuals allowed. Given the very restricted nature of the model, the fit was quite satisfactory: χ2WLSMV(866 d.f.) = 2927, p<0.000, CFI = 0.936, TLI = 0.930, RMSEA = 0.076, and WRMR = 1.698. Final scales included: Feeling understood and supported by healthcare providers; Having sufficient information to manage my health; Actively managing my health; Social support for health; Appraisal of health information; Ability to actively engage with healthcare providers; Navigating the healthcare system; Ability to find good health information; and Understand health information well enough to know what to do.
Conclusions
The HLQ covers 9 conceptually distinct areas of health literacy to assess the needs and challenges of a wide range of people and organisations. Given the validity-driven approach, the HLQ is likely to be useful in surveys, intervention evaluation, and studies of the needs and capabilities of individuals.
doi:10.1186/1471-2458-13-658
PMCID: PMC3718659  PMID: 23855504
Health literacy; Measurement; Assessment; Health competencies; Psychometrics; HLQ
8.  ITEM ANALYSIS OF THREE SPANISH NAMING TESTS: A CROSS-CULTURAL INVESTIGATION 
NeuroRehabilitation  2009;24(1):75-85.
Neuropsychological evaluations conducted in the United States and abroad commonly include the use of tests translated from English to Spanish. The use of translated naming tests for evaluating predominately Spanish-speakers has recently been challenged on the grounds that translating test items may compromise a test’s construct validity. The Texas Spanish Naming Test (TNT) has been developed in Spanish specifically for use with Spanish-speakers; however, it is unlikely patients from diverse Spanish-speaking geographical regions will perform uniformly on a naming test. The present study evaluated and compared the internal consistency and patterns of item-difficulty and -discrimination for the TNT and two commonly used translated naming tests in three countries (i.e., United States, Colombia, Spain). Two hundred fifty two subjects (126 demented, 116 nondemented) across three countries were administered the TNT, Modified Boston Naming Test-Spanish, and the naming subtest from the CERAD. The TNT demonstrated superior internal consistency to its counterparts, a superior item difficulty pattern than the CERAD naming test, and a superior item discrimination pattern than the MBNT-S across countries. Overall, all three Spanish naming tests differentiated nondemented and moderately demented individuals, but the results suggest the items of the TNT are most appropriate to use with Spanish-speakers. Preliminary normative data for the three tests examined in each country are provided.
doi:10.3233/NRE-2009-0456
PMCID: PMC2666471  PMID: 19208960
9.  Better assessment of physical function: item improvement is neglected but essential 
Arthritis Research & Therapy  2009;11(6):R191.
Introduction
Physical function is a key component of patient-reported outcome (PRO) assessment in rheumatology. Modern psychometric methods, such as Item Response Theory (IRT) and Computerized Adaptive Testing, can materially improve measurement precision at the item level. We present the qualitative and quantitative item-evaluation process for developing the Patient Reported Outcomes Measurement Information System (PROMIS) Physical Function item bank.
Methods
The process was stepwise: we searched extensively to identify extant Physical Function items and then classified and selectively reduced the item pool. We evaluated retained items for content, clarity, relevance and comprehension, reading level, and translation ease by experts and patient surveys, focus groups, and cognitive interviews. We then assessed items by using classic test theory and IRT, used confirmatory factor analyses to estimate item parameters, and graded response modeling for parameter estimation. We retained the 20 Legacy (original) Health Assessment Questionnaire Disability Index (HAQ-DI) and the 10 SF-36's PF-10 items for comparison. Subjects were from rheumatoid arthritis, osteoarthritis, and healthy aging cohorts (n = 1,100) and a national Internet sample of 21,133 subjects.
Results
We identified 1,860 items. After qualitative and quantitative evaluation, 124 newly developed PROMIS items composed the PROMIS item bank, which included revised Legacy items with good fit that met IRT model assumptions. Results showed that the clearest and best-understood items were simple, in the present tense, and straightforward. Basic tasks (like dressing) were more relevant and important versus complex ones (like dancing). Revised HAQ-DI and PF-10 items with five response options had higher item-information content than did comparable original Legacy items with fewer response options. IRT analyses showed that the Physical Function domain satisfied general criteria for unidimensionality with one-, two-, three-, and four-factor models having comparable model fits. Correlations between factors in the test data sets were > 0.90.
Conclusions
Item improvement must underlie attempts to improve outcome assessment. The clear, personally important and relevant, ability-framed items in the PROMIS Physical Function item bank perform well in PRO assessment. They will benefit from further study and application in a wider variety of rheumatic diseases in diverse clinical groups, including those at the extremes of physical functioning, and in different administration modes.
doi:10.1186/ar2890
PMCID: PMC3003539  PMID: 20015354
10.  Bifactor and Item Response Theory Analyses of Interviewer Report Scales of Cognitive Impairment in Schizophrenia 
Psychological assessment  2011;23(1):245-261.
We conducted psychometric analyses of two interview-based measures of cognitive deficits: the 21-item Clinical Global Impression of Cognition in Schizophrenia (CGI-CogS; Ventura et al., 2008), and the 20-item Schizophrenia Cognition Rating Scale (SCoRS; Keefe et al., 2006), which were administered on two occasions to a sample of people with schizophrenia. Traditional psychometrics, bifactor analysis, and item response theory (IRT) methods were used to explore item functioning, dimensionality, and to compare instruments. Despite containing similar item content, responses to the CGI-CogS demonstrated superior psychometric properties (e.g., higher item-intercorrelations, better spread of ratings across response categories), relative to the SCoRS. We argue that these differences arise mainly from the differential use of prompts and how the items are phrased and scored. Bifactor analysis demonstrated that although both measures capture a broad range of cognitive functioning (e.g., working memory, social cognition), the common variance on each is overwhelmingly explained by a single general factor. IRT analyses of the combined pool of 41 items showed that measurement precision is peaked in the mild to moderate range of cognitive impairment. Finally, simulated adaptive testing revealed that only about 10 to 12 items are necessary to achieve latent trait level estimates with reasonably small standard errors for most individuals. This suggests that these interview-based measures of cognitive deficits could be shortened without loss of measurement precision.
doi:10.1037/a0021501
PMCID: PMC3183749  PMID: 21381848
item response theory; CGI-CogS; SCoRS; schizophrenia and cognitive deficits; computerized adaptive testing
11.  Assessment of health-related quality of life in arthritis: conceptualization and development of five item banks using item response theory 
Background
Modern psychometric methods based on item response theory (IRT) can be used to develop adaptive measures of health-related quality of life (HRQL). Adaptive assessment requires an item bank for each domain of HRQL. The purpose of this study was to develop item banks for five domains of HRQL relevant to arthritis.
Methods
About 1,400 items were drawn from published questionnaires or developed from focus groups and individual interviews and classified into 19 domains of HRQL. We selected the following 5 domains relevant to arthritis and related conditions: Daily Activities, Walking, Handling Objects, Pain or Discomfort, and Feelings. Based on conceptual criteria and pilot testing, 219 items were selected for further testing. A questionnaire was mailed to patients from two hospital-based clinics and a stratified random community sample. Dimensionality of the domains was assessed through factor analysis. Items were analyzed with the Generalized Partial Credit Model as implemented in Parscale. We used graphical methods and a chi-square test to assess item fit. Differential item functioning was investigated using logistic regression.
Results
Data were obtained from 888 individuals with arthritis. The five domains were sufficiently unidimensional for an IRT-based analysis. Thirty-one items were deleted due to lack of fit or differential item functioning. Daily Activities had the narrowest range for the item location parameter (-2.24 to 0.55) and Handling Objects had the widest range (-1.70 to 2.27). The mean (median) slope parameter for the items ranged from 1.15 (1.07) in Feelings to 1.73 (1.75) in Walking. The final item banks are comprised of 31–45 items each.
Conclusion
We have developed IRT-based item banks to measure HRQL in 5 domains relevant to arthritis. The items in the final item banks provide adequate psychometric information for a wide range of functional levels in each domain.
doi:10.1186/1477-7525-4-33
PMCID: PMC1550394  PMID: 16749932
12.  Development of the Two Stage Rapid Estimate of Adult Literacy in Dentistry (TS-REALD) 
This work proposes a revision of the 30 item Rapid Estimate of Adult Literacy in Dentistry (REALD-30), into a more efficient and easier-to-use two-stage scale. Using a sample of 1,405 individuals (primarily women) enrolled in a Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), the present work utilizes principles of item response theory and multi-stage testing to revise the REALD-30 into a two-stage test of oral health literacy, named Two-Stage REALD or TS-REALD, which maximizes score precision at various levels of participant ability. Based on the participant’s score on the 5-item first-stage (i.e., routing test), one of three potential stage-two tests is administered: a 4-item Low Literacy test, a 6-item Average Literacy test, or a 3-item High Literacy test. The reliability of scores for the TS-REALD is greater than .85 for a wide range of ability. The TS-REALD was found to be predictive of perceived impact of oral conditions on well-being, after controlling for educational level, overall health, dental health, and a general health literacy measure. While containing approximately one-third of the items on the original scale, the TS-REALD was found to maintain similar psychometric qualities.
doi:10.1111/j.1600-0528.2011.00619.x
PMCID: PMC3165105  PMID: 21592170
Dental Health Literacy; Dental Care; Oral Health Quality of Life; Health Literacy; Psychometrics
13.  Differential Item Functioning of the Boston Naming Test in Cognitively Normal African American and Caucasian Older Adults 
Scores on the Boston Naming Test (BNT) are frequently lower for African American when compared to Caucasian adults. Although demographically-based norms can mitigate the impact of this discrepancy on the likelihood of erroneous diagnostic impressions, a growing consensus suggests that group norms do not sufficiently address or advance our understanding of the underlying psychometric and sociocultural factors that lead to between-group score discrepancies. Using item response theory and methods to detect differential item functioning (DIF), the current investigation moves beyond comparisons of the summed total score to examine whether the conditional probability of responding correctly to individual BNT items differs between African American and Caucasian adults. Participants included 670 adults age 52 and older who took part in Mayo's Older Americans and Older African Americans Normative Studies. Under a 2-parameter logistic IRT framework and after correction for the false discovery rate, 12 items where shown to demonstrate DIF. Six of these 12 items (“dominoes,” “escalator,” “muzzle,” “latch,” “tripod,” and “palette”) were also identified in additional analyses using hierarchical logistic regression models and represent the strongest evidence for race/ethnicity-based DIF. These findings afford a finer characterization of the psychometric properties of the BNT and expand our understanding of between-group performance.
doi:10.1017/S1355617709990361
PMCID: PMC2835360  PMID: 19570311
Boston Naming Test; Item response theory; Differential item functioning; Ethnicity; Race; Bias
14.  The five item Barthel index 
OBJECTIVES—Routine data collection is now considered mandatory. Therefore, staff rated clinical scales that consist of multiple items should have the minimum number of items necessary for rigorous measurement. This study explores the possibility of developing a short form Barthel index, suitable for use in clinical trials, epidemiological studies, and audit, that satisfies criteria for rigorous measurement and is psychometrically equivalent to the 10 item instrument.
METHODS—Data were analysed from 844 consecutive admissions to a neurological rehabilitation unit in London. Random half samples were generated. Short forms were developed in one sample (n=419), by selecting items with the best measurement properties, and tested in the other (n=418). For each of the 10 items of the BI, item total correlations and effect sizes were computed and rank ordered. The best items were defined as those with the lowest cross product of these rank orderings. The acceptability, reliability, validity, and responsiveness of three short form BIs (five, four, and three item) were determined and compared with the 10 item BI. Agreement between scores generated by short forms and 10 item BI was determined using intraclass correlation coefficients and the method of Bland and Altman.
RESULTS—The five best items in this sample were transfers, bathing, toilet use, stairs, and mobility. Of the three short forms examined, the five item BI had the best measurement properties and was psychometrically equivalent to the 10 item BI. Agreement between scores generated by the two measures for individual patients was excellent (ICC=0.90) but not identical (limits of agreement=1.84±3.84).
CONCLUSIONS—The five item short form BI may be a suitable outcome measure for group comparison studies in comparable samples. Further evaluations are needed. Results demonstrate a fundamental difference between assessment and measurement and the importance of incorporating psychometric methods in the development and evaluation of health measures.


doi:10.1136/jnnp.71.2.225
PMCID: PMC1737527  PMID: 11459898
15.  Rasch Analysis of the Fullerton Advanced Balance (FAB) Scale 
Physiotherapy Canada  2011;63(1):115-125.
ABSTRACT
Purpose: This cross-sectional study explores the psychometric properties and dimensionality of the Fullerton Advanced Balance (FAB) Scale, a multi-item balance test for higher-functioning older adults.
Methods: Participants (n=480) were community-dwelling adults able to ambulate independently. Data gathering consisted of survey and balance performance assessment. Psychometric properties were assessed using Rasch analysis.
Results: Mean age of participants was 76.4 (SD=7.1) years. Mean FAB Scale scores were 24.7/40 (SD=7.5). Analyses for scale dimensionality showed that 9 of the 10 items fit a unidimensional measure of balance. Item 10 (Reactive Postural Control) did not fit the model. The reliability of the scale to separate persons was 0.81 out of 1.00; the reliability of the scale to separate items in terms of their difficulty was 0.99 out of 1.00. Cronbach's alpha for a 10-item model was 0.805. Items of differing difficulties formed a useful ordinal hierarchy for scaling patterns of expected balance ability scoring for a normative population.
Conclusion: The FAB Scale appears to be a reliable and valid tool to assess balance function in higher-functioning older adults. The test was found to discriminate among participants of varying balance abilities. Further exploration of concurrent validity of Rasch-generated expected item scoring patterns should be undertaken to determine the test's diagnostic and prescriptive utility.
doi:10.3138/ptc.2009-51
PMCID: PMC3024205  PMID: 22210989
aged; balance; fall risk assessment tool; falls; psychometrics; FAB Scale; aînés; chutes; équilibre; outil d'évaluation du risque de chute; psychométrie
16.  Development and preliminary psychometric testing of a new OA pain measure – an OARSI/OMERACT initiative 
Osteoarthritis and Cartilage  2008;16(4):409-414.
Summary
Objective
To evaluate the measurement properties of a new osteoarthritis (OA) pain measure.
Methods
The new tool, comprised of 12 questions on constant vs intermittent pain was administered by phone to 100 subjects aged 40+ years with hip or knee OA, followed by three global hip/knee questions, the Western Ontario and McMaster Universities (WOMAC) pain subscale, the symptom subscales of the Hip Disability and OA Outcome Score (HOOS) or Knee Injury and OA Outcome Score (KOOS), and the limitation dimension of the Late Life Function and Disability Instrument (LLFDI). Test-retest reliability was assessed by re-administration after 48–96 h. Item response distributions, inter-item correlations, item-total correlations and Cronbach's alpha were assessed. Principle component analysis was performed and test-retest reliability was assessed by intra-class correlation coefficient (ICC).
Results
There was good distribution of response options across all items. The mean intensity was higher for intermittent vs constant pain, indicating subjects could distinguish the two concepts. Inter-item correlations ranged from 0.37 to 0.76 indicating no item redundancy. One item, predictability of pain, was removed from subsequent analyses as correlations with other items and item-total correlations were low. The 11-item scale had a corrected inter-item correlation range of 0.54–0.81 with Cronbach's alpha of 0.93 for the combined sample. Principle components analysis demonstrated factorial complexity. As such, scoring was based on the summing of individual items. Test-retest reliability was excellent (ICC 0.85). The measure was significantly correlated with each of the other measures [Spearman correlations −0.60 (KOOS symptoms) to 0.81 (WOMAC pain scale)], except the LLFDI, where correlations were low.
Conclusions
Preliminary psychometric testing suggests this OA pain measure is reliable and valid.
doi:10.1016/j.joca.2007.12.015
PMCID: PMC3268063  PMID: 18381179
Osteoarthritis; Hip; Knee; Pain; Outcome measure; Validation; Instrument development
17.  Development and assessment of floor and ceiling items for the PROMIS physical function item bank 
Arthritis Research & Therapy  2013;15(5):R144.
Introduction
Disability and Physical Function (PF) outcome assessment has had limited ability to measure functional status at the floor (very poor functional abilities) or the ceiling (very high functional abilities). We sought to identify, develop and evaluate new floor and ceiling items to enable broader and more precise assessment of PF outcomes for the NIH Patient-Reported-Outcomes Measurement Information System (PROMIS).
Methods
We conducted two cross-sectional studies using NIH PROMIS item improvement protocols with expert review, participant survey and focus group methods. In Study 1, respondents with low PF abilities evaluated new floor items, and those with high PF abilities evaluated new ceiling items for clarity, importance and relevance. In Study 2, we compared difficulty ratings of new floor items by low functioning respondents and ceiling items by high functioning respondents to reference PROMIS PF-10 items. We used frequencies, percentages, means and standard deviations to analyze the data.
Results
In Study 1, low (n = 84) and high (n = 90) functioning respondents were mostly White, women, 70 years old, with some college, and disability scores of 0.62 and 0.30. More than 90% of the 31 new floor and 31 new ceiling items were rated as clear, important and relevant, leaving 26 ceiling and 30 floor items for Study 2. Low (n = 246) and high (n = 637) functioning Study 2 respondents were mostly White, women, 70 years old, with some college, and Health Assessment Questionnaire (HAQ) scores of 1.62 and 0.003. Compared to difficulty ratings of reference items, ceiling items were rated to be 10% more to greater than 40% more difficult to do, and floor items were rated to be about 12% to nearly 90% less difficult to do.
Conclusions
These new floor and ceiling items considerably extend the measurable range of physical function at either extreme. They will help improve instrument performance in populations with broad functional ranges and those concentrated at one or the other extreme ends of functioning. Optimal use of these new items will be assisted by computerized adaptive testing (CAT), reducing questionnaire burden and insuring item administration to appropriate individuals.
doi:10.1186/ar4327
PMCID: PMC3978724  PMID: 24286166
18.  A functional difficulty and functional pain instrument for hip and knee osteoarthritis 
Arthritis Research & Therapy  2009;11(4):R107.
Introduction
The objectives of this study were to develop a functional outcome instrument for hip and knee osteoarthritis research (OA-FUNCTION-CAT) using item response theory (IRT) and computer adaptive test (CAT) methods and to assess its psychometric performance compared to the current standard in the field.
Methods
We conducted an extensive literature review, focus groups, and cognitive testing to guide the construction of an item bank consisting of 125 functional activities commonly affected by hip and knee osteoarthritis. We recruited a convenience sample of 328 adults with confirmed hip and/or knee osteoarthritis. Subjects reported their degree of functional difficulty and functional pain in performing each activity in the item bank and completed the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC). Confirmatory factor analyses were conducted to assess scale uni-dimensionality, and IRT methods were used to calibrate the items and examine the fit of the data. We assessed the performance of OA-FUNCTION-CATs of different lengths relative to the full item bank and WOMAC using CAT simulation analyses.
Results
Confirmatory factor analyses revealed distinct functional difficulty and functional pain domains. Descriptive statistics for scores from 5-, 10-, and 15-item CATs were similar to those for the full item bank. The 10-item OA-FUNCTION-CAT scales demonstrated a high degree of accuracy compared with the item bank (r = 0.96 and 0.89, respectively). Compared to the WOMAC, both scales covered a broader score range and demonstrated a higher degree of precision at the ceiling and reliability across the range of scores.
Conclusions
The OA-FUNCTION-CAT provided superior reliability throughout the score range and improved breadth and precision at the ceiling compared with the WOMAC. Further research is needed to assess whether these improvements carry over into superior ability to measure change.
doi:10.1186/ar2760
PMCID: PMC2745788  PMID: 19589168
19.  Psychometric Properties of Reverse-Scored Items on the CES-D in a Sample of Ethnically Diverse Older Adults 
Psychological assessment  2011;23(2):558-562.
Background
Reverse-scored items on assessment scales increase cognitive processing demands, and may therefore lead to measurement problems for older adult respondents.
Objective
To examine possible psychometric inadequacies of reverse-scored items on the Center for Epidemiologic Studies Depression Scale (CES-D) when used to assess ethnically diverse older adults.
Methods
Using baseline data from a gerontologic clinical trial (n=460), we tested the hypotheses that the reversed items on the CES-D: (a) are less reliable than non-reversed items, (b) disproportionately lead to intra-individually atypical responses that are psychometrically problematic, and (c) evidence improved measurement properties when an imputation procedure based on the scale mean is used to replace atypical responses.
Results
In general, the results supported the hypotheses. Relative to non-reversed CES-D items, the four reversed items were less internally consistent, were associated with lower item-scale correlations, and were more often answered atypically at an intra-individual level. Further, the atypical responses were negatively correlated with responses to psychometrically sound non-reversed items that had similar content. The use of imputation to replace atypical responses enhanced the predictive validity of the set of reverse-scored items.
Conclusions
Among older adult respondents reverse-scored items are associated with measurement difficulties. It is recommended that appropriate correction procedures such as item re-administration or statistical imputation be applied to reduce the difficulties.
doi:10.1037/a0022484
PMCID: PMC3115428  PMID: 21319906
CES-D; depression; reversed item format; older adults
20.  Adaptive Short Forms for Outpatient Rehabilitation Outcome Assessment 
Objective
To develop outpatient adaptive short forms (ASFs) for the Activity Measure for Post-Acute Care (AM-PAC) item bank for use in outpatient therapy settings.
Design
A convenience sample of 11,809 adults with spine, lower extremity, upper extremity and miscellaneous orthopedic impairments who received outpatient rehabilitation in one of 127 outpatient rehabilitation clinics in the US. We identified optimal items for use in developing outpatient ASFs based on the Basic Mobility and Daily Activities domains of the AM-PAC item bank. Patient scores were derived from the AM-PAC computerized adaptive testing (CAT) program. Items were selected for inclusion on the ASFs based on functional content, range of item coverage, measurement precision, item exposure rate, and data collection burden.
Results
Two outpatient ASFs were developed: 1) an 18-item Basic Mobility ASF and 2) a 15-item Daily Activities ASF, derived from the same item bank used to develop the AM-PAC-CAT. Both ASFs achieved acceptable psychometric properties.
Conclusions
In outpatient PAC settings where CAT outcome applications are currently not feasible, IRT-derived ASFs provide the efficient capability to monitor patients’ functional outcomes. The development of ASF functional outcome instruments linked by a common, calibrated item bank has the potential to create a bridge to outcome monitoring across PAC settings and can facilitate the eventual transformation from ASFs to CAT applications easier and more acceptable to the rehabilitation community.
doi:10.1097/PHM.0b013e318186b7ca
PMCID: PMC3947754  PMID: 18806511
Outcomes Assessment; Rehabilitation; Item Response Theory; Physical Functioning
21.  Development of A Promis Item Bank to Measure Pain Interference 
Pain  2010;150(1):173-182.
This paper describes the psychometric properties of the PROMIS Pain Interference (PROMIS-PI) bank. An initial candidate item pool (n=644) was developed and evaluated based on review of existing instruments, interviews with patients, and consultation with pain experts. From this pool, a candidate item bank of 56 items was selected and responses to the items were collected from large community and clinical samples. A total of 14,848 participants responded to all or a subset of candidate items. The responses were calibrated using an item response theory (IRT) model. A final 41-item bank was evaluated with respect to IRT assumptions, model fit, differential item function (DIF), precision, and construct and concurrent validity. Items of the revised bank had good fit to the IRT model (CFI and NNFI/TLI ranged from 0.974 to 0.997), and the data were strongly unidimensional (e.g., ratio of first and second eigenvalue = 35). Nine items exhibited statistically significant DIF. However, adjusting for DIF had little practical impact on score estimates and the items were retained without modifying scoring. Scores provided substantial information across levels of pain; for scores in the T-score range 50-80, the reliability was equivalent to 0.96 to 0.99. Patterns of correlations with other health outcomes supported the construct validity of the item bank. The scores discriminated among persons with different numbers of chronic conditions, disabling conditions, levels of self-reported health, and pain intensity (p< 0.0001). The results indicated that the PROMIS-PI items constitute a psychometrically sound bank. Computerized adaptive testing and short forms are available.
doi:10.1016/j.pain.2010.04.025
PMCID: PMC2916053  PMID: 20554116
Quality-of-life outcomes; quality-of-life measurement; pain
22.  Evaluating professionalism in medical undergraduates using selected response questions: findings from an item response modelling study 
BMC Medical Education  2011;11:43.
Background
Professionalism is a difficult construct to define in medical students but aspects of this concept may be important in predicting the risk of postgraduate misconduct. For this reason attempts are being made to evaluate medical students' professionalism. This study investigated the psychometric properties of Selected Response Questions (SRQs) relating to the theme of professional conduct and ethics comparing them with two sets of control items: those testing pure knowledge of anatomy, and; items evaluating the ability to integrate and apply knowledge ("skills"). The performance of students on the SRQs was also compared with two external measures estimating aspects of professionalism in students; peer ratings of professionalism and their Conscientiousness Index, an objective measure of behaviours at medical school.
Methods
Item Response Theory (IRT) was used to analyse both question and student performance for SRQs relating to knowledge of professionalism, pure anatomy and skills. The relative difficulties, discrimination and 'guessabilities' of each theme of question were compared with each other using Analysis of Variance (ANOVA). Student performance on each topic was compared with the measures of conscientiousness and professionalism using parametric and non-parametric tests as appropriate. A post-hoc analysis of power for the IRT modelling was conducted using a Monte Carlo simulation.
Results
Professionalism items were less difficult compared to the anatomy and skills SRQs, poorer at discriminating between candidates and more erratically answered when compared to anatomy questions. Moreover professionalism item performance was uncorrelated with the standardised Conscientiousness Index scores (rho = 0.009, p = 0.90). In contrast there were modest but significant correlations between standardised Conscientiousness Index scores and performance at anatomy items (rho = 0.20, p = 0.006) though not skills (rho = .11, p = .1). Likewise, students with high peer ratings for professionalism had superior performance on anatomy SRQs but not professionalism themed questions. A trend of borderline significance (p = .07) was observed for performance on skills SRQs and professionalism nomination status.
Conclusions
SRQs related to professionalism are likely to have relatively poor psychometric properties and lack associations with other constructs associated with undergraduate professional behaviour. The findings suggest that such questions should not be included in undergraduate examinations and may raise issues with the introduction of Situational Judgement Tests into Foundation Years selection.
doi:10.1186/1472-6920-11-43
PMCID: PMC3146946  PMID: 21714870
23.  Psychometric Properties of the Participation Scale among Former Buruli Ulcer Patients in Ghana and Benin 
Background
Buruli ulcer is a stigmatising disease treated with antibiotics and wound care, and sometimes surgical intervention is necessary. Permanent limitations in daily activities are a common long term consequence. It is unknown to what extent patients perceive problems in participation in social activities. The psychometric properties of the Participation Scale used in other disabling diseases, such as leprosy, was assessed for use in former Buruli ulcer patients.
Methods
Former Buruli ulcer patients in Ghana and Benin, their relatives, and healthy community controls were interviewed using the Participation Scale, Buruli Ulcer Functional Limitation Score, and the Explanatory Model Interview Catalogue to measure stigma. The Participation Scale was tested for the following psychometric properties: discrimination, floor and ceiling effects, internal consistency, inter-item correlation, item-total correlation and construct validity.
Results
In total 386 participants (143 former Buruli ulcer patients with their relatives (137) and 106 community controls) were included in the study. The Participation Scale displayed good discrimination between former Buruli ulcer patients and healthy community controls. No floor and ceiling effects were found. Internal consistency (Cronbach's alpha) was 0.88. In Ghana, mean inter-item correlation of 0.29 and item-total correlations ranging from 0.10 to 0.69 were found while in Benin, a mean inter-item correlation of 0.28 was reported with item-total correlations ranging from −0.08 to 0.79. With respect to construct validity, 4 out of 6 hypotheses were not rejected, though correlations between various constructs differed between countries.
Conclusion
The results indicate the Participation Scale has acceptable psychometric properties and can be used for Buruli ulcer patients in Ghana and Benin. Future studies can use this Participation Scale to evaluate the long term restrictions in participation in daily social activities of former BU patients.
Author Summary
Buruli ulcer is a stigmatising condition caused by infection with Mycobacterium ulcerans. Besides the long term medical consequences, Buruli ulcer may lead to participation restrictions in social life. The Participation Scale intends to assess perceived participation restrictions; however, this instrument has been developed in patients affected by leprosy and other disabling conditions, and has never been used before among Buruli ulcer patients. We aimed to analyze the reliability and validity of the Participation Scale among former Buruli ulcer patients in Ghana and Benin. This study included former Buruli ulcer patients from 2 different treatment sites, along with their relatives and healthy community controls residing in similar geographical areas. Former Buruli ulcer patients were interviewed using the Participation Scale, Buruli Ulcer Functional Limitation Score, and the Explanatory Model Interview Catalogue to measure stigma. Relatives and healthy community controls were interviewed using the Participation Scale. We tested the Participation Scale for discrimination, floor and ceiling effects, internal consistency, inter-item correlation, item-total correlation and construct validity. The results of the analysis suggest that the Participation Scale has acceptable psychometric properties. As such, the instrument can be used to assess participation restrictions among former Buruli ulcer patients in Ghana and Benin.
doi:10.1371/journal.pntd.0003254
PMCID: PMC4230837  PMID: 25393289
24.  An Item Response Theory (IRT) Analysis of the Short Inventory of Problems-Alcohol and Drugs (SIP-AD) among non-treatment seeking Men-Who-Have-Sex-With-Men: Evidence for a shortened 10-item SIP-AD 
Addictive behaviors  2009;34(11):948-954.
The Short Inventory of Problems-Alcohol and Drugs (SIP-AD) is a 15-item measure that assesses concurrently negative consequences associated with alcohol and illicit drug use. Current psychometric evaluation has been limited to classical test theory (CTT) statistics, and it has not been validated among non-treatment seeking men-who-have-sex-with-men (MSM). Methods from Item Response Theory (IRT) can improve upon CTT by providing an in-depth analysis of how each item performs across the underlying latent trait that it is purported to measure. The present study examined the psychometric properties of the SIP-AD using methods from both IRT and CTT among a non-treatment seeking MSM sample (N = 469). Participants were recruited from the New York City area and were asked to participate in a series of studies examining club drug use. Results indicated that five items on the SIP-AD demonstrated poor item misfit or significant differential item functioning (DIF) across race/ethnicity and HIV status. These five items were dropped and two-parameter IRT analyses were conducted on the remaining 10 items, which indicated a restricted range of item location parameters (−.15 to −.99) plotted at the lower end of the latent negative consequences severity continuum, and reasonably high discrimination parameters (1.30 to 2.22). Additional CTT statistics were compared between the original 15-item SIP-AD and the refined 10-item SIP-AD and suggest that the differences were negligible with the refined 10-item SIP-AD indicating a high degree of reliability and validity. Findings suggest the SIP-AD can be shortened to 10 items and appears to be a non-biased reliable and valid measure among non-treatment seeking MSM.
doi:10.1016/j.addbeh.2009.06.004
PMCID: PMC2726268  PMID: 19564078
Item Response Theory; Reliability; Validity; Alcohol Use; Drug Use
25.  Re-evaluating a vision-related quality of life questionnaire with item response theory (IRT) and differential item functioning (DIF) analyses 
Background
For the Low Vision Quality Of Life questionnaire (LVQOL) it is unknown whether the psychometric properties are satisfactory when an item response theory (IRT) perspective is considered. This study evaluates some essential psychometric properties of the LVQOL questionnaire in an IRT model, and investigates differential item functioning (DIF).
Methods
Cross-sectional data were used from an observational study among visually-impaired patients (n = 296). Calibration was performed for every dimension of the LVQOL in the graded response model. Item goodness-of-fit was assessed with the S-X2-test. DIF was assessed on relevant background variables (i.e. age, gender, visual acuity, eye condition, rehabilitation type and administration type) with likelihood-ratio tests for DIF. The magnitude of DIF was interpreted by assessing the largest difference in expected scores between subgroups. Measurement precision was assessed by presenting test information curves; reliability with the index of subject separation.
Results
All items of the LVQOL dimensions fitted the model. There was significant DIF on several items. For two items the maximum difference between expected scores exceeded one point, and DIF was found on multiple relevant background variables. Item 1 'Vision in general' from the "Adjustment" dimension and item 24 'Using tools' from the "Reading and fine work" dimension were removed. Test information was highest for the "Reading and fine work" dimension. Indices for subject separation ranged from 0.83 to 0.94.
Conclusions
The items of the LVQOL showed satisfactory item fit to the graded response model; however, two items were removed because of DIF. The adapted LVQOL with 21 items is DIF-free and therefore seems highly appropriate for use in heterogeneous populations of visually impaired patients.
doi:10.1186/1471-2288-11-125
PMCID: PMC3201037  PMID: 21888648
Visual impairment; Vision-related quality of life; Item response theory; Graded response model; Differential item functioning

Results 1-25 (895559)