Choices were presented to 9 individuals with developmental disabilities using a two-choice format. Each pair of items, selected based on prior preference assessment, was presented to each participant in three conditions (actual items, pictures of the items, and spoken-name presentation) using a reversal design. The evaluation was conducted using food items, and was then repeated using nonfood items. The participants were also given a test to measure their skills on discrimination tasks ranging in difficulty from simple to conditional discriminations. The participants' abilities to make consistent choices with food and nonfood items were predicted, with 94% accuracy, by their discrimination skills. The findings suggest that presentation methods can affect the accuracy of a choice assessment, and that the systematic assessment of basic discrimination skills can be used to predict the effectiveness of different presentation methods in this population.
Scores on the Boston Naming Test (BNT) are frequently lower for African American when compared to Caucasian adults. Although demographically-based norms can mitigate the impact of this discrepancy on the likelihood of erroneous diagnostic impressions, a growing consensus suggests that group norms do not sufficiently address or advance our understanding of the underlying psychometric and sociocultural factors that lead to between-group score discrepancies. Using item response theory and methods to detect differential item functioning (DIF), the current investigation moves beyond comparisons of the summed total score to examine whether the conditional probability of responding correctly to individual BNT items differs between African American and Caucasian adults. Participants included 670 adults age 52 and older who took part in Mayo's Older Americans and Older African Americans Normative Studies. Under a 2-parameter logistic IRT framework and after correction for the false discovery rate, 12 items where shown to demonstrate DIF. Six of these 12 items (“dominoes,” “escalator,” “muzzle,” “latch,” “tripod,” and “palette”) were also identified in additional analyses using hierarchical logistic regression models and represent the strongest evidence for race/ethnicity-based DIF. These findings afford a finer characterization of the psychometric properties of the BNT and expand our understanding of between-group performance.
Boston Naming Test; Item response theory; Differential item functioning; Ethnicity; Race; Bias
Professionalism is a difficult construct to define in medical students but aspects of this concept may be important in predicting the risk of postgraduate misconduct. For this reason attempts are being made to evaluate medical students' professionalism. This study investigated the psychometric properties of Selected Response Questions (SRQs) relating to the theme of professional conduct and ethics comparing them with two sets of control items: those testing pure knowledge of anatomy, and; items evaluating the ability to integrate and apply knowledge ("skills"). The performance of students on the SRQs was also compared with two external measures estimating aspects of professionalism in students; peer ratings of professionalism and their Conscientiousness Index, an objective measure of behaviours at medical school.
Item Response Theory (IRT) was used to analyse both question and student performance for SRQs relating to knowledge of professionalism, pure anatomy and skills. The relative difficulties, discrimination and 'guessabilities' of each theme of question were compared with each other using Analysis of Variance (ANOVA). Student performance on each topic was compared with the measures of conscientiousness and professionalism using parametric and non-parametric tests as appropriate. A post-hoc analysis of power for the IRT modelling was conducted using a Monte Carlo simulation.
Professionalism items were less difficult compared to the anatomy and skills SRQs, poorer at discriminating between candidates and more erratically answered when compared to anatomy questions. Moreover professionalism item performance was uncorrelated with the standardised Conscientiousness Index scores (rho = 0.009, p = 0.90). In contrast there were modest but significant correlations between standardised Conscientiousness Index scores and performance at anatomy items (rho = 0.20, p = 0.006) though not skills (rho = .11, p = .1). Likewise, students with high peer ratings for professionalism had superior performance on anatomy SRQs but not professionalism themed questions. A trend of borderline significance (p = .07) was observed for performance on skills SRQs and professionalism nomination status.
SRQs related to professionalism are likely to have relatively poor psychometric properties and lack associations with other constructs associated with undergraduate professional behaviour. The findings suggest that such questions should not be included in undergraduate examinations and may raise issues with the introduction of Situational Judgement Tests into Foundation Years selection.
Although the Autobiographical Memory Test (AMT) is widely used its psychometric properties have rarely been investigated. This paper utilises data gathered from a 10-item written version of the AMT, completed by 5792 adolescents participating in the Avon Longitudinal Study of Parents and Children, to examine the psychometric properties of the measure. The results show that the scale derived from responses to the AMT operates well over a wide range of scores, consistent with the aim of deriving a continuous measure of over-general memory. There was strong evidence of group differences in terms of gender, low negative mood, and IQ, and these were in agreement when comparing an item response theory (IRT) approach with that based on a sum score. One advantage of the IRT model is the ability to assess and consequently allow for differential item functioning. This additional analysis showed evidence of response bias for both gender and mood, resulting in attenuation in the mean differences in AMT across these groups. Implications of the findings for the use of the AMT measure in different samples are discussed.
Avon Longitudinal Study of Parents and Children; ALSPAC; Autobiographical Memory Test; AMT; Graded response model; Differential item functioning; Mood congruence
Currently there is a lot of interest in the flexible framework offered by item banks for measuring patient relevant outcomes, including functional status. However, there are few item banks, which have been developed to quantify functional status, as expressed by the ability to perform activities of daily life.
This paper examines the psychometric properties of the AMC Linear Disability Score (ALDS) project item bank using an item response theory model and full information factor analysis. Data were collected from 555 respondents on a total of 160 items.
Following the analysis, 79 items remained in the item bank. The remaining 81 items were excluded because of: difficulties in presentation (1 item); low levels of variation in response pattern (28 items); significant differences in measurement characteristics for males and females or for respondents under or over 85 years old (26 items); or lack of model fit to the data at item level (26 items).
It is conceivable that the item bank will have different measurement characteristics for other patient or demographic populations. However, these results indicate that the ALDS item bank has sound psychometric properties for respondents in residential care settings and could form a stable base for measuring functional status in a range of situations, including the implementation of computerised adaptive testing of functional status.
Physical function is a key component of patient-reported outcome (PRO) assessment in rheumatology. Modern psychometric methods, such as Item Response Theory (IRT) and Computerized Adaptive Testing, can materially improve measurement precision at the item level. We present the qualitative and quantitative item-evaluation process for developing the Patient Reported Outcomes Measurement Information System (PROMIS) Physical Function item bank.
The process was stepwise: we searched extensively to identify extant Physical Function items and then classified and selectively reduced the item pool. We evaluated retained items for content, clarity, relevance and comprehension, reading level, and translation ease by experts and patient surveys, focus groups, and cognitive interviews. We then assessed items by using classic test theory and IRT, used confirmatory factor analyses to estimate item parameters, and graded response modeling for parameter estimation. We retained the 20 Legacy (original) Health Assessment Questionnaire Disability Index (HAQ-DI) and the 10 SF-36's PF-10 items for comparison. Subjects were from rheumatoid arthritis, osteoarthritis, and healthy aging cohorts (n = 1,100) and a national Internet sample of 21,133 subjects.
We identified 1,860 items. After qualitative and quantitative evaluation, 124 newly developed PROMIS items composed the PROMIS item bank, which included revised Legacy items with good fit that met IRT model assumptions. Results showed that the clearest and best-understood items were simple, in the present tense, and straightforward. Basic tasks (like dressing) were more relevant and important versus complex ones (like dancing). Revised HAQ-DI and PF-10 items with five response options had higher item-information content than did comparable original Legacy items with fewer response options. IRT analyses showed that the Physical Function domain satisfied general criteria for unidimensionality with one-, two-, three-, and four-factor models having comparable model fits. Correlations between factors in the test data sets were > 0.90.
Item improvement must underlie attempts to improve outcome assessment. The clear, personally important and relevant, ability-framed items in the PROMIS Physical Function item bank perform well in PRO assessment. They will benefit from further study and application in a wider variety of rheumatic diseases in diverse clinical groups, including those at the extremes of physical functioning, and in different administration modes.
The Effective Consumer Scale (EC-17) measures the skills of musculoskeletal patients in managing their own healthcare. The objectives of this study were to translate the EC-17 into Dutch and to further evaluate its psychometric properties.
The EC-17 was translated and cognitively pretested following cross-cultural adaptation guidelines. Two hundred and thirty-eight outpatients (52 % response rate) with osteoarthritis or fibromyalgia completed the EC-17 along with other validated measures. Three weeks later, 101 patients completed the EC-17 again.
Confirmatory factor analysis supported the unidimensional structure of the scale. The items adequately fit the Rasch model and only one item demonstrated differential item functioning. Person reliability was high (0.92), but item difficulty levels tended to cluster around the middle of the scale, and measurement precision was highest for moderate and lower levels of skills. The scale demonstrated adequate test–retest reliability (ICC = 0.71), and correlations with other measures were largely as expected.
The results supported the validity and reliability of the Dutch version of the EC-17, but suggest that the scale is best targeted at patients with relatively low levels of skills. Future studies should further examine its sensitivity to change in a clinical trial specifically aimed at improving effective consumer skills.
Arthritis; Consumer participation; Psychometrics; Rasch analysis
The 17-item Hamilton Rating Scale for Depression (HRSD17) and the Montgomery Äsberg Depression Rating Scale (MADRS) are two widely used clinicianrated symptom scales. A 6-item version of the HRSD (HRSD6) was created by Bech to address the psychometric limitations of the HRSD17. The psychometric properties of these measures were compared using classical test theory (CTT) and item response theory (IRT) methods. IRT methods were used to equate total scores on any two scales. Data from two distinctly different outpatient studies of nonpsychotic major depression: a 12-month study of highly treatment-resistant patients (n=233) and an 8-week acute phase drug treatment trial (n=985) were used for robustness of results.
MADRS and HRSD6 items generally contributed more to the measurement of depression than HRSD17 items as shown by higher item-total correlations and higher IRT slope parameters. The MADRS and HRSD6 were unifactorial while the HRSD17 contained 2 factors. The MADRS showed about twice the precision in estimating depression as either the HRSD17 or HRSD6 for average severity of depression. An HRSD17 of 7 corresponded to an 8 or 9 on the MADRS and 4 on the HRSD6.
The MADRS would be superior to the HRSD17 in the conduct of clinical trials.
MADRS; HRSD; item response theory; classical test theory; psychometrics
The Judgment of Line Orientation (JLO) test was developed to be, in Arthur Benton’s words, “as pure a measure of one aspect of spatial thinking, as could be conceived.” The JLO test has been widely used in neuropsychological practice for decades. The test has a high test-retest reliability (Franzen, 2000), as well as good neuropsychological construct validity as shown through neuroanatomical localization studies (Tranel, Vianna, Manzel, Damasio, & Grabowski, 2009). Despite its popularity and strong psychometric properties, the full-length version of the test (30 items) has been criticized as being unnecessarily long (Straus, Sherman, & Spreen, 2006). There have been many attempts at developing short forms; however, these forms have been limited in their ability to estimate scores accurately. Taking advantage of a large sample of JLO performances from 524 neurological patients with focal brain lesions, we used techniques from Item Response Theory (IRT) to estimate each item’s difficulty and power to discriminate among various levels of ability. A random item IRT model was used to estimate the influence of item stimulus properties as predictors of item difficulty. These results were used to optimize the selection of items for a shorter method of administration which maintained comparability with the full form using significantly fewer items. This effectiveness of this method was replicated in a second sample of 82 healthy elderly participants. The findings should help broaden the clinical utility of the JLO and enhance its diagnostic applications.
Neuropsychological evaluations conducted in the United States and abroad commonly include the use of tests translated from English to Spanish. The use of translated naming tests for evaluating predominately Spanish-speakers has recently been challenged on the grounds that translating test items may compromise a test’s construct validity. The Texas Spanish Naming Test (TNT) has been developed in Spanish specifically for use with Spanish-speakers; however, it is unlikely patients from diverse Spanish-speaking geographical regions will perform uniformly on a naming test. The present study evaluated and compared the internal consistency and patterns of item-difficulty and -discrimination for the TNT and two commonly used translated naming tests in three countries (i.e., United States, Colombia, Spain). Two hundred fifty two subjects (126 demented, 116 nondemented) across three countries were administered the TNT, Modified Boston Naming Test-Spanish, and the naming subtest from the CERAD. The TNT demonstrated superior internal consistency to its counterparts, a superior item difficulty pattern than the CERAD naming test, and a superior item discrimination pattern than the MBNT-S across countries. Overall, all three Spanish naming tests differentiated nondemented and moderately demented individuals, but the results suggest the items of the TNT are most appropriate to use with Spanish-speakers. Preliminary normative data for the three tests examined in each country are provided.
The 10-item Connor-Davidson Resilience Scale (10-item CD-RISC) is an instrument for measuring resilience that has shown good psychometric properties in its original version in English. The aim of this study was to evaluate the validity and reliability of the Spanish version of the 10-item CD-RISC in young adults and to verify whether it is structured in a single dimension as in the original English version.
Cross-sectional observational study including 681 university students ranging in age from 18 to 30 years. The number of latent factors in the 10 items of the scale was analyzed by exploratory factor analysis. Confirmatory factor analysis was used to verify whether a single factor underlies the 10 items of the scale as in the original version in English. The convergent validity was analyzed by testing whether the mean of the scores of the mental component of SF-12 (MCS) and the quality of sleep as measured with the Pittsburgh Sleep Index (PSQI) were higher in subjects with better levels of resilience. The internal consistency of the 10-item CD-RISC was estimated using the Cronbach α test and test-retest reliability was estimated with the intraclass correlation coefficient.
The Cronbach α coefficient was 0.85 and the test-retest intraclass correlation coefficient was 0.71. The mean MCS score and the level of quality of sleep in both men and women were significantly worse in subjects with lower resilience scores.
The Spanish version of the 10-item CD-RISC showed good psychometric properties in young adults and thus can be used as a reliable and valid instrument for measuring resilience. Our study confirmed that a single factor underlies the resilience construct, as was the case of the original scale in English.
Resilience; 10-item CD-RISC; Young adults; Reliability; Validity; Questionnaire
The aim of this study is to evaluate the validity and the psychometric properties of a German version of the 20-item neck pain and disability scale (NPAD) for use in primary care settings. Four hundred and forty-eight participants from 15 general practices in the area of Göttingen Germany completed a multidimensional questionnaire including a newly developed German version of the NPAD (NPAD-d) and self-reported demographic and clinical information. Reliability was tested using Cronbach’s alpha. Item-to-total score correlations were analysed. Factor structure was explored by using unrestricted principal factor analysis. Construct validity of the NPAD-d was evaluated by simple correlation analyses (Pearson’s rho) with social and clinical characteristics. The discriminative abilities of the NPAD-d were examined by comparing differences between subgroups stratified on non-NPAD-d pain related characteristics using t tests for mean scores. Cronbach’s alpha of NPAD-d was 0.94. Item-to-total scale correlations ranged between 0.414 and 0.829. Exploratory principal factor analysis indicated that the NPAD-d covers one factor with an explained variance of 48%. Correlation analysis showed high correlations with criterion variables. The NAPD-d scores of subgroups of patients were significantly different showing good discriminative validity of the scale. The NPAD-d demonstrated good validity and reliability in this general practice setting. The NPAD-d may be useful in the clinical assessment process and the management of neck pain.
Neck pain; Assessment; General practice; Validity; Reliability
This paper describes the psychometric properties of the PROMIS Pain Interference (PROMIS-PI) bank. An initial candidate item pool (n=644) was developed and evaluated based on review of existing instruments, interviews with patients, and consultation with pain experts. From this pool, a candidate item bank of 56 items was selected and responses to the items were collected from large community and clinical samples. A total of 14,848 participants responded to all or a subset of candidate items. The responses were calibrated using an item response theory (IRT) model. A final 41-item bank was evaluated with respect to IRT assumptions, model fit, differential item function (DIF), precision, and construct and concurrent validity. Items of the revised bank had good fit to the IRT model (CFI and NNFI/TLI ranged from 0.974 to 0.997), and the data were strongly unidimensional (e.g., ratio of first and second eigenvalue = 35). Nine items exhibited statistically significant DIF. However, adjusting for DIF had little practical impact on score estimates and the items were retained without modifying scoring. Scores provided substantial information across levels of pain; for scores in the T-score range 50-80, the reliability was equivalent to 0.96 to 0.99. Patterns of correlations with other health outcomes supported the construct validity of the item bank. The scores discriminated among persons with different numbers of chronic conditions, disabling conditions, levels of self-reported health, and pain intensity (p< 0.0001). The results indicated that the PROMIS-PI items constitute a psychometrically sound bank. Computerized adaptive testing and short forms are available.
Quality-of-life outcomes; quality-of-life measurement; pain
The Functional Assessment of Cancer Therapy (FACT) is one of the most commonly used self-report instruments for evaluation of health-related quality of life in oncology patients. However, cultural considerations necessitate testing of the subscales in different populations. We sought to qualitatively and quantitatively investigate the applicability and psychometric properties of the Chinese version of the FACT-Cervix (FACT-Cx) in Chinese women with cervical cancer.
Ten personal interviews were conducted in order to explore patients’ opinions about the scale and its items in depth. In addition the questionnaire was administered to 400 women with cervical cancer to test its psychometric properties. Reliability was assessed using Cronbach’s alpha coefficient and item-subscale correlation while validity was evaluated using factor analysis and known-group validity.
Some items related to sex and the ability to give birth were questioned in the personal interviews, mostly regarding their significance and acceptance in the Chinese cultural context. The Cronbach’s alphas of FACT-Cx and the subscales were greater than 0.7, except for the cervical-cancer-specific subscale which was 0.57. Factor analysis demonstrated that the FACT-G construct generally paralleled the original. There were significant differences in the FACT-Cx and some subscales between those receiving and not receiving treatment and among the patients with different performance status.
In general, psychometric properties of the Chinese version supported its use with cervical cancer patients in Mainland China. Further work is needed to improve the psychometric adequacy of the cervical-cancer-specific subscale and adjust it to cultural considerations.
Health-related quality of life; FACT-Cx; FACT-G; Chinese version; Psychometric properties; Cervical cancer
With the Profile of Mood States (POMS), a German version of an international instrument for the assessment of mood is available. The paper introduces a new short version containing 24 items and four scales. In a study about indoor climate in 4596 office workers only a few missing values were noted. Psychometric analyses showed very good characteristics of the four scales regarding their internal consistency (Cronbach’s α) and scale fit. High floor effects indicated a limited exhaustion of the scale range. Age and gender effects of the scale scores concerned the scales “vigour” and “fatigue”. Furthermore, the scales of the POMS discriminated between groups with different self-reported disease incidences. A less beneficial characteristic of the POMS could be noted in terms of a high correlation of the scales “numbness” and “fatigue". With the tested version of the POMS, a short instrument with good psychometric properties has been presented which can be assessed in healthy as well as in health-impaired persons.
Item response theory (IRT) is extensively used to develop adaptive instruments of health-related quality of life (HRQoL). However, each IRT model has its own function to estimate item and category parameters, and hence different results may be found using the same response categories with different IRT models. The present study used the Rasch rating scale model (RSM) to examine and reassess the psychometric properties of the Persian version of the PedsQLTM 4.0 Generic Core Scales.
The PedsQLTM 4.0 Generic Core Scales was completed by 938 Iranian school children and their parents. Convergent, discriminant and construct validity of the instrument were assessed by classical test theory (CTT). The RSM was applied to investigate person and item reliability, item statistics and ordering of response categories.
The CTT method showed that the scaling success rate for convergent and discriminant validity were 100% in all domains with the exception of physical health in the child self-report. Moreover, confirmatory factor analysis supported a four-factor model similar to its original version. The RSM showed that 22 out of 23 items had acceptable infit and outfit statistics (<1.4, >0.6), person reliabilities were low, item reliabilities were high, and item difficulty ranged from -1.01 to 0.71 and -0.68 to 0.43 for child self-report and parent proxy-report, respectively. Also the RSM showed that successive response categories for all items were not located in the expected order.
This study revealed that, in all domains, the five response categories did not perform adequately. It is not known whether this problem is a function of the meaning of the response choices in the Persian language or an artifact of a mostly healthy population that did not use the full range of the response categories. The response categories should be evaluated in further validation studies, especially in large samples of chronically ill patients.
quality of life; school children; Iran; Rasch model
To test the psychometric properties of the short form of the Chinese version Diabetes Quality of Life for Youth scale (C-DQOLY-SF).
RESEARCH DESIGN AND METHODS
A 30-item C-DQOLY-SF was administered to 371 adolescents with type 1 diabetes. Exploratory and confirmatory factor analysis, correlation with HbA1c, internal consistency, and test-retest reliability were used to examine the psychometric characteristics of C-DQOLY-SF.
A 25-item questionnaire with three correlated second-order factor structures best fitted data. Scores on the 25-item C-DQOLY-SF significantly correlated with HbA1c values. Cronbach’s α and ICCs of each scale and subscale ranged from 0.77 to 0.90 and from 0.70 to 0.92, respectively.
The C-DQOLY-SF has satisfactory reliability and validity. The C-DQOLY-SF can be conveniently used in clinical settings to assess the quality of life of adolescents with type 1 diabetes.
The psychometric properties of instruments used to measure self-reported experiences of discrimination in epidemiologic studies are rarely assessed, especially regarding construct validity. The authors used 2000–2001 data from the Coronary Artery Risk Development in Young Adults (CARDIA) Study to examine differential item functioning (DIF) in 2 versions of the Experiences of Discrimination (EOD) Index, an index measuring self-reported experiences of racial/ethnic and gender discrimination. DIF may confound interpretation of subgroup differences. Large DIF was observed for 2 of 7 racial/ethnic discrimination items: White participants reported more racial/ethnic discrimination for the “at school” item, and black participants reported more racial/ethnic discrimination for the “getting housing” item. The large DIF by race/ethnicity in the index for racial/ethnic discrimination probably reflects item impact and is the result of valid group differences between blacks and whites regarding their respective experiences of discrimination. The authors also observed large DIF by race/ethnicity for 3 of 7 gender discrimination items. This is more likely to have been due to item bias. Users of the EOD Index must consider the advantages and disadvantages of DIF adjustment (omitting items, constructing separate measures, and retaining items). The EOD Index has substantial usefulness as an instrument that can assess self-reported experiences of discrimination.
African Americans; bias (epidemiology); observer variation; prejudice; psychometrics; questionnaires; reproducibility of results
The purpose of this study was to describe the questionnaire development process for evaluating important elements of an evidence-based practice (EBP) curriculum and to report on initial reliability and validity testing for the primary component of the questionnaire, an EBP knowledge exam.
The EBP knowledge test was evaluated with students enrolled in a doctor of chiropractic program. The initial version was tested with a sample of 374 and a revised version with a sample of 196 students. Item performance and reliability were assessed using item difficulty, item discrimination, and internal consistency. An expert panel assessed face and content validity.
The first version of the knowledge exam demonstrated a low internal consistency (KR20=0.55) and a few items had poor item difficulty and discrimination. This resulted in an expansion in the number of items from 20 to 40, as well as a revision of the poorly performing items from the initial version. The KR20 of the second version was 0.68; 32 items had item difficulties of between 0.20 and 0.80 and 26 items had item discrimination values of 0.20 or greater.
A questionnaire for evaluating a revised EBP integrated curriculum was developed and evaluated. Psychometric testing of the EBP knowledge component provided some initial evidence for acceptable reliability and validity.
Evidence-Based Practice; Reproducibility of Results; Questionnaires; Chiropractic; Knowledge
To create self-report physical function (PF) measures for children using modern psychometric methods for item analysis as part of Patient Reported Outcomes Measurement Information System (PROMIS).
Study Design and Setting
PROMIS qualitative methodology was applied to develop two PF item pools comprised of 32 mobility and 38 upper extremity items. Items were computer administered to subjects aged 8–17 years. Scale dimensionality and sources of local dependence (LD) were evaluated with factor analysis. Items were analyzed for differential item functioning (DIF) between genders. Items with LD, DIF, or low discrimination were considered for removal. Computerized adaptive testing performance was simulated, and short forms were constructed.
3,048 children (51.8% female, 40% non-white, 22.7% chronically ill) participated. At least 754 respondents answered each item. Factor analytic results confirmed two dimensions of PF. Fifty-two of 70 items tested were retained. A 23 item mobility bank and a 29 item upper extremity bank resulted, and 8 item short forms were created. The item banks have high information from the population mean to 3 standard deviations below.
PROMIS pediatric PF item banks and 8-item short forms assess two dimensions, mobility and upper extremity function, and show good psychometric characteristics after large scale testing.
quality of life; outcome measure; disability; child; adolescent; psychometric methods
data collection is now considered mandatory. Therefore, staff rated
clinical scales that consist of multiple items should have the minimum
number of items necessary for rigorous measurement. This study explores
the possibility of developing a short form Barthel index, suitable for
use in clinical trials, epidemiological studies, and audit, that
satisfies criteria for rigorous measurement and is psychometrically
equivalent to the 10 item instrument.
were analysed from 844 consecutive admissions to a neurological
rehabilitation unit in London. Random half samples were generated.
Short forms were developed in one sample (n=419), by selecting items
with the best measurement properties, and tested in the other (n=418).
For each of the 10 items of the BI, item total correlations and effect
sizes were computed and rank ordered. The best items were defined as
those with the lowest cross product of these rank orderings. The
acceptability, reliability, validity, and responsiveness of three short
form BIs (five, four, and three item) were determined and compared with
the 10 item BI. Agreement between scores generated by short forms and
10 item BI was determined using intraclass correlation coefficients and
the method of Bland and Altman.
RESULTS—The five best
items in this sample were transfers, bathing, toilet use, stairs, and
mobility. Of the three short forms examined, the five item BI had the
best measurement properties and was psychometrically equivalent to the
10 item BI. Agreement between scores generated by the two measures for
individual patients was excellent (ICC=0.90) but not identical (limits
item short form BI may be a suitable outcome measure for group
comparison studies in comparable samples. Further evaluations are
needed. Results demonstrate a fundamental difference between assessment
and measurement and the importance of incorporating psychometric
methods in the development and evaluation of health measures.
This study was conducted to translate and validate the Brief Pain Inventory (BPI) questionnaire in the Malay language. The psychometric properties in terms of construct and concurrent validity of the Malay version of BPI were evaluated. The internal consistency and test-retest stability were also evaluated.
The original version of BPI was translated into a Malay version by the standard procedure and piloted among 35 cancer patients with pain. A total of 113 (95.0%) agreed to participate in this study out of 119 eligible patients with an age ranging from 18 to 76 years. They were interviewed between August and November 2004 for the main study to evaluate the psychometric properties of Malay version of BPI.
The pain intensity items demonstrated high loading with a factor whereas the pain interference items were loaded on the other factor in factor analysis. Two factors explained 62% of the variance. With Karnofsky Performance Scale (KPS), pain intensity scale had a moderate negative (Pearson’s) correlation (r=−0.520, p<0.001) and pain interference scale had a good negative correlation (r=−0.732, p<0.001), showing an appropriate concurrent validity. The coefficient alpha of both scales demonstrated a good internal consistency of the items. The intraclass correlation coefficient for the test-retest stability was 0.61 for the pain intensity scale and 0.88 for the pain interference scale.
Overall, the Malay version of the BPI is a reliable and valid instrument for cancer pain assessment and it is comparable with the original version of the BPI in terms of structure and psychometric properties.
The Mayo Cognitive Factor Scores were derived from a “core battery” consisting of the WAIS-R, WMS-R, and Auditory Verbal Learning Test. The present study sought to clarify the factor structure of an expanded neuropsychological battery in normal elderly controls. Confirmatory factor analysis was performed on the WAIS-III, WRAT-3 Reading, Boston Naming Test, Controlled Oral Word Association Test, Category Fluency, Rey-Osterrieth Complex Figure, Visual Form Discrimination, and Trail Making Test A & B. A base four-factor model consistent with the WAIS-III factor structure was utilized. Only one novel five factor model differentiating processing and motor speed tests improved upon this base model. Other models did not, including a factor for executive function, division of construction/visuospatial ability, or “hold”/“no hold” language abilities.
The 15-item Care Transition Measure (CTM-15) is a measure for assessing the quality of care during transition from the patients’ perspective. The purpose of this study was to test the psychometric properties of the CTM-15 and CTM-3 (a 3-item version of the CTM-15) in Singapore, a multi-ethnic urban state in South-east Asia.
A consecutive sample of patients was recruited from two tertiary hospitals. The subjects or their proxies were interviewed 3 weeks after discharge from hospital to home in English or Chinese using the CTM-15 questionnaire. Information about patients’ visit to emergency department (ED), non-elective rehospitalisation for the condition of index hospitalisation, and care experience after discharge was also collected from respondents. Psychometric properties of CTM-15 and CTM-3 based on the five-point response scale (i.e. strongly disagree, disagree, neutral, agree, and strongly agree) and the three-point response scale (i.e. [strongly] agree, neutral, and [strongly] disagree) were tested for English and Chinese versions separately. Internal consistency reliability was assessed using Cronbach’s alpha and construct validity was tested with T-test or Pearson’s correlation by examining hypothesised association of CTM scores with ED visit, rehospitalisation, and experience with care after discharge. Exploratory factor analysis was performed to examine latent dimensions of CTM-15.
A total of 414 (proxy: 96.1%) and 165 (proxy: 84.8%) subjects completed the interviews in English and Chinese, respectively. Cronbach’s alpha values of the different CTM-15 versions ranged from 0.81 to 0.87. In contrast, Cronbach’s alpha values of the CTM-3 ranged from 0.42 to 0.63. Both CTM-15 and CTM-3 were correlated with care experience after discharge regardless of survey language or response scale (Pearson’s correlation coefficient: 0.36 to 0.46). Among the English-speaking respondents, the CTM-15 and CTM-3 scores based on both the three- and five-point response scales discriminated well between patients with and without ED visits or rehospitalisation for their index condition. Among Chinese-speaking respondents, no difference in CTM scores was observed between patients with and without ED visits or patients with and without rehospitalisation. The English and Chinese versions of the CTM-15 items demonstrated a similar 4-factor structure representing general care plan, medication, agreement on care plan, and specific care instructions.
The care transition measure is a valid and reliable measure for quality of care transition in Singapore. Moreover, the care transition measure can be administered to proxies using a simpler response scale. The discriminatory power of the Chinese version of this instrument needs to be further tested in future studies.
To evaluate existing measures of health numeracy using Item Response Theory (IRT).
A cross-sectional study was conducted. Participants completed assessments of health numeracy measures including the Lipkus Expanded Health Numeracy Scale (Lipkus), and the Medical Data Interpretation Test (MDIT). The Lipkus and MDIT were scaled with IRT utilizing the 2-parameter logistic model.
Three-hundred and fifty-nine (359) participants were surveyed. Classical test theory parameters and IRT scaling parameters of the numeracy measures found most items to be at least moderately discriminating. Modified versions of the Lipkus and MDIT were scaled after eliminating items with low discrimination, high difficulty parameters, and poor model fit. The modified versions demonstrated a good range of discrimination and difficulty as indicated by the Test Information Functions.
An IRT analysis of the Lipkus and MDIT indicate that both health numeracy scales discriminate well across a range of ability.
Health numeracy skills are needed in order for patients to successfully participate in their medical care. The accurate assessment of health numeracy may help health care providers to tailor patient education interventions to the patient’s level of understanding and ability. Item response theory scaling methods can be used to evaluate the discrimination and difficulty of individual items as well as the overall assessment.
Item Response Theory; Numeracy; Health Literacy; Measurement