Search tips
Search criteria 


Logo of annrheumdAnnals of the Rheumatic DiseasesVisit this articleSubmit a manuscriptReceive email alertsContact usBMJ
Ann Rheum Dis. 2007 August; 66(8): 1078–1084.
Published online 2007 January 11. doi:  10.1136/ard.2006.058693
PMCID: PMC1954720

Reliability of the ICF Core Set for rheumatoid arthritis



The comprehensive ICF Core Set for rheumatoid arthritis (RA) is a selection of 96 categories from the International Classification of Functioning, Disability and Health (ICF), representing relevant aspects in the functioning of RA patients.


To study the reliability of the ICF Core Set for RA in rheumatological practice, and to explore the metric of the qualifiers' scale.


25 RA patients from an outpatient department of rheumatology were interviewed using the ICF Core Set for RA (76% females, mean (SD) age 57.5 (12.5) years, disease duration 15.9 (14.6) years). Interviews were performed independently by both a physiotherapist and an occupational therapist on the same day and again after one week by one of them. The severity of the patients' problems was quantified on a qualifier scale ranging from 0 (no problem) to 4 (complete problem). Analyses of intra‐rater and inter‐rater agreement, kappa statistics, and Rasch analyses were applied.


Mean intra‐rater (inter‐rater) complete agreement for all categories was seen in 59% (47%) of observations, ranging from 29% (0%) to 96% (80%) for individual categories. Weighted kappa statistics with value [gt-or-equal, slanted]0.4 showed reliability in 86% of categories within raters, and in 43% of categories between raters. Improved inter‐rater and intra‐rater reliability was observed with a reduced number of qualifiers for the categories.


Inter‐rater and intra‐rater reliability of the ICF Core Set of RA was low to moderate. The metric of the qualifiers' scale may be improved by reducing the number of qualifiers to three for all components.

Keywords: International Classification of Functioning, Disability and Health (ICF); reliability; ICF Core Set; rheumatoid arthritis; health status measure

Rheumatoid arthritis (RA) is a chronic disabling disease that is often associated with limitation in physical, mental, and social function,1,2 with potential work disability.3,4 In order to describe and assess daily functioning and disability from a bio‐psychosocial perspective in all aspects of health, the framework of the World Health Organization International Classification of Functioning, Disability and Health (ICF) can be used. This framework provides a unified and standard language for the description of health and health related conditions and a common framework among all health professions.5 The ICF structure includes the two parts: (1) functioning; and (2) disability and contextual factors. Each part has two components: (i) Body Functions and Structures, Activities and Participation; (ii) Environmental Factors and personal factors. Components are further defined by the so called ICF categories. To rate the magnitude or the severity of the problem in each of the ICF categories, WHO proposes the so called qualifiers' scale. The categories are scored by a health professional during an interview with the patient.

In order to facilitate the application of the ICF in clinical practice, specific core sets were developed for specific diseases as short lists of ICF categories that are important for patients. They serve as minimal standards for the reporting of functioning and health for clinical and epidemiological studies as well as in clinical encounters (brief ICF core set) or as standards for multiprofessional, comprehensive assessment (comprehensive ICF Core Set) under consideration of influential environmental factors. The comprehensive ICF Core Set for RA represents the typical spectrum in functioning of patients with RA with a selection of 96 categories,6 in the four components—Body Functions (b), Body Structures (s), Activities and Participation (d), and Environmental Factors (e). The preliminary version of the ICF Core Set for RA6 (hereafter stated when referring to the comprehensive ICF Core Set for RA) was developed by experts consisting of rheumatology health professionals in a formal decision making and consensus process. The process included a Delphi exercise,7 a systematic literature review,8 and an empirical data collection with the ICF checklist.9

The ICF Core Set for RA has undergone content validation from a patient perspective,10,11 but the reliability of the qualifiers' scale of the ICF Core Set for RA has not yet been studied. This study had the objective to assess the observer reliability of the ICF Core Set for RA in rheumatological practice. The specific aims were (1) to estimate the inter‐rater and intra‐rater reliability of the ICF Core Set for RA when applied by health professionals in a specialised rheumatology facility, and (2) to study the metric of the qualifiers' scale.

Patients and methods

Study design

We conducted a reliability study with assessments at two time points one week apart in RA patients with stable disease activity.


A sample of 25 patients with RA12 was included in the study after written informed consent. Patients were recruited consecutively from the rheumatology outpatient department at Diakonhjemmet Hospital in Oslo, Norway. The recent disease history was objectified from patients and hospital records.

Data collection

Patients were interviewed twice, at the first time point (T1) and then 6–10 days later (second time point (T2)). Interviews at T1 were performed independently by an occupational therapist (SL) and by a physiotherapist (RHM). Patients were assessed only by one therapist assigned randomly at T2. Both health professionals were working at the department of rheumatology and were trained according to a video issued by the ICF branch of the WHO at the Ludwig‐Maximilian University in Munich, Germany.

The health professionals rated the magnitude or the severity of the problem in each of the ICF categories primarily based on the information obtained during the interview and only if necessary and in very few cases from available medical information in the patient charts.

A short introduction to the concepts of the ICF was given in lay terms to all patients at the beginning of each interview. Open ended questions were used to ask patients about their level of functioning in each of the areas referred to in each of the ICF categories contained in the ICF Core Set for RA. Patients also filled in self administered questionnaires while waiting for the interview. The rheumatology professionals had no insight into the questionnaires.


ICF Core Set for RA

The ICF Core Set for RA includes 25 categories from the component Body functions, 18 from the component Body Structures, 32 from the component Activities and Participation, and 21 from the component Environmental Factors.6 By error, one of the categories (b715 stability of joints function) was omitted. The severity of the patients' problems in each of the ICF categories is quantified with the qualifiers' scale. The qualifiers' scale of the components Body Functions and Structures, and Activities and Participation has five response levels, each ranging from 0 to 4, corresponding to no/mild/moderate/severe/complete impairment. For example, a moderate problem with walking would give a score of 2. The qualifiers' scale of the component environmental factors has nine response levels ranging from −4 to +4. A specific environmental factor can be a barrier (−4 to −1), a facilitator (1 to 4), or can have no influence (0) on the patient's functioning. If the factor has an influence, the extent of the influence (either positive or negative) can be coded with mild/moderate/severe/complete. In addition, there are the response options “8, not specified” and “9, not applicable” for each category.

Health status

Measures to describe the study population and the stability of the disease between the two time points included visual analogue scales (0–100) for pain, fatigue, and patient global assessment of disease activity, the modified Health Assessment Questionnaire (MHAQ),13 short form SF‐36,14 and Rheumatoid Arthritis Disease Activity Index (RADAI).15

Data analysis and statistics

We used descriptive statistics to describe the study population. To examine the stability of patients between both assessments the limits of agreement statistics by Bland and Altman were applied.16 Patients were excluded from the intra‐rater reliability analysis if the difference in health status exceeded limits of agreement in two of the seven measures (pain, fatigue, patient global, RADAI, MHAQ, mental health, and physical function from SF‐36) over the 6–10 day period between the assessments.

Since the response options of the qualifiers' scale “not specified”8 and “not applicable”9 are not part of the ordinal scale ranging from 0 to 4 for the components of functioning (−4 to 4 for environmental factors) they were not included for kappa statistics. Thus, the response options 8 and 9 were considered missing data.

To examine the inter‐rater and intra‐rater reliability of the ICF categories raw agreement was calculated and weighted kappa statistics were applied to investigate whether there is more agreement than might occur by chance given random guessing. The choice of this approach was based on recent suggestions by, for example, Landis and Koch,17 to placing more weight on the raw data than on the kappa coefficient, in agreement statistics.18,19 ICC analyses were not applicable for the categorical analyses in this study.

To explore the metric of the qualifiers' scale, the Rasch model for ordered response levels was used.20 Reversed threshold estimates provide sufficient evidence to conclude that the empirical ordering is not consistent with the intended ordering. Thresholds define the boundaries between response categories. Collapsing response levels may reveal the effective number and ordering of levels post hoc.20 Therefore, in case of reversed threshold estimates, response levels were collapsed to obtain an effective number and ordering of response categories post hoc.

Two different dimensions, respectively, were studied based on the Rasch model—namely, the dimension “functioning” and the dimension “environmental factors.” In line with the bio‐psychosocial model on which the ICF is based, the ICF categories of the components Body Functions and Structures, Activities and Participation were included in the Rasch analysis to study the dimension functioning. The environmental factors categories were included in the Rasch analysis to study the dimension environmental factors. After collapsing response categories agreement statistics were recalculated.

The data analysis regarding descriptive statistics and Bland–Altman plots was performed using SPSS 12.0. Weighted kappa values were analysed using the Statistical Analysis System (SAS version 9.1.3). Rasch analyses were performed with the program RUMM 2020.21 When applicable, the level of statistical significance was set to p<0.05. The study was approved by the regional ethics committee.


Demographic data and clinical characteristics of the participants are shown in table 11.. For the analyses of the intra‐rater reliability, two patients were excluded because the disease was not stable between the two time points of assessment. Mean time for assessments with the ICF Core Set for RA was 34.2 (SD 9.1, range 20–75) minutes.

Table thumbnail
Table 1 Demographic characteristics and scores for health status measures of patients (n = 25) at baseline (mean (SD) for continuous variables, % for counts)

Missing values of more than 5%, the result of response options “not specified” and “not applicable,” were present in 9/95 categories within raters (9%) and 23/95 categories (24%) between raters.

TablesTables 2–5 list the ICF categories in the different components Body Functions, Body Structures, Activities and Participation, and Environmental Factors and present intra‐rater and inter‐rater reliability with percentages for complete agreement and kappa statistics.

Table thumbnail
Table 2 Body functions: categories from ICF Core Set for RA with intra‐rater and inter‐rater reliability
Table thumbnail
Table 3 Body structures: categories from ICF Core Set for RA with intra‐rater and inter‐rater reliability
Table thumbnail
Table 4 Activities and participation: categories from ICF Core Set for RA with intra‐rater and inter‐rater reliability
Table thumbnail
Table 5 Environmental factors: categories from ICF Core Set for RA with intra‐rater and inter‐rater reliability

Mean intra‐rater agreement for all ICF categories was 59% which increased to 72% after collapsing of qualifiers, ranging from 29% (e340) to 96% (b510) before, and from 44% (e450) to 96% (b510) after collapsing of qualifiers ((tablestables 2–5).

Mean inter‐rater agreement for all ICF categories was 47% and increased to 61% after collapsing of qualifiers, ranging from 0% (e450) to 80% (d560) before, and from 8% (d415) to 88% (d560) after collapsing of qualifiers ((tablestables 2–5).

The mean intra‐rater agreement per component was 61% for Body Functions, 62% for Body Structures, 60% for Activities and Participation, and 52% in the component Environmental Factors. The mean inter‐rater agreement was for Body Functions 55%, for Body Structures 46%, for Activities and Participation 51%, and 31% in the component Environmental Factors. Between raters 52% of the ICF categories showed at least 50% agreement (78% after collapsing), and in 77% within raters (99% after collapsing).

Weighted kappa statistics showed reliability of 0.4 or higher in 82/95 ICF categories (86%) within raters, but only in 41/95 ICF categories (43%) between raters (table 66).

Table thumbnail
Table 6 Frequency of observer agreement within and between raters for categories in the ICF Core Set for RA

Rasch analyses suggested that reduction of the number of qualifiers from five to three—and from nine to three for environmental factors—improved both inter‐rater and intra‐rater agreement. According to these results, the response levels 1–2 and 3–4 of the ICF categories belonging to Body Functions, Body Structures, and Activities and Participation were collapsed, respectively. In the component Environmental Factors, the response levels from −4 to −1 and from 1 to 4 were collapsed.

Several considerations were thereby taken into account: Firstly, the number of response categories that does not follow the consecutive order intended was considered. Secondly, a further collapsing strategy was studied—namely, the collapsing of the response categories 3 and 4. However, this strategy did not yield satisfactory results as most of the ICF categories still presented response categories that did not have a consecutive order (results not shown). Also, owing to the low frequencies in response categories 3 and 4, no further collapsing strategies, such as collapsing response categories 1 and 2 and 2 and 3, were considered. Finally, the same response format was intended for all ICF categories. The proposed collapsing strategy is clinically intuitive for judging the severity of a problem in the corresponding ICF categories. After collapsing the response categories, only four ICF categories in the functioning component did not follow a consecutive order and five in the component environmental factors.


In this study the reliability of the ICF Core Set for RA demonstrated only low and at best moderate agreement. Agreement for the individual categories between different health professionals was lower than within the same person. Based on explorative analyses a reduction of the scale qualifiers from five to three (and from nine to three for environmental factors) could improve both inter‐rater and intra‐rater reliability.

This is the first study to explore and establish the reliability of an ICF Core Set for a specific disease. Extensive testing of the ICF is necessary as the ICF is a WHO adopted classification for global application. Reliability testing of the ICF has so far only been performed in one other study where not a core set, but specified ICF categories were tested for inter‐rater reliability in geriatric patients, and a moderate reliability was reported.22

ICF categories are not self assessed and a patient's report is being interpreted and scored by an interviewer, leading necessarily to a discrepancy between scores and reduced reliability. Health professionals with different specialties—such as an occupational therapist and a physiotherapist in our study—also have different focus on and awareness for the individual ICF categories. Therefore, one expects better inter‐rater agreement if similar professionals have been involved in the judgments.

Within this context it is important to remember that limits of agreement in clinical and health status measures are generally quite wide. This was for example demonstrated when we examined the test‐retest reliability for different data collection methods of self reported health status in patients with RA.23 In rheumatology, low reliability when grading tender joints is known,24 and inter‐rater agreement hardly exceeds the level of chance.25 Therefore, it does not surprise us that determined ICF categories of the ICF Core Set for RA show low reliability.

In this study, low reliability was demonstrated especially for categories concerning the environmental factors, in particular inter‐rater agreement; categories from this component of the ICF are scored by attributing a positive or negative graded weight. The patient provides the information during the assessment if a factor (for example, family) is a facilitator or a barrier or both. Because an environmental factor may be expressed as both a facilitator and a barrier by the patient, the interviewer may meet the challenge to decide on one overall final judgment when scoring the individual category. In addition, environmental factors may be difficult to assess owing to difficult technical terms that do not always have associations in everyday life or do not have a clear meaning. Another problem of testing the ICF is given by the qualifiers on a mixed scale 0–4 and recording if categories are not specified or not applicable. The ICF qualifiers “8, not specified” and “9, not applicable” are very useful from a clinical point of view but they represent a “barrier” that is difficult to overcome when performing statistical analyses.

Assessment of patients with the ICF Core Set is time consuming. Lack of clarity in wording of the ICF categories may also have produced a need for explanation, leading to variance in time required to complete the assessment (range 20–75 minutes) and to a potential for variance in agreement. Increasing experience of the health professionals in this study did not considerably shorten the time (data not shown), but time use was constant at all three assessments. The reliability of the core set in RA might be improved by means of a training manual.

Our analysis suggests reducing the number of qualifiers when applying the ICF Core Set for RA, because the metric of the five qualifier scale was not sound. In addition to improved reliability, the feasibility of the ICF Core Set in RA could be improved when fewer qualifiers need to be considered by the assessor. The suggested reduction of response qualifiers needs further confirmation in other studies as our finding of improved reliability with fewer response categories was driven by our own data. In addition, the sample size of our study has to be considered. This sample size was comparable with other studies examining reliability.23,26 The number of ICF categories considered in the dimension functioning and in the dimension environmental factors for the Rasch analyses, respectively, is fairly high in relation to the number of patients in our sample. In future studies, a larger sample size should be included in the Rasch analyses to augment the precision of the parameter estimations of the Rasch model.

A limitation of this study is that it was conducted at one centre and our findings may not be extrapolated to other environments of patients and health professionals. In addition, the results of the Rasch analyses were analysed to derive conclusions regarding the category response functions of the qualifiers' scale and were not further considered to obtain information regarding the fit of the ICF categories to the Rasch model. Further studies are needed to investigate whether the ICF Core Set for RA from the viewpoint of the modern test theory has contributed to building a measure.

A strength of this study is that we rigorously sought to examine reliability of the comprehensive ICF Core Set for RA for both inter‐rater and intra‐rater agreement. Patients in our sample represent the typical average age group of patients with RA in our outpatient clinic27 with age comparable to patients in a recent validation of the patient perspective of the ICF Core Set for RA.10 Both assessors in this study were experienced clinicians with interest in and knowledge of the ICF concept.

In summary, the reliability of the ICF RA core set is now established, being low to moderate in this study and varied considerably across categories and between raters. Our analysis suggests reducing the number of qualifiers when applying the ICF Core Set in RA. It remains to be shown, whether this empirical observation of improved agreement with reduced number of three response qualifiers in all dimensions may enhance the acceptance and feasibility of the ICF Cores Set in RA. The importance of testing the ICF Core Set for RA is apparent as the ICF has been adopted by WHO, and research on the applicability of the ICF in RA and in other rheumatic diseases is warranted to confirm our findings in this new field. ICF is an ongoing process where results from evaluations and suggestions for further improvement are continuously integrated.


This project is part of the ICF Core Sets Validation Study and was supported by a research grant from the European League Against Rheumatism (EULAR). The authors thank all patients who participated in the study.


ICF - International Classification of Functioning, Disability and Health

MHAQ - Modified Health Assessment Questionnaire

RA - rheumatoid arthritis

RADAI - Rheumatoid Arthritis Disease Activity Index

SF‐36 - Short Form 36

VAS - visual analogue scale


1. Uhlig T, Kvien T K, Glennås A, Smedstad L M, Førre O. The incidence and severity of rheumatoid arthritis, results from a county register in Oslo, Norway. J Rheumatol 1998. 251078–1084.1084. [PubMed]
2. Wolfe F. A reappraisal of HAQ disability in rheumatoid arthritis. Arthritis Rheum 2000. 432751–2761.2761. [PubMed]
3. Sokka T, Kautiainen H, Möttonen T, Hannonen P. Work disability in rheumatoid arthritis 10 years after the diagnosis. J Rheumatol 1999. 261681–1685.1685. [PubMed]
4. Ødegård S, Kvien T K, Finset A, Uhlig T. Physical and psychological predictors for word disability over seven years in patients with rheumatoid arthritis. Scand J Rheumatol 2005. 34441–447.447. [PubMed]
5. World Health Organization International classifacation of functioning, disability and health: ICF. Geneva: WHO, 2001.
6. Stucki G, Cieza A, Geyh S, Battistella L, Lloyd J, Symmons D. et al ICF core sets for rheumatoid arthritis. J Rehabil Med 2004. 44(suppl)87–93.93. [PubMed]
7. Weigl M, Cieza A, Andersen C, Kollerits B, Amann E, Stucki G. Identification of relevant ICF categories in patients with chronic health conditions: a Delphi exercise. J Rehabil Med 2004. 44(suppl)12–21.21. [PubMed]
8. Brockow T, Cieza A, Kuhlow H, Sigl T, Franke T, Harder M. et al Identifying the concepts contained in outcome measures of clinical trials on musculoskeletal disorders and chronic widespread pain using the International Classification of Functioning, Disability and Health as a reference. J Rehabil Med 2004. 44(suppl)30–36.36. [PubMed]
9. Ewert T, Fuessl M, Cieza A, Andersen C, Chatterji S, Kostanjsek N. et al Identification of the most common patient problems in patients with chronic conditions using the ICF checklist. J Rehabil Med 2004. 44(suppl)22–29.29. [PubMed]
10. Stamm T A, Cieza A, Coenen M, Machold K P, Nell V P, Smolen J S. et al Validating the International Classification of Functioning, Disability and Health Comprehensive Core Set for Rheumatoid Arthritis from the patient perspective: a qualitative study. Arthritis Rheum 2005. 53431–439.439. [PubMed]
11. Coenen M, Cieza A, Stamm T A, Amann E, Kollerits B, Stucki G. Validation of the International Classification of Functioning, Disability and Health (ICF) Core Set for rheumatoid arthritis from the patient perspective using focus groups. Arthritis Res Ther 2006. 8R84.
12. Arnett F C, Edworthy S M, Bloch D A, McShane D J, Fries J F, Cooper N S. et al The American Rheumatism Association 1987 revised criteria for the classification of rheumatoid arthritis. Arthritis Rheum 1988. 31315–324.324. [PubMed]
13. Pincus T, Summey J A, Soraci S A., Jr Wallston KA, Hummon NP. Assessment of patient satisfaction in activities of daily living using a modified stanford health assessment questionnaire. Arthritis Rheum 1983. 261346–1353.1353. [PubMed]
14. Ware J E, Jr, Sherbourne C D. The MOS 36‐item short‐form health survey (SF‐36). I. Conceptual framework and item selection. Med Care 1992. 30473–483.483. [PubMed]
15. Fransen J, Langenegger T, Michel B A, Stucki G. Feasibility and validity of the RADAI, a self‐administered rheumatoid arthritis disease activity index. Rheumatology (Oxford) 2000. 39321–327.327. [PubMed]
16. Bland J M, Altman D G. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986. 1307–310.310. [PubMed]
17. Landis J R, Koch G G. The measurement of observer agreement for categorical data. Biometrics 1977. 33159–174.174. [PubMed]
18. Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ 1992. 3041491–1494.1494. [PMC free article] [PubMed]
19. Feinstein A R, Cicchetti D V. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol 1990. 43543–549.549. [PubMed]
20. Andrich D. A rating formulation for ordered response categories. Psychometrika 1978. 43581–594.594.
21. Andrich D, Sheridan B E, Luo G. RUMM2020: Rasch Unidimensional Models for Measurement. Perth, Western Australia: RUMM Laboratory, 2002.
22. Okochi J, Utsunomiya S, Takahashi T. Health measurement using the ICF: test‐retest reliability study of ICF codes and qualifiers in geriatric care. Health Qual Life Outcomes 2005. 346.
23. Kvien T K, Mowinckel P, Heiberg T, Dammann K L, Dale Ø, Aanerud G J. et al Performance of health status measures with a pen‐based personal digital assistant. Ann Rheum Dis 2005. 64480–484.484.
24. Uhlig T, Smedstad L M, Vaglum P, Moum T, Gérard N, Kvien T K. The course of rheumatoid arthritis and predictors of psychological, physical and radiographic outcome after 5 years of follow‐up. Rheumatology (Oxford) 2000. 39732–741.741. [PubMed]
25. Hart L E, Tugwell P, Buchanan W W, Norman G R, Grace E M, Southwell D. Grading of tenderness as a source of interrater error in the Ritchie articular index. J Rheumatol 1985. 12716–717.717. [PubMed]
26. Guillemin F, Billot L, Boini S, Gerald N, Øddgård S, Kvien T K. Reproducibility and sensitivity to change of 5 methods for scoring hand radiographic damage in patients with rheumatoid arthritis. J Rheumatol 2005. 32766–768.768. [PubMed]
27. Uhlig T, Kvien T K, Jensen J L, Axell T. Sicca symptoms, saliva and tear production, and disease variables in 636 patients with rheumatoid arthritis. Ann Rheum Dis 1999. 58415–422.422. [PMC free article] [PubMed]

Articles from Annals of the Rheumatic Diseases are provided here courtesy of BMJ Group