|Home | About | Journals | Submit | Contact Us | Français|
The 39-item Parkinson's Disease Questionnaire, and particularly its summary index (PDQ-39SI) is a widely used patient-reported clinical trial endpoint. A basic assumption when summing items into a total score is that they represent a common variable. We therefore assessed the unidimensionality of the PDQ-39SI using Rasch and confirmatory factor analysis. Both analyses showed model misfit. Adjustment for differential item functioning and disordered response category thresholds did not improve model fit, and residual analyses showed deviation from unidimensionality. These data indicate multidimensionality and challenge the interpretation and validity of PDQ-39SI scores. Clinicians and investigators should use and interpret the PDQ-39SI with caution.
A basic assumption when summing rating scale items into a total score is that the items represent a common underlying construct; that is, they should be unidimensional [Nunnally and Bernstein 1994]. When total scores are not unidimensional they are technically invalid and their meaning is ambiguous since it is unclear what scores represent [Smith 2002]. This cannot be compensated for by trial design and may cause misleading inferences that influence patient care. Unambiguous interpretation is a prerequisite for scores to be acceptable as clinical trial endpoints [Food and Drug Administration 2006].
One approach to assess rating scale unidimensionality is Rasch analysis [Wilson 2005; Tennant et al. 2004a; Hobart 2003; Andrich 1988; Rasch 1960]. Since unidimensionality is an explicit Rasch model assumption, adequate model fit has often been taken as support for scale unidimensionality. For example, Jenkinson et al. [2003b] Rasch analyzed the 40-item Amyotrophic Lateral Sclerosis Assessment Questionnaire to determine the legitimacy of summarizing the scale into a single score. Results indicated that the items fit the Rasch model, which was taken as support for unidimensionality and, therefore, also for the validity of constructing an overall summary index [Jenkinson et al. 2003b]. However, it has been shown that multidimensional data can fit the model; therefore, the unidimensionality assumption needs to be tested explicitly [Tennant and Pallant 2006; Smith 2002, 1996].
In Parkinson's disease (PD), the 39-item PD questionnaire (PDQ-39) is the most widely used patient reported rating scale endpoint in clinical trials. Recent observations challenge the validity and interpretability of a majority of its eight scales [Hagell and Nygren 2007] and its eight-item short form, the PDQ-8 [Franchignoni et al. 2008]. However, no studies using methods such as Rasch analysis have assessed the dimensionality of the PDQ-39 summary index (PDQ-39SI), an overall PDQ-39 score [Jenkinson et al. 1997]. Such analyses are relevant because unidimensionality is a relative matter relating to the level of perspective and conceptualization [Pallant and Tennant 2007; Andrich 1988]. For example, although the grouping of items into eight PDQ-39 scales may not have been successful in defining eight unidimensional variables, all 39 items together could still represent a single variable. We therefore assessed whether the PDQ-39 appears to represent a unidimensional construct.
Details have been reported elsewhere [Hagell and Nygren 2007]. Briefly, self-reported postal survey PDQ-39 data from 202 people (79% response rate) with neurologist diagnosed PD [Gibb and Lees 1988] were analyzed (Table 1). The study was approved by the local research ethics committee.
The PDQ-39 [Peto et al. 1995] is a PD specific health status questionnaire comprising 39 items. Respondents are requested to affirm one of five ordered response categories according to how often, due to their PD, they have experienced the problem defined by each item. Items are grouped into eight scales that are scored by expressing summed item scores as a percentage score ranging between 0 and 100 (100¼more health problems). Based on results from exploratory factor analysis, a PDQ-39 summary index (PDQ-39SI) has been proposed [Jenkinson et al. 1997]. The PDQ-39SI is derived by the sum of the eight PDQ-39 scale scores divided by eight (the number of scales), which yields a score between 0 and 100 (100¼more health problems). This is equivalent to expressing the sum of all 39 item responses as a percentage score.
The logic of computing and reporting the PDQ-39SI is based on the assumption that the PDQ-39 represents a single underlying construct [Jenkinson et al. 1997]. This assumption was tested using Rasch analysis and confirmatory factor analysis.
The Rasch model [Rasch 1960] mathematically defines what is required from item responses in order for them to express linear measures rather than mere numbers. It separately locates persons and items on a common logit (log-odd units) metric that is centered by the mean item location, which is set at zero.
Fit of data to the Rasch model is assessed by examining the accordance between expected and observed responses across person locations (class intervals) on the measured construct [Andrich et al. 2004-2005; Andrich 1988]. Overall fit is supported by a nonsignificant item-trait interaction chi-square statistic, and individual item fit is supported by nonsignificant standardized residuals ranging between —2.5 and +2.5 [Andrich et al. 2004-2005; Andrich 1988]. Residuals represent the discrepancy between observed and expected item responses. Large positive residuals primarily suggest violation of unidimensionality, whereas large negative residuals signal local dependency (i.e. item responses are dependent on responses to other items, suggesting item redundancy). Large residuals, both positive and negative, violate model assumptions and distort measurement.
However, fit statistics can be somewhat insensitive in detecting multidimensionality [Tennant and Pallant 2006; Smith, 2002, 1996]. Smith  therefore proposed a combined approach to dimensionality testing. First, a principal component analysis (PCA; a form of factor analysis) of the residuals is used to identify potential subdimensions in the scale. A series of independent t-tests is then conducted to assess whether subsets of items yield different person measures. If violation of unidimensionality is trivial, the number of person locations that differ between two item sets is small. This approach attempts to assess whether scales are sufficiently unidimensional to be treated as such in practice [Tennant and Pallant 2006; Smith 2002].
Differential item functioning (DIF) is an additional aspect of fit to the Rasch model that may result from, for example, multidimensionality and can bias scale scores [Borsboom 2006; Holland and Wainer 1993; Andrich 1988]. DIF analyses assess whether subgroups of people with similar levels on the measured construct respond systematically different to items [Andrich et al. 2004-2005; Hagquist and Andrich 2004; Tennant et al. 2004]. When DIF is uniform (i.e. item responses differ uniformly between groups across the measured construct) this can be adjusted for by splitting the item into two new items, one for each subgroup [Hagquist and Andrich 2004; Tennant et al. 2004b].
When ordered response categories are used, such as with the PDQ-39, Rasch analysis can assess whether response categories work as assumed; that is, if they reflect an increasing amount of the measured variable [Andrich et al. 2004-2005; Hagquist and Andrich 2004; Hagquist 2001]. If thresholds between adjacent response categories (i.e. the points where there are 50/50 probabilities of scoring, e.g. 2 or 3) are disordered, these categories do not work as intended. This indicates problems such as too many response categories or overlapping category labels, or may be due to multidimensionality [Andrich et al. 2004-2005 Hagquist and Andrich 2004; Hagquist 2001].
To account for the procedure of creating PDQ-39SI scores from the eight suggested PDQ-39 scale scores we also used confirmatory factor analysis. Confirmatory factor analysis assesses statistically whether and to what extent empirical data fit a predefined hypothesized structure. Confirmatory factor analysis is therefore generally recommended over exploratory factor analysis when there is an a priori hypothesis regarding dimensionality [Floyd and Widaman 1995]. The extent to which empirical data accord with the hypothesized structure is assessed by a chi-square statistic that is expected to be nonsignificant when data fit the model. Because this statistic is sensitive to sample size, goodness-of-fit is also assessed by various descriptive fit indices [Schermelleh-Engel and Moosbrugger 2003].
All 39 items were analyzed regarding fit to the unrestricted (partial credit) Rasch model for ordered response categories. Unidimensionality was further scrutinized by PCA of the residuals followed by independent t-tests. Two estimated locations for each person were compared; one from the items with the strongest positive and one from the items with the strongest negative residual loadings (> ±0.3, respectively) on the first principal component (factor) [Tennant and Pallant 2006]. Unidimensionality was considered statistically supported if the proportion of significant individual t-tests, or the lower bound of the associated 95% binomial confidence interval (CI), did not exceed 0.05 [Tennant and Pallant 2006].
Next, the hypothesized scales-to-summary index structure of the PDQ-39SI was assessed by confirmatory factor analysis. The a priori hypothesis that the eight PDQ-39 scales represent a single underlying construct was tested by means of chi-square statistics and four descriptive fit indices: the Goodness-of-Fit Index (GFI), the Adjusted Goodness-of-Fit Index (AGFI), the Comparative Fit Index (CFI), and the Root Mean Square Error of Approximation (RMSEA) [Schermelleh-Engel and Moosbrugger 2003].
In case of signs of multidimensionality, two potential sources were explored. First, we examined the presence of DIF between genders and age groups (as defined by the median: 572 versus >72 years old). When DIF was detected, this was adjusted for by splitting items into subgroup specific items [Andrich et al. 2004-2005; Hagquist and Andrich 2004]. Secondly, we assessed if increasing health problems (as defined by PDQ-39SI scores) were reflected by increasing probabilities of endorsing response categories 0 (‘never’) through 4 (‘always’) by examining the thresholds between categories [Andrich et al. 2004-2005; Hagquist and Andrich 2004]. When disordered thresholds were found, we explored whether collapsing adjacent response categories improved model fit and unidimensionality. Analyses were performed using SPSS 14 (SPSS Inc., Chicago, IL), RUMM2020 (Rumm Laboratory Pty Ltd., Perth) and AMOS 5 (SmallWaters Corp., Chicago, IL) for Windows.
Rasch analysis yielded a significant item-trait interaction chi-square statistic (χ2, 300.064; p50.0001), indicating lack of overall model fit. Reliability was 0.96. Inspection of individual item fit suggested that 12 items did not fit the model (Table 2). Among these, eight items (23, 25, 30, 32, 33, 37, 38, and 39) displayed large positive residuals, indicating departure from unidimensionality. PCA followed by independent t-tests showed that the proportion of significantly different person measures based on items with strong positive and negative loadings on the first principal component was 0.36 (95% CI, 0.33–0.39). Similarly, confirmatory factor analysis of the proposed scales-to-summary index structure showed inadequate goodness-of-fit (Figure 1).
Next we examined the presence of DIF by gender and age. Four items (19, 24, 34, and 35) displayed significant DIF by gender and item 10 showed DIF by age. These items were then split into gender and age specific ones, respectively. Overall Rasch model fit remained significant (item-trait interaction χ2, 235.358; p50.0001) and misfit was found for the same items as before. Reliability was unchanged at 0.96.
We then assessed whether the five response categories worked as assumed. We found disordered response category thresholds in 24 items (Table 3). Threshold disordering typically involved category 1 (‘seldom’), although disordering of all thresholds occurred. Figure 2 exemplifies these observations by displaying items with (Figure 2A, B) and without (Figure 2C) threshold disordering. Response categories were then collapsed into four (16 items: 1, 6, 7, 9-14, 24, 29, 31, 33, 34, 38 and 39) and three (eight items: 3-5, 8, 23, 28, 30 and 37) categories in order to obtain response scale functionality. This did not improve overall model fit (item-trait interaction χ2, 236.136; p <0.0001). At the item level, misfit was resolved for items 29, 37, 38 and 39. However, two additional items (15 and 16) now displayed signs of misfit (fit residual values of 3.65 and 3.23, respectively). Independent t-tests of the DIF adjusted scale with collapsed response categories showed that the proportion of significantly different person measures was 0.35 (95% CI, 0.32–0.38). Reliability was unchanged at 0.96.
This study tested whether the PDQ-39 represents a unidimensional construct. Such assessments are essential as legitimate use of total scores assumes unidimensionality, and violation thereof challenges the meaning and validity of scores. Both Rasch and confirmatory factor analyses gave similar results in that neither approach found support for the unidimensionality of the PDQ-39. This challenges the validity and, consequently, the interpretability of the PDQ-39SI.
There are at least three related reasons why unidimensionality is important to consider [Smith 2002; Stout 1987]. Firstly, unidimensionality is a basic assumption for valid calculation of total scores. Secondly, unambiguous interpretation requires scores to represent a single defined attribute. That is, scores on a scale that is used to measure one variable should not be appreciably influenced by varying levels on one or more other variables. Thirdly, if scores do not represent a common line of inquiry it is unclear if two individuals with the same score can be considered comparable. Similarly, the interpretation of any differences between individuals will be ambiguous since it is unknown how they actually differ. This hampers understanding of clinical trial outcomes, which in turn has consequences for selecting interventions for individual patients.
We found evidence that the PDQ-39 does not represent a unidimensional construct. These observations are in accordance with the ambiguities observed regarding the dimensionality of the eight PDQ-39 scales [Hagell and Nygren 2007] and the PDQ-8 [Franchignoni et al. 2008]. Previous studies addressing the dimensionality of the PDQ-39SI have conducted exploratory factor analyses of the eight PDQ-39 scale scores [Jenkinson and Fitzpatrick 2007; Luo et al. 2005; Tan et al. 2004; Jenkinson et al. 2003a]. Similarly to the initial derivation of the PDQ-39SI [Jenkinson et al. 1997], these studies suggested that the eight PDQ-39 scales represent a common construct by showing that all eight PDQ-39 scale scores loaded on a single factor according to the eigenvalue 51 criterion. However, this approach is generally discouraged because it typically yields erroneous results [Gorsuch 1983]. Firstly, it tends to identify too many or too few factors (dimensions) and, secondly, the number of factors identified tends to relate to the number of variables included in the analysis (regardless of the actual number of dimensions in the data). With eight variables, as with the PDQ-39 scales, identification of a single dimension is therefore not surprising [Hair et al. 2006]. Furthermore, in the majority of studies using exploratory factor analysis [Jenkinson and Fitzpatrick 2007; Luo et al. 2005; Tan et al. 2004; Jenkinson et al. 2003a] the identified single factor explained less than 50% of the total variance and in no instance did it exceed the recommended 60% [Hair et al. 2006]. That is, more than half of the information contained in the eight PDQ-39 scales was typically not accounted for.
It may be obvious from the nature of scales such as the PDQ-39 that they reflect illness-related aspects as perceived and interpreted by patients. On one hand, it may therefore be argued that it is of less concern exactly what such scores represent. However, this would be analogous to relying on an overall score that represents unknown aspects of neurological impairments and use this to understand outcomes in clinical trials. The impact of diseases such as PD is vast and involves a variety of aspects. In order to be able to understand these and to offer interventions to improve patient wellbeing, they need to be measured without ambiguousness. This is not to say that valid overall measurement of the impact of disease from the patient's perspective cannot be obtained. A prerequisite, however, is that such instruments are based on and developed according to well-defined theories [Doward et al. 2004].
We found some evidence for the presence of DIF by age and gender. However, this does not appear to be a main source of violations to unidimensionality in the PDQ-39SI since adjustment for DIF did not improve dimensionality. Similarly, while problems with the rating scale response categories may relate to multidimensionality, it appears unlikely that this would be a major explanation here since explorative post hoc combination of response categories did not improve model fit. Instead, the dimensionality problem appears to be a conceptual one where items do not work in harmony to define a common variable.
The observed disordering among the PDQ-39 response category thresholds shows that the response scale does not work as intended. This may be due to; for example, unclear distinctions between categories or difficulties making fine tuned ratings [Andrich et al. 2004–2005; Hagquist and Andrich 2004; Hagquist 2001]. Referring back to Figure 2B, threshold disordering means that the location at which people are equally likely to respond ‘often’ or ‘always’ represents less health problems than that at which they are equally likely to respond ‘sometimes’ or ‘often’. When this phenomenon occurs something has gone wrong in the interaction between respondents, items and the response options, and the clinical meaning of the response scale is unclear.
Similarly to previous observations in the US and UK [Paterson et al. 2005; Bushnell and Martin 1999], respondents to the original Swedish PDQ-39 found the distinction between ‘occasionally’ and ‘sometimes’ ambiguous [Hagell and McKenna 2003]. As in the US PDQ-39 [Bushnell and Martin 1999], the Swedish PDQ-39 used in this study therefore substituted ‘occasionally’ by ‘seldom’ [Kim et al. 2006]. Such a modification is supported by studies specifically addressing people's interpretation of response category labels in other populations [Skevington and Tucker 1999; Szabo et al. 1996]. However, while the change from ‘occasionally’ to ‘seldom’ improved this apparent ambiguity of the PDQ-39 response options, a significant proportion of respondents still found them difficult to use [Kim et al. 2006]. Furthermore, the observed problems with the PDQ-39 response categories do not appear to be specific for the Swedish version of the scale, as similar observations have been reported with other language versions [Franchignoni et al. 2008].
Despite displaying misfit to a unidimensional measurement model PDQ-39 items may still prove useful for measurement, provided that a (or several) subset(s) of items can be shown to represent a clearly defined variable. However, the aim of this study was to assess whether the present version of the full PDQ-39 represents a unidimensional construct. Additional studies are needed to explore if reduction and/or regrouping of its items can produce a more valid and interpretable outcome measure.
This is the first independent study to assess the dimensionality of the PDQ-39SI using contemporary methods. We found clear indications of multidimensionality that cannot be explained by technical aspects of the scale but probably relate to conceptual problems. This argues against its usefulness as a clinical trial endpoint [Food and Drug Administration 2006]. More independent studies regarding the dimensionality of the PDQ-39 are needed to confirm or falsify these observations. Meanwhile, clinicians and investigators should use and interpret the PDQ-39SI with caution.
The authors wish to thank all participating patients for their cooperation and Jan Reimer for assistance with data collection. The study was supported by the Swedish Research Council, the Swedish Parkinson Academy, the Swedish Parkinson Foundation, the Skane County Council Research and Development Foundation, and the Faculty of Medicine, Lund University.
The authors have no conflicts of interest.
Peter Hagell, Department of Health Sciences, Lund University and Department of Neurology, Lund University Hospital, Lund, Sweden ; Email: es.ul.dem@llegaH.reteP.
Maria H. Nilsson, Department of Health Sciences, Lund University and Department of Neurosurgery, University Hospital, Lund, Sweden.