The new COSMIN guidelines regarding responsiveness
Lidwine B Mokkink, Caroline B Terwee, Dirk L Knol, Henrica CW de Vet
Address: Department of Epidemiology and Biostatistics and the EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, the Netherlands
In this piece, Dr Angst argues that some of the claims of the COSMIN guidelines about responsiveness do not match the demands of clinical reality and confront findings of numerous epidemiological studies.
We thank Dr Angst for his interest in the COSMIN checklist and we think he raises some relevant issues concerning responsiveness. Before we give our reaction to these issues, we would like to emphasize that the COSMIN checklist was developed to evaluate the methodological quality of studies on measurement properties. It is not a checklist to assess the quality of a measurement instrument. Furthermore, we think it is important to mention that the COSMIN checklist was developed in a Delphi study in which over 40 international experts were involved. The members of the Steering Committee did not have a vote in these Delphi rounds [9
We do not agree with all issues raised by Angst. However, we might not have been clear enough in our manual. Therefore, we would like to take the opportunity to explain the viewpoints of COSMIN regarding responsiveness in more detail. Based on the remarks of dr Angst we further clarified some issues in the COSMIN manual [1
In our response, we focus on four points which will elucidate COSMIN's ideas around responsiveness and deal with the concerns of dr Angst:
Responsiveness is longitudinal validity and therefore the assessment of responsiveness closely follows the way in which validity of measurement instruments is assessed.
A distinction should be made between the interpretation of changes in health status and responsiveness as a measurement property of a measurement instrument.
The literature on responsiveness using effect sizes and other "inappropriate measures" should not be thrown away, but provides less evidence than previously thought.
The COSMIN guidelines do not reject the "transition" question, but recommend to test hypotheses about expected relations with the "transition" question.
Responsiveness is longitudinal validity
Dr Angst argues that "For most clinicians, responsiveness is not (only) a question of longitudinal validity - they simply wish to find that instrument that more accurately detects changes over time than the other by a quantitative measure". According to COSMIN, this is the same because accurate detection of change means measuring the true amount of change, which is a matter of longitudinal validity.
The COSMIN panel was very clear that responsiveness should be considered as longitudinal validity. If you want to measure change, a valid instrument should truly measure changes in the construct(s) it purports to measure. The only distinction between (construct and criterion) validity and responsiveness is that validity concerns the validity of single scores while responsiveness concerns the validity of change scores. Consequently, the COSMIN panel concluded that responsiveness should be evaluated similarly as validity, i.e. by comparing changes on the instrument with changes on the gold standard, or - since often there is no gold standard - by testing hypotheses e.g. about expected correlations with changes in other measures, or expected differences in changes between groups.
One of the most difficult tasks when testing hypothesis, is formulating challenging hypotheses. By testing hypotheses we aim to show that the instrument truly measures changes in the construct(s) it purports to measure. In practice, this means that the instrument should measure (changes in) the right construct(s) and not (changes in) something else, but also that it should measure the right amount of change, i.e. it should not under- or overestimate the real change in the construct that has occurred. This latter aspect is often overlooked in assessing responsiveness. In the COSMIN manual we explain that specific hypotheses should therefore include an expectation about the direction and magnitude of the correlation between changes in the instrument under study and changes in a comparator instrument, or an expectation about differences in change scores on the instrument between groups.
In one of the COSMIN articles [2
], we provided some examples of hypotheses based on one of our previous studies [5
]. We would like to emphasize that these hypotheses were only used as examples. The COSMIN panel considered it not possible to formulate standards for the amount of hypotheses that need to be tested in a construct validity study. This depends on the construct to be measured and the content and measurement properties of the comparator instruments [1
]. The definition of criteria for good measurement properties was beyond the scope of the COSMIN study.
Note that also for assessing validity we also have no quantitative measures and we also test an arbitrary chosen number of hypotheses. There is no criterion to decide whether an instrument is valid or responsive. Assessing validity or responsiveness is a continuous process of accumulating evidence.
Distinction between the interpretation of changes in health status and responsiveness as a measurement property of a measurement instrument
Effect sizes and related parameters have been introduced by Cohen [10
] to provide a standardized measure of the magnitude of an effect. These measures are used to interpret changes in health status, or magnitudes of treatment effects.
It is impossible to assess in one study both the treatment effect and the responsiveness of measurement instrument based on the same effect size. If the effect size is zero, either the intervention has no effect or the outcome measure is not responsive. If the effect size is moderate, more conclusions are possible: either the effect is moderate and the outcome measure is responsive, or the effect is large or small and the outcome measure has poor responsiveness because the true effect is over- of underestimated by the instrument. So the argument of the COSMIN panel is that the effect size only has meaning as a measure of responsiveness if we know (or assume) beforehand what the magnitude of the effect of the intervention is. If, for example, we expect a large effect of the intervention we can test the hypothesis that the measurement instrument shows an effect size of 0.8 or higher. But if we expect a small effect of the intervention, we would not expect such a high effect size. This example shows that a high effect size does not necessarily indicates a good responsiveness.
When several instruments are compared in the same study, this could give evidence for the relative responsiveness of the instruments. But again, only if a hypothesis is being tested including the expected magnitude of the treatment effect. Let us propose that we have three measurement instruments (A, B, and C), all measuring the same construct. The intervention given is expected to moderately affect the construct measured by the three instruments. Results show that instrument A has an effect size of 0.8, instrument B of 0.40 and instrument C of 0.15. Based on our hypothesis of a moderate effect we should conclude that instrument B appears to best measure the construct of interest. Instrument A seems to over-estimates the treatment effect (e.g. because it shows change in persons who do not really change), and instrument A seems to under-estimates it. This example shows that it may not always be appropriate to conclude that the instrument with the highest effect size is the most responsive.
The literature on responsiveness using effect sizes and other "inappropriate measures" should not be thrown away, but provides less evidence than previously thought
In the previous paragraphs, we have tried to explain that the COSMIN panel does not totally discard effect sizes as parameters of responsiveness, but argues that it is necessary to formulate and test hypotheses about the magnitude of change that is to be expected from the treatment. Many responsiveness studies, however, have been published (and still are) in which an instrument was considered responsive just because the effect size was larger than 0.8. This is what COSMIN considers inappropriate.
Note that COSMIN does not intend to set aside decades of research. Published studies do provide evidence for responsiveness, but less than previously thought. These studies can be included in systematic reviews of measurement properties. However, it is then up to the authors of the review to decide (in retrospect) whether the results found are as could have been expected, taking the treatment, the construct to be measured, the population etc. into account. This may, however, be more difficult and more prone to bias than formulating hypotheses before the data collection.
The COSMIN guidelines do not reject the "transition" question, but recommend to test hypotheses about expected relations with the "transition" question
We agree with dr Angst that there is no gold standard for pain and other PROs measuring symptoms and perceptions and that the "transition" question (global rating of change) is often the only criterion available for measuring change in PROs. However, there is an ongoing debate on its validity and reliability [11
]. It is therefore still unclear whether the "transition" question should be considered a gold, silver, or no standard at all. Therefore the COSMIN panel proposed to consider it a construct approach of responsiveness (comparable to construct validity) if the "transition" question is used to assess responsiveness. In that case it is recommended to define and test hypotheses e.g. about the expected correlation between changes on the instrument under study and the "transition' question. Moreover, it should not be the one and only hypothesis to be tested. This is what we presented in our example.
Similar as with effect sizes, COSMIN does not discard the use of a "transition" question in the assessment of responsiveness, but recommends to formulate and test hypotheses about what correlations are to be expected.