This study clearly shows that the threshold value for a continuous variable, defining a relevant improvement or an acceptable symptom state, closely depends on the measurement technique. This first observation prompted us to perform a systematic validation of the proposed thresholds. In this study, we evaluated the validity of each proposed threshold by calculating the probability of being considered in good condition, by using external gold standards for the group of patients that were below or above the proposed threshold. We used two external gold standards reflecting both the patient's perspective (the patient's global assessment) and the physician's perspective (the DAS28-ESR) and calculated the positive LR: the best threshold was considered to be that with the highest observed positive LR. Using this methodology, we were able to propose an absolute change of at least 3, a relative change of at least 50%, and a maximum score of 2, as optimal thresholds for the RAID score, to define an absolute and relative MCII, and an acceptable symptom-state respectively.
This study has some weaknesses, but also several strengths. The very wide range of threshold values proposed using different methodologies raises the question of the optimal way to address this issue. All the techniques used in this study have been previously adopted, though no consensus has been reached in the field of clinical epidemiology [
10-
21]. This can be easily explained by the different rationales of each technique: the empirical technique involves asking physicians to propose relevant thresholds based on the simplicity of their proposal or their experience [
20]. The aim of another technique is to avoid proposing a value below the measurement error of the outcome measure, as any interpretation of results using a threshold below the noise due to this measurement error is hazardous [
22,
23]. Finally, the techniques using an external anchor are also very relevant [
14,
15]. Although the validity of this external anchor may be questioned (here we used the previously reported gold standard MCII and PASS questions, which might raise the issue of a circular reasoning), these techniques make it possible to select the optimal threshold based on the arguments for and against, favoring sensitivity (for example, 75
th percentile technique [
14,
15]), sensitivity and specificity (for example, ROC curve and correct probability technique [
23,
24]). In this study therefore, we decided to use all the different techniques in a uniform group of patients (for example, active definite RA requiring a TNF blocker) receiving the same TNF blocker (etanercept). Despite this fact, we observed a very wide variability in the thresholds proposed by these analyses. From our point of view, such variability justifies a systematic evaluation of the validity of any proposed threshold and the main question is to define the optimal methodology for evaluating such validity. In this study, we approached this question by calculating the capacity of a proposed threshold to adequately classify a patient by considering previously validated external anchors from both a patient's perspective and a physician's perspective. The MCII and PASS questions were considered to be a gold standard anchor for the patient's perspective [
14,
15]. Because we also used the MCII and PASS questions for the elaboration of such thresholds, one might be concerned by the potential circular reasoning of this approach. This is why we decided to use not only this external anchor but also another one (the DAS28), which is considered a physician's perspective [
26], while evaluating a patient. We then calculated the positive LR. This approach, using two different external anchors resulted in quite good concordance between the two analyses for each proposed threshold, strengthening our finding. This agrees with the results of a previous study suggesting that the PASS corresponds to moderate disease activity [
29]. The data presented in the figures suggest also that the most stringent thresholds are also the most valid, at least with regard to our definition of external validity.
A weakness of this study was the fact that we were unable to evaluate the discriminant capacity of the proposed thresholds in order to validate them. Another potential weakness is that the proposed thresholds were defined in a single study with a relatively small sample size. On the other hand, the strength of this study is that all these different analyses of the definition of thresholds for a continuous variable were performed on a uniform group of patients. Despite these points, using our methodology and calculating the positive LR using two external anchors, we found a difference between the different thresholds, so that we were able to propose an absolute change of 3 points and a relative change of 50% for defining a clinically relevant improvement, and a maximum score of 2 for defining an acceptable status. Further studies in different patient populations, evaluating different facets of validity (including for example, the evaluation of discriminatory capacity), are necessary to confirm these proposals.