Five PSI have been considered for empirical validation in public acute-care hospitals across Spain. All of them showed systematic variability (variation beyond chance), were proven to have cluster effect, and were able to detect hospitals above the expected. Nevertheless, several questions should be drawn out to provide a nuanced statement on their usefulness.

Is the estimated variation systematic or due to chance?

Except in the case of MLM, since it is considered a quasi-sentinel event, we should know more about the basal distribution of adverse events to properly answer this question; however, we might assume, given the nature and rationale behind the safety indicators, that this distribution is expected to be close to zero.

Our approach was precisely based on testing the alternative hypothesis throughout the estimation of robust Empirical Bayes confidence intervals against zero as the null value. The precision of the estimated intervals together with the distance between the lower limit and the zero value (the closest figure corresponded to 0.12, in PE_DVT) support the hypothesis that the variation observed is systematic, rather than random.

Is the observed variation due to hospital-providers, rather than to patients?

If this was not the case, PSI would not be useful in describing what they are aimed to, which is to elicit differences attributable to health care.

Our approach sought to elicit the hospital effect by estimating the existence of variation beyond the case-mix of patients treated -throughout the namely cluster effect. As mentioned in the results, in the studied PSI a noticeable part of variation was attributed to the hospital where the patients were treated. However, it might be argued that in a multilevel approach, this finding is quite dependant on the goodness of the risk adjustment -the worse the adjustment at patient level, the higher the proportion of variance that could be eventually explained by the hospital-level. This is particularly true in the case of studies using administrative data, where the limited information available on specific patient characteristics might reduce the goodness of risk-adjustment methods.

A way to mitigate this limitation is to reduce the extra-variance due to differences in case-mix that the model is unable to capture, by modelling the largest hospitals. These are teaching hospitals with more than 450 beds, able to provide high-tech services, and ultimately, homogeneous with regard to the patient case-mix, particularly in studies where sample size is as huge as ours.

The results of this exercise showed a significant reduction on rho-statistic values, backing the hypothesis that the strategy of risk-adjustment was missing some relevant patient characteristics. Even though this finding, cluster effect remained: rho-statistic equals 0.06 (CI95%: 0.03 to 0.11) in MLM; 0.05 (CI95%: 0.03 to 0.07) in DU; 0.10 (CI95%: 0.07 to 0.14) in CRI; 0.02 (CI95%: 0.01 to 0.03) in PE-DVT; and, 0.03 (CI95%: 0.03 to 0.05) in PS.

Are results dependant on the coding practices affecting Elixhauser comorbidities?

A particular phenomenon that could also affect the cluster estimates, and ultimately the reliance on PSI, is the differential coding intensity across hospitals. In fact, the number of secondary diagnoses has been already proven to influence the international comparisons [

21]. In theory, if this variation was closely related to coding intensity in hospitals, the cluster effect would suffer an important reduction when the number of secondary diagnoses was considered as a factor in the multilevel models; otherwise, it would be very much related to the patients, thus affecting the risk adjustment estimates.

For the purpose of this exploration the number of secondary diagnoses was categorized using the median value (4 secondary diagnoses) as a threshold. In general terms, when both models were compared, a clear reduction in the Elixhauser comorbidity β coefficients, together with stable rho-value estimates, were observed. (Additional file

1) Given that the number of secondary diagnoses absorbed part of the variance in the new model and beta coefficients changed, variation is also expected in the random effects estimation for each hospital. However, an excellent correlation (Pearson coefficient values) between the original random effects and the new ones was found: 0.83 in post-operative sepsis, 0.86 in post-operative PE-DVT, 0.94 in

*decubitus *ulcer and 0.96 in Catheter-related infection. On the other hand, except in the case of decubitus ulcer the changes in the statistical nature of the random effect (i.e. hospitals found as statistically different that average turned into statistically similar, and the other way round) were null or negligible.

Are PSI precise enough to detect hospitals with rates above the expected?

Although PSI are quite infrequent events, shrunken residuals from the multilevel analysis have been proven precise enough to detect hospitals above the expected. Figure showed some quite straightforward images on this capacity. Nevertheless, determining in what manner cluster effect might be influenced by either outlier hospitals or the extra-variance attributable to the mix of hospitals within the sample is also needed.

With regard to the former, the estimation barely changed once those outlier values -easily identifiable at the two ends of the distribution in Figure - were excluded (data not shown). Most important is the latter one. To understand this effect, new residuals were estimated and plotted in those most

*a priori *homogeneous centres, the largest ones as described in previous paragraphs. As observed, except in the case of MLM where heterogeneity across hospitals was the underlying reason for results (just 4 out of 47 hospitals were statistically above the expected in this second analysis), in the remaining PSI, this capacity held noticeably high: 23% of the hospitals were flagged above the expected in decubitus ulcer, up to 36% in catheter-related infection, 25% in the case of postoperative pulmonary embolism or deep vein thrombosis, and up to 28% in the case of postoperative sepsis. (Additional file

2)

Should policy-makers and managers trust PSI?

Our work aimed at shedding light on some empirical properties that PSI are supposed to accomplish, in order to be useful for safety measurement and, ultimately, allow concerned users an informed quality management. Thus, representing systematic variation across providers -ruling out randomness as an alternative explanation of the differences-, and flagging hospitals as potential underperformers regardless the mix of patients they treat. However, a proper use requires debating upon two lessons learnt in this study, and reflecting upon other aspects that were not part of our work.

As for the lessons learnt with the studied PSI, due to the aforementioned flaws in adjusting patient-risks, we need to be aware that hospitals with more complexity might be signalled as false bad performers, particularly if they do not properly report secondary diagnoses. Secondly, the hospital effect (cluster effect) does exist, quite consistently throughout different statistic models; however, its magnitude clearly decreases when studying homogeneous hospital-providers. Although obvious, this message directly points towards comparing comparables, particularly, when risk adjustment is expected to be sub-optimal.

As for the reflection on other issues not addressed in this exercise, it is worth pointing out that the study of the empirical properties is just a partial view on PSI's validity. Further debate upon other validity issues ought to be pursued in order to fully trust on PSI usage. As for this purpose we have to be able to answer whether PSI measure what are supposed to measure. In this work, we have assumed construct validity since PSI were carefully developed for safety measurement purposes, [

10,

11] and face validity has been granted in advance for the Spanish case, by carrying out an

*ad hoc *face-validity project [

23]. However, criterion validity -the ability for an indicator to flag true positive cases and true negative cases by comparison with a gold standard- has to be specifically addressed, in context. Fortunately, for the Spanish NHS, a recent piece of research on surgical discharges shed some light on criterion validity [

33]. In general terms, the five PSI were proven to have a quite good performance in terms of positive likelihood ratio (+LR). The most conservative estimation yielded a + LR of 26.8 in decubitus ulcer, a + LR of 406.3 in catheter-related infection, a + LR of 149.3 in PE-DVT and a + LR of 25.32 in postoperative sepsis. These figures seemed high enough to adopt the use of these PSI as a screening tool; except in the case of decubitus ulcer, clearly affected by underreporting (false negative cases) and the existence of present-on-admission ulcers (false positive cases).

Some additional effort should be made on evaluating the PSI stability over time (out of the scope of this work), but in the meantime, taking the studied PSI as screening tools, assessing wisely the limits pointed out along this work in specific contexts, might help to identify those centres from which best practice lessons can be drawn out and those where intervention is clearly needed.