The study was conducted at a school of nursing in Berne, Switzerland, with nursing students in their second of three curricular years. The two-step approach of the study consisted, first, of an evaluation of the evidence for content validity, and second, of a generalizability analysis to estimate the reliability of the instrument.
Forward-backward translation of the questionnaire
Since the study was conducted in a German-speaking country, the English QSF had to be translated into German. We used a forward-backward translation approach, which is recommended for translating test instruments [12
]. Using this approach, a native speaker of the target language (in our case German) translated the instrument from the source language (English), and another person fluent in English then translated the text back from German into English. The original and the back-translated versions were then compared to ensure that the meaning and the nuances of the text were conserved.
Evidence for the content validity of the mQSF items
The content validity of the 18 mQSF items was ascertained by asking 25 medical and nursing education experts from Switzerland, Germany and Austria to rank the importance of each item on a four-point rating scale (1 = not at all important; 4 = very important), using an online survey tool. An even number of scale points (no "neutral" middle position) was used to force clear ratings. The experts were alumni of the Master of Medical Education Programme at the University of Berne, Switzerland, who were actively involved in SP programmes at their own institutions. They were also invited to comment on the mQSF, e.g. whether they thought additional items should be added.
Moreover, since the items were rated on an ordinal rating scale, both mean and median ratings were calculated. Further, Cronbach's α was calculated to ascertain homogeneity among raters. An item-total correlation was performed to check whether any item is inconsistent with the rest of the scale and would thus have to be discarded.
We considered the relevance of an item of the mQSF as most important. If the mean of such an item was below 2.5 we studied the item-correlation of that item in more detail and decided to withdrew that item if a negative item-total correlation was present.
Reliability of the mQSF
We were interested in the reliability of the quality of the SP feedback and of how the quality might be increased, e.g. by having more than one judge rating the quality. For this purpose, an analysis of generalizability (using Genova [13
]) was used; reliability estimates were based on a partitioning into true
and multiple sources of error
Six SPs were videotaped during eight clinical encounters with different students; at the end of each encounter, feedback was given by the SPs. One videotaped encounter per SP was randomly selected for assessment by ten faculty members who judged the feedbacks according to the mQSF items. The six SPs, four females and two males, had at least 1 year of experience in role-playing and giving feedback. Three SPs impersonated a case of acute postoperative pain after an open appendectomy and were instructed to act as if they were afraid that something had gone wrong during the operation. The other three SPs enacted the role of a patient in a consultation on oral anticoagulation therapy after aortic valve replacement; they were instructed to act as if they were indifferent toward the information they received. All SP clinical encounters used and recorded in this investigation were specifically designed for this purpose and in line with the heretofore-acquired curricular competences.
In the G-study, the quality of feedback given in these six encounters was rated by 10 judges (teachers from our institution who were trained in the use of the mQSF) using a rating scale for the mQSF that ranged from 1 (= strongly agree) to 4 (= strongly disagree). We expanded the originally dichotomous rating options to a four-point rating scale because we wanted to provide more subtle parameters for the assessment of SP performance in terms of qualitative holistic judgments [14
]. Three weeks later, the procedure was repeated with the same ten teachers and the same six recorded SCEs. We thus had a fully-crossed Video (encounter) by Rater by Occasion (6×10×2) design in which we treated all facets as random.
In the subsequent decision-study (D-study), the facet "V" (video) of a CD-recorded clinical encounter was the object of measurement, whereas the number (n) of judges (facet J) and occasions (facet O) were varied (Figure ).
The object of measurement for the D-study. Facet "V" (video), of a CD-recorded clinical encounter, number (n) of judges (facet J), occasions (facet O)
Ethical approval was sought from the ethics committee of the State of Bern, Switzerland. Informed consent was obtained from all participating students and SPs. Participation in the study was completely voluntary. All participants were free to leave the study at any time without any repercussions. There was no financial compensation.