This paper reports the implementation of the SCT format in our school as a mandatory test of clinical reasoning in medical students. The test format was well accepted. Conversely, student opinion of the enjoyment and educational value of the test was mixed. Students’ scores were widely spread and we found the distribution of scores to approximate what we expected of a 5th year cohort. It is reasonable to expect students in the 5th year of a program to be performing well and high failure rates are hard to justify.
The failure rate of 3% was lower than the expected rate of 5% for the MEQ assessment, which was replaced by the SCT. Setting a passing score must result from a transparent, reproducible, objective, and defensible process [15
], but ultimately standard setting is arbitrary. Some methods of standard setting have delivered pass marks that vary substantially between different universities for the same questions [16
]. Also, there may be greatly different results within the same group of examiners depending on the method used [17
]. Hypothetical prediction by experts of borderline performance, as is commonly employed in standard setting, seems intrinsically less fair or robust than correlating student performance with the real performance of a reference group in the same test as done with SCT. Performance of candidates can be related to the performance of reference panel of experts, using panel standard deviation as a yardstick [18
]. However, it is also possible that results would have been different with different members in the reference panels. The approprieteness of this method of standard setting needs to be further explored and further research in alternative methods of standard setting, for example, as described by Collard et al., [19
] is required.
Our group of senior examiners determined that 5th year medical students who scored within 2 SD of experts in the same test were performing above expectations. The setting of the crucial pass/fail cut point at 4SD below the expert mean score occurred after detailed consideration of the performance of a volunteer 6th year cohort in the same assessment. We had expected that the 6th Year and 5th Year groups would perform to a similar level, whereas the 5th Year group performed significantly better, by 5%. Purely by chance, based on academic rank in their 5th year assessments, the 6th Year sample appeared to be representative of their cohort (data not shown). The difference in results in the SCT is probably best explained by differences in preparation - there were no stakes for the 6th Year volunteers contrasted with high stakes for the 5th Years.
The use of a new test format implied having to develop a bank of new questions, requiring the input of more members of our faculty than had been previously required for our MEQ. In addition, training of faculty members and students was required in this new method of assessment. The partnership we established with the University of Montreal to share web facilities has considerably eased this workload, especially for undertaking ERP work and for training on line. These web facilities also allow collaboration with other Australian universities in item banking, test administration and research.
Writing good SCT questions has proved not to be easy. This is true of any other type of question. The reasons for SCT questions being unsuitable are often not obvious. Firstly, the questions should not only be testing factual recall. Secondly, the selection of the terms in the Likert responses needs careful consideration in order to achieve a spread of modal responses across the 5-point range for the assessment as whole. For example, many colleagues do not feel comfortable with selecting extreme descriptors such as “essential” or “absolutely contraindicated”. We recommend avoiding such extreme descriptors and to provide a scale on only one dimension – such as “more probable, much more probable”. Thirdly, a surprising number of questions that appeared to be good questions were rejected due to discordance in the expert reference panel. We are still evaluating this and are unsure if those questions are “good” or “bad” and whether they can be used in assessment. In subsequent multidisciplinary reviews we established, whilst in some cases experts have simply made a mistake in selection of a Likert response (data entry error), that is not the commonest reason for expert discordance. We have found in some cases the question is ambiguous or has some other fault that has not been detected by its author and our original review panel. However, it is also apparent that for some questions some “experts” are simply wrong. For example, question 1 of Table was excluded from consideration because of a scoring distribution that equally rewarded responses that indicated that a ventilation-perfusion scan was both more and less useful in the investigation of suspected pulmonary embolism in the presence of an abnormal chext Xray. In a post hoc review of this question it became apparent that some experts were simply not aware that the diagnostic accuracy of a ventilation-perfusion scan is lower in the presence of the specified abnormality on chest Xray, and that the alternative investigation of a spiral contrast CT scan was then the local gold standard in that situation. Thus, we have what appears to be a usable question on investigation of suspected pulmonary embolism that has a bimodal distribution that, if used, would award significant partial credit (up to 0.87 of a full mark) to unacceptable responses. We think this apparent lack of expertise in “experts” is partly a reflection of subspecialisation within disciplines and partly a reflection of some experts being out of touch in areas peripheral to their specific interests. This is an important observation and not only for undergraduate assessment. For now we have dealt with this problem by not using those questions. More research is needed on questions in which there is expert discordance.
This, of course, raises the question of who should be in expert reference panels. It may be that generalists are more appropriate members of panels when it comes to assessing medical students. In the traditional specialties, generalists as opposed to subspecialists appear to be more able to answer a broader range of questions. We have preliminary data (not shown) to suggest that general practitioner panels will score around the same mean as a specialist panel but with a wider SD. In relation to our example of the optimal investigation for suspected pulmonary embolism, we speculate that recent medical graduates would have no problem in answering that question in a manner that would be acceptable for its inclusion in assessment. We are currently undertaking further research in this, including with a reference panel of recent medical graduates.
Perhaps the strongest evidence of the validity of our SCT relates to the extensive process involved in developing and selecting the questions. This started with question writing by an expert with the question vetted by a single colleague before submission for ERP work, the ERP work itself that would appear to “weed out” most unsuitable questions (because we excluded questions in which experts were discordant in their responses), and a subsequent review by an experienced committee of examiners before deployment of the questions.
Based on the literature, we were expecting a Cronbach alpha of around 0.8 for our SCT. An alpha of 0.78 was computed for the 158 questions sat by the 23 6th Year students. We were surprised then to compute an alpha of only 0.62 on the same questions in our 5th year cohort. We, therefore, recalculated alpha after eliminating items with a negative item:total correlation, which of course improved the statistic, but on these data had surprisingly little effect on the results. In retrospect, we believe that identifying items with a negative item:total correlation should encourage a closer look at the items, for item writing flaws, but is not an indication to remove them unless flaws are uncovered.
There was no formal collection of perceptions of faculty on the new test format. Test preparation took a lot of energy from faculty, but the general impression was that the format induced reasoning activities that are closer to the reality of practice than our MEQ, the format formerly used. As all students, ours generally don’t like changes in the assessment system and this was reflected in mixed student evaluations of the experience of the new SCT. Since then we have introduced the SCT in earlier years of the medical program and by the later years our students are now very familiar with the SCT. Reports to faculty on the SCT by student representatives are now favourable.