According to the IRT-based analysis, the psychometric properties of the professionalism SRQs were inferior to those of items relating to the testing of knowledge of anatomy. In particular professionalism items were relatively poor at discriminating between candidates. This is especially highlighted by the low person separation indices observed for these items; in order to reliably discriminate between two groups of candidates a person separation index of more than two would be required. In the case of the professionalism items these values were much less than one. The relationship between Conscientiousness Index scores, professionalism peer nominations and performance at anatomy SRQs were modest but statistically significant. However, no such relationships were observed between these former measures and performance at professionalism items. The third set of skills items, relating to the application of knowledge were observed to have psychometric properties somewhat intermediate between those of the professionalism and anatomy items. Although there were no statistically significant associations between performance on skills items and the ratings of conscientiousness and professionalism there was at least the suggestion of a trend. As with the professionalism items, the person separation indices for the skills items were relatively low. Taken together the characteristics of the three types of item may imply that the testing of applied, as opposed to pure, knowledge is generally less reliable using the SRQ format. This possibility may at least partly explain the poor psychometric properties of the professionalism items, which suggest that SRQs may not be an appropriate measure or predicator of professionalism, at least for undergraduate medical students.
Whilst some clinical exposure occurs during the first two years of Durham University Medical School training, knowledge based-performance is still the main focus of study. Therefore, conscientious study may be more closely allied to peer perceptions of professionalism than in later stages of medical training, where more patient and staff interactions are observed amongst peers. It could be argued that performance on anatomy items most closely reflects this aspect of professionalism, given that without conscientious study it is difficult to perform well on this topic. However, the converse was not true in that those that peers perceived as least professional did not demonstrate a poorer performance on any area assessed by SRQs. This suggests that medical students may be relatively accurate at perceiving high but not low levels of conscientiousness, in contrast to previous findings where Conscientiousness Index was associated with low but not high ratings of professionalism. This apparent anomaly could be due to the wider definition of Conscientiousness Index, which encapsulates a range of information on behaviour, in contrast to anatomy performance which is restricted in scope. Thus these two correlates of conscientiousness may be related to professionalism in subtly different ways.
It is also necessary to explain why the present findings seem to be at odds with those reported by Patterson et al; that SJTs predict workplace performance by GP trainees [13
]. There are two possible explanations. Firstly, professionalism may be developmental in nature, and perhaps early undergraduate medical students do not respond appropriately to SJTs because they have not yet developed appropriate situational judgement. The other, more encouraging version, is that SJTs measure aspects of professionalism different from those measured by the Conscientiousness Index. The strongest association we have found between the Conscientiousness Index and professionalism suggests that conscientiousness accounts for 25% of the variance in professionalism. While this is the largest single component that has been identified, at least to our knowledge, it leaves room for other, unknown, components to play significant roles, and there is no reason to believe that these co-vary with conscientiousness. The other four members of Psychology's 'Big Five' (extroversion, neuroticism, agreeableness and openness to experience [29
]) would be obvious candidates. Equally, Wilkinson identifies five clusters of measures of professionalism, one of which clearly correlates with conscientiousness, and the other four may well represent different aspects of professionalism [2
]. In addition, it is possible that increased patient exposure in later training years may increase students understanding of the correct response in clinical situations and lead to more consistent responses to items related to professional behaviour.
Rasch analysis has previously shown to be a useful approach when exploring the psychometric properties of medical undergraduate exam SRQs [30
]. Although not the focus of the present study, the findings from the present Rasch analysis of the exam items also suggest that SRQ format (e.g. EMQ versus MCQ) may influence their characteristics in a topic specific way. This observation merits further research. More importantly, the findings from the present study should raise some concerns regarding the use of SJTs for selection to Foundation, as proposed by the Medical Schools Council. As the candidates for these latter high-stakes assessments fall between undergraduate and postgraduate supporting evidence regarding the properties of these tests in populations at that stage of professional development is urgently required. If these tests do not perform adequately it may result in strong candidates failing to obtain one of their preferred foundation year posts, or in the worst case scenario, any post at all.
Strengths and limitations
This is the first published study to combine two distinct indices of professionalism with SRQ performance, using an IRT approach. The application of IRT allowed an interval metric of ability to be constructed from the exam question responses. Moreover, the psychometric properties of the items could be explored more fully than classical test-theory would normally allow. Ideally the ratings of professionalism in the two year groups would have been both derived from tutor group ratings and therefore some caution must be exercised in interpreting the professionalism nominations. One of the strengths of IRT is the ability to derive relatively distribution free measures of performance. It would therefore have been desirable to use test-equating via shared items to link absolute-SRQ ability across year groups rather than standardised Rasch scores, although the lack of shared questions precluded this.
The SRQ response data utilised in this study did not include sociodemographic variables, such as gender and ethnicity. Thus, it was not possible to assess the response data for the presence of differential item functioning (DIF- response bias not due to underlying ability) according to such candidate characteristics. This may be an important area of future research.
The Monte Carlo simulation suggested that the item difficulty estimates were precise and reliable in the majority of cases. However, item discrimination and guessing parameters should ideally be evaluated via a full two-parameter logistic model, rather than estimated using the more constrained Rasch model. Thus, the application of IRT, whilst possible with a relatively small number of respondents, is more suited to larger population samples.