The initial results indicate good interrater reliability for the SOAT for the specific scenarios that were used. The ankle assessments had the highest reliability coefficients (α
.91), followed by the shoulder and knee when the scores of the SPs were included in the grading process. The reliability of the ankle assessment may have been highest because it had the least number of special tests and likely had the least complex scenario compared with the shoulder and knee assessments. Future testing with more scenarios is needed to compare SOAT reliability of 1 joint with another.
During the content validation study,4
some athletic therapists anecdotally questioned whether an assessment tool that permitted such great flexibility by examinee and examiner would allow consistent grading. As our study shows, the initial scale reliability results of the SOAT support the need for further research into its use as a standard protocol for orthopaedic assessment.
Traditional OSCEs commonly employed in medical education to measure a clinical construct are limited by binary tools.17,18
In contrast, the SOAT employs both binary and global ratings of performance. Some authors have criticized traditional OSCEs, stating that they have a tendency to trivialize the underlying construct that they attempt to measure and, thus, call into question the overall validity.19,20,25
The trivialization seems to have 2 reasons: (1) all scales are binary, thus removing the expert opinion of the performance,25
and (2) scales are traditionally a list of detailed tasks that represent the underlying construct.19,20
In contrast, a rater using the SOAT is grading more than just a dichotomous scale. Raters are making judgements based on the individual's performance rather than on a predetermined set of answers. In addition, tasks and grades are associated with the examinee's ability to draw connections between categories, thus permitting complete flexibility in decision making throughout the orthopaedic assessment. In other words, the SOAT requires raters to use expert judgement when grading examinees and changes how raters evaluate each examinee relative to another. For example, if an examinee decided to use the Lachman test for anterior cruciate ligament stability, but another examinee chose to use the anterior drawer test, both choices could be considered correct if applied appropriately. Furthermore, after the examinee chooses a test, the rater can choose not only to grade whether he or she thinks that the examinee selected the correct test for anterior cruciate ligament stability but also to grade the examinee's performance of the test using a continuous scale, rather than just stating whether the examinee did the test or did not do the test. Provision of rater judgement and decision making may add to the overall validity of the SOAT to measure the underlying construct, particularly as it relates to traditional OSCEs.
In addition to rating students on detailed tasks within each major category and history subcategory, raters also can rate students on global scales. Researchers originally thought that dichotomous scales would increase reliability, but authors of many studies with a global rating scale have shown that this theory is false.25–,27
As a result, the SOAT was designed as a hybrid of detailed checklists and global rating scales. Both the addition of global rating scales to the detailed checklist and the iterative nature of grading examinees with the SOAT may have helped address concerns of validity with OSCEs (or practical, performance-based examinations).
Conceptually, the SOAT has a slightly different approach from the one Denegar and Fraser1
proposed. They recommended that evidence-based decisions for special tests be employed during the physical examination. In support of this approach, some authors of meta-analyses have proposed the superiority of the sensitivity and specificity of some special tests compared with others.28,29
However, even if 1 special test demonstrates superior diagnostic power compared with another, how each special test integrates into an overall orthopaedic assessment has not been shown. Ideally, a standardized assessment protocol should be developed for the evaluation of all patients with knee injuries. Yet those protocols will likely be limited by the many factors associated with each case. Based on these limitations, the SOAT was designed to address concerns that OSCEs tend to trivialize content, bringing into question the overall validity of the measurements.19,20,25
Strong reliability of the scales may be attributed to the extensive content validation steps taken initially in the process of developing the rating scale. In addition, a thorough rater training session may also contribute to the tool's overall reliability.4
Some investigators have not reported the validation process with performance-based examinations, leading readers to infer that this step in the overall validation may not have been established before measuring the reliability of the tool employed.30,31
Ignoring the initial content validation phase in the development of competency measures may result in lower overall scale reliability.12
In contrast, good reliability does not ensure good validity. A balance and constant interaction between validity and reliability is critical to an instrument's evolution toward construct validity.32
A final explanation of the SOAT's reliability could be the extensive training required for each rater and SP before the examination. Part of the training process included a review of the rules associated with the SOAT. The SOAT rules and assumptions for use originally were published by Lafave et al.4
The SP and rater training session was standardized through a common PowerPoint (version 2003; Microsoft Corp, Redmond, WA) presentation and set of detailed instructions on how the tool was to be used during the examination. The training session was interactive, permitting rater trainees to ask questions and gain clarification based on the specific scenarios. Although the training session took approximately 3 hours to complete, the explicit training on the procedures for using the SOAT may have contributed to the resulting strong reliability coefficients.
One limitation of our study was the research design. Ideally, a fully crossed generalizability design, in which each examinee is tested by the same raters across multiple cases, is warranted. However, the SOAT is unlike traditional OSCEs that have a station length of approximately 10 to 15 minutes and multiple stations to measure the same underlying construct. Rather, the SOAT is designed so that the history, physical examination, and interpretation are not only part of the same station over a 30-minute period but also rely on the subsequent section for rater scoring. Two 30-minute stations multiplied by 30 students in 45-minute time slots equates to 45 hours of examining. Practicality was one of the main psychometric issues that Harden and Gleeson18
raised. Even if ample rest is provided between testing time blocks, testing students for 45 hours does not seem practical. Thus, for practical reasons, we involved multiple examiners in our study, limiting the research design and the statistical analysis employed (ie, ICC or Cronbach α).
Another limitation of our study was that participants were limited to the student volunteers from the clinical practicum class at Mount Royal College in the winter semester of 2006. This convenience sample may not permit the results to be generalized to other athletic therapy student populations in Canada or elsewhere. Although generalizability theory may have been a solution to this limitation, the rationale for not employing that technique is explained in the preceding paragraph. Another solution may include further testing of a broader population outside the Mount Royal College environment. In future studies, investigators need to focus on using the SOAT with a broader audience beyond athletic therapy students at Mount Royal College.