Described competence levels
Prior to developing and administering the assessment described in this article, it was necessary to produce an initial assessment specification. A panel of eight subject-matter experts composed a hierarchy of three ordered competency levels (see Table ). These were intended to capture the range of knowledge and skills required for conducting safe and effective practice across all levels of responsibility. They were also used to guide item construction when used in conjunction with a blueprint of content strands.
Treating missing responses
Missing responses, of which there were relatively few (see column missing in Table ), were treated as incorrect. This seemed reasonable given the proposed use of the FSEP instrument as a high-stakes regulatory assessment.
FSEP assessment classical and Rasch item statistics (n = 877)
Classical statistics in addition to Rasch calibration estimates are presented for all items in Table . For each item, the summary statistics were as follows:
1. Item number in test form (Qn)
2. Item bank identification tag (ID)
3. The p-value represents the proportion of the sample with the correct answer. This is the mean item score (p)
4. The item standard deviation (sd)
5. The classical item discrimination (d) which should be at least 0.2 but preferably closer to or in excess of 0.4 [29
6. The correlation between the item score and the test total score (r) which should be at least 0.3 [30
7. The alpha reliability estimate of the test has then been presented if the item were to have been omitted from the test (a)
8. For each item distractor follows a sequence of percent-correct scores and point biserial correlation values, the latter of which should be positive corresponding to correct responses and negative corresponding to incorrect responses
9. The percentage of missing responses is then presented
10. The item difficulty (Logit)
11. Infit mean square estimate (Infit) which should be within the range 0.8 to 1.2 [31
The median item discrimination value was 0.32. The lower and upper bounds of the corresponding interquartile range were 0.24 and 0.37. This indicates that most items discriminated positively and effectively. Modest values were anticipated for the items which were found to be relatively easy. However, seven items had classical item discrimination values between 0.03 and 0.2. It was determined that these items warranted a qualitative review. In particular, items 4, 15, 24 and 32 (with discrimination values of 0.1 or less) were to be omitted from the empirical derivation of described competency levels and the residual-based evaluation of scale dimensionality. The standard error of measurement associated with each logit value was approximately 0.07 logits. This was deemed sufficiently small for the purposes of this study.
Several key test parameters were also tabulated from the CTT and Rasch analyses so that any weaknesses concerning the assumptions and requirements of the assessment could be identified. These are summarised in Table .
Test statistics from CTT and IRT analyses of the 40 item RANZCOG FSEP trial instrument
The reliability and fit values in Table provide initial evidence that the test was measuring a single dominant variable and that a single dominant latent variable underpinned the set of items. The PCA of Rasch residuals produced a first residual component with an eigenvalue of 1.65 explaining 4.65% percent of variance. These figures suggest that no additional structures were obviously apparent in the FSEP scale. The analysis of the proportion of non-invariant person scores revealed a slight departure from unidimensionality. The proportion of differences was calculated as 7.75% (68/877) with a 95% confidence interval slightly in excess of the 5% critical value (see Table ). A significant change in fewer than 59 individual scores was required to accept the hypothesis of unidimensionality (and the local independence of items). This result may improve following a review of the remaining underperforming items. This result will also prompt a qualitative review of differences between the cognitive demands of the positive and negative loading items in light of the intended construct. Overall the test statistics indicated that the test could reasonably separate candidates on the basis of ability (i.e. that it possessed acceptable criterion validity) as well as demonstrating construct validity. The priority emerging from these statistics was that the number of items would still need to be increased in the move to high-stakes applications of the assessment.
Many of the characteristics of the test can be identified from what is referred to as a Variable Map. This has been presented for the RANZCOG FSEP test in Figure . The chart has several sections to it. Working from the left of the figure the first characteristic of the chart is a scale that ranges from approximately -2.0 to +3.0. This is the logit scale and is the metric of the Rasch model analysis that enables person ability and item difficulty to be mapped conjointly. The distribution of pupil ability is presented next and each 'X' represents 1.3 candidates. It clearly shows that the range of ability is quite broad.
Variable map from the initial assessment data.
The next component of the chart is the distribution of items illustrating their relative difficulty. It can be seen that these ranged from about -1.5 to 2.5 logits. Codes are used to describe and identify each item. In this program, colour coding and level labels (i.e. L1, L2 and L3) were also incorporated to differentiate between items that were intended to tap into the different hypothesised levels. This highlighted items that were perceived by the practitioner population as being more difficult or easier than the item writing panel had envisaged. For example, item 32 (labelled 32 L1) was found to be more difficult than the expert committee had expected. It was therefore suggested that the item be qualitatively reviewed so that an explanation for this discrepancy might be established.
In Figure the chart illustrates how the items on the test were divided into blueprinted content domains. It can be seen that the distribution of item difficulties in each of the domains varies and does not necessarily cover the range of candidate abilities across all blueprint columns. For some content domains (such as Definitions) this might be appropriate but for others it may be necessary to introduce new items.
It can also be seen that there are gaps, as highlighted by the ellipses. This is important information for two reasons: firstly, subsequent item and test development can be informed by targeting new items at the location of the gaps, and secondly, these gaps indicate regions in which there is minimal information about candidate skills. Appropriately writing and seeding new items to fill these gaps increases the capacity of the test to discriminate between candidates of similar abilities on the basis of knowledge and skills matched to the regions where such gaps emerge [19
Interpreting the test: Competency levels
In addition to ability measures, other measures related to educational outcomes were derived from the data. One of these has been referred to as the competency levels and relates directly to the definition of criterion-referenced interpretation of tests. Combining the ideas of criterion-referenced interpretation with item response modelling directly links the position of a person or an item on a variable (as shown in the variable map) to an interpretation of what a practitioner, or groups of practitioners, can do, rather than focussing on a score or the performance relative to a percentage or a group. It also orients the use of the test data towards substantive interpretation of the measurement rather than reporting a score or grade. The procedure gives meaning to test scores and is used here to derive the substantive interpretation of the levels of increasing competence.
It can be seen from the variable map in Figure that several items grouped together at different points along the uni-dimensional scale. The major question was whether these clusters could be interpreted as having something in common. Each item was reviewed for the skills involved in responding to the item and it was a matter of substantive interpretation. The process required an understanding or empathy with 'how the practitioners think' when they are responding to the items.
To assist in this procedure the logit values of the item difficulties were ordered according to increasing item difficulty. Each item was also analysed for the underpinning cognitive skill involved in obtaining the correct answer. The results of these analyses have been presented in Table and Figure .
Cognitive skills audit as carried out by the specialist panel
Items ranked by relative difficulty with cognitive skill descriptions (preliminary item cluster cut-points shown as vertical dashed lines).
The question then arose that if the difficulty increased for sets of items, did the nature of the underpinning skill also alter? The two sets of information were then explored in unison. Natural breaks in difficulty were identified and then the items and the cognitive descriptions were examined to determine if a set with a common substantive interpretation could be found. Measurement error was also taken into consideration to determine whether breaks were statistically significant (see the PISA 2003 Technical Report published by the OECD for a description of this issue). A panel of eight subject-matter experts undertook this exercise. Together they identified the breaks in the variable and then offered the substantive interpretation of the levels of proficiency. These have been presented in Table .
Empirically derived competency levels for the FSEP assessment variable
Establishing competency levels empirically
With a view to deriving described competency levels, initial analysis focused on the ordered item bar chart (Figure ). Four or even five potential cut points or levels appear apparent where there is an abrupt change in relative item difficulty over groups of items. Closer analysis of the chart and of the discriminatory capacity of the individual items reveals just three distinct cut points. Equally, around each of these points there is an identifiable change in the relative cognitive skills required by participants.
In order to identify the cut points, analysis of the individual items was undertaken. As previously described, item performance (Table ) revealed four questions with low discrimination values, requiring their removal from the item bank and exclusion from consideration in this exercise. Also, as previously discussed, the variable map (Figure ) highlights questions which appeared more or less difficult than the subject matter experts had supposed. Closer inspection of these items typically revealed simple but fundamentally important concepts ensconced in inappropriately complex questions. Bearing the limitations of these individual items in mind and their relative positioning on the ordered item bar chart, a direct comparison with the skills audit demonstrates the three cut points previously alluded to.
Up to cut point 1 (ID1033) [Level 1], the bulk of the items are simple questions based around simple concepts; such as recall of definitions and application of those definitions. There is some basic physiology of fetal heart rate (FHR) control, generally as applicable to the normal CTG. Most of these early items do not have a CTG example attached or clinical scenario and therefore little synthesis of information is required. Of the two questions which do have a CTG attached, both require low level synthesis of "normal" CTGs to achieve the correct response.
From cut point 1, (ID1015) [Level 2], the items require a slightly higher understanding of the physiology of FHR control and application of that physiology to a given circumstance. Increasingly, these questions involve CTG/FHR abnormalities. There are some recall of definition items, with a few having an associated CTG and or a clinical scenario. Around half of these Level 2 items have a CTG and/or clinical scenario and increasingly these items require moderate level synthesis.
From the second cut point (ID1017) [Level 3], almost all the questions have an associated CTG and clinical scenario. Of the two which do not, one is for removal and the other requires rewriting to improve its discriminatory capacity. Increasingly, these items require high level synthesis of clinical scenarios, with an abnormal CTG and its implicit physiology to determine appropriate management. Increasingly the CTGs associated with these level 3 items involve less common but high risk examples of CTG abnormalities.
Herein the skills within each cluster were paraphrased to produce global proficiency level descriptors. This process utilised the range of sources of evidence described previously: the Rasch variable map (Figure ), the bar graph of items ranked by relative difficulty (Figure ), and the descriptions of cognitive skills embedded within the items (Table ). The relative merit of each item's statistical performance was also noted by making use of Table . These pieces of evidence, when analysed in unison, enabled a methodology for locating and defining the substantive skills within different bands of the measurement
Evaluating competency level agreement
Qualitatively evaluating the agreement between the contents of Table and Table constituted a key step in justifying the extent to which construct validity had been achieved (beyond the indications of the reliability and fit statistics).
The number of levels identified empirically for this test was the same as the number of levels hypothesised by the panel of experts. While this condition is not necessary to uphold consistency between hypothesised and derived levels, it does simplify evaluation of their correspondence. The content and direction of the empirically derived levels from the skills audit follows the same direction, structure and content as the hypothesised directions mooted by the expert panel in the initial development of the test. In fact, the link between the hypothesised and derived constructs is clear. The empirical data strongly supports the planned hypothesised direction of the test. Given a set of circumstances, the test can be used, with confidence in line with measurement error considerations, to identify the level of competence of the practitioners who have been assessed using the instrument.
The order and content of the knowledge involved in the levels identifies the typical level of competence of the practitioners. It is also evident that the distribution of practitioners over the variable covers the full range. The expert panel will use these results to make decisions about how to treat this data and what to do regarding the use of the performance levels. More importantly, those involved in providing the education program can use the distribution of practitioners over levels of competence to identify points of intervention for supplementary training and advice to practitioners regarding their likelihood of being successful in a retest situation after additional training.