Results of the item analyses are reported in . This table provides the item difficulty and item discrimination values at baseline (full sample) and at post-intervention (intervention and control groups separately). At baseline, there was a wide range of difficulty across items, from a low of 14% correct to a high of 85% correct. Point-biserial (p-bis) correlations revealed strong discriminations for each item, ranging from a low value of .10 to a high of .54. As described above, these values represent the correlation between a single item's response and the complete set of items on the same test form, which demonstrates each item's ability to discriminate more from less knowledgeable individuals. These results indicate that the first two criteria for item adequacy—variability in item difficulty and high item discrimination—were generally met at baseline.
Following intervention, separate item statistics were calculated for the intervention and control groups. It was expected that the test characteristics would generally be stable from baseline to post-intervention for the control group, with similar patterns of difficulty and discrimination. In contrast, it was expected that difficulty (e.g., p-values representing the percent of correct responses) and discrimination values would increase for the intervention group, due to learning gains as a result of participation in the intervention. In addition, it was expected that item difficulty p-values would be higher for the intervention group than the control group as well as higher at post-intervention than the baseline measurement. Results of the item analysis at follow-up are also reported in . As expected, and similar to baseline, item difficulties (intervention group range 0.29 to 0.83 at post-intervention; control group range 0.16 to 0.83 at post-intervention) and discrimination (intervention group range 0.22 to 0.60 at post-intervention; control group range 0.13 to 0.65 at post-intervention) met item adequacy requirements. In addition, for 13 out of the 15 items at post-test, item p-values were higher for the intervention group at post-test than at baseline (for the full sample). This demonstrates knowledge gains from pre-post intervention. In addition, comparing intervention and control group items statistics at post-intervention only, the p-values for the intervention group were generally higher than the control group (13 out of 15 items), which further supports the impact of the intervention on knowledge gains.
Finally, we investigated distractor quality for each item, at baseline and post-intervention, by calculating the p-bis for each incorrect response. At baseline and at post-intervention (group statistics calculated separately), the p-bis values for the vast majority of distractors were less than .05. However, there were two items (7, 14) that each had one problematic response (distractor) at baseline and at post-intervention. This finding indicates that there may be a problem with these two items, though it is easily resolved by adjusting the one problematic response choice for each item. Furthermore, the distractor confusion may explain why these were two of the most difficult test items (see ).
Descriptive statistics for the test as a whole are reported in for baseline (full sample) and for post-intervention (intervention and control groups). This table reveals the difficulty of this knowledge test. Overall, mean test scores were only moderate, though participant scores ranged across the full scale, from 0 to 15 points. In addition, test scores increased from baseline to post-intervention, for both groups, though there was a larger mean increase for the intervention group. The results of the RM ANOVA also revealed greater gains from baseline to post-intervention for the intervention group than the control group, with a significant interaction between time and group (F
<0.001; partial eta2
=0.161). This partial eta2
corresponds to a large effect size (Cohen 1988
illustrates the changes from baseline to post-intervention for the two groups.
Knowledge test scores by time and condition
Age and gender differences were also examined. While there was not a significant difference between girls' (M
=2.67) and boys' (M
=2.26) on the total score of the KAPS, t
(226)=2.40, 4th graders (M=6.70, SD=2.61) tended to score higher on the measure than 3rd graders (M=5.99; SD=2.27), t
Correlational analyses indicated that the overall score on the KAPS was moderately associated with attributions of intentionality in relational situations (r
<0.01) and in instrumental situations (r
<0.01) on a commonly used hostile attributional bias measure (Crick 1995
; Leff et al. 2006
). Similar results were found when correlating the overall score of the KAPS with another measure of social cognitive processing (Hughes et al. 2004
). For instance, the KAPS was moderately associated with attributions on intentionality on the SCAP (r
<0.001). In addition, there were low-moderate correlations between the overall KAPS score and teacher reports of student's physical aggression (r
<0.01) and relational aggression (r
<0.05). Finally, the overall score on the KAPS was negatively related to peer nominations of overt aggression (r
<0.05) and positively related to peer nominations of prosocial behavior (r
In summary, item-level analyses suggest that 13 of the 15 items demonstrate strong psychometric properties across all item statistics, while 2 of the 15 items demonstrate adequate items statistics. These findings combined with results suggesting that the test has strong test-retest reliability, is low-moderately associated with similar though distinct constructs (hostile attributions and ratings of student behavior by teachers and peers), and robust sensitivity to treatment effects indicate that the KAPS has much potential for use with urban African American youth.