This paper reports on the evaluation of a self-assessment instrument designed for use with child care providers. Test-retest and inter-rater reliability, as well as criterion validity, were assessed using a weighted kappa statistic. Interpreting these data using the method proposed by Muñoz and Bangdiwala [44
], overall reliability and validity of the instrument indicate it is an accurate and stable measure of the child care environment. This approach provides less arbitrary, simulation-based interpretation guidelines for the kappa test statistic, and improves upon the conventional method proposed by Landis and Koch in 1977 [43
A limitation of the kappa statistic as a measure of concordance was demonstrated when analyzing these data. Question N5F assessed food used to control behavior, and yielded a kappa statistic of 0.00. Given that there was no variability in the scores reported on the self-assessment instrument for that question (all center directors reported a score of "4"), the weighted kappa (Cicchetti and Allison [45
] weight used) was unable to yield a meaningful test statistic and therefore did not accurately represent agreement between the two measures. With the exception of this one question (N5F), responses on the NAP SACC self-assessment ranged from 1–4 for 44 of the 56 questions. For 11 of the questions, responses were limited to three of the four categories (N1B, NIE, N3C, N5C, N5D, N7D, PA1C, PA1D, PA2A, PA2B, PA3B), with variability on which response category was not selected, and in the situation described above, only one response category was selected by all respondents for one question. Percent agreement for this question (N5F) was 87.88%, which provided some indication of reasonable concordance. In this specific case, an alternate test of agreement would be more appropriate [46
]. Thus, in addition to weighted kappa statistics, percent (exact) agreement is also presented for these data. Although this measure does not consider agreement due to chance, and therefore may report inflated agreement, it provided a more appropriate interpretation for question N5F and is not without overall merit.
Regardless of statistical test used, for validity testing, scores on the self-assessment instrument were higher than those on the EPAO for more than 2/3 of the questions. This was expected, given that self-report may be associated with social desirability. Child care center directors may wish to describe their center in the best possible light, which is a limitation of the self-assessment approach. The original intent of the NAP SACC self-assessment instrument, however, was to raise awareness and spark interest in the child care staff completing the instrument. Use of the instrument as a primary outcome measure for research studies is not recommended, or should be done with caution. A more objective measure, such as the EPAO may be more appropriate if researchers hope to accurately capture policies and practices at the child care facility. The EPAO, however, is not without limitations. Observation that takes place over one day will capture only those behaviors and practices that occur regularly, or happen to coincide with the day of observation. In addition, child care center staff may behave or interact differently with children in the presence of an outside observer. Repeated day observation may yield more accurate results since behaviors that happen sporadically could be observed and staff may be less likely to alter behavior after a number of observation days. In general, questions that assessed the behaviors of staff (N1D, N1E, N2C, N4A, N4B, N5B, N5C, N5D, N7E, PA1D, and PA1E) had lower kappa statistics than questions that examined more concrete outcomes. The questions that had the highest kappa statistics for both types of reliability assessed fixed, or tangible aspects of the child care center environment (N3E, N7B, N9A, PA2B, PA3A, and PA6A), although this pattern did not hold when applied to validity kappa test statistics. Review of documents (e.g., menus, lesson plans, policies) may help to supplement information gleaned from observation, but there is some evidence, however, that menus may not always accurately reflect food served at the child care center [47
When questions on the NAP SACC instrument were broken down by category and separated by a kappa test statistic of less than .20 compared to those questions with a kappa test statistic of greater than or equal to .20, some within category patterns emerged. Questions related to staff behavior and provision of food were fairly evenly split, while questions that assessed center behavior (e.g., fundraising practices) and the overall environment tended to have more questions with a lower kappa test statistic. The category that yielded the highest percentage of kappa test statistics at or above .20 was provision of physical activity.
An additional limitation of the study is the small sample size for test-retest reliability testing, and the potential non-response bias with this sample that differs in race from the total sample. Center directors who completed a second self-assessment instrument (n = 38) were more likely to be in centers who served predominately white children, and had fewer African-American and Native American children. No differences emerged between the center staff who participated in the inter-rater reliability (n = 59) and the validity (n = 69) testing.
Despite some limitations, results for validity testing in this sample of child care centers were not without merit. Validity testing yielded kappa statistics lower than those found for reliability, but still provided evidence for reasonable agreement among the two measurement instruments. Reliability testing generally yielded higher kappa statistics, and inter-rater reliability results were slightly better than those for test-retest reliability. Raters from the same child care centers may have worked together and answered questions similarly, despite instructions to complete the self-assessment instruments independently, which is a limitation of this study. On the other hand, given that kappa statistics were excellent but not perfect, raters could be accurately reporting the same behaviors and policies seen at their child care center.
Future studies may wish to employ both an objective measure of the child care environment, as well as the self-assessment instrument pre- and post-intervention to see if the instruments perform in a similar, or parallel manner. Further assessment of the validity of the self-assessment instrument should be conducted using multiple days of observation, with less reliance on menus for documentation of actual food served. Questions with poor reliability and validity may be revised and retested, or eliminated from the final instrument.