There are approximately 1.1 million students enrolled in the New York City (NYC) public school system. At each of the 1,160 schools, absences are recorded on a "bubble sheet" during a designated homeroom class period. Bubble sheets are forwarded to the school's administrative offices and scanned into a local database that automatically transmits the data to a central database at the NYC Department of Education (DOE). Preliminary data are available by noon each day, however these are subject to correction over the next few days for students who, for example, arrive late. Reason for absence is not recorded.
For the purposes of this evaluation, we downloaded attendance data for the 2001–02, 2002–03, and 2003–04 academic years from the DOE website [7
]. Data consisted of the daily percent present (i.e., not absent) aggregated by 'Community School Districts,' which include elementary and middle school children (kindergarten through 8th
grade), and 'High School' (9th
grade). We calculated the median daily absentee rates separately among elementary/middle and high school students over the three-year period, and examined whether the absentee rates differed between the two groups. We hypothesized that absenteeism would be significantly higher among the older group of students, making age group-specific analyses necessary. We also assessed whether absenteeism varied by day of week and whether it was higher on days scheduled for parent teacher conferences, state exams, or half days, so that we could control for such days. Statistical significance of differences was measured using the Wilcoxon-signed rank test.
To identify days on which prospective surveillance would have indicated a statistically significant overall increase in absenteeism – or 'signal' – and specifically to determine whether the system could provide timely indications of community-wide influenza, we analyzed daily data from October 1, 2001 through June 25, 2004 retrospectively. The analysis mimicked prospective monitoring, specifically, all data from September 15, 2001 up to and including the day of analysis were used in each days analysis, but no future data. Separate analyses were carried out for elementary/middle and high school students.
There are many reasons why students are absent from school that are unrelated to illness and we wanted to control for these reasons whenever possible. Several steps were therefore taken to minimize the number of false positive signals that would normally result from a time-series analysis on data with many extreme data points (i.e., days with increased absenteeism). We removed extreme data points with a known explanation for absenteeism (e.g., Halloween) from the analysis because they were uninformative. Next, we adjusted the observed daily percent absent using a linear regression model, based on an existing DOHMH ambulance dispatch surveillance model [1
]. The percent absent was modeled as a linear function of day-of-week (parameterized as 4 dummy variables with Tuesday as reference) and whether or not the day was a scheduled low attendance day (clerical half-days, parent-teacher conference days, and state exam days). We compared daily percentages with a 14 day baseline using a modified cumulative sums (CuSum) method (8). The two modifications were: 1) We eliminated from the baseline any day on which the residual from the regression model was more than two times the standard error of all residuals. This reduced the influence of extreme, uninformative data points on the baseline mean and standard deviation. 2) We terminated CuSum signals if the percent absent returned to within 0.5 standard deviations of the baseline mean. This reduced the number of multi-day signals that were due to a spike in absenteeism on one day that was extreme enough to cause signals for two or three consecutive days. The daily percent absent was plotted along with any CuSum C1, C2 or C3 signal [8
], the daily number of emergency department (ED) patients ages 5–17 complaining of fever or influenza-like illness from an existing ED surveillance system [2
], and the weekly number of influenza A and B isolates identified at NYC reference laboratories.
To assess geographic clustering in absenteeism we obtained a more detailed dataset from the DOE, which consisted of the daily number of students registered and absent by school and grade during the 2001–02 academic year. Spatial clustering by school location was assessed using a modified, purely spatial scan statistic [9
] with a 30-day baseline, 1-day maximum temporal window and 20% (of registered students) maximum spatial window. More than one significant spatial cluster per day was possible and clusters of absenteeism could occur at a single school or several schools in a contiguous geographic area. We evaluated whether this method would have detected the sole gastrointestinal school outbreak reported during the 2001–02 school year. Analyses were carried out using SAS version 8.0 (SAS Institute, Cary, NC) and SaTScan version 4.0.3 (available free at [11