The forms from eight states, Alberta, and the Caribbean were located on the World Wide Web. Sixteen Tuberculosis Controllers designated by the National Tuberculosis Controllers Association responded to e-mail requests for copies of their contact investigation forms. The remaining forms were obtained through phone or postal mail requests. Two state health departments were unresponsive. Overall, 59 unique forms were collected from fifty states, two countries and three municipalities. New Mexico was unique in having adopted the form directly from another state, Texas.
Officials from four states provided non-standard forms. We considered these forms, as well as the county forms noted above, to be representative of data standards likely to exist in the state. However, fields unique to these forms were not added to the analysis.
Summary of Form Content
Forty-four fields derived from an initial sample of 15 forms were summarized for all available forms. Overall, 2.7% of the data elements were flagged during review as artifacts of the form encoding process. An example is the encoded field “treatment completion date”, which conflated two fields ultimately used in subsequent analyses, “treatment completion” and “treatment stop date”.
The collection of forms had a mean of 17.0 fields and median of 16 fields. As shown in , on average a form consists primarily of fields from the Identification (32.4% of fields) and Testing (24.4% of fields) categories.
However, the frequencies of individual fields drawn from a category differed, as seen in : Testing fields are present in 66.4% of forms, and Identity fields were present in 40.6% of forms on average. Each of the following fields was present in at least 75% of forms, representing the most standard data elements: Name, DOB (date of birth), Address, Treatment given, TST-date, X-ray date, TST-size, X-ray result, where TST refers to the tuberculin skin test for TB. Medical evaluation fields occupy the smallest portion of form fields (4.7%) and are the least frequent (15.4%).
Existing data standards may be appraised not just by the prevalence of particular data elements, but also through groupings of states using similar forms and groupings of fields occurring in similar contexts. Seeking such patterns, we performed cluster analysis on the dataset, treating each field as a dichotomous variable with a value of one or zero when the field was present or absent, respectively.
Two large, distinct groupings of state forms were revealed in an initial clustering of forms by state (). Group 1 included [AL,CA,CO,IA,IN,MO,MT,ND,NH,NV,SD,UT,WA,WV,WY]. Group 2 included [Alberta, AR, AZ, LA, NC-2, NE, NJ, OK, PA]. There were several additional common smaller groups (especially pairs) that consistently grouped together, including [HI,MS], [IL,Caribbean],[NV-2,TN], [CT,RI], [ME,VA], [OR, VA-2] and others.
Figure 2. This is a hierarchical clustering of the form elements using complete-linkage (furthest-neighbor) and the Manhattan distance metric. This dendrogram identifies states with similar forms that could more easily share data or standardize their data models. (more ...)
Some tight regional associations exist within this first group, especially the contiguous western states [CA,CO,MT,ND,NV,SD,UT,WY]. This group consists of fairly sparse forms, with an average of 12.8 fields compared to 19.4 in the ungrouped states. One characteristic common among these states is the absence of two of the more common fields, Phone and Address. These are potentially very important fields that help to correctly identify individuals, match them to other data, or perform contact follow-up; these activities can be critical to verification of TB control activities16
. An evaluation of contact investigations in the state of California, which conducts one fifth of the nation’s TB investigations, found a large proportion of contacts lost during the investigation process16
. Both of these missing fields could potentially be used to locate contacts or to identify high-risk regions in location-based investigations 17
Like Group 1, the Group 2 state forms also have fewer fields on average, with a mean 14.6 fields. Unlike Group 1, however, these forms use the most common fields more consistently, with a virtual absence of other fields. Although this group is not confined to a single geographic region, adjacencies between [NJ,PA] and [AR,LA,OK] point to regional similarities in contact investigation practices.
A heatmap based on simultaneous hierarchical clustering18
of form fields and states in yields further information about the types of data that are often collected together, chiefly in differentiating common from uncommon fields. Clustering on fields in the heatmap reveals a striking distinction, visible in red highlighting of fields present, between the frequency of core vs. non-core fields. These 14 core fields are shown in below. Of these, only 4 fields, Phone
, Address, Name
, and DOB
are among the 17 fields recommended by AHIMA for Master Patient Index identifiers19
, however other fields such as Xray_Date
could pertain to the AHIMA MPI fields for “Encounter/service type” and “Patient Disposition”. Whereas these core fields were common in most forms, they were coded inconsistently and many relied upon free text.
Figure 3. A simultaneous hierarchical clustering of states (row) and fields (columns). Presence/absence of a field is shown with a red (1) or white (0) cell. The core group of fields segregates clearly on the right. Group 1 and Group 2 states which often cluster (more ...)
Core data fields present in forms and databases from the national survey, based on cluster analysis of data fields.
Other groupings of fields are significant, although not as standardized as the common fields. These “motifs” of fields not only cluster together, but are also pertinent to the same category of data. One such motif is a group of data elements that record patient Medical History: [Prior_TST, Prior_Treatment_for_LTBI, Prior_Treatment_ Date, Prior_Treatment_Complete, and Symptoms], highlighted in blue in . These fields are clustered together as are most states recording these variables, [MD,SC,MN,OR,VA-2], and KY. This is clearly a trend in contact investigations intended to identify possible cases that might be susceptible to MDR-TB and/or non-compliance. Few other states record these variables.
Another motif of data elements relates to Treatment: monitoring drug regimens given to patients. It consists of the variables [Rx_type, Completion, Outcome_Or_Reason_Not_Completed,
], highlighted in yellow in . These fields also cluster together in the dendrogram, and are predominantly used by two groupings of states. One are a subset of the Group 1 states [CA,NV,IN,WA,MT,NH], and the other are inconsistently grouped states of [NM,TX,GA,DE,KY]. The CDC checklist mentioned previously6
is also clustered with this second set. As with the previous motif, few other states record these variables. These data serve a similar purpose as the last motif, only in a prospective fashion: they seek to ensure completion of drug regimens, thereby preventing relapse.
Other smaller motifs relate to Exposure [Last_Exposure_Date, Place], and contact Identity [Race/Ethnicity, Gender]. Both of these occur primarily in the non-Group-1/2 states. The grouping of variables having similar semantic types serves as an independent validation of the clustering approach.
Although these motifs were identified by visual inspection of , Principal Component Analysis (PCA) 20
of the dataset confirms their importance to variation in the form collection. In the first component, accounting for 15% of the overall variance, four of the top five contributing variable loadings (data fields) were from the exposure and identity motifs mentioned. The second component accounted for 10% of the variance; four of its top five variables comprised the treatment monitoring motif. The third component accounted for 8% of the variance; four of its top five variables comprised most of the patient history motif. These results confirm the observed field motifs as being relatively independent among the sampled forms.