Search tips
Search criteria 


Logo of procamiaLink to Publisher's site
AMIA Annu Symp Proc. 2011; 2011: 19–27.
Published online 2011 October 22.
PMCID: PMC3243268

Methods to Identify Standard Data Elements in Clinical and Public Health Forms


The fragmentation of clinical and public health systems results in divergent information collection practices, presenting challenges to standardization and EHR certification efforts. Data forms employed in public health jurisdictions nationwide reflect these differences in patient treatment, monitoring and evaluation, and follow-up, presenting challenges for data integration. To study these variations, we surveyed tuberculosis contact investigation forms from all fifty states, three municipalities and two countries. We apply statistics and cluster analysis to analyze the divergent content of contact investigation forms with the goal of characterizing normative practices and identifying a common core of data fields. We found widespread variation in data elements between states in the study, with the “Name” field being the only ubiquitous data element. Our method reveals distinct groupings of data fields employed in certain regions, allowing the simultaneous identification of core standard data fields as well as variations in practice.


The public health infrastructure in the United States encompasses national, state, city and county jurisdictions. Although some common reporting systems exist for specific diseases 1 2, local jurisdictions often follow individualized procedures. Such differences are reflected in the information collected in contact investigations, and the inconsistencies that arise in the resulting data complicate program evaluation and data aggregation across jurisdictions. Furthermore, to the extent that certain types of data are not routinely collected, epidemiologists cannot rely on these data to evaluate and improve current disease control practices.

Contact investigations are an efficient way to locate individuals exposed to infectious cases of tuberculosis (TB). However, the conduct of these investigations varies widely according to population size and density, local demographics and risk factors, health department funding levels, and state reporting requirements.

A process is needed to aid decision-makers in data integration, standardization, and program evaluation that nevertheless recognizes the unique public health practices arising in unique jurisdictions or with particular disease control efforts3 4. In order to quantify the uniformity among data forms from diverse jurisdictions, we demonstrate here the application of unsupervised clustering to identify commonly occurring form elements and regions having similar data architectures.

A common complaint among epidemiologists and researchers is the poor quality and standardization of data in public health settings 5. In this environment, the CDC has taken certain actions such as the promotion of contact investigation standards in 2005 6, 7, and the implementations of statewide surveillance systems based on the National Electronic Disease Surveillance System (NEDSS) architecture2. Despite these efforts, TB contact investigation practices at the state, county, district, city, and clinic levels nevertheless remain divergent 8.

Outside of the tuberculosis domain, the Council of State and Territorial Epidemiologists has undertaken an effort to obtain standard fields used in statewide disease reporting as input to the Public Health Case Reporting use case for electronic health records9. Efforts such as these require new methods to identify “core” data elements that form de facto standards in public health data forms.

The tasks of merging disparate data sources and record linkage are broad and long-standing problems in the field of computer science10 11 12. A database schema is the overall structure of a relational database, and like paper forms, the same data may have varied representations. Schema matching is one name for the task of reconciling two or more designs so that similar data may be pooled13. Since database schema definitions are based on explicit grammars, similarities have recently been recognized to text-based data models such as XML documents and HTML forms14, providing a conceptual link to structured paper forms. However, these tools usually target pairwise matching of data fields and require substantial human intervention.

A statistical method to analyze web query interfaces was introduced by He et al15. This approach uses observed schemas from the web to create a generative probability model of the domain assuming independence between data elements, thereby deriving a common template describing the observed documents. This paper will similarly analyze TB forms by comparing patterns of attributes that are present or absent. By contrast, we use summary statistics and cluster analysis to characterize the variability in the forms collection and to identify meaningful groupings of states or attributes. Further, by categorizing variables into medically relevant categories, we will shed light onto the content and usage of these forms. In this paper, we use the terms field, variable, and data element interchangeably to describe units of information from contact investigation forms such as Name or Date of Birth.


An initial convenience sample of 15 forms available on the internet was analyzed to confirm the lack of standardization of contact investigation forms. Based on this initial sample, it was determined that a full nationwide analysis should be conducted. Contact investigation summary forms were requested from the designated Tuberculosis Control officer from each of the United States. If forms were not available or could not be located online, they were first requested via e-mail using contact information from the NTCA website. Non-responding states or were then contacted by phone using the NTCA contact, or otherwise to a different (and often more current) contact listed on the state health department website. Participating states returned forms via e-mail attachment, fax, or mail. A CDC TB contact data checklist6 and two forms from outside the US were included for comparison.

Statewide forms were not uniformly available. In the event of non-response from health department staff, contact investigation forms were obtained from county or district offices. A standardized format for contact data is not used in several states. In these cases, forms from local jurisdictions, contact interview forms, or spreadsheets were used as a proxy for the state. Finally, some states have transitioned to electronic data entry – in these cases, data fields from the information system were used when available.

Data tabulation and descriptive statistics were conducted in Microsoft Excel. Cluster analysis and other statistical analyses were conducted in the R and Stata statistical packages. Fields were matched between forms by field name, with unknown or unclear matches being resolved by form guidelines, matching controlled values, or consultation with TB controllers. These matched field archetypes were then categorized into classes depending on the type of information they represented. These classes are listed in Figure 1.

Figure 1.
Summary of national contact investigation form composition. (A) This chart depicts the percentage of fields from each category on an average form (excluding Comments). Identification fields comprise the largest portion. (B) This chart depicts the average ...

Each form was analyzed to determine which were present, and with what cardinality and controlled value set. Only the dichotomous presence/absence of data fields was used in further methods.

Relationships among forms were investigated with cluster analysis. Hierarchical clusters were obtained both with and without frequency weights on variables in order to study the impact of more common variables. Cluster analysis was performed under the Manhattan distance measure with complete linkage. The Manhattan distance has the intuitive interpretation of representing the number of field differences between two forms when field presence/absence is dichotomized.

To reveal further patterns in field usage in forms across regions, a cluster heatmap using row/column clustering was employed to identify simultaneous groups of data fields and states using them. These groupings were then investigated to identify frequently co-occurring semantic motifs pointing to differences in form content and usage across regions.

To determine whether the local burden of tuberculosis was related to the contact data collected by TB control programs, we compared TB case rates for each state in our study to the number of fields present in the contact investigation forms of that state.


The forms from eight states, Alberta, and the Caribbean were located on the World Wide Web. Sixteen Tuberculosis Controllers designated by the National Tuberculosis Controllers Association responded to e-mail requests for copies of their contact investigation forms. The remaining forms were obtained through phone or postal mail requests. Two state health departments were unresponsive. Overall, 59 unique forms were collected from fifty states, two countries and three municipalities. New Mexico was unique in having adopted the form directly from another state, Texas.

Officials from four states provided non-standard forms. We considered these forms, as well as the county forms noted above, to be representative of data standards likely to exist in the state. However, fields unique to these forms were not added to the analysis.

Summary of Form Content

Forty-four fields derived from an initial sample of 15 forms were summarized for all available forms. Overall, 2.7% of the data elements were flagged during review as artifacts of the form encoding process. An example is the encoded field “treatment completion date”, which conflated two fields ultimately used in subsequent analyses, “treatment completion” and “treatment stop date”.

The collection of forms had a mean of 17.0 fields and median of 16 fields. As shown in Figure 1A, on average a form consists primarily of fields from the Identification (32.4% of fields) and Testing (24.4% of fields) categories.

However, the frequencies of individual fields drawn from a category differed, as seen in Figure 1B and C : Testing fields are present in 66.4% of forms, and Identity fields were present in 40.6% of forms on average. Each of the following fields was present in at least 75% of forms, representing the most standard data elements: Name, DOB (date of birth), Address, Treatment given, TST-date, X-ray date, TST-size, X-ray result, where TST refers to the tuberculin skin test for TB. Medical evaluation fields occupy the smallest portion of form fields (4.7%) and are the least frequent (15.4%).

Hierarchical Clustering

Existing data standards may be appraised not just by the prevalence of particular data elements, but also through groupings of states using similar forms and groupings of fields occurring in similar contexts. Seeking such patterns, we performed cluster analysis on the dataset, treating each field as a dichotomous variable with a value of one or zero when the field was present or absent, respectively.

Two large, distinct groupings of state forms were revealed in an initial clustering of forms by state (Figure 2). Group 1 included [AL,CA,CO,IA,IN,MO,MT,ND,NH,NV,SD,UT,WA,WV,WY]. Group 2 included [Alberta, AR, AZ, LA, NC-2, NE, NJ, OK, PA]. There were several additional common smaller groups (especially pairs) that consistently grouped together, including [HI,MS], [IL,Caribbean],[NV-2,TN], [CT,RI], [ME,VA], [OR, VA-2] and others.

Figure 2.
This is a hierarchical clustering of the form elements using complete-linkage (furthest-neighbor) and the Manhattan distance metric. This dendrogram identifies states with similar forms that could more easily share data or standardize their data models. ...

Some tight regional associations exist within this first group, especially the contiguous western states [CA,CO,MT,ND,NV,SD,UT,WY]. This group consists of fairly sparse forms, with an average of 12.8 fields compared to 19.4 in the ungrouped states. One characteristic common among these states is the absence of two of the more common fields, Phone and Address. These are potentially very important fields that help to correctly identify individuals, match them to other data, or perform contact follow-up; these activities can be critical to verification of TB control activities16. An evaluation of contact investigations in the state of California, which conducts one fifth of the nation’s TB investigations, found a large proportion of contacts lost during the investigation process16. Both of these missing fields could potentially be used to locate contacts or to identify high-risk regions in location-based investigations 17.

Like Group 1, the Group 2 state forms also have fewer fields on average, with a mean 14.6 fields. Unlike Group 1, however, these forms use the most common fields more consistently, with a virtual absence of other fields. Although this group is not confined to a single geographic region, adjacencies between [NJ,PA] and [AR,LA,OK] point to regional similarities in contact investigation practices.

A heatmap based on simultaneous hierarchical clustering18 of form fields and states in Figure 3 yields further information about the types of data that are often collected together, chiefly in differentiating common from uncommon fields. Clustering on fields in the heatmap reveals a striking distinction, visible in red highlighting of fields present, between the frequency of core vs. non-core fields. These 14 core fields are shown in Table 1 below. Of these, only 4 fields, Phone, Address, Name, and DOB are among the 17 fields recommended by AHIMA for Master Patient Index identifiers19, however other fields such as Xray_Date and Xray_Result could pertain to the AHIMA MPI fields for “Encounter/service type” and “Patient Disposition”. Whereas these core fields were common in most forms, they were coded inconsistently and many relied upon free text.

Figure 3.
A simultaneous hierarchical clustering of states (row) and fields (columns). Presence/absence of a field is shown with a red (1) or white (0) cell. The core group of fields segregates clearly on the right. Group 1 and Group 2 states which often cluster ...
Table 1.
Core data fields present in forms and databases from the national survey, based on cluster analysis of data fields.

Other groupings of fields are significant, although not as standardized as the common fields. These “motifs” of fields not only cluster together, but are also pertinent to the same category of data. One such motif is a group of data elements that record patient Medical History: [Prior_TST, Prior_Treatment_for_LTBI, Prior_Treatment_ Date, Prior_Treatment_Complete, and Symptoms], highlighted in blue in Figure 3. These fields are clustered together as are most states recording these variables, [MD,SC,MN,OR,VA-2], and KY. This is clearly a trend in contact investigations intended to identify possible cases that might be susceptible to MDR-TB and/or non-compliance. Few other states record these variables.

Another motif of data elements relates to Treatment: monitoring drug regimens given to patients. It consists of the variables [Rx_type, Completion, Outcome_Or_Reason_Not_Completed, and Rx_Stop_Date], highlighted in yellow in Figure 3. These fields also cluster together in the dendrogram, and are predominantly used by two groupings of states. One are a subset of the Group 1 states [CA,NV,IN,WA,MT,NH], and the other are inconsistently grouped states of [NM,TX,GA,DE,KY]. The CDC checklist mentioned previously6 is also clustered with this second set. As with the previous motif, few other states record these variables. These data serve a similar purpose as the last motif, only in a prospective fashion: they seek to ensure completion of drug regimens, thereby preventing relapse.

Other smaller motifs relate to Exposure [Last_Exposure_Date, Place], and contact Identity [Race/Ethnicity, Gender]. Both of these occur primarily in the non-Group-1/2 states. The grouping of variables having similar semantic types serves as an independent validation of the clustering approach.

Although these motifs were identified by visual inspection of Figure 3, Principal Component Analysis (PCA) 20 of the dataset confirms their importance to variation in the form collection. In the first component, accounting for 15% of the overall variance, four of the top five contributing variable loadings (data fields) were from the exposure and identity motifs mentioned. The second component accounted for 10% of the variance; four of its top five variables comprised the treatment monitoring motif. The third component accounted for 8% of the variance; four of its top five variables comprised most of the patient history motif. These results confirm the observed field motifs as being relatively independent among the sampled forms.

Correlates of Form Complexity

We assessed the correlation of form complexity, measured in the total number of fields present, with TB case rates and with the recentness of form development when this information was available. The results in Figure 4 show a weak but significant association between form complexity and TB case rates. A stronger association is seen between form age and the number of fields, with newer forms tending to collect more complex data, however the unavailability of older forms results in high leverage in the regression for the two pre-1990 forms. These results may reflect the complex relationship of TB case rates with public health investment in TB control in urban versus rural environments. They may also indicate a tendency toward increasing program monitoring data in states having recently updated data collection practices. For example, a 2007 form not represented in this analysis contains an additional field for Quantiferon results21, a test not in widespread pratice before 2005.

Figure 4.
(A) A scatterplot of contact investigation form complexity vs. TB case rate (states only) shows a very weak correlation (slope 0.87; 95% CI [0.24–1.5]; R2=0.11). However, in (B) we see a greater apparent correlation with date of form revision ...


Recent strides toward electronic reporting systems makes this an opportune time to broadly evaluate the diversity of data representations on the ground. Rarely are software developers tasked with unifying so many diverse data sources, and surveys of existing public health practice are sorely needed. The methods presented in this paper support standardization efforts by identifying high-level trends, groups of common practice, and the diversity of semantic type motifs.

In this case study, only eight of forty-four fields were individually present in at least 75% of forms (Name, DOB, Address, Treatment given, TST-date, X-ray date, TST-size, X-ray result). However, fewer than 50% of the states have forms that contain (collectively) all of these eight basic elements. Only five states, MD, DE, HI, MS, and TN, had forms that encompass all of the thirteen “core” fields in Figure 3. Surprisingly, Name was the only ubiquitous field in the forms sampled. That this field is followed by TST-date and TST-size in frequency betrays the decades-long use of the tuberculin skin test in TB control among exposed individuals.

Our analyses investigated possible explanatory variables relating to form complexity in this setting. Both TB case rate and date of form revision show significant but weak associations with increasing form complexity. The case rate trend may reflect the increasing number of risk factors and case management data required for outbreak management in regions with a historically higher burden of tuberculosis. The trend toward increasing complexity over time may be more nuanced, as it is likely influenced by a greater diversity of tuberculosis metrics and risk factors over time 16. Understanding the source of data complexity in clinical environments may be a first step to preventing proliferation of competing representations and to identifying and promoting standards.

Another approach to understanding the diversity of data collected in this setting is to identify sources of commonality among the state forms. The use of cluster analysis helps to identify sets, or motifs, of data fields that group together in clinical use and practice. Motifs such as patient history, treatment monitoring, and patient identity and relationships have distinct clinical utility to public health, and might be components of standards uniquely useful to programs with differing public health challenges; indeed, the CDC may include certain performance metrics for individual jurisdictions depending on their unique epidemiology.

The approach of simultaneously clustering states using data motifs could be useful in promoting regional standardization approaches targeted to groups having the particular clinical needs represented by specific data motifs. As a side effect, this approach will also identify patterns of public health practice sharing that transcend geographic proximity. For example, the state of Kentucky and the city of San Francisco shared virtually identical forms in content, layout and coding. Where local geographic patterns do exist, neighboring states and the CDC might use this analysis to harmonize their practices and data collection regionally. More generally, provider groups in clinical settings seeking to identify areas and strategies for standardization might use a similar approach to sift through a multitude of data sources to highlight areas of semantic commonality and communities of practice.

Related methods of analyzing clinical data complexity have recently been employed by the Council of State and Territorial Epidemiologists in the Case Reporting Standardization Workgroup, as part of an effort to advise criteria for meaningful use in EHR standardization. We believe this approach could be applied more generally to many sources of clinical data, and could be adapted for use with electronic data sources as well as paper-based forms. As more clinical and public health environments embrace electronic records, efforts to unify redundant data representations will become increasingly important.


The main focus of our analysis was the level of standardization of common fields between states. We did not record information about how the forms were used in different states, by whom they were filled out, or which protocols were followed. Furthermore, we did not examine the contents of fields, nor the consistency with which they are completed. Index case fields, overall form layout, and field codes also were not analyzed. These additional properties could potentially influence the standardization of data or data collection practices. For example, some state forms still retained a graphical three-tier home/work/leisure place of association graph in the form layout, reflecting an outdated tuberculosis control paradigm. While our analysis did not address these structural features, they could also conceivably be encoded and analyzed using cluster analysis without change to our methodology.

In addition to identifying a snapshot of data standards in this area of public health practice, we used metadata from forms to investigate features such as revision history and case rates that were anticipated to influence form complexity. However, we did not have sufficient examples of multiple form revisions from individual states to validate the hypothesized growth in complexity over time. We are currently collecting more recent forms and database schemas to enable a longitudinal study of information use changes.

Despite these limitations, this comprehensive overview of contact investigation data collection will inform standardization efforts, data pooling, and epidemiologic studies targeting particular variables.


This survey of contact investigation data standards provides insight into the similarities and differences between state practices, laying out in detail which kinds of data are most common in the field. While it is suboptimal to pursue standardization after a diversity of data representations have already been adopted, this is nevertheless a common scenario in health care. The methods presented may help future standards efforts identify not only the critical data required by all systems, but also other key functional groups of data from subsets of systems.

Furthermore, we have suggested groupings of states and data fields that should be useful to TB control officers at local, state and federal levels in determining best practices and charting a path to true standardization. The health system represents a wealth of knowledge and experience that is reflected in the data collected. By examining artifacts such as data forms, the informatics community can aid the incorporation of this knowledge into ongoing data standardization and integration efforts.


This work was supported by the NLM Biomedical Informatics Training Grant 1T32GM063495-01. We are greatly indebted to the many health officials at all levels of the public health system for generously contributing forms, guidelines, time and wisdom to this project.


1. Elimination DoT, editor. Centers for Disease Control and Prevention; 2003. Tuberculosis information management system user’s Guide, Version 1.2.
2. Status of State Electronic Disease Surveillance Systems --- United States, 2007. Morbidity and Mortality Weekly Report. 2009 Jul 31;58(29):804–807. 2009; [PubMed]
3. Response to the Request for Information on the Development and Adoption of a National Health Information Network from the ONCHIT, DHHS. Baltimore, MD: Public Health Data Standards Consortium; Nov 15, 2004. 2005.
4. Pina J, Turner A, Kwan-Gett T, Duchin J. Task analysis in action: the role of information systems in communicable disease reporting. 2009. [PMC free article] [PubMed]
5. O’Carroll PW. Public health informatics and information systems. Springer Verlag; 2003.
6. CDC Guidelines for the investigation of contacts of persons with infectious tuberculosis: recommendations from the National Tuberculosis Controllers Association and CDC. Morbidity and Mortality Weekly Report. 2005;54(RR-17):1–47. [PubMed]
7. Jereb J, Leary L, Taylor Z. Aggregate Reports for Tuberculosis Program Evaluation: Training Manual and User’s Guide. Atlanta, GA: Centers for Disease Control and Prevention; 2005.
8. Wilce M, Shrestha-Kuwahara R, Taylor Z, Qualls N, Marks S. Tuberculosis contact investigation policies, practices, and challenges in 11 US communities. Journal of Public Health Management and Practice. 2002;8(6):69. [PubMed]
9. 2009–11. Public Health Case Reporting Community Website. Accessed 3-1-2011.
10. Newcombe HB, Kennedy JM, Axford S, James A. Automatic linkage of vital records. Science. 1959;130(3381):954–959. [PubMed]
11. Fellegi IP, Sunter AB. A theory for record linkage. Journal of the American Statistical Association. 1969;64(328):1183–1210.
12. Subramanyan GS, Yokoe DS, Sharnprapai S, Tang Y, Platt R. An algorithm to match registries with minimal disclosure of individual identities. Public Health Reports. 1999;114(1):91. [PMC free article] [PubMed]
13. Kim W, Seo J. Classifying schematic and data heterogeneity in multidatabase systems. Computer. 1991;24(12):12–18.
14. Doan AH, Domingos P, Halevy AY. Reconciling schemas of disparate data sources: A machine-learning approach. ACM SIGMOD Record. 2001;30(2):509–520.
15. He B, Chang KCC. Statistical schema matching across web query interfaces. 2003.
16. Sprinson J, Flood J, Fan C, et al. Evaluation of tuberculosis contact investigations in California. The International Journal of Tuberculosis and Lung Disease. 2003;7(Supplement 3):S363–S368. [PubMed]
17. Klovdahl AS, Graviss EA, Yaganehdoost A, et al. Networks and tuberculosis: an undetected community outbreak involving public places. Soc Sci Med. 2001 Mar;52(5):681–694. [PubMed]
18. Gower JC, Digby P. Expressing complex relationships in two dimensions. Interpreting multivariate data. 1981:83–118.
19. Fabian D, Haenke J, Webb L. Reconciling and managing EMPIs. Journal of AHIMA/American Health Information Management Association. 2010;81(4):52. [PubMed]
20. Jolliffe I. Principal component analysis. 2002.
21. Health WSDo, editor. Olympia, WA: Washington State Department of Health; 2007. Tuberculosis Contact Investigation Form.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association