Study setting
The Missouri Lower Respiratory Infection (LRI) Project was a large prospective cohort study of outcomes (mortality and functional change) of nursing home residents who developed an LRI [
10,
11]. The protocol was approved by institutional review boards at two medical centers, several independent hospitals, and two nursing home ethical review panels. Conducted in central Missouri and the St. Louis area, the study enrolled subjects from August 1995 through September 1998, and data collection continued for an additional three months.
Our institutional review board helped develop an appropriate strategy for enrolling participants. We contacted attending physicians in all facilities that had agreed to participate in the study. Physicians either declined to participate, or agreed to have trained study nurses provide timely, comprehensive evaluations of their residents who developed an illness consistent with an LRI. Physicians could a priori exclude any resident from evaluation. The study nurse recorded initial data and quickly communicated findings to the resident's physician, usually by facsimile transmission. Treatment decisions were left to the attending physician. Because these detailed evaluations were authorized by attending physicians who received clinical information and made treatment decisions accordingly, evaluations were considered part of appropriate care. For this reason, institutional review boards allowed a simplified consent process consisting of a simple refusal or acceptance of the clinical evaluation by the resident or a family member.
Study enrollment was a two-step process [
10]. Criteria for evaluating, excluding, and enrolling residents are shown in Table . First, after eliminating residents with exclusion reasons, eligible patients with illness signs and symptoms compatible with an LRI were evaluated. Based on the evaluation and chest radiograph results, residents who met the LRI definition (Table ) were enrolled. We refer the reader to Mehr et al. [
10] for further details regarding evaluation and enrollment. Residents could be enrolled multiple times, providing that they were well and off antibiotics for at least seven days following the previous episode. In the analysis, we used general estimating equations to adjust for individuals being represented in the data more than once.
| Table 1Criteria for evaluating, excluding, and enrolling residents in the Missouri LRI Project. |
Evaluation information was subsequently abstracted from medical records without recording personal identifiers on the abstraction forms. Other data were obtained by medical record abstraction and follow-up visits with surviving residents. Data were also collected on costs of care and potential quality-of-care indicators for facilities. Using these data, we have conducted analyses that consider several outcomes, including mortality, functional status, indicators of radiographic diagnosis of pneumonia, and costs of care. Figure shows a flowchart of the project's organizational activities.
Data Collection
All study nurses were trained with a standard protocol. To verify examination procedures, portions of evaluations were performed by different nurses (with the resident's permission) and compared immediately following the second evaluation. Additionally, the principal investigator or the co-investigator overseeing the St. Louis site shadowed each study nurse to observe evaluation skills and provided immediate feedback.
Starting with creating our data forms, we employed many standard data management procedures to minimize missing and erroneous data (Table ). For example, we designed data forms with multiple choices and check boxes whenever possible to avoid problems with interpreting handwriting, data abstractors used a specific code to indicate that items were blank and not inadvertently omitted, and the fields for continuous variables (e.g., temperature, white blood cell count) on our forms included an appropriate number of digits, decimal points (where appropriate), and clearly labeled measurement units.
| Table 2Data management principles.* |
All forms were pre-tested by investigators and research assistants. This resulted in dropping some data elements that were judged too time-consuming to find in the medical records (for example, date of the latest pneumococcal vaccine, which could require searching several years of charts for some residents). Each form included the study title, the form title, space for the subject's identification number, and a footer with the version number and date. We were fortunate to have a full-time research assistant who had extensive experience with chart abstraction. She initially trained all of the other research assistants by visiting facilities and going through the abstraction forms item-by-item. Subsequently, research assistants from each site (central Missouri and St. Louis) developed a manual that captured all of this information. We used conference calls to facilitate this process. The manual included an overview of the study forms, information on requesting and examining medical records, a decision matrix on what information to record for each type of resident (e.g. enrolled vs. evaluated but not enrolled), detailed instructions for locating and abstracting each form's data elements, Current Procedural Terminology (CPT) codes to be recorded for the economic analysis, medication lists and codes, copies of each form, and common abbreviations and medical shorthand.
Initially, we did not appreciate the complexity of our data management needs. Within a few months, we defined clear rules on which personnel were responsible for each data management task and how each task was to be completed. Early in the study, 51 evaluations were selected for complete re-abstraction by another research assistant. Abstractors compared these forms to determine where differences occurred, further standardized their methods, and reconciled any errors that were made.
All computerized data were stored on a secure network that limited access to authorized individuals and required a password for entry. Paper forms were stored in locked cabinets when not in use. We used a relational database to track enrollment, follow-up evaluations, and receipt and location (e.g., at data entry) of all forms. To ensure confidentiality, each resident was assigned a study identification number that was included on all forms in lieu of personal identifiers. After data cleaning was completed, resident names and social security numbers were completely expunged from the tracking database, as required by our IRB. All files on our computer network were backed up regularly; approximately quarterly, files were copied and stored at an offsite location so they could be recreated in case of a major system failure.
We used twenty different forms for data collection. This necessitated substantial data quality control over an extended period of time, and precluded data entry by project staff. After visual inspection for legibility and completeness, forms were sent to an on-campus data entry facility in batches of manageable size as they became available. Data cleaning was a two-step process involving data entry followed by detailed examination of the data for potential errors. To reduce typographical errors, forms were double-entered and verified; after one data entry operator entered a form, a second operator entered the same form and resolved typographical differences, if necessary. To facilitate identification and processing, we printed forms on differently colored paper. We then used SAS software, Version 6.1 of the SAS System for Windows [
12] to read data batches, check for errors, correct errors, and compile batches of entered data into data sets for analysis. These procedures are summarized below. Detailed descriptions of these procedures, including input statements used for data entry and management using SAS software, are available in additional file
1.
Initial screening and enrollment forms were entered in the tracking database within a week of a resident's evaluation. The tracking database checked for internal consistency (e.g., residents who met enrollment criteria were enrolled). These data were checked and corrected before follow-up assessments. The database was also used to print out lists of individuals who should receive 30- or 90-day follow-up evaluations, lists of individuals whose records were available for abstraction, and lists of missing or inconsistent data. We used weekly meetings to distribute these lists, collect incoming forms, and discuss any problems that arose. These weekly meetings provided regular discussions of problems and solutions that were critical to the data management process. We kept minutes of all meetings. In addition to weekly project meetings, staff involved with data collection regularly met with the data manager and principal investigator. We also kept a log of issues that resulted in procedural changes. One entry reads, "If we don't know whether a medication was given in capsule or tablet form, specify tablet. (10/26/95)." Batches of forms were sent to data entry approximately monthly throughout the project. The turnaround time for data entry was typically two to four weeks.
Data Entry and Cleaning
Prior to submitting forms for data entry, forms were visually inspected for completeness and legibility. Errors found at this stage were corrected by drawing a single horizontal line through the erroneous value, printing the correct value above or next to the original, initialing and dating the change, and adding an explanatory note when appropriate. We followed the same procedure to correct data following entry, with the exception that a specific SAS command was created for each edit. The erroneous data were never obscured, thus maintaining a clear audit trail of all changes. For each study form, we developed a data dictionary that contained several elements, including the name, description, type, allowable values (range), and maximum field width for each variable. This helped data entry personnel set up a series of data entry formats that ensured output of high quality data files. Any questions about form legibility or out-of-range responses were flagged by data entry personnel for later resolution.
We established several rules to ensure accountability and data quality. Except for data entry, completed forms never left the office. The original electronic files we received from data entry were never altered. We copied each file and worked with the copies, never the originals. Each batch was given a name that identified the type of form and included a sequential number identifying the data entry batch. For example, the raw data files for participant evaluation forms for the Columbia site were named EVALCL01.DAT through EVALCL35.DAT, as there were 35 batches of entered forms. This allowed us to use simple macro variable names to refer to the batches in our SAS software programs. Figure shows a flow chart of the computational tasks.
Entered data were returned to us as flat text files. Because changes must be made at specific row and column locations, directly editing a large text file can be quite difficult, particularly when each observation extends over hundreds of columns and several rows. A change made to the wrong location may be particularly difficult to find and correct. For this reason, and also to preserve the original data, the entered data files were never altered, but used to create analyzable SAS data sets. For each batch of forms, the input program read the text file, reported potential anomalies, and created a SAS data set. Each variable was given a label that included the form and a brief description of the variable. To facilitate naming almost 2,900 variables, 2–3 characters that identified forms were often used at the beginning of variable names, and variable labels included the source form as well. For example, variables names for the evaluation form usually started with EV, while those for the 90-day (quarterly) evaluation started with Q90.
The input program was primarily devoted to statements that checked the entered data for potential errors. Our strategies for checking data quality include range and consistency checks and checking for missing data. Developing boundaries for out-of-range values required a collaborative effort between the data manager and the clinician-investigators. We focused staff efforts on the highest priority data, recognizing that some missing data would simply take too much time to recover. For example, for our main outcome measure (mortality), we performed a death certificate search for the three residents who were lost to follow-up, and had no missing data. Similarly, we placed a high priority on determining activities of daily living status and body mass index. We defined high priority items as those required for determining study eligibility (e.g., vital signs, respiratory signs and symptoms, recent change in status, age, time in facility, etc.), outcomes (e.g., mortality, ADL status, health care use for the economic analysis), and variables that were considered likely covariates or confounders based on our previous work and literature review (e.g., laboratory tests, chest x-ray results, body mass index, cognitive status, comorbid diagnoses, etc.).
All editing programs were tested to make sure they detected anomalous values and did not report in-range data as anomalous using a dummy dataset containing known errors. We also made comparisons across different forms to check whether they were collected in the proper sequence, whether variables such as date of birth and gender were consistent across multiple forms, whether forms were entered more than once, and whether forms indicated as entered in the management database matched the entered data files.
All code for editing data was stored in separate files of SAS statements for each form and batch. We could thus locate the edit commands for an individual without difficulty, and the commands were easily corrected if necessary. As these files of edit commands were changed, the original text file was re-read, the edits applied, and a new SAS data set was saved. In addition to writing on paper forms, comment statements were sometimes included in the files of edit statements to provide information on why data values were changed or added. This helped preserve the "audit trail" of edits, an important process for maintaining data integrity.
Data were checked and edited as soon as possible after entry to help ensure that information was still available. After making repeat requests for some irretrievable information, we developed a computerized database of potential problems and their resolutions. This provided further documentation of all data changes and helped avoid sending staff out repeatedly to investigate the same issues.
Creating data sets for analysis
Once the edits for each batch were complete, we appended data from each batch to a master file. Rather than waiting until all data were entered, we created interim datasets to check for consistency across datasets, check for duplicate forms, and compare entered forms with the tracking database to see if the two sources matched. Statistical analysis highlighted more potential problems, making necessary another (abbreviated) round of checking and editing. After editing was complete, we calculated the error rate for each batch with the following formula:
Similarly, we summed errors and forms over all batches to derive an error rate for each type of form.