Throughout the world, a major and increasing cause of death and disability is traumatic injury (
World Health Organization, 2001). The loss of productivity from injury in the United States in 2000 alone was estimated at $326 billion (
Finkelstein, Corso, & Miller, 2006). Recent efforts, including those by nurse scientists (
Richmond et al., 2007;
Scheetz, 2005;
Sommers et al., 2006;
Thompson et al., 2008), have been focused on injury prevention and on improving systems of trauma care in order to reduce morbidity and mortality following injury (
Sommers, 2006). Multivariate and longitudinal research approaches are generally taken in order to suitably address questions posed by these issues in injury research. The problem of missing data is frequently encountered as a result by trauma researchers and must be appropriately handled. Additionally, because of the sources and types of data used in injury research, the problem of missing data might be compounded. If missing data issues are addressed improperly, they might lead to inefficient, underpowered analyses or biased estimates (
Joseph, Belisle, Tamim, & Sampalis, 2004).
The purpose of this paper is to provide guidance for managing the problem of missing data in clinical studies of trauma in order to minimize bias and increase the validity of findings for subsequent translation. An integrative review and simulated analysis based upon data from the National Study on the Costs and Outcomes of Trauma (NSCOT) was used as an example.
Missing data is an omnipresent problem in clinical research, but this problem is often compounded in trauma research where the prevalence of missing data is often much higher. The first basis for the scope of the problem in trauma research is the data source. The severity of injury and nature of emergent care being provided to patients might interfere with data collection (
Joseph et al., 2004). Data might also be missing for a variety of reasons including: missing reports concerning prehospital care (
Newgard, 2006) or having an item that is not fully testable in a particular patient (e.g., pupillary response on a patient with trauma to one eye). Traumatic injury, by its very nature, is an unexpected event, thus the injured person might be unable to provide needed information and a proxy informant might not be available (
Moore et al., 2005). People at increased risk for trauma include those with comorbidities such as substance abuse which might make follow-up more difficult in longitudinal studies (
Gentilello, Donovan, Dunn, & Rivara, 1995;
Holavanahalli et al., 2006;
Nilsen, Holmqvist, Nordqvist, & Bendtsen, 2007).
The second area that contributes to the problem of missing data in trauma research is that many studies are conducted as secondary analyses of available data. To facilitate studies of injury, researchers have several sources of secondary clinical data available to them including local and state hospital trauma registries, national hospital data from the Agency for Healthcare Research and Quality Healthcare Cost and Utilization Project (HCUP), and national trauma databases such as the American College of Surgeon’s National Trauma Data Bank. When considering the use of secondary sources of data, researchers must balance the knowledge to be gained against the awareness of incomplete cases available for analysis. Thus, regardless if the study is conducted prospectively or retrospectively, the problem of missing data is an issue that injury researchers must be able to aptly deal with on a routine basis in order to ensure that their results are precise, valid, and reliable.
The extent and pattern of missing data may vary greatly from a single item to an entire missing time point of data. In trauma research, items, which are one or more measures on a multi-item survey, may be missing for various reasons. For example, a participant might not want to answer a specific question about alcohol use on a depression questionnaire or be unable to provide an open-ended follow-up answer on a multi-item pain question because of intubation. At the variable level data may be missing, either an entire single-item measure such as age, or an entire multiple-item measure e.g., the Glasgow Coma Scale (GCS). Lastly, an entire time point of data might be missing because factors such as participant death, withdrawal from the study, or inability to be contacted for follow-up. These reasons often result in different patterns of “missingness” (absence) within the data and the underlying reasons (e.g., withdrawal from the study because of perceived burden of intervention) should be examined separately.
Although one frequently hears that the ideal study is one without missing data, achieving this is generally not feasible. The best study designers, however, attempt to decrease the influence of missing data through various techniques. This would include pilot-testing questionnaires to ensure multiple-item measures are clear and feasible. If members of the healthcare team will be collecting data for the study, it is imperative that the rationale for the study and the need for each item are apparent to all.
Researchers should also allow time for the team to provide feedback regarding feasibility and ask questions during planning and training. to allow the team to have maximal “buy-in” concerning the study. Additionally, employing adequate retention strategies is critical in longitudinal studies to decrease loss to follow-up (Carpenter & Kenward, nd;
Fox-Wasylyshyn & El-Masri, 2005). In order to accurately assess the effect of missing data, researchers should report the type, amount, and patterns of missing data on main variables of interest. This reporting also indicates assessment of data quality for readers (for an example of good data reporting, see
Mackenzie et al., 2007).
A plethora of current biostatistical research exists indicating the problem of missing data in clinical studies. The issue is complex and one that is inseparable from the general problem of statistics: inference. It does not take sophisticated statistical expertise to understand the two basic concerns caused by missing data in patient care research. They are:
- Is the available sample a random sample of the patient population of interest?
- Are all relevant and available data being used in the analysis?
The latter question deals directly with efficiency; that is to say, is the researcher maximizing power to answer the question of interest?
A sample selected at random is what drives statistical inference—the ability to generalize from a sample to a population. A randomly selected sample does not necessarily mean that it will be representative. For instance, if only female participants were randomly selected from a mixed-gender population of patients, the sample is not truly representative of the population of interest. It follows, then, that this characteristic of a sample—whether or not it was selected at random—cannot be assessed by looking at the sample itself. Instead, how the sample was obtained should determine how one assesses the appropriateness of random selection. Similarly, whether missing data are missing randomly, cannot be assessed with the sample. However in practice, in the absence of further information, researchers often check patterns of missing data with the use of a dummy variable 0=present, 1=absent and assess correlation (
Fox-Wasylyshyn & El-Masri, 2005). If the correlation coefficient is high, then data are probably not missing completely at random (MCAR). A nomenclature for missing data has been developed to deal with this concern (). The resulting taxonomy differentiates among three possible situations:
Situation 1
The available sample, although incomplete, is still a random sample. This type of missing data is often referred to as MCAR or ignorable. Because the available sample is still a random sample from the population of interest, the usual statistical inferential methods remain valid when missing data are ignored. Sometimes, it is clear that missing data are MCAR. For example, suppose the admission GCS score was not obtained for patients randomly assigned to one data collector. The reason the GCS score is missing is unrelated to the test itself. Additionally it is unrelated to other variables such as time of admission because of the random-collection assignment. An available case analysis—an analysis of participants with admission GCS scores available—will be valid. However, the analysis would be less precise than if no data were missing.
Situation 2
The available sample is a biased sample with respect to the population of interest but the bias depends on known variables. Inference will be invalid if systematic bias is not taken into account. This type of missing data is called missing at random (MAR) and, perhaps confusingly, is also ignorable. This is because given available information, participants can be identified to represent other participants with missing data. For example, suppose the GCS score is missing for some trauma patients, but this time, the score is more likely to be missing for older patients. Because age is an available covariate for all participants, selection bias can be corrected. A method that could be used in this instance of MAR data in a regression analysis is inverse probability weighting (
Finkelstein et al., 2006;
McKnight, McKnight, Sidani, & Figueredo, 2007).
These techniques are indicated to correct for certain under-represented covariate profiles (e.g., those who stay in the study versus those who drop out), assigning weights proportional to the inverse of the probability of being observed (
Fitzmaurice, Laird, & Ware, 2004;
McKnight et al., 2007). In this approach, only complete cases are included then the data are reweighted so that their distribution more closely approximates that of the full sample based upon the covariate profiles (
Little & Rubin, 2002). A drawback to this approach in a multivariate situation is that it does not always allow for making use of all available data. Because of the constraints necessary for its implementation, weighting is generally used only when missing data patterns are few, similar in nature, and for a limited number of analyses (univariate, generalized estimating equations;
McKnight et al., 2007).
Multiple imputation (MI) is another appropriate method that can be used for any type of statistical analysis to handle MAR data (
Donders, van der Heijden, Stijnen, & Moons, 2006;
Rubin, 1987). In this method, multiple (
m) versions (typical range 5–20) of the data set are created using available data to predict missing values. These data sets are then used to conduct m analyses, each with its own regression values and confidence intervals, which are then combined into one inferential analysis (
Rubin, 1987).
The particular appeal of this method is that once completed data sets have been developed, standard statistical methods can be used. Through the use of MI data sets, researchers are able to reduce the bias associated with single imputation which might lead to narrow confidence intervals (
Fitzmaurice et al., 2004;
Newgard, 2006). Single imputation techniques (e.g., mean value, last value carried forward) are generally not recommended because they underestimate uncertainty (
McKnight et al., 2007;
Rubin & Schenker, 1991). The MI approach is equally appropriate for handling missing data in both cross-sectional and longitudinal designs with MAR data (
Ali & Siddiqui, 2000). Several statistical packages are available for MI and include R, AMELIA II, SOLAS 3.0, SAS, and S-PLUS. For further detailed information on the steps of MI in nursing research, see the reviews by
Patrician (2002),
McCleary (2002), and
Kneipp and McIntosh (2001).
A third method available for handling MAR data is maximum likelihood (ML). ML is a model-based procedure for parameter estimation that in some situations has results similar to MI (
Collins, Schafer, & Kam, 2001). Multiple imputation can be intuitively understood as an extension of simple imputation, a familiar concept to many researchers. In contrast, understanding ML methods requires a high degree of statistical knowledge. Furthermore, implementing ML procedures requires special software (LISREL, EQS, Amos, and others), which might not be familiar to some researchers (
Allison, 2002). For these practical reasons we have focused on MI as the method of choice for MAR data. A detailed discussion of ML methods can be found in
Little and Rubin (2002) and
Allison (2002).
Situation 3
The available sample is biased and the bias depends on the missing information. This is called nonignorable (NI) missing data or missing not at random (MNAR) because inference ignoring missing data will be biased. Sometimes, it is known that missing data are NI, for example, if admission GCS scores are missing and it is known that GCS scores are more likely to be missing for more mild levels of brain injury. No methods currently exist for making accurate inferences in this situation. However, MI has been shown to indicate estimates that are less biased than the available case analysis in some NI situations (
Joseph et al., 2004).
The success of multiple imputation in an NI setting depends on having participants in the sample that can be identified using known covariates to reasonably represent those for whom data are missing—or put another way, having a variable that is highly correlated with missing variables (such as age in the GCS example). Even in such a situation, when data are NI, the question of potential bias lingers. The possible effect of the informative missingness can be assessed using sensitivity analyses (
Allison, 2002).
In general, a sensitivity analysis is any repetition of analysis under an alternative assumption. Such analyses, then, are used to test the robustness of the statistical analysis to underlying assumptions (
Allison, 2002;
Saltelli, Chan, & Scott, 2000). In a study with missing data, sensitivity analyses allow researchers to model data across a reasonable range of expected values and assess the effects of the type of missing data assumed to be present. For example, if previous studies indicate an expected median GCS score of 13 in a sample, the researcher could impute higher values of scores for those missing. Specifically for multiple imputation analysis, which as typically implemented relies on the assumption that missing data are ignorable, it is possible to specify NI imputation models (
Rubin, 1987). If results of a sensitivity analysis are not substantially different under alternative assumptions about the type of missing data present (models), a researcher can have confidence that even if the data are NI, for practical purposes the potential effect of the missing data is ignorable.
Besides the important concern of having a representative sample for inferential statistics, there is also a concern about efficient use of data. If the incomplete variable is a covariate, but not interesting itself, imputation of a covariate(s) can be used to improve precision of the analysis. It allows patients’ records to be used that would otherwise not be included, thereby increasing sample size and power. Changes in regression equations will be shown by increases in the amount of variance accounted for in statistical models. However, if a predictor or outcome of interest in a clinical study is incomplete, imputation, while potentially correcting for bias, does not actually improve precision of an association estimate.