During the 1990s, CDC developed a statistical method to address the problem of the increasing proportions of cases of HIV reported without a risk factor.6
This method, which assigns a risk factor distribution to cases without a reported risk factor, is based on reporting patterns (four to 10 years before the date the dataset was created) among cases that were originally reported without a risk factor, but that were later reclassified as having a known risk factor, which was obtained from follow-up investigations and chart reviews. Reclassified cases are divided into 16 groups representing the cross-classification of four regions (Northeast, Midwest, South, West), two sexes (female, male), and two races (white, other). Proportions of risk factor reclassification are calculated for all transmission categories for each of the 16 combinations of region, sex, and race. These proportions are combined with reporting delay weights and applied to cases for which risk factor information is missing.
Calculations of the proportions of redistributed risk factors are based on two assumptions: (1
) the distribution of risk factors among cases initially submitted with no reported risk factor (NRR) does not change during the period used in calculating weights, and (2
) cases reclassified as NRR are representative of all NRR cases. Both of these assumptions are increasingly unlikely to be valid. The pattern of risk factors has changed since the beginning of the epidemic,1,7
and reclassified cases usually represent cases for which risk factors are easiest to find (Personal communication, Eve Mokotoff, Michigan Department of Community Health, and Judith Sackoff, New York City Department of Health and Mental Hygiene, June 2005). In addition, a recent reabstraction study found that for males, the current method overestimated the number of cases attributed to male-to-male sex and IDU and underestimated the number of cases attributed to HRH contact; for females, it overestimated IDU and underestimated HRH contact.8
Until the ascertainment and reporting of HIV risk factors improve significantly, surveillance is likely to rely on statistical approaches to adjust for missing risk factor information.
Missing data is an ongoing problem in routinely collected data or large-scale epidemiologic studies.9
Some frequently used, but less sound ways of handling missing data are list-wise deletion, pair-wise deletion, and mean substitution.10–14
More statistically rooted methods of handling missing data are concentrated not on merely replacing a missing value but on attempting, by using available data, to preserve the relationships inherent in the dataset.10,12–14
Multiple imputation, the method of choice for large datasets,15
is one such method. It requires specification of a statistical model and is considered a sound approach.12,13
Multiple imputation does not attempt to estimate each missing value. Instead of estimating the risk factor distribution probabilities for cases with missing risk factors by the current redistribution approach, the multiple imputation approach draws a random sample of the missing values from its distribution. This process results in valid statistical inferences that properly reflect the uncertainty due to missing values. Instead of filling in a single value for each missing value, multiple imputation16
replaces each missing value with a set of plausible values that reserve the statistical distribution of the imputed variable and the relationship with other variables in the imputation model. The multiply imputed datasets are then analyzed by using standard procedures for complete data. Results from these analyses are then combined to get the final estimates.
Specifically, multiple imputation follows these steps:
- Impute missing values by using an approximate model incorporating random variation; repeat M times, generating M datasets.
- Perform standard statistical analyses on each dataset.
- Combine results from the datasets to compute overall multiple imputation estimate and SE.
This method maintains the original variability of the missing data by creating imputed values, which are based on variables correlated with the missing data and the reasons the data are missing. Uncertainty is accounted for by generating iterations of the missing data and observing the variability between the imputed datasets.14
Assumptions of the multiple imputation method include the following: the data must be missing at random (the probability of being missing depends on observed variables), the model used to generate the imputed values must be “correct” in some sense (i.e., must include all anticipated predictor variables), and the model used in the analysis must be consistent with the model used in the imputation.15,16
The use of multiple imputation is desirable in adjusting for missing HIV risk factor information because it produces unbiased parameter estimates, which reflect the uncertainty associated with estimating missing data. In addition, multiple imputation methods are available in easy-to-use software.17–19
We used SAS® procedure MI19
with a discriminant function analysis, based on multivariate normal theory. We compared the results from multiple imputation and the results from the risk factor redistribution method currently used by CDC.
In our analysis, we included AIDS data from all 50 states and DC and HIV data from 32 states (as of 2004). All data, after collection by state and local health departments, were reported to CDC without personally identifying information.
We used information in HARS from the 50 states and DC about people whose diagnosis of AIDS had been made from 2000 to 2004 and who had been reported through June 2005 to assess the variables that were missing in ≤20% of cases, those that were thought to be correlated with the lack of reported risk factors, and those that will be used in future analyses of surveillance data. We tested the correlation of covariates with reported risk factor and with the absence of risk factor information by using Cramer's V statistic20
-values from Chi-square tests. The variables considered control variables in analyses and the variables with a Cramer's V statistic of approximately ≥0.1 for males and females were retained for further analyses. All of the variables that were correlated with the absence of risk factor information were included in our analysis. Data were imputed 10 times both for males and females, HIV, and AIDS, based on relative efficiency of about 95% or better.
Multiple imputation models were calculated for each combination of males and females, and transmission categories; only the missing values for risk factors were imputed. No interaction terms were included in the models. A sensitivity analysis of case frequency by time (in months) to reclassify a case resulted in our decision to use data from the past five years (sufficient to capture approximately 85% of the cases that were eventually reclassified).
In the HIV analysis, we included data on diagnoses made from 2000 to 2004 (reported to CDC through June 2005) from 32 states with name-based HIV reporting. All inclusion criteria and analyses of AIDS data were repeated with HIV data.