|Home | About | Journals | Submit | Contact Us | Français|
The purpose of this study was to assess an alternative statistical approach—multiple imputation—to risk factor redistribution in the national human immunodeficiency virus (HIV)/acquired immunodeficiency syndrome (AIDS) surveillance system as a way to adjust for missing risk factor information.
We used an approximate model incorporating random variation to impute values for missing risk factors for HIV and AIDS cases diagnosed from 2000 to 2004. The process was repeated M times to generate M datasets. We combined results from the datasets to compute an overall multiple imputation estimate and standard error (SE), and then compared results from multiple imputation and from risk factor redistribution. Variables in the imputation models were age at diagnosis, race/ethnicity, type of facility where diagnosis was made, region of residence, national origin, CD-4 T-lymphocyte cell count within six months of diagnosis, and reporting year.
In HIV data, male-to-male sexual contact accounted for 67.3% of cases by risk factor redistribution and 70.4% (SE=0.45) by multiple imputation. Also among males, injection drug use (IDU) accounted for 11.6% and 10.8% (SE=0.34), and high-risk heterosexual contact for 15.1% and 13.0% (SE=0.34) by risk factor redistribution and multiple imputation, respectively. Among females, IDU accounted for 18.2% and 17.9% (SE=0.61), and high-risk heterosexual contact for 80.8% and 80.9% (SE=0.63) by risk factor redistribution and multiple imputation, respectively.
Because multiple imputation produces less biased subgroup estimates and offers objectivity and a semiautomated approach, we suggest consideration of its use in adjusting for missing risk factor information.
Since the early 1980s, recording behavioral risk factors associated with acquired immunodeficiency syndrome (AIDS) and then with human immunodeficiency virus (HIV) infection has been critical in elucidating the infectious nature of the epidemic, identifying areas in which prevention efforts are essential (i.e., screening blood donations), and focusing prevention and treatment programs on the basis of major transmission routes. However, throughout the 1990s, the proportion of cases that were reported to the Centers for Disease Control and Prevention (CDC) without an identified risk factor for HIV infection increased. In 2005, approximately 40% of HIV cases, compared with less than 20% in 1994, were reported to CDC without risk factor information.1,2
In the U.S., the legal authority to collect and store information on cases of HIV and AIDS resides with the governments of the 50 states, the District of Columbia (DC), and U.S.-dependent areas. Government agencies voluntarily forward HIV and AIDS surveillance data to CDC after removing personally identifying information, including the patient's name, from the record for each case.
The combination of expansion in reporting volume (a result of integrating HIV with AIDS reporting), reliance on laboratory reports as the initial case notification to health departments, and decreased access to detailed documentation for follow-up of newly reported cases resulted in larger case loads for follow-up by surveillance staff and decreasing success in acquiring the necessary information.3–5
The first case report forms, developed by CDC in the early 1980s, were designed to collect clinical, demographic, and risk factor information for each case. Initially, only information about intravenous drug use, blood transfusion, and sexual preference (the term used on the first form) was requested. Today, the standardized risk factors for adults that are collected for public health surveillance purposes are male-to-male sexual contact; injection drug use (IDU); high-risk heterosexual (HRH) contact (contact with a person known to have, or to be at high risk for HIV infection, with high risk based on, for example, a history of male-to-male sexual contact, IDU, or receipt of blood products); and receipt of a blood product, transfusion, or transplant.5
Although a person can have multiple risk factors, for the purposes of analysis and presentation in reports, the risk factor information on each surveillance record is summarized according to hierarchical categories. In descending order of priority, these hierarchical categories are:
This hierarchy is based on the probability of transmission per act as well as the prevalence of infection among people to whom these categories apply. For a classification of HRH contact, the case report form must bear an indication of a sex partner with, or at high risk for HIV infection.
In this article, we describe the method currently used to redistribute risk factors when risk factor information is missing from the national HIV/AIDS reporting system (HARS),6 present and evaluate an alternative method to address the growing proportion of HIV/AIDS cases reported without risk factors,3,4 and suggest an approach for handling future missing risk factor information in HARS. We focused on the risk factors among adults and adolescents; fortunately, cases of HIV infection in children have become rare in the U.S., and most are attributable to perinatal exposure.1
During the 1990s, CDC developed a statistical method to address the problem of the increasing proportions of cases of HIV reported without a risk factor.6 This method, which assigns a risk factor distribution to cases without a reported risk factor, is based on reporting patterns (four to 10 years before the date the dataset was created) among cases that were originally reported without a risk factor, but that were later reclassified as having a known risk factor, which was obtained from follow-up investigations and chart reviews. Reclassified cases are divided into 16 groups representing the cross-classification of four regions (Northeast, Midwest, South, West), two sexes (female, male), and two races (white, other). Proportions of risk factor reclassification are calculated for all transmission categories for each of the 16 combinations of region, sex, and race. These proportions are combined with reporting delay weights and applied to cases for which risk factor information is missing.
Calculations of the proportions of redistributed risk factors are based on two assumptions: (1) the distribution of risk factors among cases initially submitted with no reported risk factor (NRR) does not change during the period used in calculating weights, and (2) cases reclassified as NRR are representative of all NRR cases. Both of these assumptions are increasingly unlikely to be valid. The pattern of risk factors has changed since the beginning of the epidemic,1,7 and reclassified cases usually represent cases for which risk factors are easiest to find (Personal communication, Eve Mokotoff, Michigan Department of Community Health, and Judith Sackoff, New York City Department of Health and Mental Hygiene, June 2005). In addition, a recent reabstraction study found that for males, the current method overestimated the number of cases attributed to male-to-male sex and IDU and underestimated the number of cases attributed to HRH contact; for females, it overestimated IDU and underestimated HRH contact.8 Until the ascertainment and reporting of HIV risk factors improve significantly, surveillance is likely to rely on statistical approaches to adjust for missing risk factor information.
Missing data is an ongoing problem in routinely collected data or large-scale epidemiologic studies.9 Some frequently used, but less sound ways of handling missing data are list-wise deletion, pair-wise deletion, and mean substitution.10–14 More statistically rooted methods of handling missing data are concentrated not on merely replacing a missing value but on attempting, by using available data, to preserve the relationships inherent in the dataset.10,12–14
Multiple imputation, the method of choice for large datasets,15 is one such method. It requires specification of a statistical model and is considered a sound approach.12,13 Multiple imputation does not attempt to estimate each missing value. Instead of estimating the risk factor distribution probabilities for cases with missing risk factors by the current redistribution approach, the multiple imputation approach draws a random sample of the missing values from its distribution. This process results in valid statistical inferences that properly reflect the uncertainty due to missing values. Instead of filling in a single value for each missing value, multiple imputation16 replaces each missing value with a set of plausible values that reserve the statistical distribution of the imputed variable and the relationship with other variables in the imputation model. The multiply imputed datasets are then analyzed by using standard procedures for complete data. Results from these analyses are then combined to get the final estimates.
Specifically, multiple imputation follows these steps:
This method maintains the original variability of the missing data by creating imputed values, which are based on variables correlated with the missing data and the reasons the data are missing. Uncertainty is accounted for by generating iterations of the missing data and observing the variability between the imputed datasets.14
Assumptions of the multiple imputation method include the following: the data must be missing at random (the probability of being missing depends on observed variables), the model used to generate the imputed values must be “correct” in some sense (i.e., must include all anticipated predictor variables), and the model used in the analysis must be consistent with the model used in the imputation.15,16
The use of multiple imputation is desirable in adjusting for missing HIV risk factor information because it produces unbiased parameter estimates, which reflect the uncertainty associated with estimating missing data. In addition, multiple imputation methods are available in easy-to-use software.17–19 We used SAS® procedure MI19 with a discriminant function analysis, based on multivariate normal theory. We compared the results from multiple imputation and the results from the risk factor redistribution method currently used by CDC.
In our analysis, we included AIDS data from all 50 states and DC and HIV data from 32 states (as of 2004). All data, after collection by state and local health departments, were reported to CDC without personally identifying information.
We used information in HARS from the 50 states and DC about people whose diagnosis of AIDS had been made from 2000 to 2004 and who had been reported through June 2005 to assess the variables that were missing in ≤20% of cases, those that were thought to be correlated with the lack of reported risk factors, and those that will be used in future analyses of surveillance data. We tested the correlation of covariates with reported risk factor and with the absence of risk factor information by using Cramer's V statistic20 and p-values from Chi-square tests. The variables considered control variables in analyses and the variables with a Cramer's V statistic of approximately ≥0.1 for males and females were retained for further analyses. All of the variables that were correlated with the absence of risk factor information were included in our analysis. Data were imputed 10 times both for males and females, HIV, and AIDS, based on relative efficiency of about 95% or better.
Multiple imputation models were calculated for each combination of males and females, and transmission categories; only the missing values for risk factors were imputed. No interaction terms were included in the models. A sensitivity analysis of case frequency by time (in months) to reclassify a case resulted in our decision to use data from the past five years (sufficient to capture approximately 85% of the cases that were eventually reclassified).
In the HIV analysis, we included data on diagnoses made from 2000 to 2004 (reported to CDC through June 2005) from 32 states with name-based HIV reporting. All inclusion criteria and analyses of AIDS data were repeated with HIV data.
The variables retained and used in multiple imputation models to impute values for missing risk factors for HIV and AIDS analyses included age at diagnosis, race/ethnicity, type of facility where diagnosis was made, region of residence, national origin, T-lymphocyte cell count (CD4) within six months of diagnosis (AIDS cases only), and reporting year (Tables 1a and and1b1b and Tables 2a and and2b2b).
In the AIDS data, male-to-male sexual contact accounted for 57.8% by risk factor redistribution, compared with 60.6% by multiple imputation (SE=0.30) (Table 3a). Also among males, IDU accounted for 19.5% by risk factor redistribution and 18.2% by multiple imputation (SE=0.26), and HRH contact accounted for 15.5% by risk factor redistribution and 14.2% by multiple imputation (SE=0.22).
In the AIDS data on females, the estimates of cases attributable to IDU were very close: 28.8% by risk factor redistribution and 29.6% by multiple imputation (Table 3b). HRH contact accounted for 68.8% by risk factor redistribution and 68.3% by multiple imputation.
In the HIV data, male-to-male sexual contact accounted for 67.3% by risk factor redistribution and 70.4% by multiple imputation (SE=0.45) (Table 4a). Also among males, IDU accounted for 11.6% by risk factor redistribution and 10.8% by multiple imputation (SE=0.34), and HRH contact accounted for 15.1% by risk factor redistribution and 13.0% by multiple imputation (SE=0.34).
Among females, the distribution of HIV was similar to the distribution of AIDS. IDU accounted for 18.2% by risk factor redistribution and 17.9% by multiple imputation (SE=0.61) (Table 4b). In the other major transmission category for females—HRH contact—estimates were 80.8% by risk factor redistribution and 80.9% (SE=0.63) by multiple imputation.
For data on HIV infection and AIDS in males, the multiple imputation estimates for male-to-male sexual contact were slightly higher than the proportions from the risk factor redistribution method and were slightly smaller for HRH contact and IDU. For females, however, the multiple imputation estimates for IDU were very similar to those from risk factor redistribution, and they were slightly higher for IDU. Overall, the differences are not of public health significance. We could not test statistical significance between results from the two methods because there is no simple way to estimate the uncertainty associated with the result derived from the risk factor redistribution method. Whether or not they are statistically significant is not important, as some differences between the two methods are expected. No difference in the overall results does not mean that there is no difference in the results for subpopulations. One advantage of the multiple imputation method is that it provides appropriate (unbiased or less biased) estimates not only for the overall risk factor distribution, but also for the risk factor distribution within each subpopulation group that can be characterized by variables included in the imputation model.
These results for the major transmission categories (four for males and two for females) compare favorably with detailed reviews of the medical records of females in three states during the late 1990s8 and interviews conducted with females during the mid-1990s.21 The Enhanced HIV Risk Factor Assessment Project, a review of medical records in three states, concluded that compared with medical record reviews, for females, risk factor redistribution overestimated IDU and underestimated HRH contact; for males, it overestimated male-to-male sexual contact and IDU and underestimated HRH contact.8
Given the increase in the proportion of HIV cases that have occurred in females, estimates from multiple imputation appear not only plausible but more realistic than risk factor redistribution, for which data from the past four to 10 years are used. No interviews or medical record reviews have been conducted recently enough to serve as a comparison with our results. The multiple imputation methodology itself, however, is being used for national datasets generated by the National Center for Health Statistics: National Health Interview Survey, State and Local Area Integrated Telephone Survey, National Health and Nutrition Examination Survey, and by the Federal Reserve for the Survey of Consumer Finances.
In the future, data from CDC's Medical Monitoring Project (MMP) can be used to evaluate the performance of multiple imputation or other alternatives (methods or classification schemes) to risk factor redistribution. The MMP is a national, population-based surveillance project collecting information on clinical outcomes and behaviors of HIV-infected individuals receiving care in the U.S. In addition, CDC and its state surveillance partners are exploring the addition of a female presumed heterosexual contact category.
Results from multiple imputation and risk factor redistribution methods may differ because more variables are included in multiple imputation analysis (seven for multiple imputation; three for risk factor redistribution). In addition, unlike the risk factor redistribution method, the multiple imputation method takes into account the relationships between those variables and the variable being imputed (risk factor) so that overall variability of the missing data is maintained and parameter estimates are unbiased.
Unlike risk factor redistribution, multiple imputation is not based on assumptions about the data that are no longer valid. In addition, multiple imputation could be automated as part of CDC's annual processing of national HIV/AIDS data, resulting in the use of a documented method that is accepted at the national and state levels. Another advantage of multiple imputation is that the method could be reassessed every three or so years (a less automated assessment involving reassessing the variables and determining the number of imputations) instead of the annual labor-intensive work needed to determine the proportions for risk factor redistribution.
Among the practical considerations in adopting the multiple imputation method are the training needs of CDC staff and state surveillance coordinators, the development of SAS programs for national and state use, and the development of procedures for disseminating information about changes.
The most noteworthy limitation of the multiple imputation method is the need for resources to implement the change to a new system. Most of the resources needed would be at the national level. A second limitation is that we were not able to fully assess the missing-at-random assumption. However, as an alternative, we included in the analysis all variables that were collected in the national system and that we knew were of good quality and correlated with missing risk factor information.
Even though the overall results of multiple imputation and risk factor redistribution are similar, results for some subgroups may differ statistically. However, multiple imputation produces less biased subgroup estimates because it maintains the statistical relationship between variables, particularly the relationship between the risk factor and the variables determining the subgroups. This advantage, coupled with the objectivity and relatively automated approach that multiple imputation offers, lead us to recommend that the national HIV surveillance program consider adopting the multiple imputation method to adjust for missing (not reported) risk factor information.
The findings and conclusions in this article are those of the authors and do not necessarily represent the views of the Centers for Disease Control and Prevention.