|Home | About | Journals | Submit | Contact Us | Français|
To determine whether the imputation procedure used to replace missing data by the U.S. Census Bureau produces bias in the estimates of health insurance coverage in the Current Population Survey's (CPS) Annual Social and Economic Supplement (ASEC).
Eleven percent of the respondents to the monthly CPS do not take the ASEC supplement and the entire supplement for these respondents is imputed by the Census Bureau. We compare the health insurance coverage of these “full-supplement imputations” with those respondents answering the ASEC supplement. We then compare demographic characteristics of the two groups and model the likelihood of having insurance coverage given the data are imputed controlling for demographic characteristics. Finally, in order to gauge the impact of imputation on the uninsurance rate we remove the full-supplement imputations and reweight the data, and we also use the multivariate regression model to simulate what the uninsurance rate would be under the counter-factual simulation that no cases had the full-supplement imputation.
The noninstitutionalized U.S. population under 65 years of age in 2004.
The CPS-ASEC survey was extracted from the U.S. Census Bureau's FTP web page in September of 2004 (http://www.bls.census.gov/ferretftp.htm).
In the 2004 CPS-ASEC, 59.3 percent of the full-supplement imputations under age 65 years had private health insurance coverage as compared with 69.1 percent of the nonfull-supplement imputations. Furthermore, full-supplement imputations have a 26.4 percent uninsurance rate while all others have an uninsurance rate of 16.6 percent. Having imputed data remains a significant predictor of health insurance coverage in multivariate models with demographic controls. Both our reweighting strategy and our counterfactual modeling show that the uninsured rate is approximately one percentage point higher than it should be for people under 65 (i.e., approximately 2.5 million more people are counted as uninsured due to this imputation bias).
The imputed ASEC data are coding too many people to be uninsured. The situation is complicated by the current survey items in the ASEC instrument allowing all members of a household to be assigned coverage with the single press of a button. The Census Bureau should consider altering its imputation specifications and, more importantly, altering how it collects survey data from those who respond to the supplement.
The bias affects many different policy simulations, policy evaluations and federal funding allocations that rely on the CPS-ASEC data.
The Robert Wood Johnson Foundation.
The U.S. Census Bureau's Annual Social and Economic Supplement (ASEC) to the Current Population Survey (CPS) provides the most visible estimate of the number of uninsured people in the United States. The ASEC has become the survey of record for estimates of health insurance coverage because it produces both national and state estimates of health insurance coverage, makes its micro data available to analysts within 6 months after the data are collected, contains a wealth of demographic information (including family structure and income), and releases its detailed report on an annual basis (Blewett et al. 2004). The ASEC estimates of health insurance coverage are widely used in academic research literature and media outlets, and it is the survey to which all other surveys are compared for coverage measurement.
The ASEC estimates of health insurance coverage are used for a variety of purposes. The Congressional Budget Office makes use of the ASEC to help score legislation (Glied, Remler, and Zivin 2002), and states use the ASEC to monitor progress in determining the success of the State Children's Health Insurance Program (SCHIP) in reducing the number of low income uninsured children in each state (Davern, Blewett et al. 2003). The ASEC is also used to allocate three to four billion dollars per year to states to fund SCHIP based, in part, on the number of low income uninsured children in each state and the number of low-income children in each state (Davern, Blewett et al. 2003).
Because of their many uses, the ASEC uninsurance estimates have been scrutinized over the years, especially as they tend to be higher than most surveys that ask about health insurance coverage at a single point in time (Lewis, Ellwood, and Czajka 1998; Fronstin 2000; Short 2001; Congressional Budget Office 2003). Some of the national surveys that measure health insurance coverage include the National Health Interview Survey (NHIS), the Household Component of the Medical Expenditure Panel Survey (MEPS-HC), and the Survey of Income and Program Participation (SIPP) (Blewett et al. 2004). The ASEC estimate of the number of people with no health insurance for the entire previous calendar year, is typically higher than the full-year uninsurance estimates produced using data from these other surveys. In addition, the ASEC full-year uninsurance estimates are even higher than many “point-in-time” uninsurance estimates from other surveys (Congressional Budget Office 2003; Czajka 2005; Peterson 2005). Several authors have offered potential reasons why the health insurance estimates differ across the various federal surveys, including: differences in sample frame; sample selection and population coverage; mode of survey administration; survey operationalization of the concept of uninsurance; misreporting by respondents in the survey; and data processing (Lewis, Ellwood, and Czajka 1998; Fronstin 2000; Short 2001; Congressional Budget Office 2003).
One of the least explored reasons why the ASEC differs from other surveys is the impact of missing data imputation on health insurance coverage estimates from the ASEC. Davern et al. (2004) explicitly examined this issue with respect to state survey estimates of health insurance coverage, finding that the process used by the Census Bureau to impute missing insurance data in the ASEC biases the state estimates of uninsurance. Some states had higher and some states had lower rates of coverage due to bias, but the national rate was unbiased (Davern et al. 2004).
In this paper we focus explicitly on the national estimate of uninsurance to explore whether the imputation process used by the Census Bureau explains the higher uninsurance estimates found in the ASEC relative to other national surveys. Specifically, we examine whether there is a significant difference in estimates of uninsurance for those cases that have health insurance data imputed (11 percent of the ASEC sample) and those that do not. We begin by describing the imputation methodology the Census Bureau uses for the ASEC health insurance items and why we think the current methodology may impact the overall rates of coverage to produce an upward bias in the national estimates of uninsurance.
The CPS is a monthly survey conducted by the Census Bureau and sponsored by the U.S. Bureau of Labor Statistics. The ASEC supplement is added to the general CPS monthly survey in February through April of each year. The CPS has a rotating panel design in which sampled households are in the sample for 4 months, then rotate out for 8 months, and then rotate back in for an additional 4 months. The major exception to this rotation is that November respondents in their eighth month of interviewing may be re-interviewed if they meet certain criteria (children or minority members in the household). They are recontacted to take the CPS and ASEC supplement again between February and March (U.S. Census Bureau 2002; Davern, Beebe et al. 2003).
Missing data in the form of item nonresponse is a common problem in survey research (Groves et al. 2002). Missing data results when someone refuses to answer a survey item, an entire supplement, or an entire survey. Item missing data also results when respondents “do not know” an answer to a question. An estimated 11 percent of the CPS monthly core survey respondents do not respond to the entire ASEC supplement. These 11 percent have their entire ASEC supplement values imputed. We refer to these cases as “full-supplement imputations” and they are the focus of our analysis.1
Statisticians have developed a wide range of techniques for dealing with item nonresponse (e.g., Kalton 1983; Kalton and Kasprzyk 1986; Little and Rubin 1987; Rubin 1996; Heeringa, Little, and Raghunathan 2002; Marker, Judkins, and Winglee 2001). Most of the techniques use information from the completed cases to impute a model-based estimate to the cases with missing data. The Census Bureau uses “hotdeck” imputation to replace item nonresponse in all of its household surveys (e.g., CPS, SIPP, decennial census, and the Survey of Program Dynamics). Hotdeck is a type of model-based imputation by which a respondent's valid value for a specific variable is assigned to another respondent that does not have a valid value for the variable. The respondent with the valid value is called a “donor” and the one with a missing value is called a “recipient.” For example, if the donor is 15 years old, then the recipient (respondent with missing age) is given a value of 15 and the donor maintains the age of 15. Donors and recipients are matched together based on key demographic characteristics that they share in common; for example, sex, age, and work status (for additional information see David et al. 1986; Mason, Lesser, and Traugott 2001; Marker, Judkins, and Winglee 2001; Davern et al. 2004).
Although properly specified imputation can alter basic distributional summary statistics (means and variances) compared with the statistics calculated using complete cases only, it should not transform the relationships among variables. If there was a relationship between two variables in the reported data it should remain the same in the imputed data, and no new relationships should appear after the imputation. The basic idea of model-based imputation is to use the existing relationships within the reported data to make an informed guess as to what the actual response would have been. In assessing accuracy of datasets with imputed hotdeck values for estimating demographic characteristics, a critical question is whether there is a relationship between having an imputed value and the concept being imputed after controlling for the covariates used to impute the data. If this relationship is strong after controlling for the variables used in the imputation process, there is evidence that bias may have been introduced. To examine whether this is occurring in the ASEC data we explore three questions: (1) is there a difference between the imputed cases and the nonimputed cases with respect to key demographic variables and health insurance coverage?; (2) is there a difference in the probability of being uninsured after controlling for covariates used to impute health insurance coverage?; and (3) is the difference in the probability of being uninsured between those cases with and without imputed data enough to produce a substantively significant difference in the national uninsurance estimate?
The ASEC imputation specification for private insurance coverage has two stages. The first is to impute whether each person is a private health insurance policyholder (done separately for self-purchased and employer sponsored insurance). The second part of the private health insurance imputation process is to impute whether the policyholder has a “family plan” or an “individual plan.” If the person is imputed to have a family plan, then dependent coverage is extended to other specific family members. If the person is imputed to have an individual plan, then no dependent coverage is extended to other family members in the household.
The private insurance coverage imputation for the ASEC data has two limitations that we believe are working together to create a significant problem. First, the imputation specification does not use family size when imputing whether or not a person is a private coverage policyholder (U.S. Census Bureau 1998). This is a problem because one-person families are more likely to be private health insurance policyholders after controlling for other covariates. As a result, we expect too few policyholders to be imputed in one-person families, and therefore too little private health insurance coverage among people living in one-person families.
After the private insurance coverage policyholder is imputed, then family or individual coverage is imputed. When a person is imputed to be a policyholder with family coverage, the specification then extends this private coverage to specific relatives living in the same household. Dependent coverage in the imputation process is not extended to a nonchild or nonspouse of the policyholder (U.S. Census Bureau 1998). Although theoretically correct, this is problematic because this limitation is not enforced in the computer-assisted ASEC survey instrument where dependent coverage can be extended to everyone within the household regardless of relationship. Specifically, respondents are asked if “anyone else in the household is covered by this type of insurance” with one acceptable response being “everyone,” at which point the Census Bureau's field representative can enter “A” (for “all”) regardless of the relationship among the people being assigned coverage. Therefore some nondependent household members may inappropriately be assigned dependent coverage during the survey interview. Because of this difference between the survey interview and the imputation specifications, we expect less dependent health insurance coverage in the imputed data than in the reported data.
We use the 2004 ASEC supplement to the CPS for this analysis. Data were collected from 77,149 households representing 213,241 individuals; the household response rate for the monthly portion of the CPS was 84 percent (U.S. Census Bureau 2004). In this paper we limit our analysis to ASEC respondents under the age of 65; those above 65 are very likely to have coverage through Medicare.
We examine the full-supplement imputation cases to determine whether there is bias in the national estimates of uninsurance. The first analysis shows general demographic characteristics of all people in the ASEC under the age of 65. We conduct two independent sample t-tests comparing the demographic characteristics of the full-supplement imputations to the nonfull-supplement imputations. The key variables of interest are health insurance coverage as well as other demographic characteristics. The second analysis shows the percent of households in which everyone in the household has coverage by household size. This analysis is further broken into full-supplement imputation households—defined as a household in which any one person in the household is a full-supplement imputation—versus all others.2
The third analysis uses a multinomial logistic regression model with the dependent variable taking on one of three values indicating whether the person was coded: (1) to be uninsured; (2) to have any public coverage; and (3) as having private coverage only. If the respondent indicates both private and public insurance, the person is coded as having public insurance.3 The key covariate of interest in this model is whether the person was a full-supplement imputation.
We use the following variables in our model as they are used in the Census Bureau's health insurance coverage hotdeck routine: age, veteran status, employment, earnings, employer size, self-employment, family labor force participation, and poverty status. In our model we also used several other covariates including race, ethnicity, education, family size, and citizenship in an attempt to see if they are able to explain away the relationship between being an imputed case the probability of being uninsured.
In addition to these key covariates of coverage, we include an interaction term for being under 19 years of age and being a full-supplement imputation. It is possible the impact will vary by whether the person is a child or an adult because children are much more likely to obtain dependent coverage than to be a policyholder. Several of the variables are not collected for children, and we have coded children to not be married, be out of the labor force, and have less than a high school degree.
In our final analysis we estimate the impact of the full-supplement imputation on the uninsurance rate, private coverage rate, and public coverage rate. The impact is assessed in two distinct ways. First, in order to give the multinomial logistic regression results a meaningful scale we use the recycled predictions methodology to obtain estimated coverage rates (Graubard and Korn 1999; StataCorp 2001; Kronick and Gilmer 2002). The recycled predictions approach uses the actual values for a respondent (e.g., black, male, three-person family) to determine the probability of having the various coverage types (e.g., public or private) under a counterfactual scenario. We alter only one variable in the counterfactual scenario to observe its marginal impact on an individual's probability of coverage. We sum these new altered counterfactual person probabilities for various types of coverage to get the adjusted overall rate of coverage under the scenario. The counterfactual recycled predictions analysis gives everyone the value of “not full-supplement imputation.” These recycled rates allow us to control for everything else we include in our model (e.g., demographic characteristics and key covariates) while trying to isolate the effect of being a full-supplement imputation. We use this analysis to answer the central question: holding other covariates constant, what is the impact of full-supplement imputation on coverage rates?
We then assess the impact of the full-supplement imputations on the rate of uninsurance and the rate of private health insurance coverage by reweighting the ASEC data after removing all of the full-supplement imputation cases. We adjust the nonfull-supplement imputation cases in the ASEC to population control totals from the entire ASEC by race, ethnicity, gender, age, and poverty status. This analysis helps to further answer the question of what would happen to the ASEC data if the full-supplement cases were treated as nonrespondents, as opposed to having their ASEC items fully imputed.4 We compare the reweighted and recycled predictions estimates to the standard ASEC estimates to gauge the impact of imputation on the health insurance coverage measures.
Table 1 shows the basic demographics of those cases with full-supplement imputations compared with everyone else in the ASEC sample. The full-supplement imputations make up 10.8 percent of the ASEC data and they are significantly less likely to have private insurance coverage: 59.3 percent compared with 69.1 percent for all other cases. There is no significant difference in public coverage between the full-supplement imputations compared with all other cases.5 The full-supplement imputations also have significantly higher rates of uninsurance, at 26.4 percent compared with 16.6 percent for all others. Among other contrasts, the full-supplement imputations are also less likely to be under 19 years of age, less likely to be working, more likely to be black and less likely to be white, and are more likely to be out of the labor force.
Table 2 compares health insurance coverage rates for those in households where any one person in the household is a full-supplement imputation and those in households without any full-supplement imputations. As expected, the findings are consistent across household size; households with at least one person with a full-supplement imputation have lower rates of coverage. For example, a person living alone who is a full-supplement imputation is 10 percent more likely to be uninsured; those living in two-person full-supplement households are 18.3 percent more likely to have at least one uninsured households member; and so forth. The percent difference between the full-supplement imputations and the nonfull-supplement imputations is lowest for one-person households, highest for two- and three-person households, with four- and five-person households falling in the middle.
Table 3 shows the results from the multinomial logistic regression model. The coefficient for the full-supplement imputation cases is a strong predictor of whether the respondent is uninsured versus having private health insurance coverage. The full-supplement imputation cases are 2.2 times more likely to have private health insurance coverage relative to being uninsured after controlling for other important covariates. In addition, the interaction effect representing those full-supplement imputations (relative risk ratio=0.67) from one-person families shows that the impact of being a full-supplement imputation is somewhat reduced for this group. This is because they are not impacted by the survey process that allows all household members to have the same coverage because there are no other household members.
The public insurance coverage estimates were also significant. Full-supplement imputation cases were more likely to be uninsured than to have public insurance coverage. The relative risk ratio was approximately 1.69 and significant, and for full-supplement children it was 0.65 and significant. We believe that full-supplement imputation cases are much more likely to be uninsured because fewer cases are being imputed to have private health insurance coverage. Because the multinomial logistic regression contains interaction effects whose coefficients are difficult to interpret in isolation, we performed additional analyses.
In Table 4 we present the unadjusted ASEC estimates of private coverage, public coverage, and uninsurance for people under 65 years of age. We also break this table into children under age 19 and adults aged 19–64 and report the results from two adjustment methods. The first adjustment is the reweighted data and the second are the “model-based” recycled prediction estimates. Both of these techniques (reweighting and recycled predictions) yield strikingly similar results. The national unadjusted ASEC estimate of people with private coverage is 68.1 percent for those under 65 years of age. In the adjusted “model-based” recycled predictions the estimate is higher, at 69.0 percent, and the reweighted results are slightly higher, at 69.1 percent. The rates for any public insurance coverage do not vary among the three methods. The unadjusted ASEC national estimate of uninsurance is 17.6 percent, compared with 16.7 percent using the “model-based” recycled predictions and 16.6 percent using the reweighted data. The results within each age stratum follow the same pattern observed for the total population—compared with the unadjusted estimates, the re-weighted and recycled prediction estimates are higher for private coverage, lower for uninsurance, and similar for public insurance.
We find evidence that the Census Bureau estimates of health insurance coverage are biased with respect to the full-supplement imputations. We estimate that the bias translates to roughly 2.5 million less uninsured or 6 percent of the total number of uninsured for those people under 65 years of age.6 This magnitude of difference is similar to the adjustment made to the 1999 uninsurance estimate resulting from the addition of a new health insurance verification question (Nelson and Mills 2001).
The goal of imputation is to use available data to make an informed estimate of what the missing value should be. Imputation should not alter or introduce new relationships among the variables (Davern et al. 2004). Our results show that cases with no imputed values differ significantly from the full-supplement imputation cases with respect to health insurance coverage. The differences persist even after controlling for the variables that were used to impute health insurance coverage. Our results demonstrate that bias in health insurance coverage estimates is introduced by the current imputation specifications used by the Census Bureau.
Following from these results we recommend that the Census Bureau alter its imputation specifications to eliminate this bias. An important first step in this process is to use family size when imputing private health insurance policyholder status. Adjusting the specifications for who is allowed to receive dependent health insurance coverage is much more complicated. Imputations should reflect the reported data and should not be used to enforce rules that are not enforced in the reported data. As currently designed, the ASEC instrument may allow for too much dependent coverage to be reported among nonimputed households with the “A” for “all” possible response to who else in the household is covered under the policy. With one press of a button dependent coverage can be assigned to people in the household who may not be eligible to receive health insurance coverage through a specific plan. It is quite possible that this assigns dependent coverage to people outside of the family health insurance unit who may not, in fact, be eligible for that coverage. To further complicate the picture, there is evidence from administrative data that a fair number of people actually receiving dependent health insurance coverage benefits do not, in fact, qualify (i.e., are not part of the family health insurance unit). Companies such as Ford and Northwest Airlines have investigated their health insurance coverage roles to remove ineligible people, such as older children and ineligible unmarried partners, who had been enrolled in the plan. In both cases about 10 percent of the people obtaining insurance from the companies were found to not be eligible (Appleby 2004; Cummins and Fedor 2005). As a result we think it is important to actually ask who is covered without necessarily restricting dependent coverage to specific family members as the imputation does.
We also think that simply allowing dependent coverage to be assigned to everyone in the household with a press of a button is not a good practice.
The ASEC survey instrument may allow for incorrect reporting of dependent coverage by allowing the “A” option to assign coverage to all household members, but the imputation specification is not the place to fix this problem. In the short run the Census Bureau should fix the imputed data to reflect the reported data by allowing an imputation of “A” just as the reported data does. However, in the long run the Census Bureau should alter its survey instrument to eliminate the “A” for “all” option. This is especially critical because dependent coverage is of particular concern as the cost of employer-based coverage continues to increase and the offer and take up of dependent coverage continues to decline (Holohan 2003; Gould 2004). To get a more refined estimate of dependent coverage and policyholders, additional prompts are needed to clearly specify dependent coverage in the ASEC questionnaire. Improved data collection will result in more reliable estimates as well as better baseline data from which to impute missing values.
It is widely acknowledged in the research community that data from the ASEC produce estimates of uninsurance for an entire year that are too high (e.g., Congressional Budget Office 2003; Peterson 2005). In this paper we have documented a previously unexplored reason why these estimates may be too high: the Census Bureau's imputation procedure for full-supplement cases. However, the adjusted estimates presented in Table 4 are still much higher than the full-year uninsured estimates observed in other Census Bureau surveys, such as the SIPP (Congressional Budget Office 2003; Peterson 2005). The imputation bias discussed here is not large enough to reconcile the ASEC estimates with other surveys measuring full-year uninsurance rates. However, the adjusted numbers do bring the estimates closer to other estimates by increasing the amount of private coverage and lowering the number of uninsured.
The authors wish to thank Linda Bilheimer, John Czajka, Deborah Chollet, Steve Zuckerman, Marie Wang, Chuck Nelson, Robert Mills, and Joanne Pascal for participating in a meeting at the Mathematica Policy Research in February 2004 to discuss the preliminary data analysis contained in this paper. Their insights greatly improved our interpretation and understanding of the issue. The authors would also like to thank Karen Soderberg for an outstanding job of editing this manuscript. All remaining problems are the fault of the authors only. Preparation of this manuscript was funded by Grant no. 38846 from The Robert Wood Johnson Foundation to the State Health Access Data Assistance Center at the University of Minnesota, School of Public Health.
1Another 2–3 percent of the ASEC respondents have one or more of the items in the health insurance series imputed, but we do not explicitly examine these at this point. An additional 15 percent of respondents have only dependent health insurance (either privately purchased or employer sponsored) marked as imputed. All of these imputed cases have private health insurance coverage (no one is imputed to be uninsured). After investigation, the Housing and Household Economic Statistics Division of the Census Bureau found that these cases were mistakenly marked as imputed. The Census Bureau informed us that the specification will be altered to remove this problem.
2For the vast majority (85 percent) of people in households with at least one full-supplement imputation person, everyone in the household is a full-supplement imputation.
3This is done because the full-supplement imputations are much more likely to have both public and private health insurance coverage than those people who respond to the supplement. Putting those with both public and private health insurance coverage into the public insurance category highlights the main issue regarding imputation and private coverage. To fix this problem, the Census Bureau could impute public coverage first and then use the imputed public coverage in the hotdeck to impute private coverage. This will reduce the number of full-supplement imputations with both types of coverage.
4This is what is done in other surveys, such as the NHIS, to adjust for supplement nonresponse. In the NHIS the sample child and sample adult supplements have sample loss from the household and person portion of the interview as sampled respondents refuse to take them. These supplements weight the responding cases to represent the entire adult and the entire child population. Supplement refusers do not have their full set of supplement data imputed as in the CPS-ASEC (National Center for Health Statistics 2003).
5See endnote 3.
6There are estimated to be 253,621,207 million people <65 in the United States in 2004 multiplied by 1 percent is 2.5 million.