We describe a general approach for conducting population-based case-control studies of cancer among the US elderly using SEER-Medicare data. Cases and controls are drawn from the population of Medicare beneficiaries over age 65 years who reside in SEER catchment areas. Exposures are assessed by using linked Medicare claims.
A major strength of such case-control studies is the essentially complete ascertainment of cancer cases from the source population. Cancer registries participating in the SEER program are required to meet strict standards with respect to case ascertainment and data quality (http://seer.cancer.gov
). PEDSF data derived from SEER include information on tumor histology, grade, and stage, allowing analysis by cancer subtype. In parallel, the availability of a 5% random subcohort of Medicare beneficiaries provides an opportunity to select controls who appropriately reflect the source population. Individuals in this subcohort are eligible to be selected as controls for as long as they remain cancer free.
Cases and controls are thus fully representative of the elderly Medicare population living in SEER areas. These samples can be generalized to the entire US elderly population with 2 caveats. First, 3% of people over age 65 years do not have Medicare. Medicare eligibility depends on having Social Security benefits, or being married to someone with benefits, which in turn depends on documentation of work history. Although the proportion of elderly who do not qualify is very small, presumably poor people and recent immigrants would be overrepresented. The second caveat is that SEER areas are not entirely representative of the overall US population. SEER areas were selected to include a relatively large fraction of racial/ethnic minorities (refer to http://seer.cancer.gov/
). SEER areas also overrepresent urban areas and higher income persons (16
As we illustrate in , an added strength of the described approach is the very large number of cancer cases and controls that can be evaluated. The sample size is substantial even for some less common cancers, and these large numbers enhance the investigator's ability to examine rare medical conditions and procedures as cancer risk factors.
Importantly, researchers should be cautioned regarding several limitations. Because Medicare coverage is largely restricted to elderly people, the SEER-Medicare data cannot be used to evaluate risk factors that arise earlier in life. Likewise, the vast majority of cancer cases who can be included in a case-control study (i.e., with antecedent Medicare claims data for exposure assessment) are over age 65 years. One must be cognizant that results from studies of the elderly may not be generalizable to younger populations. Nonetheless, because risk of most cancers increases steeply with age, such studies are directly informative for a substantial fraction of cancer cases.
The major issues with use of SEER-Medicare to conduct case-control studies concern the completeness and accuracy of Medicare claims to evaluate risk factors of interest (i.e., exposure assessment). First, only conditions diagnosed and recorded by a health-care provider, or related procedures, can be evaluated. If a medical condition is asymptomatic or underdiagnosed in the elderly (e.g., possible examples include hepatitis C virus infection, depression, and alcoholism), then reliance on Medicare claims will lack sensitivity. In addition, as described earlier, Medicare claims (particularly in NCH) may falsely document the presence of a condition when it is not actually present. This nonspecificity can be reduced by requiring multiple claims or a MEDPAR claim for the condition.
Furthermore, the claims files described in do not provide data on some classical exposures of interest to cancer epidemiologists. For instance, there are limited data on tobacco or alcohol use, except indirectly as indicated by the presence of medical conditions that arise from smoking (e.g., emphysema) or drinking (e.g., alcoholic hepatitis), or laboratory test results, except when abnormalities trigger a medical diagnosis (e.g., anemia). Similarly, data on physical activity and body mass index are not available, although obesity itself can be evaluated as a claims diagnosis. Without Part D data, researchers have no information on medication use, except for certain drugs administered as infusions or injections (e.g., chemotherapy). These restrictions limit the range of conditions that can be evaluated as risk factors.
As noted above, we typically exclude a period prior to cancer diagnosis/control selection from exposure assessment, because medical evaluation of cases likely leads to heightened ascertainment of medical conditions. This bias could be quite severe, since cases, as they develop early signs of cancer, would be expected to increasingly visit their health-care providers. Both nonspecific health complaints and symptoms related to the organ system in which the incipient cancer is situated would prompt added testing and diagnoses in cases.
An example that supports exclusion of a period prior to diagnosis/selection is provided in , using unpublished data from a case-control study of skin cancer in the elderly (15
). HIV infection is an established strong risk factor for Kaposi sarcoma, and results using the Medicare claims data for the period before 3 months prior to case diagnosis/control selection support this conclusion (2.33% of cases with a claim for HIV vs. 0.13% of controls, yielding a crude odds ratio of 18). However, these prevalence estimates based on antecedent Medicare claims substantially underestimate the true HIV prevalence. Notably, for the cases, many additional HIV diagnoses are present in the Medicare claims data in the months at or after Kaposi sarcoma diagnosis, reflecting both newly diagnosed infections (prompted by HIV testing after recognition of the cancer) and initial claims for previously recognized HIV infection. Indeed, the 14 HIV claims prior to diagnosis represent only 20% of all such claims among the cases. In contrast, controls have few additional claims documenting HIV infection at or after their selection date, reflecting an absence of specific medical attention and testing relative to their arbitrary selection date. To avoid differential exposure assessment between cases and controls, one must utilize only the HIV diagnoses prior to case diagnosis/control selection, even though this approach results in a marked underascertainment of HIV (particularly for the cases). In turn, this nondifferential underascertainment leads to a bias toward the null in the magnitude of association with exposures of interest.
US Medicare Claims Documenting HIV Infection Among Kaposi Sarcoma Cases and Controls, 1973–2007a
Another important limitation in exposure assessment arises from the restricted window available before diagnosis/selection in which to assess Medicare claims. Specifically, claims data are not available prior to age 65 years (rarely, data are available from younger ages if the person was covered due to end-stage renal disease or disability) or 1986 for inpatient data in MEDPAR (NCH and OUTPT data begin in 1991). In addition, for cases and controls from 2003 to 2005, only claims in 1998 and after can be assessed. One may evaluate associations with time since first Medicare claim as a proxy for duration of exposure (i.e., latency), and increasing risk with increasing duration can be taken as evidence for an etiologic relation. However, for many exposures, the interval based on claims data is only a rough proxy, because the data cover only a limited time period, and it is usually not possible to determine when the exposure was first present. Furthermore, if the effect of an exposure on cancer risk is greatest soon after onset of the exposure (e.g., a new user effect for medication), evaluation of claims data that capture mostly long-term exposures will lead to an underestimate of the association (21
While reliance on Medicare data somewhat restricts the duration over which associations can be assessed, the time window is often quite long. For example, among the controls shown in , the median duration for which claims were available prior to selection was 8.0 years, and 30.1% had 11 years or more. Nonetheless, cases and controls selected at young ages or in early calendar years will have more limited claims data, and less opportunity to be identified as exposed to the risk factor of interest, than subjects selected at older ages or later in calendar time. Depending on the exposure of interest, there may be a minimum duration of available claims data required to be reasonably certain of capturing the exposure, which would then entail eliminating subjects who are younger or from earlier calendar years. Our approach to control selection matches them to the cases according to age and calendar year, so that the lack of sensitivity in exposure assessment that arises from the limited duration of claims data is nondifferential.
Although we focused on a case-control design, other options can be utilized with the SEER-Medicare database. One is a case-cohort study design, considering all cancer cases in the Medicare population along with the reconstructed 5% random subcohort. Using incidence density sampling, it would also be possible to individually match controls to cases to create a “nested” case-control study (i.e., nested in the Medicare cohort). However, given the extremely large number of subjects, both approaches are computationally challenging. The case-cohort design requires repeated evaluation of exposure information (i.e., prior medical conditions or treatments based on Medicare claims) in each successive risk set. For the nested case-control study, the computation burden associated with individual control selection and analyses using conditional logistic regression could be substantial, and this approach would only be feasible for cancers where the number of cases is not too large. Nonetheless, these case-cohort and case-control approaches would be expected to yield equivalent measures of association.
In closing, we encourage investigators to utilize SEER-Medicare data, which can be readily obtained, to conduct studies evaluating risk factors for cancer. Such studies have compelling strengths, including the availability of large population-based samples of cancer cases and representative controls. There are also important challenges, particularly related to the limitations and complexities of claims data, and we hope our discussion will facilitate appropriate study design and analysis.