Search tips
Search criteria 


Logo of amjepidLink to Publisher's site
Am J Epidemiol. 2011 October 1; 174(7): 860–870.
Published online 2011 August 4. doi:  10.1093/aje/kwr146
PMCID: PMC3203375

Use of Surveillance, Epidemiology, and End Results-Medicare Data to Conduct Case-Control Studies of Cancer Among the US Elderly


Cancer is an important cause of morbidity in the elderly, and many medical conditions and treatments influence cancer risk. The Surveillance, Epidemiology, and End Results (SEER)-Medicare database can be used to conduct population-based case-control studies that elucidate the etiology of cancer among the US elderly. SEER-Medicare links data on malignancies ascertained through SEER cancer registries to claims from Medicare, the US government insurance program for people over age 65 years. Under one approach described herein, elderly cancer cases are ascertained from SEER data (1987–2005). Matched controls are selected from a 5% random sample of Medicare beneficiaries. Risk factors of interest, including medical conditions and procedures, are identified by using linked Medicare claims. Strengths of this design include the ready availability of data, representative sampling from the US elderly population, and large sample size (e.g., under one scenario: 1,176,950 cases, including 221,389 prostate cancers, 185,853 lung cancers, 138,041 breast cancers, and 124,442 colorectal cancers; and 100,000 control subjects). Limitations reflect challenges in exposure assessment related to Medicare claims: restricted range of evaluable risk factors, short time before diagnosis/selection for ascertainment, and inaccuracies in claims. With awareness of limitations, investigators have in SEER-Medicare data a valuable resource for epidemiologic research on cancer etiology.

Keywords: aged, case-control studies, data collection, epidemiologic methods, Medicare, neoplasms, risk factors, SEER Program

Cancer is a major cause of morbidity in the United States, with a total of 1.34 million cases reported during 2005 from 49 of the 50 states (1). Cancer incidence typically rises with age, and a disproportionate fraction of cases occur among the elderly. For example, among people aged 65 years or older in the same 49 US states in 2005, there were 738,000 cancers (55% of the US total), including 131,000 lung cancers (67%), 90,000 colorectal cancers (64%), 113,000 prostate cancers (61%), and 79,000 female breast cancers (42%) (1).

Medical conditions are among the factors that are known or suspected to affect cancer risk. Examples of important etiologic associations include viral infections (e.g., human immunodeficiency virus (HIV) with Kaposi sarcoma and non-Hodgkin lymphoma, hepatitis C virus with liver cancer), autoimmune conditions (e.g., rheumatoid arthritis and Sjögren syndrome with non-Hodgkin lymphoma, ulcerative colitis with colon cancer), and metabolic conditions (e.g., obesity with cancers of the colon, esophagus, and uterus) (26). In addition, certain medical treatments or procedures can strongly increase subsequent cancer risk (e.g., radiation therapy and sarcomas, solid organ transplant and non-Hodgkin lymphoma) (4, 7). These associations arise because of the direct effects of the medical conditions or their treatment (e.g., inflammation, metabolic disturbances, or direct DNA damage), or they may be explained by shared genetic traits or environmental exposures that predispose to both the medical condition and cancer. As with cancer, the prevalence of many of these medical conditions increases with age.

Understanding the etiology of cancer in the US elderly is important, especially given the overall aging of the population. The Surveillance, Epidemiology, and End Results (SEER)-Medicare database links data from the National Cancer Institute's SEER cancer registry program with claims data from Medicare, the federally funded insurance program for the US elderly. These data are made available to investigators and have been used extensively in research (details at This resource is valuable for conducting research on cancer in the elderly because, as we describe below, it combines population-based ascertainment of cancer outcomes with insurance claims that can be used to assess the prevalence of medical conditions and treatments. For example, we previously used SEER-Medicare data in several case-control studies that evaluated risk of hematologic malignancies and skin cancers in association with immune-related conditions (815).

In the present paper, we describe a general approach to case-control studies of cancer using SEER-Medicare data. We highlight opportunities available to researchers and describe appropriate methods. In particular, we document unique challenges that arise from limitations in exposure assessment related to Medicare claims data, including a restricted range of evaluable risk factors, short time before cancer diagnosis or control selection for ascertainment of medical conditions and treatments, and inaccuracies in claims. These topics were not covered in our previous studies, which were examples of this approach but did not address detailed issues in methodology.


SEER-Medicare data sources

SEER is a National Cancer Institute-funded program collecting data on cancer incidence and survival from US cancer registries ( SEER began in 1973 with 9 state and metropolitan area cancer registries. Successive expansions in 1992 and 2001 led to the inclusion in SEER of 17 cancer registries that presently cover approximately 26% of the US population. The number of elderly adults (aged ≥65 years) in SEER coverage regions and the number of cancers in the elderly population are shown by calendar year in Table 1. In total, 146 million person-years are covered during 1973–2007, with 3.1 million incident cancers.

Table 1.
Population and Cancer Counts Among People at Least 65 Years of Age, SEER Cancer Registry Regions, 1973–2007

Medicare provides federally funded health insurance for approximately 97% of persons aged 65 years or older in the United States (16). Medicare also provides health insurance for individuals under age 65 years who have end-stage renal disease or medical disability. As of 2005, Medicare covered 42.6 million people, of whom 35.8 million (84%) were over the age of 65 years ( All beneficiaries are entitled to Part A coverage, which includes hospital inpatient care. Approximately 96% of participants pay to subscribe to Part B coverage, which covers physician and outpatient services. In January 2006, Medicare began offering voluntary outpatient coverage for medications (Part D); these data have only recently been made available for researchers.

Medicare reimburses providers under a fee-for-service model for specific procedures (e.g., office visit, surgery, radiographic imaging) tied to appropriate medical diagnoses. Alternatively, some Medicare beneficiaries (24% as of 2010) choose to enroll in a health maintenance organization (HMO) that provides capitated care ( HMOs are not required to submit claims to Medicare for individual services, so there is no information about specific medical conditions or health care provided for Medicare beneficiaries enrolled in HMOs.

The SEER-Medicare database comprises files created during electronic linkage of SEER and Medicare data ( The linkage utilizes a deterministic algorithm based on name, Social Security number, sex, and date of birth. The match successfully links 94% of the SEER cancer cases over age 65 years with specific Medicare recipients, the deficit reflecting that 3% of elderly people do not have Medicare and that an additional 3% do not have sufficient or accurate enough information for the linkage. Resulting match files are stripped of identifiers. The SEER-Medicare linkage is updated biennially. This paper utilizes data from the 2008 linkage, which includes SEER cancer cases through 2005 and Medicare claims through 2007; as of December 2010, SEER-Medicare data will include SEER cases through 2007 and Medicare claims through 2009.

Table 2 describes the files included in the SEER-Medicare database. The patient entitlement and diagnosis summary file (PEDSF) contains information on all SEER cancer cases who matched to Medicare records, including demographic data, details from SEER on cancer type (e.g., site, morphology, grade, stage), Medicare eligibility and coverage, and socioeconomic data collected by the US Census for the census tract where the patient resides. The summarized denominator (SUMDENOM) file contains similar demographic and Medicare data for a 5% sample of Medicare beneficiaries living in the SEER areas, randomly selected on the basis of the last 2 digits of their Social Security number. However, the SUMDENOM file excludes people from the 5% sample who were reported to SEER with an incident cancer (i.e., individuals who are in the PEDSF file have been removed from the SUMDENOM file).

Table 2.
Description of SEER-Medicare Data Files, 1973–2007

The remaining files listed in Table 2 provide Medicare claims data for individuals in the PEDSF and SUMDENOM files. Specifically, the medical provider analysis and review (MEDPAR) file provides hospital claims for all short stay, long stay, and skilled nursing facility care. The national claims history (NCH) file includes claims from physicians and other noninstitutional medical care providers. The outpatient (OUTPT) file contains claims from institutional outpatient providers, including hospital outpatient departments, rural health clinics, renal dialysis facilities, outpatient rehabilitation facilities, and mental health centers.

An overview of the case-control study design

The SEER-Medicare database provides an opportunity to conduct case-control studies utilizing population-based sampling. Specifically, consider the population of all elderly Medicare beneficiaries (aged 65 years or older) living in the SEER registry areas as the source population, that is, a cohort that people enter when they receive Medicare coverage. By selecting cancer cases over age 65 years from the PEDSF file, which consists of cancers identified by the SEER registries, one obtains a complete census of all cancers arising in this source population.

Although data on the entire Medicare cohort are not available, it is straightforward to construct a subcohort representing a 5% random sample. To do this, one utilizes the SUMDENOM file and adds back the people who developed cancer. This can be accomplished by using a flag in the PEDSF file that indicates which SEER cases were originally in the 5% sample. Using the 2008 version of the SEER-Medicare data, the authors found that the combined data set created from the SUMDENOM file and the flagged cases from the PEDSF file comprise a 5% subcohort of 812,290 Medicare beneficiaries who were living in SEER areas during some time point of SEER coverage. In a case-control study, one can sample from this 5% subcohort of Medicare recipients to create a representative sample of controls.

For the selected cases and controls, one then uses the linked Medicare claims prior to cancer diagnosis/control selection to identify the presence of medical conditions, treatments, or procedures (i.e., “exposures”) possibly related to cancer risk. We emphasize that only exposures reflected in Medicare claims can be evaluated. As discussed below in more detail, this somewhat narrow definition prevents consideration of some important cancer risk factors. In the Discussion, we also review additional issues regarding availability and accuracy of claims that warrant careful attention.

Additional details on sampling of cases and controls

Because exposures are identified through Medicare claims, a key aspect of subject selection is to ensure comparability of the available Medicare data in cases and controls. Claims data are limited by age (data for most people are unavailable before age 65 years) and calendar year (no MEDPAR claims before 1986, no OUTPT or NCH claims before 1991; Table 2). In addition, for PEDSF cases diagnosed in 2003–2005, the most recent linkage does not include Medicare claims before 1998.

The PEDSF and the SUMDENOM files provide additional information on Medicare coverage status for each calendar month. Periods when subjects were covered by both Parts A and B and were not in an HMO are most informative with respect to claims data, and the investigator can use this information to select subjects with a minimum period of Medicare coverage or evaluate for differences in coverage between cases and controls that would lead to differential exposure assessment. For many analyses, we exclude a period prior to diagnosis/selection (several months to a year) from exposure assessment, because during this period cases may have been ill from their incipient cancer and, in comparison to controls, would have been more rigorously evaluated and treated for underlying conditions.

These considerations affect case and control selection. For example, with the exclusion of 1 year of exposure data from Medicare claims immediately prior to diagnosis/selection, we use the following selection criteria for cases: 1) diagnosis in PEDSF of the cancer of interest as a first cancer, where the cancer was not diagnosed first on autopsy or on the death certificate; 2) age at cancer diagnosis of 66–99 years; 3) calendar year of cancer diagnosis of 1987 or after; 4) at least 13 months of Part A, Part B, non-HMO Medicare coverage prior to cancer diagnosis (because Medicare coverage is usually continuous, and exclusion of 1 year of data prior to diagnosis would entail at least 1 earlier month of coverage for assessment of exposures).

Several variations are possible. First, the investigator may include all cases of the cancer of interest, not just as a first cancer; include cases diagnosed at autopsy or on death certificate; or not exclude cases based on a maximum age at cancer diagnosis. Second, depending on the importance of capturing outpatient claims documenting the exposure of interest, one may require that cases be diagnosed in 1992 or after to ensure availability of NCH and OUTPT claims data. Third, the investigator can specify that the minimum duration of Medicare coverage be continuous or that coverage be obtained over specific time windows prior to diagnosis.

Control selection from the 5% random sample of Medicare recipients in the SEER areas mirrors the above criteria for cases. For each calendar year from which cases were sampled, we enumerate individuals in the 5% random sample who were cancer free as of July 1 (the midpoint) of that year and who meet the specified Medicare coverage requirement (e.g., at least 13 months of prior Part A, Part B, non-HMO coverage). From the eligible group, we randomly select controls for each calendar year who are frequency matched to cases by sex and age as of July 1 of that year. Controls can be sampled only once in a calendar year, but they can be sampled repeatedly across multiple years, and they can later become cancer cases. However, cancer cases diagnosed in 2003–2005 cannot be used as controls before 2003, because unlike cancer cases before 2003, they lack claims data prior to 1998.

Additional details on exposure assessment using Medicare claims data

For the selected cases and controls, the investigator assesses the presence of exposures of interest using the Medicare claims submitted prior to the diagnosis/selection date. Table 2 provides some relevant characteristics of the Medicare files; a more detailed description of Medicare claims files is beyond the scope of this article, and readers are referred elsewhere (

For many conditions, an inpatient diagnosis in MEDPAR may be considered to indicate a more severe manifestation than claims present only in the NCH or OUTPT files. In addition, for many conditions, inpatient diagnoses may be more reliable than other claims, because hospitals are more thoroughly audited for accuracy of claim diagnoses than individual providers (1720). For this reason, we often consider a medical condition to be present if there is either a single MEDPAR diagnosis or 2 NCH or OUTPT diagnoses separated by at least 30 days. As noted above, we usually exclude a period prior to diagnosis/selection to avoid differential ascertainment of exposures between cases and controls. Additional measures of exposure that can be used to assess associations with cancer include latency (time from first Medicare claim for the exposure until diagnosis/selection), inpatient (MEDPAR) vs. outpatient-only (NCH or OUTPT) diagnoses, and a “dose-response” relation using the number of claims for the condition as a measure of severity.

Statistical analysis

The prevalence of the exposure of interest is compared between cases and controls by using contingency tables and unconditional logistic regression. In the logistic regression models, the investigator adjusts for the matching factors such as calendar year, sex, and age. Polytomous logistic regression is used when more than one type of cancer case is analyzed (e.g., subtypes of non-Hodgkin lymphoma).

Under the approach we have outlined, the variance of the odds ratios from these models needs to be adjusted for the multiple sampling of controls across calendar years and the inclusion of some controls as subsequent cases (12). Further statistical details are given in the Appendix.

Upon request of the corresponding author, we will provide the following macros for SAS software (SAS Institute, Inc., Cary, North Carolina) that assist investigators in the selection of cases and controls and in statistical analyses, using the above approach.

  1. ALLCANCER.FILE.SAS: Selects cancer cases from PEDSF.
  2. SUMDENOM.ALLCANCERS.SAS: Selects matched controls from the 5% random sample of Medicare beneficiaries in SEER areas.
  3. ROBUSTVARIANCE: Performs polytomous logistic regression accounting for the sampling design described in this paper.


Table 3 presents the number of SEER cancers in 1992–2005 selected as cases by using the criteria described above. Overall, there are 1,176,950 cancer cases, including 221,389 prostate cancers, 185,853 lung cancers, 138,041 breast cancers, and 124,442 colorectal cancers.

Table 3.
Cancers Selected as Casesa Among Elderly US Medicare Beneficiaries (n = 1,176,950), 1992–2005

Table 4 compares these cases with 100,000 cancer-free controls selected as described above. By design, the cases are frequency matched perfectly by sex, age category, and calendar year. The cases and controls are also similar in terms of race/ethnicity and duration of Medicare claims data. These controls represent 86,336 unique individuals, with 74,249 selected once, 10,709 selected twice, and 1,378 selected 3 or more times. Also, 7,125 controls (7.1%) subsequently developed cancer.

Table 4.
Characteristics of Cases and Controls Sampled From SEER-Medicare, 1992–2005a

The prevalence of some example medical conditions and procedures among the controls is presented in Table 5. As expected, some chronic viral infections (e.g., hepatitis C virus and HIV) and medical conditions (e.g., organ transplantation) that are strongly associated with cancers are quite rare in this population. A higher prevalence is seen for additional medical conditions and treatments of potential interest (e.g., rheumatoid arthritis, blood transfusion), and other conditions that may not be linked to cancer are also very common, as expected (e.g., depression, essential hypertension). The apparent prevalence of these conditions decreases with use of more stringent criteria, such as requiring multiple supporting claims (Table 5).

Table 5.
Prevalence of Selected Medical Conditions and Procedures Among 100,000 US Medicare Controls, 1973–2007


We describe a general approach for conducting population-based case-control studies of cancer among the US elderly using SEER-Medicare data. Cases and controls are drawn from the population of Medicare beneficiaries over age 65 years who reside in SEER catchment areas. Exposures are assessed by using linked Medicare claims.

A major strength of such case-control studies is the essentially complete ascertainment of cancer cases from the source population. Cancer registries participating in the SEER program are required to meet strict standards with respect to case ascertainment and data quality ( PEDSF data derived from SEER include information on tumor histology, grade, and stage, allowing analysis by cancer subtype. In parallel, the availability of a 5% random subcohort of Medicare beneficiaries provides an opportunity to select controls who appropriately reflect the source population. Individuals in this subcohort are eligible to be selected as controls for as long as they remain cancer free.

Cases and controls are thus fully representative of the elderly Medicare population living in SEER areas. These samples can be generalized to the entire US elderly population with 2 caveats. First, 3% of people over age 65 years do not have Medicare. Medicare eligibility depends on having Social Security benefits, or being married to someone with benefits, which in turn depends on documentation of work history. Although the proportion of elderly who do not qualify is very small, presumably poor people and recent immigrants would be overrepresented. The second caveat is that SEER areas are not entirely representative of the overall US population. SEER areas were selected to include a relatively large fraction of racial/ethnic minorities (refer to SEER areas also overrepresent urban areas and higher income persons (16).

As we illustrate in Table 3, an added strength of the described approach is the very large number of cancer cases and controls that can be evaluated. The sample size is substantial even for some less common cancers, and these large numbers enhance the investigator's ability to examine rare medical conditions and procedures as cancer risk factors.

Importantly, researchers should be cautioned regarding several limitations. Because Medicare coverage is largely restricted to elderly people, the SEER-Medicare data cannot be used to evaluate risk factors that arise earlier in life. Likewise, the vast majority of cancer cases who can be included in a case-control study (i.e., with antecedent Medicare claims data for exposure assessment) are over age 65 years. One must be cognizant that results from studies of the elderly may not be generalizable to younger populations. Nonetheless, because risk of most cancers increases steeply with age, such studies are directly informative for a substantial fraction of cancer cases.

The major issues with use of SEER-Medicare to conduct case-control studies concern the completeness and accuracy of Medicare claims to evaluate risk factors of interest (i.e., exposure assessment). First, only conditions diagnosed and recorded by a health-care provider, or related procedures, can be evaluated. If a medical condition is asymptomatic or underdiagnosed in the elderly (e.g., possible examples include hepatitis C virus infection, depression, and alcoholism), then reliance on Medicare claims will lack sensitivity. In addition, as described earlier, Medicare claims (particularly in NCH) may falsely document the presence of a condition when it is not actually present. This nonspecificity can be reduced by requiring multiple claims or a MEDPAR claim for the condition.

Furthermore, the claims files described in Table 2 do not provide data on some classical exposures of interest to cancer epidemiologists. For instance, there are limited data on tobacco or alcohol use, except indirectly as indicated by the presence of medical conditions that arise from smoking (e.g., emphysema) or drinking (e.g., alcoholic hepatitis), or laboratory test results, except when abnormalities trigger a medical diagnosis (e.g., anemia). Similarly, data on physical activity and body mass index are not available, although obesity itself can be evaluated as a claims diagnosis. Without Part D data, researchers have no information on medication use, except for certain drugs administered as infusions or injections (e.g., chemotherapy). These restrictions limit the range of conditions that can be evaluated as risk factors.

As noted above, we typically exclude a period prior to cancer diagnosis/control selection from exposure assessment, because medical evaluation of cases likely leads to heightened ascertainment of medical conditions. This bias could be quite severe, since cases, as they develop early signs of cancer, would be expected to increasingly visit their health-care providers. Both nonspecific health complaints and symptoms related to the organ system in which the incipient cancer is situated would prompt added testing and diagnoses in cases.

An example that supports exclusion of a period prior to diagnosis/selection is provided in Table 6, using unpublished data from a case-control study of skin cancer in the elderly (15). HIV infection is an established strong risk factor for Kaposi sarcoma, and results using the Medicare claims data for the period before 3 months prior to case diagnosis/control selection support this conclusion (2.33% of cases with a claim for HIV vs. 0.13% of controls, yielding a crude odds ratio of 18). However, these prevalence estimates based on antecedent Medicare claims substantially underestimate the true HIV prevalence. Notably, for the cases, many additional HIV diagnoses are present in the Medicare claims data in the months at or after Kaposi sarcoma diagnosis, reflecting both newly diagnosed infections (prompted by HIV testing after recognition of the cancer) and initial claims for previously recognized HIV infection. Indeed, the 14 HIV claims prior to diagnosis represent only 20% of all such claims among the cases. In contrast, controls have few additional claims documenting HIV infection at or after their selection date, reflecting an absence of specific medical attention and testing relative to their arbitrary selection date. To avoid differential exposure assessment between cases and controls, one must utilize only the HIV diagnoses prior to case diagnosis/control selection, even though this approach results in a marked underascertainment of HIV (particularly for the cases). In turn, this nondifferential underascertainment leads to a bias toward the null in the magnitude of association with exposures of interest.

Table 6.
US Medicare Claims Documenting HIV Infection Among Kaposi Sarcoma Cases and Controls, 1973–2007a

Another important limitation in exposure assessment arises from the restricted window available before diagnosis/selection in which to assess Medicare claims. Specifically, claims data are not available prior to age 65 years (rarely, data are available from younger ages if the person was covered due to end-stage renal disease or disability) or 1986 for inpatient data in MEDPAR (NCH and OUTPT data begin in 1991). In addition, for cases and controls from 2003 to 2005, only claims in 1998 and after can be assessed. One may evaluate associations with time since first Medicare claim as a proxy for duration of exposure (i.e., latency), and increasing risk with increasing duration can be taken as evidence for an etiologic relation. However, for many exposures, the interval based on claims data is only a rough proxy, because the data cover only a limited time period, and it is usually not possible to determine when the exposure was first present. Furthermore, if the effect of an exposure on cancer risk is greatest soon after onset of the exposure (e.g., a new user effect for medication), evaluation of claims data that capture mostly long-term exposures will lead to an underestimate of the association (21).

While reliance on Medicare data somewhat restricts the duration over which associations can be assessed, the time window is often quite long. For example, among the controls shown in Table 4, the median duration for which claims were available prior to selection was 8.0 years, and 30.1% had 11 years or more. Nonetheless, cases and controls selected at young ages or in early calendar years will have more limited claims data, and less opportunity to be identified as exposed to the risk factor of interest, than subjects selected at older ages or later in calendar time. Depending on the exposure of interest, there may be a minimum duration of available claims data required to be reasonably certain of capturing the exposure, which would then entail eliminating subjects who are younger or from earlier calendar years. Our approach to control selection matches them to the cases according to age and calendar year, so that the lack of sensitivity in exposure assessment that arises from the limited duration of claims data is nondifferential.

Although we focused on a case-control design, other options can be utilized with the SEER-Medicare database. One is a case-cohort study design, considering all cancer cases in the Medicare population along with the reconstructed 5% random subcohort. Using incidence density sampling, it would also be possible to individually match controls to cases to create a “nested” case-control study (i.e., nested in the Medicare cohort). However, given the extremely large number of subjects, both approaches are computationally challenging. The case-cohort design requires repeated evaluation of exposure information (i.e., prior medical conditions or treatments based on Medicare claims) in each successive risk set. For the nested case-control study, the computation burden associated with individual control selection and analyses using conditional logistic regression could be substantial, and this approach would only be feasible for cancers where the number of cases is not too large. Nonetheless, these case-cohort and case-control approaches would be expected to yield equivalent measures of association.

In closing, we encourage investigators to utilize SEER-Medicare data, which can be readily obtained, to conduct studies evaluating risk factors for cancer. Such studies have compelling strengths, including the availability of large population-based samples of cancer cases and representative controls. There are also important challenges, particularly related to the limitations and complexities of claims data, and we hope our discussion will facilitate appropriate study design and analysis.


Author affiliations: Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland (Eric A. Engels, Ruth M. Pfeiffer); Information Management Services, Rockville, Maryland (Winnie Ricker, William Wheeler, Ruth Parsons); and Division of Cancer Control and Population Sciences, National Cancer Institute, Rockville, Maryland (Joan L. Warren).

This research was supported by the National Cancer Institute.

The authors acknowledge the efforts of the Applied Research Program, National Cancer Institute; the Office of Research, Development, and Information, Centers for Medicare and Medicaid Services; Information Management Services, Inc.; and the Surveillance, Epidemiology, and End Results (SEER) Program tumor registries in the creation of the SEER-Medicare database.

The interpretation and reporting of these data are the sole responsibility of the authors.

Conflict of interest: none declared.



human immunodeficiency virus
health maintenance organization
medical provider analysis and review
national claims history
patient entitlement and diagnosis summary file
Surveillance, Epidemiology, and End Results
summarized denominator


Variance Calculation for Polytomous Logistic Regression

Our variance calculation was previously presented in Quinlan et al. (12) and modifies an approach first described in Anderson et al. (8). Let Y = (Y0, Y1, Y2, …. YK) to denote the outcome variable in a nested case-control study comprising one control group and K case groups. We use indicator notation, that is, Y0 = 1 if the person is a control and 0 otherwise; and Yi = 1 if the person is a case of type i and 0 otherwise, i = 1, …, K. We use polytomous logistic regression to compare each case group with the controls, by modeling:

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr146fx1_ht.jpg

for the covariate vector X = [1, X1, …, Xm], that includes a one for the intercept term. As An external file that holds a picture, illustration, etc.
Object name is amjepidkwr146fx2_ht.jpg, we assume θ0 = [0, …, 0]. We then use maximum likelihood estimation to obtain the log odds ratio estimates θj = [θj1, θj2, …, θjm], j = 1, …, K, for the jth case type in the polytomous logistic model.

Although the corresponding covariance estimator accounts for the fact that the same control group is used for each disease subtype comparison, we additionally need to consider that, due to constraints in our subcohort, a substantial number of individuals were sampled multiple times as controls, and that some case individuals were sampled as controls prior to developing disease and becoming a case. Let the covariance matrix of the maximum likelihood estimates of the log odds ratio parameters be denoted by Σ. For each study subject, we obtain the scores Si = (Si1, …, Sik), from each of the K polytomous logistic regression models. For example, for subject l, the score for model j, or, equivalently, θj, is given by Sij = −Xij[YijP(Yij = 1|Xij, θj)]. We define the matrix of scores for n study subjects as

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr146fx3_ht.jpg

Control subjects have entries in every column of the score matrix S, as they contribute to all logistic models. Individuals who served as controls before they were selected as cases also contribute to several logistic models. By use of the above notation, the asymptotic variance of the estimates (θ1, …, θk) is given by ΣBΣ. B is estimated by the following equation:

An external file that holds a picture, illustration, etc.
Object name is amjepidkwr146fx4_ht.jpg

where i denotes the sum over individuals, and the second sum inside refers to the repeated measurements on the same person.


1. CDC WONDER. United States cancer statistics, 1999–2005 mortality archive request. Atlanta, GA: Centers for Disease Control and Prevention, US Department of Health and Human Services; 2008. ( (Accessed September 2, 2010)
2. Renehan AG, Tyson M, Egger M, et al. Body-mass index and incidence of cancer: a systematic review and meta-analysis of prospective observational studies. Lancet. 2008;371(9612):569–578. [PubMed]
3. Tucker MA, D'Angio GJ, Boice JD, Jr, et al. Bone sarcomas linked to radiotherapy and chemotherapy in children. N Engl J Med. 1987;317(10):588–593. [PubMed]
4. Grulich AE, van Leeuwen MT, Falster MO, et al. Incidence of cancers in people with HIV/AIDS compared with immunosuppressed transplant recipients: a meta-analysis. Lancet. 2007;370(9581):59–67. [PubMed]
5. Ekström Smedby K, Vajdic CM, Falster M, et al. Autoimmune disorders and risk of non-Hodgkin lymphoma subtypes: a pooled analysis within the InterLymph Consortium. Blood. 2008;111(8):4029–4038. [PubMed]
6. Saito I, Miyamura T, Ohbayashi A, et al. Hepatitis C virus infection is associated with the development of hepatocellular carcinoma. Proc Natl Acad Sci U S A. 1990;87(17):6547–6549. [PubMed]
7. Ekbom A, Helmick C, Zack M, et al. Ulcerative colitis and colorectal cancer. A population-based study. N Engl J Med. 1990;323(18):1228–1233. [PubMed]
8. Anderson LA, Pfeiffer R, Warren JL, et al. Hematopoietic malignancies associated with viral and alcoholic hepatitis. Cancer Epidemiol Biomarkers Prev. 2008;17(11):3069–3075. [PMC free article] [PubMed]
9. Anderson LA, Landgren O, Engels EA. Common community acquired infections and subsequent risk of chronic lymphocytic leukaemia. Br J Haematol. 2009;147(4):444–449. [PMC free article] [PubMed]
10. Anderson LA, Gadalla S, Morton LM, et al. Population-based study of autoimmune conditions and the risk of specific lymphoid malignancies. Int J Cancer. 2009;125(2):398–405. [PMC free article] [PubMed]
11. Anderson LA, Pfeiffer RM, Landgren O, et al. Risks of myeloid malignancies in patients with autoimmune conditions. Br J Cancer. 2009;100(5):822–828. [PMC free article] [PubMed]
12. Quinlan SC, Morton LM, Pfeiffer RM, et al. Increased risk for lymphoid and myeloid neoplasms in elderly solid-organ transplant recipients. Cancer Epidemiol Biomarkers Prev. 2010;19(5):1229–1237. [PMC free article] [PubMed]
13. Chang CM, Quinlan SC, Warren JL, et al. Blood transfusions and the subsequent risk of hematologic malignancies. Transfusion. 2010;50(10):2249–2257. [PMC free article] [PubMed]
14. Lanoy E, Engels EA. Skin cancers associated with autoimmune conditions among elderly adults. Br J Cancer. 2010;103(1):112–114. [PMC free article] [PubMed]
15. Lanoy E, Costagliola D, Engels EA. Skin cancers associated with HIV infection and solid-organ transplantation among elderly adults. Int J Cancer. 2010;126(7):1724–1731. [PMC free article] [PubMed]
16. Warren JL, Klabunde CN, Schrag D, et al. Overview of the SEER-Medicare data: content, research applications, and generalizability to the United States elderly population. Med Care. 2002;40(8 suppl):IV-3–IV-18. [PubMed]
17. Katz JN, Barrett J, Liang MH, et al. Sensitivity and positive predictive value of Medicare Part B physician claims for rheumatologic diagnoses and procedures. Arthritis Rheum. 1997;40(9):1594–1600. [PubMed]
18. Fowles JB, Lawthers AG, Weiner JP, et al. Agreement between physicians’ office records and Medicare Part B claims data. Health Care Financ Rev. 1995;16(4):189–199. [PubMed]
19. Kiyota Y, Schneeweiss S, Glynn RJ, et al. Accuracy of Medicare claims-based diagnosis of acute myocardial infarction: estimating positive predictive value on the basis of review of hospital records. Am Heart J. 2004;148(1):99–104. [PubMed]
20. Klabunde CN, Harlan LC, Warren JL. Data sources for measuring comorbidity: a comparison of hospital records and Medicare claims for cancer patients. Med Care. 2006;44(10):921–928. [PubMed]
21. Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. Am J Epidemiol. 2003;158(9):915–920. [PubMed]

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press