|Home | About | Journals | Submit | Contact Us | Français|
Much attention is directed currently to identifying sub-types of cancers that are genetically and clinically distinct. The expectation is that sub-typing on the basis of somatic genomic characteristics will supplant traditional pathological sub-types with respect to relevance for targeted therapies and clinical course. Less attention has been paid to the goal of validating sub-types on the basis of the distinctiveness of their etiologies. In this article it is shown that studies of individuals with double primary malignancies provide uniquely valuable information for establishing the etiologic distinctiveness of candidate tumor sub-types. Studies of double primaries have the potential to definitively rank candidate taxonomic systems with respect to their etiological relevance by determining which sub-types are most highly correlated in the double primaries. The concept is illustrated with data from studies of the concordance of estrogen and progestin status in bilateral breast cancers, where it is shown that double primaries are much more likely to be concordant with respect to ER status than for PR status. The high concordance of ER status is consistent with a growing literature demonstrating the etiologic distinctiveness of ER+ and ER- tumors.
Much attention is currently being directed at the goal of classifying cancers into distinct molecular sub-types, using extensive genomic data that distinguishes the somatic characteristics of these sub-types.1,2 It is anticipated that sub-classifications based on molecular characteristics will lead to a better understanding of cancer biology, and better avenues for determining appropriate targeted therapies.3,4 Efforts to validate the relevance of candidate sub-types have usually focused on establishing clinical distinctiveness on the basis of, say, different survival for patients in the sub-types, or by merely showing that the genomically determined sub-types differ systematically with respect to conventional pathologic criteria. It is reasonable to speculate, however, that sub-types that are genuinely biologically distinct are likely also to possess distinct etiologies. Conventionally, epidemiologists have examined the potential etiologic heterogeneity of cancer sub-types by comparing the candidate sub-types with respect to the presence of known risk factors for the cancer in question.5 Since it is widely believed that many risk factors for cancer, especially genetic factors, have not yet been identified, this strategy is necessarily limited in scope. The thesis of this article is that optimal sub-typing of cancer from an etiologic perspective can be accomplished without the need to know any risk factors, merely by studying the co-occurrence of the sub-types in series of patients with double primary malignancies.
The ageing of the population, allied to a gradual increase in survival following cancer has led to a rapid increase in the occurrence of second primary cancers. In fact, over 16% of all cancers reported to the SEER registries in the USA are second primaries.6 This has led to recognition of second primaries as a special resource for epidemiologic studies, especially where the two primaries have occurred in the same organ.7–10 A unique feature of the occurrence of a double primary is the opportunity it affords for comparison of characteristics of the two tumors. Thus investigators have been interested in whether independently occurring cancers in an individual patient are more likely to be similar with regard to tumor pathologic characteristics. For example, many investigators have endeavored to correlate the estrogen receptor (ER) status of pairs of contralateral breast cancers.11–21 These studies have generally shown a fairly high correlation. That is, if the patient’s first cancer is ER+ then the second cancer is much more likely to be ER+ as well. This has led to speculation about the etiologic causes of this aggregation. But this is hampered by absence of a conceptual structure for interpreting these results in the context of cancer risk.
In this article a conceptually simple mathematical structure for interpreting the results of studies of tumor-sub-classifications between double primaries is presented, allowing unique insight into the etiologic heterogeneity of tumor sub-classifications. A framework is provided for comparing the relative merits of competing classification systems, such as those based on traditional gross pathology and systems based on, for example, selected tumor markers or clustering of genome-wide array patterns. The results also provide insight into study design strategies for identifying new cancer risk factors, and for interpreting the results of existing case-control studies, and as a basic framework for interpreting the interplay of germ-line and somatic genomic profiles.
The fundamental premise is that the degree of correlation of the sub-types of independent double primary tumors is directly related to the degree of risk heterogeneity in the population with respect to the sub-types. This relationship can be expressed by a relatively simple equation, but first we need to define and explain the terms that are used.
Consider for simplicity two tumor sub-types, A and B. The relevant data available from a study of double primaries is a simple cross-tabulation of the frequencies of occurrence of these sub-types in the two tumors from each patient. Let these frequencies be denoted by FAA, FBB, FAB and FBA (Table 1), where for example FAA is the number of patients in which both tumors are of type A, FAB is the number of patients where the first tumor is of type A and the second tumor is of type B, etc.. For reasons that are explained in detail in the Statistical Appendix the key measure of association is the odds ratio. The odds ratio from the 2×2 table cross-classifying the tumor types is defined as the parameter ψ, and its estimate is given by
Our goal is to relate the preceding odds ratio to the risk heterogeneity of the two sub-types. Conceptually, each individual in the population can be classified into one of a large number of risk categories, where people in a risk category all possess similar risk of the disease under investigation. On this basis we can define terms that characterize the variation in risk among individuals in the population. Specifically, we can define K2 to be the population coefficient of risk variation for the cancer type overall. This is the variance of the risks of individuals in the population divided by the square of the mean risk. We can define analogous terms that characterize the variation in the risks for each of the tumor sub-types, and the degree by which these risk profiles are aligned. Specifically, we define to be the population coefficient of risk variation for the cancer type A, we define to be the coefficient of risk variation for tumor sub-type B, and we define KC to be a corresponding standardized term for the risk covariance, representing the degree to which risks of A and B are aligned from person to person. Thus KC is a natural term for characterizing the degree of risk heterogeneity between tumor sub-types A and B. These terms represent the true variations in risk among individuals in the population, though they cannot be observed directly without complete and perfect knowledge of all of the factors that influence risk, both known and unknown.
The key result is that the observed association (odds ratio) between tumor types among double primaries is directly related to the degree of risk covariation in the population. Specifically,
That is ψ is inversely related to KC, and so ψ can equivalently be used as a measure of risk heterogeneity with the advantage that, unlike KC it can be observed directly. The conceptual simplicity of this result relies on some key approximating assumptions. The two tumors in each patient must be biologically independent, an assumption that is degraded to the extent that metastases are mis-diagnosed as second primaries. It is also assumed that the two cancer occurrences are experimental replicates, driven by an identical constellation of genetic and environmental risk factors possessed by the patient, an assumption that is necessarily an approximation. Further discussion of the credibility of these and other assumptions, along with a proof of the mathematical result, is provided in the Statistical Appendix.
To understand the implications of equation (2), consider some special cases. First consider the circumstance in which the risk profiles of the two sub-types are perfectly correlated. This is the case if, for every risk category, the risk of tumor type A is directly proportional to the risk of tumor type B. In this setting it is easily shown that ψ = 1. In other words, independence in the occurrence of A and B tumors (an odds ratio of 1) corresponds to perfect alignment (a perfect correlation of 1) of their risk profiles. The extreme opposite would occur in the improbable context of risk exclusivity. This occurs if a person who has a positive risk for tumor A necessarily has zero risk of tumor B, and vice versa. In this case the odds ratio is infinite. This result can be understood by recognizing that risk exclusivity would imply that only double primaries of type AA or type BB could occur, since no person is at risk for both A and B tumors, and so the denominator of the odds ratio in equation (1) is inevitably zero. The more important point to recognize is that the more negatively correlated the population risk profiles, i.e. the more heterogeneous the risk profiles, the greater the odds ratio observed in the tumor types of double primaries. Finally, consider the situation in which the risks of the two tumor sub-types are simply uncorrelated in the population, i.e. the linear correlation coefficient is zero. In this case . In this setting the magnitude of the odds ratio between tumor sub-types is determined by a combination of the underlying degrees of risk variation of the two tumor sub-classifications.
These concepts can be illustrated in the context of breast cancer where numerous studies correlating pathologic characteristics of bilateral breast cancers have been published. A particular aspect of breast tumors that has garnered a lot of attention is the concordance of hormone receptor levels. A literature search was undertaken to identify articles in which cross-tabulated frequencies of either ER or PR status, or both, were reported in women with contra-lateral breast cancer. This involved initially a Medline search using the key words “receptor” “contralateral” and “breast”. This was followed up by examining articles cited in those identified in the Medline search. This process identified 10 articles in which data were presented in sufficient detail for our purposes.11–20 An especially large study was identified where the cross-tabulated data were not presented, and these data were kindly supplied by the author upon request.21 The results were initially evaluated for heterogeneity using the Monte Carlo version of the Breslow and Day statistic, and based on these results summary odds ratios and confidence intervals were calculated using the Mantel-Haenszel method.
The data are presented in Table 2. A strong association for ER receptor status is identified, with a summary odds ratio of 5.2 [3.8–7.2]. In contrast, the association of PR status in the two tumors is seen to be much more modest (OR=2.1, 95% CI 1.6–2.9). These results indicate that the ER status of breast cancers has etiologic relevance, and that sub-groups of breast cancers characterized by ER status should possess distinct risk factors. Conversely, there is considerably less evidence to support the hypothesis that classifying breast cancers by PR status would be likely to be fruitful in identifying distinct risk factors. Note that the study by Weitzel et al.20 was excluded due to the fact that the observed odds ratio was substantially higher than the other studies, and because this study was conducted solely in BRCA1/2 carriers. Inclusion of this study would further accentuate the distinction observed between the summary odds ratios for ER and PR.
The explosion of information about the extent and nature of somatic mutations in cancers during the past several years has led to a renewed interest in the pathologic sub-classification of tumors. Many studies have been conducted that endeavor to use, for example, genome-wide data to sub-classify tumors on the basis of their somatic molecular characteristics.22,23 These molecular classifications have challenged the notion that traditional pathologic criteria are the most relevant for classifying tumors. Tumors can be classified on the basis of the presence of a somatic mutation in a single gene, such as TP53, or a small number of distinct candidate genes. Alternatively, hierarchical clustering of expression array data can reveal apparent clustering of tumors suggesting distinct sub-classifications. Investigators studying these phenomena have traditionally used clinical criteria, such as case survival, to try to validate the relevance of any postulated classification system, although validation presents its own challenges.24 The thesis of this article is that genuine tumor sub-classifications are likely to be etiologically distinct, and that etiologic heterogeneity is, in and of itself, an important criterion for validating the relevance of any candidate sub-classification system.
How do we test for etiologic heterogeneity of tumor sub-classifications? Classically this is accomplished in case-control studies where the odds ratios of candidate or known risk factors are compared between the tumor sub-categories. Although comparison of each subtype with a common control group is frequently employed for this purpose, it has been shown that the most direct and efficient way to identify etiologic heterogeneity of individual risk factors is simply by comparing the risk profiles of the sub-classifications without the need for population controls in a case-only design.25 This strategy is necessary to identify individual risk factors that have distinctive influences on the risk heterogeneity. However, the theory presented in this article has shown that a more global assessment of etiologic heterogeneity is possible prior to conducting case-control or case-only studies to identify the sources of the heterogeneity. That is, detailed analyses of the tumor characteristics of double primaries can, in principle, provide important insights for planning and analyzing future epidemiologic investigations. By examining in a series of double primaries the odds ratios of various candidate tumor sub-classifications based on somatic molecular profiling and/or visible pathologic or clinical characteristics, and by identifying the classification with the largest odds ratio, future epidemiologic investigations could be based on searching for risk factors that provide distinctive effects on the disease sub-categories so identified, or indeed by examining these sub-categories in separate studies.
However, this strategy does have limitations. In particular one needs studies of double primary malignancies where the tissue from both tumors is available in order that the necessary pathological classifications can be accomplished on both tumors. The examples presented of breast cancers typically involve relatively small studies in settings where the ingredients of the sub-classification (ER, PR and histology) are collected routinely. Prospective studies of new approaches based on, say genome-wide arrays, would involve logistical challenges to collect the necessary specimens in sufficient numbers and quality to accomplish the sub-classifications of the pairs of tumors. In addition to these technical and logistical challenges the proposed method can really only be contemplated for cancer sites where the occurrence of independent second primaries are relatively common. This certainly includes breast cancer and melanoma, and possibly sites such as lung and colorectal, though care would be necessary to identify genuinely independent second primaries. Studies of paired organs such as the ovary, the testicle and the kidney are also feasible in principle, though limited by the rarity of occurrence of double malignancies. These sample size limitations also affect the feasibility of studying rare tumor sub-types.
The data from studies of breast cancer show a strong concordance of double primaries on the basis of ER receptor status. This result is consistent with a growing literature of epidemiologic studies that have identified specific reproductive risk factors that differ markedly in their effect on the risks of ER+ versus ER- breast cancers.26–28 Studies have also been conducted that point to genetic factors that distinguish ER+ from ER- tumors.29,30 Interestingly, it has also been shown that risk factors distinguish breast cancer histologic types.31 Investigators have also studied the etiologic heterogeneity of more refined molecular sub-types, such as those based on Her2 expression in addition to ER and PR status,32,33 categories originally suggested by expression profiling.34 A more detailed investigation in a large study of paired contralateral breast tumors would be necessary to determine the combination of receptor status, histology, and possibly other tumor characteristics that provide the axes on which the etiologic heterogeneity of breast cancer is best represented.
The theory on which the results are based also has technical limitations. It involves the premise that cancer occurrence is a fundamentally stochastic phenomenon that is influenced by genetic and environmental risk propensities that are unique to the individual. We must assume that this individual cancer propensity is the predominant influence on the risk of both the first cancer and the second cancer. This allows us to assume that the two occurrences are essentially experimental replicates, which in turn allows us to infer the degree of person-to-person risk variation in the population (see the Statistical Appendix for further details). Further, the assumption allows us to infer indirectly the risk covariance between the sub-classes, i.e. the degree of risk heterogeneity. This assumption is clearly not literally correct, and it could be perturbed by issues such as the differential impact of treatment for the first primary on the risk of a second primary of the different sub-types, differences in case survival of the sub-types, changing underlying risk due to the aging of the individual, and diagnostic errors, such as misclassifications of metastases as second primaries (or vice versa). Indeed, in the literature on ER status in breast cancer the authors of the studies reported in Table 2 were frequently focused on the impact of hormonal treatment on the receptor status of the second primary. Also, although it is conventional to classify contralateral breast cancers as “independent” second primaries, in fact this is not a settled issue. In studies of the clonal relatedness of contralateral breast tumors the mutational profiles of synchronous tumors often appear more similar than for metachronous tumors (see for example Imyanitov et al.35). In fact, in the studies presented in Table 2, the ER/PR profiles of synchronous cases generally exhibited greater association than for the metachronous cases presented in the table (data not shown), suggesting that a proportion of contralateral breast cancers may actually be metastases from the first primary, with this proportion being higher for more contemporaneous (i.e. synchronous) occurrences. These potential problems should dissuade us from over-interpreting the magnitude of observed concordance odds ratios. However, the thesis is that the approach is nonetheless valuable as a tool for identifying risk heterogeneity from a broad brush perspective, and for comparing and ranking classification systems on their concordances.
In summary, observation of the relative frequencies of concordances and discordances of sub-types of double primary malignancies provides a unique opportunity to gain insight into cancer epidemiology. The degree of concordance provides direct evidence of the global risk heterogeneity of the sub-categories, providing experimental validation of the etiologic relevance of the sub-classification system. The presence of strong etiologic heterogeneity points to the need for epidemiologic investigations that involve separate study of the sub-types to search for the distinct risk factors (or distinct effects of individual risk factors) that are causing the heterogeneity. Failure to conduct stratified epidemiologic studies in the presence of strong etiologic heterogeneity inevitably diminishes the sensitivity and statistical power of the study to detect important risk factors whose effects are different for the sub-types. Careful evaluation of the concordance of tumor characteristics in double primaries should be an important tool in the on-going effort to uncover the causes of cancer.
I am grateful to Malcolm Pike for insightful comments on an early draft of this article, to Monica Brown for supplying detailed data from her study using the California Cancer Registry, and to Erica Schubert for help with the literature review. This work was supported by the National Cancer Institute at the National Institutes of Health (CA124504 and CA131010).
Every individual in the population can be classified into one of many risk categories based on the magnitude of the risk. Let the prevalence of the ith risk category be pi where ∑pi = 1. Let the cancer risk for individuals in the ith category be ri, where this risk is the sum of the cancer risks of the tumor sub-types, i.e. ri = rAi + rBi, where rAi is the risk of a cancer of type A, and rBi is the risk of a cancer of type B. Further, let the corresponding mean risks be, respectively, μ, μA and μB where μ = μA + μB. We can represent the relative degree to which risk varies in the population by K2 = v/μ2, the square of the coefficient of variation of the distribution of risks in the population, where . Similarly, the risk variation coefficients for sub-types A and B are denoted , where . Further let c be the covariance in the risk profiles, i.e. c = ∑pi rAi rBi − μA μB, and define the risk covariance coefficient as KC = c / μA μB.
These stratified population risks are the fundamental forces that drive the observed occurrences of cancer types A and B in patients who experience double malignancies. Defining EAA=E(FAA/N), EAB=E(FAB/N), etc. as the long-run expected frequencies of these co-occurrences, we can express these as a function of the underlying risk correlation structure as follows. Consider EAA first. EAA represents the probability that both the first and second tumors are of sub-type A, given that a double malignancy has been observed in the patient. This can be further decomposed into the probability that the first tumor is of type A, multiplied by the probability that the second tumor is of type A, given that the first is of type A. The probability that the first tumor is of type A is simply μA / μ. However to calculate the probability associated with the second tumor we need to understand the concept of risk-biased sampling. Patients with tumors of type A are sampled in direct proportion to their risks of a type A tumor.36 If, say, people in risk group i have twice the risk of those in risk group j, i.e. rAi = 2rAj, then they are twice as likely to be sampled (i.e. to have a cancer of type A occur). In fact, more generally, risk group i will be represented among individuals with cancer of type A in proportion to pi rAi. Thus in the risk profile for patients with A tumors, the population frequencies p1, p2, p3, … are replaced by qA1, qA2, qA3,… where qAi = pi rAi / ∑pi rAi. It follows that among patients with double primaries in which the first tumor was of type A the probability that the second tumor is also of type A is given by the following:-
and by a similar derivation
From this it is easily shown that
It is important to recognize that the preceding mathematical structure is constructed for conceptual clarity, but clearly contains some important assumptions and approximations. The key assumption is that two cancers that occur in the same patient are experimental replicates. To make this assumption we must be confident that the tumors are biologically independent. This would not be the case if one of the tumors is actually a metastasis of the first tumor, mis-diagnosed as a second primary. There is a considerable recent literature on this evolving topic of investigation, but current thinking is that for some cancers, notably contralateral breast cancer and melanoma, we can be confident that most diagnosed second primaries are biologically independent, while for others metastases may frequently be misdiagnosed as second primaries.35,37–40 Clearly frequent misdiagnoses of this nature would inevitably inflate the apparent association of tumor sub-types. The notion of experimental replication also requires that we view the probability of occurrence of a first cancer in an individual as being the same as the probability of a second cancer, given the occurrence of the first. The idea here is that the cancer risk of any individual is approximately constant over the period in which the two cancers occur. This assumption is clearly not literally true, since cancer risk changes with age, and the second cancer necessarily occurs at a later age. However, the relatively few years that usually elapse between most observed double primaries suggests that the influence of age on this phenomenon is probably minor. A more serious concern is that treatment for the first primary, especially systemic medical treatment as opposed to surgery or radiotherapy, may alter the risk profile for the subsequent cancer.41 Finally we assume population-based incidence sampling of patients with a second cancer. In this way we can relate the occurrence frequencies directly with the cancer risks in the underlying population. Because of these issues we must view the resulting analysis as providing broad, overarching inferences that can be useful for planning research strategies, rather than an analysis that provides precise estimates of effect.
Novelty: The article provides a unique technical strategy to determine if candidate tumor sub-types are etiologically distinct.
Impact: The method has the ability to determine, among candidate tumor classification systems, which system is the most relevant for defining sub-types that are etiologically distinct.