Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Biometrics. Author manuscript; available in PMC 2009 October 13.
Published in final edited form as:
PMCID: PMC2761024

Comparison of Properties of Tests for Assessing Tumor Clonality


In a recent article Begg et al. (2007) proposed a statistical test to determine whether or not a diagnosed second primary tumor is biologically independent of the original primary tumor, by comparing patterns of allelic losses at candidate genetic loci. The proposed Concordant Mutations Test is a conditional test, an adaptation of Fisher’s Exact Test, that requires no knowledge of the marginal mutation probabilities. The test was shown to have generally good properties, but is susceptible to anti-conservative bias if there is wide variation in mutation probabilities between loci, or if the individual mutation probabilities of the parental alleles for individual patients differ substantially from each other. In this article, a likelihood ratio test is derived in an effort to address these validity issues. This test requires pre-specification of the marginal mutation probabilities at each locus, parameters for which some information will typically be available in the literature. In simulations this test is shown to be valid, but to be considerably less efficient than the Concordant Mutations Test for sample sizes (numbers of informative loci) typical of this problem. Much of the efficiency deficit can be recovered, however, by restricting the allelic imbalance parameter estimate to a pre-specified range, assuming that this parameter is in the pre-specified range.

Keywords: Clonality, concordant mutations test, likelihood ratio test

1. Introduction

Cancer pathologists are increasingly exploring the use of genetic fingerprinting to assist in classifying tumors. This is likely to be particularly useful in distinguishing second primary cancers from metastases, especially in clinical scenarios where this distinction is difficult when based solely on gross pathology, and where the correct diagnosis is clinically relevant. This is the case, for example, for contralateral cancers in the same organ type with the same histology, or for the occurrence of, say, a solitary lung nodule in a patient who has survived a previous head and neck primary of the same cell type (Geurts et al., 2005; Leong et al., 1998). Since tumors typically harbor many somatic mutations, the patterns of these mutations provide the evidence for distinguishing clonal tumors (i.e. metastases), characterized by common somatic mutations that occurred in the single, originating clonal cell, from independent tumors, where the patterns of mutations have no common origin. These genetic fingerprints can be determined by studying markers of somatic mutations, such as the presence of loss of heterozygosity, at candidate loci known to experience frequent allelic losses in the tumor type under evaluation. Numerous studies of this nature have been conducted in recent years in many cancer sites (see for example Ha and Califano, 2003; Hafner et al., 2002; Huang et al., 2001; Imyanitov et al., 2002).

In a recent article, our group proposed a new statistical test for this purpose (Begg et al., 2007). This is a relatively simple adaptation of Fisher’s Exact Test, in which the marginal frequencies of somatic mutations on the two tumors are fixed, and the test statistic is a simple count of the number of common concordant mutations that occur on the same parental alleles. We hereafter refer to this as the Concordant Mutations Test (CM). The reference distribution can be expressed as a simple combinatorial sum, as in Fisher’s Test. The attraction of this approach is its simplicity, allied to the fact that it is based solely on the data observed in a single patient. However, it was shown that its validity depends on two important assumptions. The first assumption is that the probability of a mutation is common across the genetic loci investigated. It was demonstrated that the range of mutation probabilities typically studied may have little impact on the properties of the CM test. The second assumption is that the mutation probability at each locus is the same for each parental allele. An imbalance in these probabilities has the effect of inducing correlation in the observed mutations between tumors even when the tumors arise independently. Although the CM test may work well when there are modest departures from these assumptions, it is of interest to explore the development of techniques that are not sensitive to these validity concerns.

The goal of this article is to explore a new test for this problem that is designed to circumvent these threats to the validity of the CM test. This new test is a likelihood ratio test that requires knowledge of the marginal mutation probabilities at each locus. Some knowledge of these probabilities will usually be available on the basis of data from the literature, and these estimates will improve as tumors are increasingly examined for patterns of allelic loss.

2. Methods

We use to the extent possible in the following the same notation as in Begg et al. (2007). The data consist of indicators of LOH, and indicators of whether common losses on the two tumors occurred on the same parental allele, at each of J informative candidate loci. Let the locus be denoted by the subscript i, and let ai = 1 if a mutation occurs on the same allele in both tumors at the ith locus (0 otherwise), ei = 1 if the mutation occurs in both tumors regardless of allelic concordance, fi = 1 if only tumor 1 has a mutation at the ith locus, gi = 1 if only tumor 2 has a mutation at the ith locus, and hi= 1 if neither tumor has a mutation at the ith locus. In general, pairs of tumors that share a larger number of “concordant” loci are more reflective of clonal tumors, while a larger number of discordant loci suggests independent origin of the tumors. The Concordant Mutations Test utilizes a = Σai as the test statistic, and has a closed-form reference distribution (Begg et al., 2007).

2.1 Proposed Test Based on the Likelihood Ratio

We assume at the outset that the individual mutation probabilities at each locus are distinct but known, and are denoted by pi, i =1,,,J. Generally some information regarding these mutational probabilities will be available from the literature on any locus that would be considered a candidate for studies of this nature, and the amount of such background information is likely to increase rapidly in the future from studies such as the Cancer Genome Atlas, sponsored by the National Cancer Institute ( The tests are designed to distinguish two hypotheses: HI and HC. Under the independence model (HI ), the two tumors are biologically independent. That is, all of the somatic mutations that gave rise to the tumors occurred in different cells. In constructing the CM test, the reference distribution is based on the assumption that these mutational patterns are statistically independent. However, this assumption may not be correct. For example, at any given locus, say the ith locus, the somatic mutation could occur on either the paternal allele or the maternal allele. If only one of these alleles contains an effective copy of the crucial tumor suppressor gene that is the reason for the frequent observation of allelic losses at this locus in the cancer type under investigation, then a loss on this particular allele is more likely to be observed in tumors than a loss on the other allele. Let π be the conditional probability that an observed loss occurs on this favored allele, given that a mutation has occurred, where π ≥ ½. Then if the two tumors occur independently, i.e. if HI is true, P(ai = 1 | ei = 1) = π2 + (1 − π)2 ≥ ½. Thus even under the independence model there will be an induced correlation between the mutational patterns on the two tumors, and the magnitude of this correlation increases with the extent to which π exceeds ½.

Under the clonal hypothesis (HC ), both of the tumors originate from a single (clonal) cell in which the pivotal somatic mutations occurred. Later, the colony of daughter cells from this clonal cell gives rise to the second tumor when one (or more) of these cells migrates to form a new (metastatic) colony. Subsequently the growth of both the original colony and the new colony may become dominated by cells that experience subsequent somatic mutations that confer a growth advantage. The occurrences of these subsequent mutations are “independent” in the sense described in the previous paragraph, i.e. if a candidate locus experiences mutations in both tumors the probability of concordance is π2 + (1 − π)2. We assume that the parameter π is common to all of the candidate loci under investigation. Our ability to distinguish HC from HI is primarily governed by whether or not the preponderance of observed mutations at the time the tumors are pathologically evaluated occurred during the clonal phase of tumor development. If most of the mutations occurred in the clonal phase, then the two mutational patterns will be very similar. Conversely, if there is a prolonged “independent” phase with many subsequent mutations, then the clonality signal will be harder to detect. We characterize the strength of the clonality signal with the parameter c, where c is defined to be the conditional probability that an observed mutation occurred in the clonal phase (versus the subsequent independent phase) of the tumor development. Thus c = 0 corresponds to the independence hypothesis HI, while HC is characterized by c > 0. Under this parametric model, the likelihood (for an individual patient) takes the form:


where 0 ≤ c ≤1 and 0.5 ≤π ≤ 1. The likelihood ratio statistic is L = L(ĉ,[pi])/L(0,[pi]0), where (ĉ, [pi]) is the MLE of the unconstrained likelihood, while [pi]0 is the MLE of π when c = 0. These estimates are obtained by numerical maximization as there is no closed form solution in general. If there are no common mutations on the two tumors, i.e. if a and ea are both 0, where e = Σei, then the likelihood provides no information about π and we set [pi] = 0.5.

To obtain a reference distribution for the test statistic we utilize probability sampling from the estimated reference distribution under the null hypothesis. That is, for each locus i we generate (ai,eiai,fi + gi,hi) from a multinomial distribution with a size parameter of 1 using the following probabilities:- [ pi2{π^02+(1π^0)2},pi2{2π^0+(1π^0)}, 2pi(1 − pi),(1 − pi)2]. A reference likelihood ratio statistic, L*, is calculated from this dataset. This process is repeated a large number of times. The p-value of the test is the relative frequency with which L > L*.

As we subsequently show in Section 3, the preceding unconstrained LR test has sub-optimal properties in the context of our small sample setting, due to inadequate power to estimate π reliably. A pragmatic solution is to constrain the range of admissible estimates for π. Empirical results suggest that by restricting the MLE of π to the arbitrary range [0.5, 0.8] we can utilize the available information more efficiently, assuming that π truly does lie in this range. In the simulation in Section 3 we present results for this test, referred to as the LR(0.8) test.

2.2 Design of Simulations

We examined datasets with different numbers of informative loci (10, 20, 30), with signal strengths either null (c = 0), moderate (c = 0.5) or large (c = 0.9), with allelic probability imbalances represented by π = 0.5, π = 0.6 and π = 0.7, and with “average” mutation probabilities of p = 0.3 and p = 0.5. Variation in mutation probabilities was selected based on the variance of log {pi/(1 − pi )} obtained from a motivating dataset of Imyanitov et al. (2002). In each setting the individual pis were chosen to conform to these variances as indicated in the table footnotes in Begg et al. (2007).

These configurations demonstrate the operating characteristics of the likelihood ratio approach under perfect circumstances, i.e. where the values of {pi } used in this test are known without error. In practice we will only have estimates of these quantities. To estimate the degree of plausible misclassification error in practice we have made use of a literature review for a planned clonality study in melanoma. In the planning of this study we identified 20 markers at sites for which LOH is common in melanoma. The reported frequencies of LOH at these sites ranged from 10% to 56%, and the denominators of these relative frequencies ranged from 9 to 23. Based on these statistics we make the assumption that the presumed known (logit) values of {pi } in the LR method typically possess an error variance of around 0.32. In order to address the impact of this uncertainty regarding {pi }, we have created simulations in which errors are added to each generated value of pi, where log{pi*/(1pi*)}=log{pi/(1pi)}+εi, and where εi is generated from a normal distribution with mean 0 and variance σ2 = 0.32. That is, the original datasets were generated using the “correct” {pi}, while the LR test statistic and its reference distribution were calculated using { pi*}.

The simulations were generated in the following way. First, we specified values of {pi}, c, π and σ2. We then generated the values of the mutation probabilities subject to error, i.e. { pi*}, as indicated above. We generated the data (ai,,eiai,,fi + gi,,hi ) for the ith locus from a multinomial distribution with a size parameter of 1 using the following probabilities :- [cpi+(1c)2pi21cpi{π2+(1π)2},(1c)2pi21cpi2π(1π),2(1c)pi(1pi)1cpi,(1pi)21cpi], and we repeated this for each locus to obtain the complete dataset. We then calculated the test statistic on the basis of this dataset. For calculation of the likelihood ratio test statistic and its corresponding p-value we either use {pi } or { pi*} depending on whether we are evaluating the test in the “ideal” or the “realistic” setting. The size of the test is the relative frequency in which the p-value is less than the nominal value (0.05) in simulations generated with c = 0, and the power corresponds to simulations with c ≠ 0.

3. Results

The size and power of the tests are displayed in Table 1. We see that the likelihood ratio (LR) test succeeds in its goal of correcting the validity problems that affect the Concordant Mutations (CM) test when the mutation probabilities vary across loci and/or the alleles at each locus possess unequal mutation probabilities. The size of the test remains at <0.05 for the constrained version of the test (LR(0.8)), and the test continues to have good validity properties even when the “known” values of {pi }used in the construction of the test are known only with considerable error (see the column denoted “LR*(0.8)”).

Size and Power

We also present the corresponding results for power when the data are generated from a clonal model with c = 0.5. [Data are not shown for the setting of a strong clonal signal, with c = 0.9, as all of the tests maintain high power in this configuration.] The results in the power section of Table 1 are “calibrated” to a significance level of 0.05 to facilitate direct comparison of the tests in their abilities to distinguish H I and HC. The results show a clear pattern. The unconstrained LR test, though valid, is substantially less powerful than the CM test. The LR test which constrains the estimate of π to the range [0.5, 0.8] appears to recover a considerable portion of this efficiency loss. Some additional power is lost when we take into consideration the fact that knowledge of {pi }is required for the LR approach, while{pi }will be subject to measurement error in practice. This efficiency loss is evidenced by the reduced power in the final column of Table 1, although these losses seem modest even in the presence of the substantial degree of measurement error used in these simulations.

4. Data Analysis Example

We have re-analyzed the data set from Imyanitov et al. (2002) using both tests. Each of 14 genetic loci was examined for presence of LOH. The data consist of pairs of bilateral breast cancers from 28 patients. Overall the CM test classifies 7 of the 28 patients as clonal, while the LR test with the parameter π restricted to the range [0.5, 0.8] classifies 9 of the patients as clonal. For 24 of the 28 patients, the results of the two tests are concordant with respect to statistical significance at the 5% level. In general, the CM test gives greater weight to the presence of the concordant mutations, while for the LR test, data patterns that are inconsistent with the pre-specified marginal probabilities seem to have a greater influence than in the CM test. Examples from 7 selected patients are displayed in Table 2. Losses are denoted as “▲”, the absence of LOH by “○”, and non-informative loci by “–”. Concordant and discordant losses at the same locus are denoted by “▲▲” and “▲[triangle]”, respectively. Case # 25 highlights the fact that the LR test can find significant evidence of clonality even in the presence of a single concordant mutation.

Table 2
Selected Patients with Breast Cancer [Adapted from Imyanitov et al. (2002)]

5. Discussion

Our research does not clearly establish the preferred test. The advantages of the likelihood ratio test are that it has a valid test size, and its discriminatory power is almost as good as the Concordant Mutations Test. On the other hand, the Concordant Mutations Test does not rely on pre-specification of individual marginal mutation probabilities at each locus, and it is simple to construct and calculate.

The two tests differ in their use of the data in an important way. The Concordant Mutations test statistic is the count of the number of (potentially clonal) concordant mutations. Thus it can only lead to a result in favor of HC if there are significantly more concordant mutations than expected under H I. The presence of loci at which no mutations occurred in either tumor has only an indirect effect on this test. By contrast, the likelihood ratio test weighs the presence of these concordant normal loci more heavily in favor of the clonal hypothesis. Indeed it is possible to create datasets where there are no concordant mutations, yet the likelihood ratio test leads to a significant result in favor of HC. Since concordant mutations represent active events that occurred in tumor development, i.e. allelic losses, while concordant normals represent merely the absence of such events, it is not clear that one can make a persuasive case that two tumors are clonal merely on the basis of their common failure to experience as many random allelic events as anticipated. In other words, establishing the clonal origin of two tumors may realistically require evidence of active, potentially clonal events, rather than “passive” evidence from the absence of allelic events. Thus, the Concordant Mutations test may be the appropriate statistic to use in practice for establishing the clonal origin of tumors, notwithstanding its validity limitations, and the availability of an alternative test with a more accurately calibrated false positive error rate.


The research was supported by the National Cancer Institute, award number CA098438.


  • Begg CB, Eng KH, Hummer AJ. Statistical tests for clonality. Biometrics. 2007;63:522–30. [PMC free article] [PubMed]
  • Geurts TW, Nederlof PM, van den Brekel MW, van’t Veer LJ, de Jong D, Hart AA, van Zandwijk N, Klomp H, Balm AJ, van Velthuysen ML. Pulmonary squamous cell carcinoma following head and neck squamous cell carcinoma: metastasis or second primary? Clinical Cancer Research. 2005;11:6608–6614. [PubMed]
  • Ha PK, Califano JA. The molecular biology of mucosal field cancerization of the head and neck. Critical Reviews in Oral Biology and Medicine. 2002;14:363–369. [PubMed]
  • Hafner C, Knuechel R, Stoehr R, Hartmann A. Clonality of multifocal urothelial carcinomas: 10 years of molecular genetic studies. International Journal of Cancer. 2002;101:1–6. [PubMed]
  • Huang J, Behrens C, Wistuba I, Gazdar AF, Jagirdar J. Molecular analysis of synchronous and metachronous tumors of the lung: impact on management and prognosis. Annals of Diagnostic Pathology. 2001;5:321–329. [PubMed]
  • Imyanitov EN, Suspitsin EN, Grigoriev MY, Togo AV, Kuligina E, Belogubova EV, Pozharisski KM, Turkevich EA, Rodriquez C, Cornelisse CJ, Hanson KP, Theillet C. Concordance of allelic imbalance profiles in synchronous and metachronous bilateral breast carcinomas. International Journal of Cancer. 2002;100:557–564. [PubMed]
  • Leong PP, Rezai B, Koch WM, Reed A, Eisele D, Lee DJ, Sidransky D, Jen J, Westra WH. Distinguishing second primary tumors from lung metastases in patients with head and neck squamous cell carcinoma. Journal of the National Cancer Institute. 1998;90:972–977. [PubMed]