Search tips
Search criteria 


Logo of biostsLink to Publisher's site
Biostatistics. 2011 July; 12(3): 478–492.
Published online 2011 January 20. doi:  10.1093/biostatistics/kxq082
PMCID: PMC3114655

Causal assessment of surrogacy in a meta-analysis of colorectal cancer trials

Yun Li,* Jeremy M.G. Taylor, and Michael R. Elliott
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA ; yunlisph/at/


When the true end points (T) are difficult or costly to measure, surrogate markers (S) are often collected in clinical trials to help predict the effect of the treatment (Z). There is great interest in understanding the relationship among S, T, and Z. A principal stratification (PS) framework has been proposed by Frangakis and Rubin (2002) to study their causal associations. In this paper, we extend the framework to a multiple trial setting and propose a Bayesian hierarchical PS model to assess surrogacy. We apply the method to data from a large collection of colon cancer trials in which S and T are binary. We obtain the trial-specific causal measures among S, T, and Z, as well as their overall population-level counterparts that are invariant across trials. The method allows for information sharing across trials and reduces the nonidentifiability problem. We examine the frequentist properties of our model estimates and the impact of the monotonicity assumption using simulations. We also illustrate the challenges in evaluating surrogacy in the counterfactual framework that result from nonidentifiability.

Keywords: Bayesian estimation, Counterfactual model, Identifiability, Multiple trials, Principal stratification, Surrogate marker


In clinical trials, the true clinical outcomes of interest can take long or cost much to collect. Hence, there is often great interest in identifying and collecting surrogate markers that can inform us regarding the effect of the treatment more quickly and/or at lower expense. We refer to surrogate markers as intermediate outcomes that have strong associations with the true end point, are measured after the treatment assignment and carry information about the treatment effect. Different investigators use different terminology; here we follow the definition of Buyse and Molenberghs (1998) and Taylor and others (2005). As biomarkers continue to be discovered at an ever-increasing rate, the interest in studying surrogate markers and assessing the relationship among the surrogate marker (S), the true end point (T), and the treatment (Z) continues to increase.

Our research was motivated by the ongoing interest in exploring the relationship between the shorter-term end point of disease-free survival (DFS) and the longer-term end point of overall survival (OS) for colorectal cancer patients. The adjuvant colon cancer end points (ACCENT) group obtained individual data from stage II and III colorectal cancer patients identified from a large collection of clinical trials. Based on the data, Sargent and others (2005, 2007) concluded that DFS after 3 years follow-up could replace the 5-year OS as a primary end point. The data were then reanalyzed by Alonso and Molenberghs (2008) and Baker (2008) among others. While previous research is helpful in predicting the treatment effect on T in a new trial, it has not focused on fostering understanding of the causal process leading from Z to T through S which can add supporting explanatory information (Joffe and others, 2007). In this paper, we directly assess the causal associations among S, T, and Z in the meta-analytic setting using the principal stratification (PS) framework defined by Frangakis and Rubin (2002).

Much work has been proposed to assess surrogacy. In a single trial setting, Prentice (1989) defined S as a perfect surrogate when S fully captures the effect of the treatment on T. For less than perfect surrogacy, Freedman and others (1992) proposed to measure the proportion of treatment effect explained by S. These methods utilize the model f(T|S,T) and examine the degree to which Z affects S and S subsequently affects T, which Joffe and Greene (2009) term “the causal-effects paradigm.” In a meta-analytical setting, the paradigm is also explored by Alonso and Molenberghs (2007) and Baker (2008) among others. To ensure that the parameter estimates are valid, there must be no unmeasured confounders between Z and T, or between Z and S, or between S and T. In clinical trials, the purpose of randomization is to balance the distributions of confounders at baseline between 2 comparison groups; however, no such balance is achieved for the confounders between S and T. When such confounders exist, the parameter estimates may be invalid in the causal-effect paradigm.

In a PS framework, it hypothesizes that every patient has 2 potential outcomes (i.e. counterfactuals) corresponding to each treatment. However, we only observe one set of the potential outcomes because a patient either receives the control or the experimental treatment but not both. In this paper, we focus on binary S and T. For every patient, there are 4 possible realizations of the pair of the 2 potential outcomes of S and these pairs define the 4 principal strata. Since membership in a principal stratum is assumed to exist prior to treatment, it cannot be affected by treatment. Our goal is to examine the joint distribution of the potential outcomes of S and those of T. The set of probabilities associated with the cross-classification of different pairs of the potential outcomes of S and T describe the causal association of S and T and are often quantity of interest. These causal association measures examine how Z affects S related to how Z affects T, which Joffe and Greene (2009) term “the causal-association paradigm”. They are valid no matter whether or not an unmeasured confounder exists between S and T or whether or not S affects T, which can be advantageous and more realistic in practice.

In a single trial setting, Li and others (2010) have developed a method to estimate these causal association parameters defined in the PS framework. Their method applies to the setting when it is sensible to assume monotonicity, that is, it is impossible for anyone in the study to benefit more from the control treatment than the experimental treatment with respect to both S and T. Gilbert and Hudgens (2008) and Wolfson and Gilbert (2010) have also proposed surrogacy measures in this framework which are also restricted to the single trial setting and assume monotonicity.

In this paper, our goal is to evaluate the causal associations among S, T, and Z in a meta-analytic framework. In a meta-analysis of multiple trials, it is sensible to assume that the different trials come from a common population and each trial consists of a sample from this population. We hypothesize that there is a latent joint distribution of the potential outcomes of S and those of T at the population level which is invariant, and that each trial has its own distribution of potential outcomes that is linked to the population distribution. We propose a Bayesian hierarchical PS model to obtain these population-level causal association measures and the trial-specific counterparts that vary from trial to trial. Our model allows information sharing across different trials and accounts for clustering. Our proposed method differs from previous methods in several important aspects. First, our method applies to the meta-analytic setting and provides both trial-specific and overall population-level causal measures. Second, we relax the monotonicity assumption and permit the possibility that there are some patients who would benefit more from the control treatment than the experimental treatment. The extension is important because it may be more realistic particularly when the 2 treatments are both active. In addition, we explore the impact of the monotonicity assumption when there is a slight violation and investigate the challenges in evaluating surrogacy in a counterfactual framework that result from nonidentifiability. Previous approaches by Daniels and Hughes (1997) and the Rtrial2 measure proposed by Buyse and others (2000) are also developed in a meta-analytic setting that rely only on the treatment effects at the group level and are useful in decision making about the new treatment effect. Our causal association measures can be useful in obtaining a finer understanding of the causal process, and the framework allows us to increase precision of estimates through incorporating model assumptions that are sensible in the causal process.

In Section 2, we introduce our model and estimation methods. In Section 3, we apply our methods to the colon cancer data. In Section 4, we use simulations to examine the frequentist properties of our estimates. We then investigate the impact of the monotonicity assumption and illustrate the challenges associated with nonidentifiability. We conclude with a discussion in Section 5.


2.1. Notation, causal framework, and PS

Suppose there are N trials, and Mi patients in the ith trial. Let Zij = zij denote patient j in trial i assigned either control (zij = 0) or experimental (zij = 1) treatment, where i = 1,…,N and j = 1,…,Mi. Let Sij and Tij be the observed surrogate and clinical outcomes, respectively. The realizations of (Sij,Tij) are denoted by sij and tij, defined without loss of generality as either 0 for bad outcomes or 1 for good outcomes. For each contingency table cross-classified by all possible levels of Zij, Sij, and Tij in trial i, the observed cell frequency is denoted by ristz and the cell proportion is denoted by An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx1_ht.jpg, where An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx2_ht.jpg.

Let Sij(0) denote the potential surrogate outcome, if, possibly contrary to the fact, subject j in trial i is randomized to zij = 0 and Sij(1) if she/he is assigned to zij = 1. We define Tij(0) and Tij(1) similarly. The 4 principal strata are defined by the possible realization of the potential outcomes pair (Sij(0),Sij(1)): (0,0), (0,1), (1,1) and (1,0), which we refer to as “never responsive,” “responsive,” “always responsive,” and “harmed” and denote as Sij* = n,r,a,h. Similarly, there are 4 possible values for (Tij(0),Tij(1)) denoted by Tij* = n,r,a,h. For each table in trial i cross-classified by Sij* and Tij*, we denote the cell frequency by niklz and the cell proportion by An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx3_ht.jpg, corresponding to probability pikl, where k,l = 1,2,3,4. There is an analagous 4 × 4 table for the population probabilities pkl's which summarize the joint distribution among (S(0),S(1)) and (T(0),T(1)) at the population level and are invariant across trials.

2.2. Assumptions

In addition to the standard assumptions of ignorability of treatment assignment and stable unit treatment value assumption (Rubin, 1978, 1980). Another common assumption is monotonicity which requires that Sij(1) ≥ Sij(0) and Tij(1) ≥ Tij(0) for all i,j; hence, we cannot have (Sij(0) = 1,Sij(1) = 0) or (Tij(0) = 1,Tij(1) = 0). That is, pik4 = pi4l = pk4 = p4l = 0, where k,l = 1,2,3,4. Since one of the potential outcomes is unobserved, the PS framework is overparameterized and this assumption is crucial in improving the identifiability of the model and the precision of the estimates. However, it requires that every patient would have done at least as well as that when she or he receives Zij = 1 relative to that when she or he receives Zij = 0. It is perhaps true for most patients but not usually satisfied for all patients. For example, in the colon cancer studies, if some patients had fatal adverse side effects from the experimental therapy, they would have had better outcomes had they received control treatment, even though the average effect of the treatment may be better. In the sections that follow, we will describe the causal estimands of interest with the monotonicity assumption (denoted by monotonicity model) and without (denoted by full model) and investigate the impact of violations of the assumption.

2.3. Population-level and trial-specific surrogacy measures

In addition to pkl's and pikl's, many quantities derived from these probabilities are of interest. By definition, the causal effect on T (CET) is based on the comparisons between 2 potential outcomes. Because of randomization, CETi = pi + 2pi + 4, which equals the intent-to-treat (ITT) effect of Zij on Tij. Similarly, the causal effect on S in trial i (CESi) is pi2 +pi4 + which equals the ITT effect on Sij. We partition the ITT effect on T into 2 components. One is the ITT effect within the principal strata of “always responsive” and “never responsive” which is equal to pi12 + pi32 − (pi14 + pi34) and defined as the net dissociative effect (NDE). It measures the net ITT effect on T for patients whose S's are not responsive to Z. The other is the ITT effect within the principal strata of “responsive” and “harmed” which is pi22 + pi42 − (pi24 + pi44), and we define it as the net associative effect (NAE). Note that CETi = NDEi + NAEi. We define the associative proportion (APi) as NAEi/CETi and the surrogate associative proportion (SAPi) as NAEi/CESi. These quantities are generalizations of the same terminology defined by Li and others (2009) when monotonicity was assumed. The population-level causal surrogacy measures NDE, NAE, CET, CES, AP, and SAP are defined similarly. The comparisons of pi22/pi2 + relative to pi12/pi1 + , pi32/pi3 + and pi42/pi4 + are also interesting as we would expect pi22/pi2 + to be higher than the other ratios for a good surrogate. Similarly, we can compare pi44/pi4 + with pi14/pi1 + , pi24/pi2 + and pi34/pi3 + . The quantities pi22/pi2 + and pi44/pi4 + measure the fractions of the patients whose potential outcomes of S and T are (0,1) and (1,0), respectively, which are standardized by the sample size in each principal stratum. The comparisons at the population level can be similarly made.

2.4. Observed data and complete data

Consider the complete data consisting of Zi, potential outcomes of Sij and Tij as summarized by niklz. The observed data consists of the number of patients ristz for each combination of Zij, Sij, and Tij. The connection between niklz and ristz is as follows: ri000 = ni110 + ni120 + ni210 + ni220, ri010 = ni130 + ni140 + ni230 + ni240, ri100 = ni310 + ni320 + ni410 + ni420, ri110 = ni330 + ni340 + ni430 + ni440, ri001 = ni111 + ni141 + ni411 + ni441, ri011 = ni121 + ni131 + ni421 + ni431, ri101 = ni211 + ni311 + ni241 + ni341, and ri111 = ni221 + ni231 + ni321 + ni331. The connections between πistz's and pikl's are analogous. Since the number of patients in each arm per trial is known and there are 8 observed counts for each table in each trial, we can estimate 6 free parameters. Hence, out of 15 free counterfactual probabilities in each trial, 9 are nonidentifiable. Some combinations are identifiable such as pi2 +pi4 + , pi1 +pi3 + , pi + 2pi + 4, and so on. When the monotonicity assumption is made, the relationship between the observed data and complete data will follow as above except that nik4z, ni4lz, pik4, and pi4l are assumed to be zeros for k,l = 1,2,3,4.

2.5. The model for complete data

For simplicity, we assume equal patient allocations between the 2 arms. We extend a flexible log-linear model used by Li and others (2009) and assume the cell frequency in the complete data tables niklz follows a Poisson distribution, which has mean μikl,

An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx4_ht.jpg

We can interpret λikS and λilT as row and column effects, respectively, and λiklST as the interactions. These are analogous to the random effects in a mixed model setup. For identifiability of the log-linear model, we use the constraints, λi3S = λi3T = λik3ST = λi3lST = 0, which treats the third row and the third column as the reference cells. Conditional on the sum of the cell counts for each table, the cell frequencies are multinomial, that is, ni00z,…,ni11z|ri + + z~multinomial(ri + + z,pi00,…,pi11). The counterfactual probabilities in each table can be expressed as An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx5_ht.jpg

In an overparametrized model, there is little information on the correlation parameters and hence simpler independent priors for the trial-level parameters are used and given by λi~N(λ,σλ2), λikS~NkS, σλkS2), λilT~NlTλlT2), and λiklST~NklSTλkl2). We can interpret the hyperparameters as population-level parameters in the latent contingency table for the hypothetical population, which are analogous to the fixed effects in a mixed model setting. The population-level parameters follow the same constraints for identifiability: λ3S = λ3T = λk3ST = λ3lST = 0. The population-level probabilities pkl can be obtained accordingly. We assign the hyperparameters proper priors as follows: (λ,λkSlTklST)~N(δ,σ2), (σλ − 2λkS − 2λlT − 2λkl − 2)~Gamma(τab). We let δ = 0 and σ2 = 9/4 that result in priors with characteristics similar to those of Garret and Zeger (2000) for logistic regression and being relatively flat on the counterfactual probabilities. We parameterize Gamma(τab) to have an expected mean of τaτb and variance of τaτb2. We assume τa = 2.1 and τb = 1/2 that corresponds to a moderately diffuse prior on the variances with 95% probability interval of (0.35,7.80).

The conditional posterior distributions and the use of a data augmentation estimation method are described in the supplementary material available at Biostatistics online. When the monotonicity assumption is made, the model and the estimation procedure can be easily modified accordingly by letting λi4S = λi4T = λi44ST = λ4S = λ4T = λ44ST = nik4z = ni4lz = pik4z = pi4lz = 0.

2.6. Model selection

We use the deviance information criterion (DIC) based on the complete-data likelihood instead of the observed-data likelihood for assessment and comparison of the models with and without the monotonicity assumption. It is suggested as a more reliable measure in the setting of missing data (Celeux and others, 2006). We first define the deviance as D(θ) = − 2log(p(y|θ)) + C, where y represents the complete data (niklz's), θ are the model parameters, p(y|θ) is the likelihood function and C is a constant which will cancel out in calculations. Let An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx6_ht.jpg refer to the expectation of D and An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx7_ht.jpg denote the deviance at the expectation of θ. Here, we use the the posterior mean of θ as An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx8_ht.jpg. The effective number of parameters of a model is computed as An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx9_ht.jpg. We define DIC as An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx10_ht.jpg. Models with smaller DICs are preferred.


3.1. The results

Multiple trials have been conducted to evaluate fluorouracil (FU)-based chemotherapy for patients with early-stage colon cancer. The ACCENT Group obtained individual data from 18 phase III trials conducted from 1978 to 1999 involving 20 898 patients. Sargent and others (2005) defined DFS (i.e. no cancer recurrence or death) after 3 years of median follow-up as S and OS after 5 years of median follow-up as T. We define a binary DFS status at 3-year follow-up as S and the binary OS status at 5-year follow-up as T. For the 5% of censored patients, we imputed S or T using a Cox model with a Weibull baseline hazard function. Data from 10 of these trials was available together with 2 additional similar trials (Table 1) resulting in 14 036 patients in total.

Table 1.

Summaries of colon cancer trials

The Pearson correlation between CETi and CESi from 12 trials is estimated as 0.81 (p < 0.001). We also adopt a latent variable approach which used a pseudo-likelihood estimation method for the bivariate mixed model assumed for the latent S and T (Tilahun and others, 2008). We obtained that the treatment adjusted trial-level correlation is 0.928 with standard error (SE) of 0.508 and the individual level correlation is 0.919 (SE = 0.029). Hence, there is a close association between the treatment effect on S and that on T, and in addition S and T for each individual is highly correlated.

We apply our proposed methods with and without assuming monotonicity. We run Monte Carlo Markov chains of 80 000 iterations for the monotonicity model and 2 000 000 for the full model. Trace plots and the Gelman–Rubin statistics appear to be acceptable as convergence tests. We collect 2 000 posterior draws via systematic sampling after discarding the first half of the iterations. The population-level quantities are presented in Table 2. For the 9 probabilities shared by the full model and the monotonicity model, the posterior means are similar in magnitude. Very high probabilities for p11 and p33 indicate that most patients are either always responders or never responders for both S and T.

Table 2.

Bayesian estimates for population-level counterfactual quantities for ACCENT data with and without monotonicity. Prior specifications: τa = 2.1, τb = 1/2, δ = 0, and σ2 = 9/4

The summary measures such as NDE, NAE, AP, and SAP, are much influenced by which model we fit. When the full model is used, NAE is much bigger than NDE; both AP and SAP are bigger than 0.5. This implies that the majority of the effect of Z on T occurs among the patients for whom Z also impacts their S. The posterior means of p22/p2 + , p12/p1 + , p32/p3 + , and p42/p4 + are estimated as 0.305, 0.003, 0.021, and 0.022 and those of p44/p4 + , p14/p1 + , p24/p2 + , and p34/p3 + are estimated as 0.027, 0.002, 0.013, and 0.017. The measures p22/p2 + and p44/p4 + are the largest among those compared that indicates good concordance between the potential outcomes for S and T within each principal stratum. All these measures indicate 3-year DFS is a good surrogate for 5-year OS. We also notice that these surrogacy measures have negative lower bounds because this model allows negative treatment effects. On the other hand, when we fit the monotonicity model, NDE is much larger than NAE and AP and SAP are close to 0.1. These measures imply that the majority of the ITT effect on T occurs among the patients whose S's are not responsive to Z. The monotonicity assumption forces all quantities to be nonnegative and the lower bounds for these 4 measures are all positive.

The posterior distributions for these 4 measures at the population level and trial level are plotted in Figures 1 and and2.2. The population-level quantities appear to summarize all the trial-specific ones well. Similarly, the means and spreads of these distributions appear to be sensitive to the monotonicity assumption. The conclusions are opposite in this example depending on whether or not this assumption is made. Intuitively, the potential outcomes of 3-year DFS and 5-year OS are more likely to be closely associated because of the inherent connection between the definitions of DFS and OS. In addition, since in many trials, 2 active treatments were compared, it is likely that some patients may benefit from the control treatment more than from the experimental treatment and the monotonicity assumption may not be sensible. We use the DIC to compare 2 fitted models. The DIC is 556.0 for the full model and is 593.8 for the monotonicity model which indicates that the full model is the preferred model by DIC which is consistent with our intuition. While the full model gives more sensible results, the degree of nonidentifiability is much higher than the monotonicity model. The 95% credible intervals from the full model for NAE and NDE are much wider and include 0's. The intervals for AP and SAP go outside of ( − 1,1) because the denominators in our example, CES and CET, are close to zeros. The counterfactual model requires one to divide CES and CET into even smaller compartments which subsequently need much larger sample size to reach narrower ranges. As a result, the inference for these measures is not conclusive.

Fig. 1.

Population-level (solid lines) and trial-specific (dashed lines) surrogacy measures: NDE and NAE for ACCENT data. Left panel: with monotonicity; and Right panel: without monotonicity.

Fig. 2.

Population-level (solid lines) and trial-specific (dashed lines) surrogacy measures: AP and SAP for ACCENT data. Left panel: with monotonicity; and Right panel: without monotonicity.

3.2. Sensitivity of priors and model structure

In Figure 3, we examine the identifiability of the population-level probabilities in a full model by plotting prior and posterior distributions against each other (Garret and Zeger, 2000). The less identifiable a parameter is the more similarity and overlap between the prior and posterior distributions and the more sensitive the parameter estimate is toward a prior distribution. We find that there is more information from the data regarding p11 and p33 than p34 and p43. To further assess the prior impact on the inference, we vary the means of the priors for the hyperparameters by changing δ to 0.7 but fix σ2, then we fix δ but change σ2 to 1 and 4. Relative to that when δ = 0 and σ2 = 9/4, we find that the changes in the posterior means and posterior standard deviations (PSDs) are usually less than 10% except for AP, SAP, and very small probabilities (e.g. p41 and p43). We then vary (τab = 2.1,1/2) to (τab = 1,1) that corresponds to wider 95% probability interval of (0.27,39.50). The changes in posterior means are usually smaller than 10% except AP, NAE and p22; however, the changes in PSDs are much bigger and often greater than 50%. These results are presented in the supplementary material available at Biostatistics online.

Fig. 3.

Prior and posterior distributions on population-level counterfactual probabilities. Dashed lines for the prior distributions and solid lines for the posterior distributions.

To investigate the impact of the model structure on the posterior distributions, we perform different analyses by assuming different mean structures for μikl (2.1) (results shown in the supplementary material available at Biostatistics online). We first allow λiklST = λklST. Relative to the full model, the changes in posterior means and PSDs are very small. We then let λiklST = 0 which result in drastically different posterior means and PSDs. Since the set λiklST represents associations between S and T, it may be reasonable to assume that they are the same in all trials, but unreasonable to assume that they are all zeros, we never-the-less consider both to assess the impact of different assumptions.


We conducted a small simulation study to examine the frequentist properties of the population-level counterfactual quantities in the proposed method and the impact of the monotonicity assumption. We generated 200 data sets with the parameter specifications similar to the posterior means of the parameters in the full model from the ACCENT example. We fit both the full model and the monotonicity model to each data set. The summary statistics over 200 simulations are presented in Table 3. The mean bias (Bias) and posterior standard deviation (An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx11_ht.jpg) are computed by averaging over the biases for the posterior means and the PSDs across 200 simulated data, respectively. The empirical standard deviation (ESD) is the standard deviation of the 200 posterior means from each simulation and 95% coverage rates (CR) measure the probability of whether or not the 2.5–97.5 percentile of each posterior distribution covers the true value. When the fitted model is the full model, the relative biases are usually small for all quantities examined with the exception of very small quantities such as p12 and p41 whose absolute biases are still small but the relative biases are big. For identifiable quantities such as CES and CET, the An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx11_ht.jpgs are close to ESDs and the CRs are close to 95%, which is consistent with the asymptotic frequentist property. For most of the nonidentifiable quantities, the average PSDs are usually slightly larger than ESDs and the CRs are usually slightly larger than 95% with the exception of p33 when the 2.5–97.5 percentile of its prior distribution does not include the true value. This is consistent with the literature for nonidentifiable models (Gustafson, 2005). However, for AP and SAP, An external file that holds a picture, illustration, etc.
Object name is biostskxq082fx11_ht.jpg is much bigger than ESD because the denominators, CET and CES, are very small, albeit often consistent with reality, which can result in the estimates of CET and CES close to zero and consequently unstable estimates of AP and SAP.

Table 3.

Frequentist properties for population-level counterfactual quantities summarized over 200 simulated data sets. The true parameter values: (λ,λ1S,λ2S,λ4S,λ1T,λ2T,λ4T,λ11ST,λ12ST,λ ...

The impact of the monotonicity assumption appears to be huge. Although ESDs are smaller than those from the full model (results not shown), the magnitudes of the biases are uniformly much larger. The nominal CRs are often much smaller than 95%. While NAE is much bigger than NDE in the full model which is consistent with the truth, the NAE estimate is much smaller than NDE and contributes to only a small fraction of the CET in the monotonicity model. The CRs are very poor for these quantities of interest (e.g. 22% for NDE and 13% for AP). Hence, despite the true additional probabilities pertaining to the full model being very small in our simulations, their impact on our statistical inferences can be enormous by assuming these probabilities to be zero.

We also explore the capacity of DIC as a tool to choose the better model. The first set of simulations let the truth be the full model with the same parameter specifications as above. We find that for 126 times out of 184 (68.5%), the full model leads to smaller DICs and is preferred to the monotonicity model; for 16 out of 200 times, the DIC values were negative for the full model. The second simulations use the same parameter specification but allow λ4S = λ4T = λ14ST = λ24ST = λ44ST = 0 to satisfy monotonicity. For 196 times out of 200 (98.5%), the fitted monotonicity model leads to smaller DICs and is preferred. Hence, when the true model is the full model and the degree of nonidentifiability increases, the DIC has smaller power to detect the correct model.

To investigate the limiting distributions of the population-level parameters, we conduct more simulations with the number of trials increasing from 12, 50 to 100 and with 100 patients per trial when both the true model and the fitted model are the full model. The results are shown in the supplementary material available at Biostatistics online. We find that the biases are negligible in all scenarios; as the number of trials increase, the ESDs for almost all population probabilities decrease suggesting that these quantities converge in probability.


We propose a Bayesian hierarchical model to assess surrogacy in a meta-analytic setting in a PS framework. Our model captures the association between the potential outcomes of S and those of T accounting for clustering. It provides the trial-specific causal association measures that vary from trial to trial and the population measures that describe the average causal association in the latent population. We then apply our method to an important clinical application in evaluating the causal surrogacy of 3-year DFS for 5-year OS based on a large collection of multiple colon cancer trials.

Our data analysis also demonstrates several methodological issues: (1) sensitivity to monotonicity assumption; and (2) challenges associated with nonidentifiability. The analysis shows that the relative magnitude of the NAE versus the NDE may be drastically different with and without the monotonicity assumption. Monotonicity is often assumed in previous literature to reduce the degree of nonidentifiability and to allow one to estimate the parameters more reliably and accurately. However, in our analysis, the monotonicity model did not provide sensible results. Our simulations show that a slight deviation from the assumption, that is, in the existence of a very small fraction of the patients in the “harmed” group, the estimates can be severely biased and intervals have undercoverage. On the other hand, a full model without the assumption gives us more sensible results in our analysis. In our simulations, the estimated causal association measures generally have good properties in terms of bias, variance estimation, and coverage probabilities. However, relaxing the assumption can result in huge efficiency loss and uncertainty in the estimates and subsequently the statistical inference can be uninformative. A higher degree of nonidentifiability appears to pose greater challenges in computations and achieving robustness toward model structures and prior specifications. Hence, there is a trade off between bias and variance, in this case between the ability to reliably estimate the counterfactual quantities and the ability to restrict the estimates to narrower ranges. The nonidentifiability issues can become more challenging when the treatment effects on S and those on T are small. The difficulty in assessing the surrogacy in a counterfactual framework is also illustrated by the work of Wolfson and Gilbert (2010). A limitation of our approach is our parametric assumptions for our hierarchical model. While we explain the sensitivity of our results to these assumptions to some degree, we also note that assumption-free analysis is impossible in the causal setting because of the large “missing data” problem implied by the unobserved potential outcomes. Even nonparametric estimation of simple compliance average causal effects or instrumental variable estimator requires monotonicity, which as our example shows, can have profound effect on inference.

There are many additional research questions that can be explored in our surrogacy setting. There is a need for investigation on the sample size requirements to detect meaningful associations and the methods to select best partially identifiable models. Our approach applies to the setting where S and T are binary. As one reviewer points out, when a continuous S and T are arbitrarily dichotomized, it can artificially create NAEs which may not exist. It will be useful to propose methods to evaluate causal surrogacy when S and T are other common data types.


National Institutes of Health (CA129102, CA25224).

Supplementary Material

Supplementary Data:


The authors would like to thank anonymous reviewers, Anna Chernin, Drs Abel Tilabun, Bhramar Mukherjee, Greg Yothers, and Penn/Michigan Causal Group for their help. Conflict of Interest: None declared.


  • Alonso A, Molenberghs G. Surrogate marker evaluation from an information theory perspective. Biometrics. 2007;1:180–186. [PubMed]
  • Alonso A, Molenberghs G. Evaluating time to cancer recurrence as a surrogate marker for survival from an information theory perspective. Statistical Methods in Medical Research. 2008;17:497–504. [PubMed]
  • Baker SG. Two simple approaches for validating a binary surrogate endpoint using data from multiple trials. Statistical Methods in Medical Research. 2008;17:505–514. [PubMed]
  • Buyse M, Molenberghs G. Criteria for the validation of surrogate endpoints in randomized experiments. Biometrics. 1998;54:1014–1029. [PubMed]
  • Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics. 2000;1:49–67. [PubMed]
  • Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models. Bayesian Analysis. 2006;1:651–674.
  • Daniels MJ, Hughes MD. Meta-analysis for the evaluation of potential surrogate markers. Statistics in Medicine. 1997;16:1965–1982. [PubMed]
  • Frangakis CE, Rubin DB. Principal stratification in casual inference. Biometrics. 2002;58:21–29. [PubMed]
  • Freedman LS, Graubard BI, Schatzkin A. Statistical validation of intermediate endpoints for chronic disease. Statistics in Medicine. 1992;11:167–178. [PubMed]
  • Garret ES, Zeger SL. Latent class model diagnosis. Biometrics. 2000;56:1055–1067. [PubMed]
  • Gilbert PB, Hudgens MG. Evaluating causal effect predictiveness of candidate surrogate endpoints. Biometrics. 2008;65:1223–1232. [PMC free article] [PubMed]
  • Gustafson P. On model expansion, model contraction, identifiability, and prior information: two illustrative scenarios involving mismeasured variables. Statistical Science. 2005;20:111–140.
  • Joffe MM, Greene T. Related causal frameworks for surrogate outcomes. Biometrics. 2009;65:530–538. [PubMed]
  • Joffe MM, Small D, Hsu C. Defining and estimating intervention effects for groups that will develop an auxiliary outcome. Statistical Science. 2007;22:74–97.
  • Li Y, Taylor JMG, Elliott MR. A Bayesian approach to surrogacy assessment using principal stratification in clinical trials. Biometrics. 2010;66:523–531. [PMC free article] [PubMed]
  • Prentice RL. Surrogate endpoints in clinical trials, definition and operational criteria. Statistics in Medicine. 1989;8:431–440. [PubMed]
  • Rubin DB. Bayesian-inference for causal effects—role of randomization. The Annals of Statistics. 1978;6:34–58.
  • Rubin DB. Randomization analysis of experimental-data - the Fisher randomization test—comment. Journal of American Statistical Association. 1980;75:591–593.
  • Sargent DJ, Patiyil S, Yothers G. and others. End points for colon cancer adjuvant trials: observations and recommendations based on individual patient data from 20 898 patients enrolled onto 18 randomized trials from the ACCENT Group. Journal of Clinical Oncolology. 2007;25:4569–4574. [PubMed]
  • Sargent DJ, Wieand HS, Haller DG. and others. Disease-free survival versus overall survival as a primary end point for adjuvant colon cancer studies: individual patient data from 20 898 patients on 18 randomized trials. Journal of Clinical Oncolology. 2005;23:8664–8670. [PubMed]
  • Taylor JMG, Wang Y, Thiébaut R. Counterfactual links to the proportion of treatment effect explained by a surrogate marker. Biometrics. 2005;61:1102–1111. [PubMed]
  • Tilahun A, Pryseley A, Alonso A. and others. Information theory-based surrogate marker evaluation from several randomized clinical trials with binary endpoints, using SAS. Journal of Biopharmaceutical Statistics. 2008;18:326–341. [PubMed]
  • Wolfson J, Gilbert P. Statistical Identifiability and the Surrogate Endpoint Problem, with Application to Vaccine Trials. Biometrics. 2010;66:1153–1161. [PMC free article] [PubMed]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press