PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Stat Biopharm Res. Author manuscript; available in PMC 2010 June 23.
Published in final edited form as:
Stat Biopharm Res. 2010 May 1; 2(2): 229–238.
doi:  10.1198/sbr.2009.0070
PMCID: PMC2890300
NIHMSID: NIHMS171595

Evaluating the Proportion of Treatment Effect Explained by a Continuous Surrogate Marker in Logistic or Probit Regression Models

Jie Huang, Associate Director and Bin Huang, Associate Professor

Abstract

Using surrogate endpoints in clinical trials is desirable for drug development because the trials can be shortened and therefore more cost-effective. Validating a surrogate for the clinical endpoint is critical in this context. One of the key steps in statistical validation of a surrogate for a single trial is to estimate the proportion of treatment effect explained (PTE or PE) by a surrogate. Often the measure for PTE is estimated from the difference in coefficients of treatment from two models with or without adjusting for the surrogate for clinical endpoint. Inherent problems with the method are: the two models may not be valid simultaneously; and the estimate can often lie outside the interval [0, 1]. In this article, we provide alternative measures for evaluating the proportion of treatment effect explained by a surrogate in logistic or probit regression models. Our measures can be estimated easily with any statistical programs capable of binary linear regression modeling, and the interpretation of the measures can be illustrated using Ordinal Dominance (OD) curves. The concept can be visually understood by any practical user. Simulation shows our alternative measures yield more accurate estimates which are less biased, less variable, and with narrower confidence intervals. A clinical trial example is provided.

Keywords: Biomarker, Logistic, Mediator, Probit model, Ordinal dominance curve, Surrogate validation

1. Introduction

In clinical trials, study endpoints such as cancer survival or cardiovascular event require a prolonged follow-up time. The trials can be costly; it may be difficult to enroll patients, and even more difficult to follow and monitor patients. However, a surrogate endpoint can usually be observed early during a trial, and can thus be used as an attractive substitute for the clinical endpoint in studying treatment effect. Using surrogate endpoints, trials may be shortened, and it may be possible to avoid fatal clinical endpoints before drug approval. It makes efficacious treatment available to patients sooner, saves patients’ lives, and reduces medical expenditure. However, a critical question must be answered beforehand: What may serve as a valid surrogate endpoint? Both the U.S. regulatory agency Food and Drug Administration (FDA) (Temple 1995) and the European Medicines Agency (EMEA) (EMEA/CHMP report 2007) have recognized the importance of developing and validating surrogate endpoints. The validation of biomarkers as surrogate endpoints is part of FDA’s “critical path initiative.” Workshops and meetings were organized among the regulatory agencies, industry representatives, and the academics to discuss and set out position statements on surrogate endpoints (An NIDAID Workshop 1989; Biomarkers Definitions Working Group 2001; DeGruttola et al. 2001). In recent years, statistical validation of surrogate endpoints has attracted increased attention.

In his landmark article, Prentice (1989) defined a surrogate as “a response variable for which a test of the null hypothesis of no relationship to the treatment groups under comparison is also a valid test of the corresponding null hypothesis based on the true endpoint.” He defined three criteria for admitting a valid surrogate. One of the criteria requires that “the full effect of treatment on the true endpoint is captured by the surrogate, that is, f (T |S, Z) = f (T |S).” This condition is rather stringent and unlikely to be true in practice. To validate this equation, Freedman, Graubard, and Schatzkin (1992) proposed a statistic to measure the proportion of the treatment effect explained by the surrogate (hereafter PE(FGS)). This statistic is defined as the percentage change of treatment effects estimated from two models with or without adjusting for the surrogate marker. It is well known that PE(FGS) suffers from several serious drawbacks (DeGruttola et al. 1997; Bycott and Taylor 1998; Wang and Taylor 2002). In particular, either the point estimate or its confidence interval estimate can be out of the range [0, 1]. As a result, it often fails to provide a meaningful assessment of “the proportion of treatment effect explained by a surrogate marker.” Subsequently, Wang and Taylor (2002) provided the more desirable alternative measures F(F’) for PTE, which are less variable and have its point estimate within [0, 1] under certain conditions. When surrogate markers and outcome endpoint are either all continuous or binary, F(F’) can be calculated easily. For the continuous surrogate marker and binary outcome, Wang and Taylor (2002) suggested two approaches to derive the measures, but one is computationally difficult, and the other is not easily interpretable. Evaluating a continuous surrogate for a binary endpoint is often encountered in clinical studies. For examples, blood pressure level has been accepted by clinical and regulatory as a surrogate for the incidence of stroke and congestive heart failure (Biomarkers Definitions Working Group 2001; DeGruttola et al. 2001). In an oncology imaging study, the use of post-therapy FDG-PET as a metabolic surrogate marker of tumor response in cervical cancer was prospectively validated (Schwarz et al. 2007). In this article, we provide alternative measures to evaluate the proportion of treatment effect explained by a surrogate in the logistic and probit regression.

The remainder of this article is organized as follows. Section 2 introduces the definition of PTE in general. Section 3 provides our simple alternative PTE measures for the case of continuous surrogate marker and binary outcome. Our measures can be obtained easily with any commonly used statistical computation package capable of binary linear regression modeling. Simulation results are shown in Section 4 for comparing our measures to PE(FGS). Results of our measures and PE(FGS) are compared to each other; both are compared to the estimated true PTE from the continuous latent outcome of the binary outcome. Interpretations of our measures are provided using Ordinal Dominance (OD) curves (Bamber 1975); a concept that can be easily understood by practical users. In Section 5, a clinical example is illustrated for the method. Discussions and conclusions are provided in Section 6.

2. Measure of the Proportion of Treatment Effect Explained by the Surrogate

Freedman, Graubard, and Schatzkin (1992) defined the proportional of treatment effect explained by the surrogate (PE(FGS)) as the percentage change of the coefficient of the treatment effect from two models, that is, with or without adjusting for a surrogate in the model. Their article focused on the binary outcome in the logistic model. The method provides an intuitive concept, but the two logistic models may not be valid at the same time. Estimation of PE(FGS) is shown to be quite variable; the point estimate and its confidence interval can be outside of interval [0,1] (Freedman 2001). When the endpoint is the time to event, Lin, Fleming, and DeGruttola (1997) defined PTE based on the two Cox proportional hazard models with or without adjusting for the surrogate in the model. The two models can have different baseline hazards. Although both models may be approximated by valid Cox PH models, the confidence interval is again very wide.

Considering estimating PTE in the generalized linear model setting, the outcome observation (ti) is assumed to be from a natural exponential family with density function:

f(ti;θi,ϕ)=exp[tiθib(θi)a(ϕ)+c(ti,ϕ)].

The outcome (ti) relates to treatment (zi), and the surrogate (si) by a generalized model [T |Z, S],

E(Tizi,si)=h1(β0+β1si+β2zi).

h(.) is the link function for the generalized linear model. If this full model holds, the marginal model of ti, conditional on treatment only, would be obtained by integrating over the distribution function of f (s|zi). The marginal model [T |Z] is

f(tizi)=f(tizi,s)f(szi)ds.

If both f (ti |zi, si) and f (si |zi) are normal density functions, the marginal model [T |Z] can be obtained easily and follows normal distribution as of [T |Z, S]. In general, the marginal model may not be analytically integrable; f (ti |zi, si) and f (ti |zi) are different types of density functions. For example, if h() is a logit link function, numerical approximation is necessary to derive the marginal model [T |Z].

In practice, a reduced model is fitted for outcome (ti) and treatment (zi) by using the same link function of the full model by omitting the surrogate variable as

E(Tizi,si)=h1(γ0+γ1zi).

The PE(FGS) is estimated by

PE(FGS)=1β2γ1.

This definition of PE(FGS) is intuitive and easy to calculate. There are inherited problems; first, the full model and the marginal model may not be valid simultaneously except in special cases; second, the PE(FGS) estimate can be extremely variable, and its point estimate and confidence interval can lie outside the interval [0,1]. For these reasons, seeking more reliable statistical measures of PTE would be necessary.

3. Alternative Measure of the Proportion of Treatment Effect Explained by the Surrogate: A Better Approach

3.1 Definition of F (F’) as Alternative Measures of PTE

Motivated by the idea of Tsiatis, DeGruttola, and Wulfsohn (1995), Wang and Taylor (2002) defined alternative measures (F and F’) for the proportion of treatment effect explained by a surrogate. The measures are defined by two ratios, that is,

F=AAABAABBandF=BABBAABB.

Terms AA, BB, AB, and BA are defined as

AA=h~(gA(s)dPA(s)),AB=h~(gA(s)dPB(s)),BA=h~(gB(s)dPA(s)),andBB=h~(gB(s)dPB(s)),
(1)

where PA(s) and PB(s) denote the probability distribution function of surrogate (S) in treatment A group and treatment B group (treatment indicator Z = 0 for A or Z = 1 for B), respectively; gA(s) and gB(s) are functions of the conditional distributions of outcome [T |s] in two treatment groups, and h(.) is a monotone link function. Citation {AA, AB, BA, BB} are four types of potential outcomes corresponding to the two treatment groups and surrogate distribution in two treatment groups. In other words, if one could manipulate the experiment, so that the distribution of the surrogate marker can be randomly assigned to different treatment conditions, we should observe four types of outcomes, labeled AA, AB, BA, BB. However, since such experimental manipulation is not possible in reality, we could only observe AA and BB corresponding to the expected outcome under treatment A and B. In order to evaluate the effect of the surrogate marker, the potential outcome of AB and BA are introduced. (AABB) measures the overall treatment difference. While (AABB) = (AAAB) + (ABBB), (AAAB) measures the change of the probability of outcome (T = 1) for patients in treatment A group, had their surrogate distribution changed to PB(S) from PA(S). (ABBB) measures the change in the probability of outcome (T = 1) due to the change of treatment assuming surrogate follows the distribution in treatment B. Thus F measures the proportion of the overall treatment difference that is attributable to the change of distribution of surrogate or “the overall treatment difference explained by a surrogate.” F’ measure is similarly defined, as a mathematically symmetric measure to F. Similar interpretation holds for F’ and (BABB) while (AABB) = (AABA) + (BABB). Wang and Taylor (2002) conducted research and simulation for cases when outcome and surrogate are either continuous or binary variables. When both T and S are normally distributed continuous variables, F = F’ = PE(FGS). Although this may not be the case in the nonnormal setting, a simulation study from Wang and Taylor’s work and our work in this article shows F and F’ are very close in value. The average of the F and F’ has thus been suggested. In the case of binary outcome and continuous surrogate marker, Wang and Taylor suggested two approaches. One is to define the treatment effect on the probability scale, that is, assume logit(gZ(S)) is a linear model for Z and S, which leads to a messy expression for F (F’); another is to model gZ(S) by a linear function, which makes the interpretation of F(F’) difficult. Neither approach provides good solutions for the practical user. However, if we use the probit model for [T |S, Z] to define treatment effect and probit function for PZ(s), we are able to the derive the analytical solutions for F and F’. When the logistic model is desired for [T |S, Z], F and F’ can be derived by approximating logistic function with probit function. For the remainder of the article, we examine the measures F (F’) and provide an easy solution for the setting of logistic and probit regression models of binary clinical outcome. Interpretations of our measures are provided by graphics.

3.2 Estimation of F(F’) for Binary Outcome in Randomized Clinical Study

In the setting of a randomized clinical trial, T is a binary outcome, Z is a binary indicator of treatment, and S is a continuous surrogate marker. Without loss of generality, we assume that an increased surrogate will more likely result in an event (T = 1 for event and T = 0 for nonevent). Let us assume PA(S) and PB(S) follow the normal distribution functions (Φ denotes for normal distribution function)

PA(s)=Φ(sμAσA),PB(s)=Φ(sμBσB).
(2)

Let gA(S) and gB(S) be the mean of the conditional distribution of [T |S, Z = 0] and [T |S, Z = 1], that is, gA(S) = Pr(T = 1|S, Z = 0) and gB(S) = Pr(T = 1|S, Z = 1). More specifically, let the conditional probability of the outcome (T) given S take the probit functional forms

gA(s)=Pr(T=1s,Z=0)=Φ(β0+β1s)gB(s)=Pr(T=1s,Z=1)=Φ(β0+β1sω).
(3)

The ω(= −β2) represents the effect of treatment on the outcome. Setting ω as nonnegative will ensure reduced odds of event (T = 1) for treatment versus placebo given S = s. In the expression for F(F’), (AAAB) measures the change in the probability [T = 1|Z = A] for patients in treatment A group, had their surrogate distribution followed PB(S) instead of PA(S). Similar interpretation holds for (BABB). Both F and F’ measure the treatment effect that is due to the change in the distributions of surrogate S. From the definition of (2) and (3), it can be shown that

AA=Φ(β0β1+μA1β12+σA2),BB=Φ((β0ω)β1+μB1β12+σB2),AB=Φ(β0β1+μB1β12+σB2),BA=Φ((β0ω)β1+μA1β12+σA2).
(4)

Each of the parameters in (4) can be estimated from fitted models (2) and (3). F (F’) can thus be calculated easily. The choice of g(.) and h(.) may vary. In the setting of interest, we use identity function for h(.) and mean function for g(.). When the logit function is desired for the conditional probability [T |S], the analytical solution to the integrations for each of the terms in (4) is not available. However, one may use the good approximation of

exp(y)1+exp(y)Φ(3πy)

to obtain the estimates of AA, BB, AB, and BA.

In the special case ω = 0, we have ABBB = 0 and AABA = 0, thus F = F’ = 1. If μA = μB and σA = σB, that is, the distributions for S are the same in both treatment groups, then AA = AB [implies] F = 0 and BA = BB [implies] F’ = 0. From the definitions of AA, BB, AB, and BA, it is easy to see that each term represents a probability value confined within [0, 1]. Without loss of generality, it is assumed that AABB. In order to ensure that F and F’ value are bounded within [0, 1], the sufficient and necessary conditions are: AAABBB and AA ≥ BA ≥ BB. Only the sufficient conditions are of practical interest, they are (Wang and Taylor 2002):

  • R1: PA(.) is stochastically higher than PB(.);
  • R2: gA(S) and gB(s) are nondecreasing function of s;
  • R3: gA(S) ≥ gB(s) for all s.

Such conditions are generally met if S is a surrogate endpoint. In particular, for the setting of interest, if the variances of S are equal for both treatment groups, then all three conditions are met.

Unlike PE(FGS), definition of F(F’) measures is not based on the model assumptions. It can be applied for any model as long as the calculation of each term is feasible.

3.3 Graphical Presentation of F(F’) for the Continuous Surrogate

By definition each term of (AA, BB, AB, BA) corresponds to a probability term which is defined by two distribution functions F(.) and G(.); that is, ∫ F(x)dG(y). Thus, each term could be presented by an area under an ordinal dominance (OD) curve (Bamber 1975) connecting (0, 0) and (1, 1) in a two-dimensional probability space. Figure 1 illustrates a clinical example that will be presented in detail later in Section 4. OD curves are plotted for AA (solid line), BB (long dashed), AB (dotdashed), and BA (dotted dash). The areas under each curve are the values for AA, BB, AB, and BA. Thus the area bounded by AA and BB curves corresponds to the (AABB) value; that is, the overall treatment effect. Further, the area of (AABB) can be divided into two mutually exclusive areas of (AAAB) and (ABBB). Therefore the F statistic could be represented by the ratio between two areas, the area defined by the solid line and short dashed lines (AAAB) versus the area defined by the solid and long dashed lines (AABB). Similarly, F’ is represented by the ratio between the areas defined by (BABB) and (AABB). This graphic presentation is easy to understand, and directly corresponds to the definitions of F and F’ statistics. When the curves of AA and BB overlap, it suggests no treatment effect; the larger the (AABB) area is, the stronger the treatment effect is. On the other hand, the closer together the AB and BB curves (or BA and AA curves) are, the higher the percentage of the effect of treatment is explained by the surrogate. When AA overlaps AB or BA overlaps BB, F(F’) = 0, suggesting the surrogate is useless. When AA overlaps BA or AB overlaps BB, F(F’) = 1, suggesting the surrogate is perfect.

Figure 1
Graphic presentation of F(F’) in terms of the Ordinal Dominance (OD) curves for a clinical case study. Areas under each curve are the values for AA, BB, AB, and BA. OD curves are plotted with different symbols for AA (—- solid line), ...

4. Simulation Study

A simulation study is conducted for various scenarios (Table 1) where PTE varies from useless to perfect (0%, 16.67%, 33.33%, 66.67%, 80%, and 100%). The purpose of the simulation is to compare the performance of F(F’) measures to PE(FGS) measure in the probit model. For the specific setting of the problem (i.e., binary outcome and treatment with continuous surrogate), PE(FGS) and F(F’) by definition are not the same. However, if the outcome endpoint is a continuous one (Tc), and the relationships between treatment (Z), surrogate (S), and outcome (Tc) are linearly related by S = α0+α1Z + e and Tc = β0 + β1S + β2Z + ε (e and ε are iid normal). Measures of PE(FGS) and F(F’) are all the same under the assumption that both Tc and S are normally distributed, that is, PE(FGS)=F=F=α1β1α1β1+β2 (Wang and Taylor 2002). We refer to this value as the true latent PTE. A binary random variable T is derived from the latent continuous measureTc, that is, T is a random variable from a Bernoulli distribution with probability Φ(tc). Simulation results are compared to this true latent PTE to assess the performances of PE(FGS) and F(F’). Unless the surrogate is perfect or useless, F and F’ usually differ slightly; thus we also include the estimate FM(= (F+F’)/2). Theoretical expected values of F, F’, FM, and latent PTE are calculated and presented in Table 1.

Table 1
Simulation design setting with true values

The sample sizes are set to be 500 for each treatment group. For each scenario, simulation results are obtained from 1000 replicates, and are summarized in Table 2. The bootstrap estimates are obtained based on 1000 bootstrap samples. Since the distributional form for the F and F’ statistics are unknown, following Wang and Taylor’s (2002) suggestion, bias-corrected (BC) Bootstrap (Efron and Tibshirani 1986) method is used for constructing the confidence intervals of F and F’. The two following models are used for the simulation.

S=α0+α1Z+e;
Model 1

Φ1(P(T=1))=β0+β1S+β2Z.
Model 2

To calculate PTE, the marginal model below is estimated,

Φ1(P(T=1))=γ0+γ1Z.
Model 3

Estimates of PE(FGS) or FM are summarized in Table 2. The distributions of these estimates are plotted in histograms as shown in Figure 2. The results show that, in any of the considered cases, F(F’) and FM has little bias and smaller standard deviations (SD) compared to PE(FGS). In all cases, the estimated PE(FGS) is biased toward 0 and seriously underestimates the true PTE except for the design case C6, in which β2 = 0 and the surrogate is perfect. The variability of estimated PE(FGS) can be more than twice that of FM except in design C3. The variability is particularly large when the marker is perfect or useless (cases C1 and C6). In design C3, the SD of PE(FGS) is smaller due to the fact that the mean estimate is around zero (compared to the true effect of 1/3). In fact, when the true PTE is less than 1/3, the estimated PE(FGS) is negative. The bias of the estimated PE(FGS) improves as the true surrogate effect approaches one; however, the standard deviation remains larger than estimates of F(F’). As presented in Figure 2, the distributions of PE(FGS) are more skewed to the left than F(F’), particularly for design C1 (true effect = 0). Also in Table 2, the average FM is close to the true PTE. The SDs of F and F’ are similar. The problems of PE(FGS) identified in our simulation are consistent with the simulation findings from other studies (Bycott and Taylor 1998; Wang and Taylor 2002).

Figure 2
Histogram of the distribution of estimate PE(FGS) and FM for different simulation settings. FM has smaller biases and standard deviations (SD) comparing to PE(FGS).
Table 2
Summary of asymptotic results from simulation study for F, F’, FM, and PE(FGS)

Table 3 presents the 95% CI (confidence interval) and its coverage rate for F, F’, FM, and PE(FGS). PE(FGS) has the poorest coverage rate, especially when the true PTE is small (only 4.8% when true PTE = 66.7%). On the other hand, F, F’, and FM have nominal coverage rates (>93%) over all considerations. Table 4 shows the numbers of times (out of 1000 simulations) that 95% CIs lie within [0, 1], lower bound ≥ 0, and upper bound ≤ 1. As discussed earlier, PE(FGS) underestimates the true value by 30% or more. Therefore its 95% CI is shifted toward the negative direction. The lower CI is often below 0 and upper CI is below 1. This can be seen from the results of cases C1–C4. PTE is worse than FM in terms of lower bound ≥ 0 or CI within [0, 1]. They are comparable in terms of upper bound ≤1. In the cases of C5 and C6, although the results for PE(FGS) and FM are comparable, both are often within [0, 1], the point estimate of PE(FGS) and its CI are biased toward zero.

Table 3
95% confidence intervals (CI) by BC pivotal percentile, and coverage
Table 4
Shown are the numbers of times (out of 1000) that the 95% confidence intervals (CIs) lie between [0,1], that lower bounds are greater than or equal to zero, and that upper bounds are less than or equal to one for FM and PE(FGS)

With probit model, it could be derived that Model 3 takes the form of

Φ1(P(T=1Z))=(β0+α0β1)1+β12σs2+(β2+α1β1)1+β12σs2Z.

Thus, PE(FGS) by definition can be derived, that is,

PEFGS=1β21+β12σs2β2+α1β1.

Compare PE(FGS) to the latent PTE (= α1β1α1β1+β2 from Section 3.2), the bias is

β2(1+β12σs1)β2+α1β1,

which is always ≥ 0, suggesting PE(FGS) almost always underestimate the true effect. It is unbiased only when β2 = 0 or β1 = 0. When the logit model is desired for Model 2, an approximation of relationship

exp(y)1+exp(y)Φ(3πy)

can be used to estimate PTE, and same method is applicable.

Our simulation shows FM is a more accurate and less variable measure than PE(FGS). The point estimate is bounded within [0, 1]. It is unavoidable that the confidence interval of F (F’) can be out of interval [0, 1] when the true PE(FGS) is close to 0 or 1. Otherwise, its 95% CI is mostly within [0, 1] when the true PTE is not near 0 or 1. The precision, variability, and coverage probability are much improved versus the PE(FGS) estimate.

5. An Example From an Ophthalmology Study

We use an example from published literature (Buyse et al. 1998; Wang and Taylor 2002). This is a randomized clinical trial study investigating the effect of interferon-α in 190 patients with age-related macular degeneration (ARMD). The surrogate S and clinical end point T are defined as S = change of visual acuity at six months (lines loss of vision in six months)

T={0if patient lost less than three lines ofvision at one year compared to baseline1if patient lost three or more lines ofvision at one year compared to baseline}.

A Box–Cox plot suggests that five extreme data points (≥8 or ≤−6) could be outliers. The Surrogate variable presents a reasonable normal distribution within each treatment group. Two groups t-test on surrogate was not significant for all 190 patients, but became significant after excluding the five patients with extreme values. It might be arguable that the five patients are outliers. For the purpose of illustrating Prentice’s validation criteria, these five patients’ data are excluded from analysis. Three of the criteria are met, that is, treatment is significant for clinical endpoint T, surrogate is also significant for clinical endpoint T, and there is no interaction between treatment and surrogate. The criteria that treatment effect is captured by surrogate for the clinical endpoint is assessed by the PTE estimation.

Point estimates of PTE, F and F’, bootstrap estimates and their confidence intervals (CIs) are reported in Table 5. The bootstrap is used for constructing CI with replicates of 1000. The latent PTE is calculated based on the continuous clinical endpoint (vision at one year), which is 78%. Our median estimate of F and F’ are similar (62% and 64.5%). They are similar to the estimates reported by Wang and Taylor (2002) (69% and 61.9%) where the surrogate is treated as binary data. The CIs of F, F’, and FM are clearly narrower than that of PE(FGS); and they are about 30% narrower than that reported by Wang and Taylor (2002). The point estimates of F and F’ are all within (0, 1), and the lower bound of the 95% CI are above 0, suggesting change of visual acuity at six months is a potential surrogate for the clinical endpoint (loss of three or more lines of vision at one year).

Table 5
Estimates and 95% confidence interval of F, F’, and PE(FGS) for ARMD data

Figure 1 presents the effect of treatment in the placebo group (AA curve, solid line), and in the interferon-α group (BB curve, long dashed line). The BB curve lies below the AA curve suggesting that interferon-treatment may be effective in preventing more than three lines of loss of vision in ARMD patients by the end of one year. The treatment effect is presented by the inner area between the AA and BB curves. To demonstrate how much of the treatment effect could be explained by the surrogate measure, AB (short dashed line) and BA (dotted dash line) curves were plotted together with AA and BB curves. The area between the AA and AB curves is about 2/3 (F = 0.628) of the (AABB) area, suggesting that a substantial amount of treatment–effect could be explained by the surrogate measure. Similarly, the F’(= 0, 647) measure is presented by the area of (BABB) out of the area of (AABB).

6. Conclusion and Discussion

This article provides a useful and more accurate estimate for the proportion of treatment effect explained by a surrogate when the clinical endpoint is modeled by logistic or probit regression. The estimate can be calculated easily with any existing statistical computation package that is capable of binary regression modeling, such as SAS, SPSS, or S-Plus. This measure can be presented graphically by the ratio of the areas under ordinal dominant curves.

The derived measures F(F’) have much smaller bias and smaller variability in contrast to commonly used PE(FGS). The expected value of F(F’) is often inside [0, 1] under a few mild conditions, while PE(FGS) often lies outside of [0, 1]. Unlike PE(FGS), measures F(F’) have no requirement that the fitted marginal model [T|Z] and the full model [T |S, Z] be the same type. In the logistic model [T |Z, S], the marginal model of [T |Z] is no longer in the logistic model setting, and it is not easily obtained. F(F’) can be easily estimated by proper transformation of logit to probit function, and it is shown to be more reliable and efficient. In the setting of the probit model of [T |S, Z], the marginal model [T |S] follows a valid probit model as well. PE(FGS) estimate is shown to be unreliable and seriously underestimates the true PTE by as much as 30% or more.

Although theoretically both F and F’ measures are to quantify the proportion of the treatment effect explained by a surrogate marker, the two could be slightly different in a nonnormal setting. FM as an average of both is less bias and has smaller variability than F and F’, and is much better than PE(FGS). We recommend the use of FM. Many other alternative measures have been proposed to assess the proportion of the treatment effect that is explained by a surrogate marker or mediator. A closely related proposal is the relative indirect effect (RIND) measure proposed by Huang et al. (2004). The RIND measure was proposed by first estimating an imaginary quantity, that is, the expected potential outcome given a specific treatment had the effect of treatment not acted through the intermediate surrogate endpoint. When taking the g(.) as the expected mean of the conditional distribution of [T |S, Z], this expected potential outcome is equivalent to the AB measure in our definition, and the RIND is equivalent to the F measure of Wang and Taylor’s (2002). There are slight differences in the RIND and F(F’) measures. First, RIND is defined by setting g(.) as the expected conditional mean; second, unlike the F(F’) measure, the RIND measure does not require the treatment to be a binary measure; third, RIND was defined based on a system of two generalized regression equations, and thus generally requires linear assumptions between the treatment Z and the outcome T. However, such assumptions could be easily relaxed. Both F(F’) and RIND measure were proposed for single trial only. Although it is accepted to evaluate the effect of surrogacy from multiple trials (Buyse et al. 2000), validation based on a single trial can provide preliminary evidence in the early study of surrogate or biomarker. Surrogate validation using F(F’) can be valid in a large single trial. Recently, a new measure was proposed based on information theory which can be applied to a wide variety of settings (Alonso and Molenberghs 2007). The information-theoretic based measure inherits a difficulty in providing a hard cut-off value for discriminating good or bad surrogates.

In this article, we have considered only a single surrogate endpoint in the clinical trial. In many situations, there could be multiple biomarkers on the pathway to the clinical endpoint, where the combination of biomarkers may serve as a surrogate endpoint. One may follow the two-stage approach, by first finding a best linear combination of these markers with the available statistical methods (Pepe 2003), then applying the method presented in this article to validate the surrogate. If multiple surrogate markers assume a joint distribution, F(F’) can be derived under multiple integration.

Acknowledgments

The authors thank the Pharmacological Therapy for Macular Degeneration Study Group, F. Hoffman-LaRoche, and Dr. Marc Buyse and Dr. Geert Molenberghs for providing the authors with data and the permission to use the data. This work is partially supported by the NIDA R01DA019965-01A1 awarded to Dr. Bin Huang.

Contributor Information

Jie Huang, Novartis Pharmaceuticals, Oncology Business Unit, East Hanover, NJ 07936 (moc.sitravon@gnauh.eij).

Bin Huang, Center for Epidemiology and Biostatistics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229 (gro.cmhcc@gnauh.nib).

REFERENCES

  • Alonso A, Molenberghs G. Surrogate Marker Evaluation from an Information Theory Perspective. Biometrics. 2007;63:180–186. [PubMed]
  • Bamber D. The Area Above the Ordinal Dominance Graph and the Area Below the Receiver Operating Characteristic Graph. Journal of Mathematical Psychology. 1975;12:387–415.
  • Biomarkers Definitions Working Group Biomarkers and Surrogate Endpoints: Preferred Definitions and Conceptual Framework. Clinical Pharmacology and Therapeutics. 2001;69(3):89–95. [PubMed]
  • Buyse M, Molenberghs G. Criteria for the Validation of Surrogate Endpoint in Randomized Experiments. Biometrics. 1998;54:1014–1029. [PubMed]
  • Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. The Validation of Surrogate Endpoints in Meta-analyses of Randomized Experiments. Biostatistics. 2000;1:49–67. [PubMed]
  • Bycott PW, Taylor J. An Evaluation of a Measure of the Proportion of the Treatment Effect Explained by a Surrogate Marker. Controlled Clinical Trials. 1998;19:555–568. [PubMed]
  • DeGruttola V, Fleming T, Lin DY, Coombs R. Perspective: Validating Surrogate Markers—Are We Being Naive? Journal of Infectious Diseases. 1997;175:237–246. [PubMed]
  • DeGruttola V, Clax PC, Demets DL, Downing GJ, Ellenberg SS, Freedman L, Gail MH, Pretience R, Wittes J, Zeger SL. Considerations in the Evaluation of Surrogate Endpoints in Clinical Trials: Summary of a National Institutes of Health Workshop. Controlled Clinical Trials. 2001;22:485–502. [PubMed]
  • Efron B, Tibshirani R. Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science. 1986;1:54–77.
  • EMEA/CHMP report. Innovative Drug Development Approaches: Report From the EMEA / CHMP-Think-Tank Group on Innovative Drug Development. [March 2007]. 2007. http://www.EMEA.EUROPA.EU/PDFS/HUMAN/ITF/12731807EN.PDF.
  • Freedman LS. Confidence Intervals and Statistical Power of the ‘Validation’ Ratio for Surrogate or Intermediate Endpoint. Journal of Statistical Planning and Inference. 2001;96:143–153.
  • Freedman LS, Graubard BI, Schatzkin A. Statistical Validation of Intermediate Endpoints for Chronic Disease. Statistics in Medicine. 1992;11:167–178. [PubMed]
  • Huang B, Sivaganesan S, Succop P, Goodman E. Statistical Assessment of Mediational Effects for Logistic Mediational Methods. Statistics in Medicine. 2004;23:2713–2728. [PubMed]
  • Lin DY, Fleming TR, DeGruttola V. Estimating the Proportion of Treatment Effect Explained by a Surrogate Marker. Statistics in Medicine. 1997;16:1515–1527. [PubMed]
  • NIAID Workshop Statistical Issues for HIV Surrogate Endpoints: Point/Counterpoint. Statistics in Medicine. 1989;17:2435–2462. [PubMed]
  • Pepe MS. The Evaluation of Medical Tests for Classification and Predictions. Oxford University Press; 2003. (Oxford Statistical Science Series).
  • Prentice R. Surrogate Endpoints in Clinical Trials: Definition and Operational Criteria. Statistics in Medicine. 1989;8:431–440. [PubMed]
  • Temple RJ. A Regulatory Authority’s Opinion About Surrogate Endpoints. In: Nimmo WS, Tucker GT, editors. Clinical Measurement in Drug Evaluation. Wiley; New York: 1995. pp. 3–22.
  • Tsiatis AA, DeGruttola V, Wulfsohn MS. Modeling the Relationship of Survival to Longitudinal Data Measured With Error: Applications to Survival and CD4 Counts in Patients with AIDs. Journal of the American Statistical Association. 1995;90:27–37.
  • Wang Y, Taylor J. A Measure of the Proportion of Treatment Effect Explained by a Surrogate Marker. Biometrics. 2002;58:803–812. [PubMed]
  • Schwarz JK, Siegel BA, Dehdashti F, Grigsby PW. Association of Posttherapy Positron Emission Tomography With Tumor Response and Survival in Cervical Carcinoma. Journal of American Medical Association. 2007;298(19):2289–2295. [PubMed]