In this article, we study the problem of surrogate marker evaluation based on the same set of assumptions A1–A4 made by GH.

The major focus of GH is the one-marker setting. While the estimation methods proposed in their paper can be well-extended to accommodate more than one marker, the summary measures of surrogate value are essentially defined based on the single marker value itself (i.e., PAE and AS proposed in GH). With the number of markers in the model increasing, it becomes more and more difficult to quantify an individual marker’s contribution. Instead, characterizing the joint effect of multiple markers together using a metric that can be compared between risk models becomes essential. This is what we are trying to achieve here using a new graphical tool and its summary measure. We proposed a graphical tool for characterizing the distribution of risk difference between randomized treatment arms as a function of marker values, and used this tool to put different risk models on the same scale for comparison with respect to their principal surrogate value. In particular, we proposed a clinically meaningful summary measure (standardized total gain) derived from the risk difference distribution as a basis for inference. This summary measure is appealing given that it characterizes the capacity of the model for classifying subjects into treatment effective and ineffective categories. It has a limitation of being well defined only based on the arithmetic difference between *risk*_{(0)} and *risk*_{(1)}. Depending on the scientific question, the treatment effect may be represented by a different type of contrast (e.g., the risk ratio (GH)), in which case alternative summary measures may be preferred.

The graphical tool can be applied to multiple markers because of its focus on the distribution of risk difference instead of the distribution of the marker, and has application to guide vaccine/treatment development. For example, identifying the region of marker values with large risk differences may provide a lead for refinement of the vaccine or treatment. In practice, the predictiveness curve and its summary measure can be used to compare markers or to evaluate incremental value of a new marker. Then, for a chosen model, we can further explore how the risk difference depends on individual marker values.

Including more markers increases model complexity and poses a challenge to estimation. Here we are interested in continuous markers and continuous or discrete baseline covariates. The existing nonparametric method discretizing continuous variables has unsatisfactory performance when the marker’s performance is evaluated conditional on covariates. The fact that the fully parametric method relies on the assumption about the joint distribution of the baseline covariates and the markers is also unappealing. Here we developed a semiparametric approach for estimation. An easy-to-implement EM algorithm is employed to maximize the estimated likelihood. The method works either in a standard randomized trial or when a close-out placebo vaccination (CPV) component is added to help identify and estimate *risk*_{(0)}. In addition to developing the standardized total gain and using the close-out design, this work extends GH by providing a method for evaluating and comparing surrogate value of multiple biomarkers, and for providing a more robust method for estimation that naturally handles continuous biomarker and continuous or discrete covariates. The method accommodates two-phase sampling designs, commonly used in clinical trials. On the other hand, the semiparametric estimator based on EM algorithm does take more computation time compared to the parametric method in GH. While GH explicitly allows the continuous marker to be subject to left-censoring, our new work does not address this issue. This is a topic of current research.

Under the baseline predictor strategy utilized in GH, with multiple biomarkers we need to be able to predict fairly well each of the biomarkers. The more biomarkers the greater the challenge in accomplishing this. The CPV strategy is particularly attractive as the number of biomarkers increases, because its effectiveness to predict the biomarkers does not decline with the number of biomarkers. By extending from the one to at least two biomarkers setting, we also face all of the challenges faced in model selection for ordinary regression modeling, such as collinearity. In practice, we can consider different approaches to handle collinearity such as selecting biomarkers measuring different biological functions, or reduce the dimension of markers using techniques such as principal components analysis.

In practice it is important to check the validity of the parametric structural models for

*risk*_{(1)} and

*risks*_{(0)} specified by A4. In a standard trial design, while it is straightforward to test goodness-of-fit of models for

*risk*_{(1)}, models for

*risk*_{(0)} cannot be tested. Fortunately the CPV design provides a way, based on the equation

from which it is apparent that

*P*{

*S*(1)|

*Y* (0) = 0,

*W*} is identified from the CPV sample,

*P*{

*Y* (0) = 1|

*W*} is identified from placebo subjects, and

*P*{

*S*(1)|

*W*} is identified from active treatment subjects. Therefore a goodness-of-fit test can be constructed based on the difference between

*risk*_{(0)} obtained under A4 and that obtained based on

(5).

The meaningful interpretation of the standardized total gain (STG) as a measure of classification accuracy relies on an extra assumption (A5, monotonicity), even though our method for its estimation does not require this assumption. Thus again we recommend the STG for settings where monotonicity is plausible, for example placebo-controlled trials where there is a significant overall beneficial treatment effect. We leave it to other work to explore relaxation of this assumption, which would be important for trials of two active treatments.

We developed our method for the scenario of Constant Biomarkers (CB). However, for placebo-controlled trials the method can also be applied to the general case that *S*(0) varies. For example, in an influenza vaccine trial, with biomarker(s) immune response(s) to influenza targets, *S*(0) will vary due to prior flu-illnesses. With interest in the risk conditional on both *S*(1) and *S*(0), we can enhance the study design by measuring the anti-influenza immune response(s) at baseline for subjects assigned to active treatment, which can substitute for *S*(0). Then vaccine arm subjects have data on both potential biomarkers (*S*(1), *S*(0)), which allows direct application of our semiparametric method. The semiparametric location-scale model may be employed to estimate the distribution of *S*(1) conditional on *W* and *S*(0).

Finally, with various summary measures of surrogate value developed in the literature, an important objective is to evaluate the comparative performance of the summary measures in terms of discrimination, predictiveness, etc. It does not appear possible to address these questions directly based on a single trial, as what is needed is meta-analysis of multiple trials, or at least meta-analysis of sub-sets of one very large trial. Meta-analysis would allow assessing, across study units, the correlation of treatment effects on the biomarker(s) with treatment effects on the clinical endpoint (for example such methods are developed and discussed in

Daniels and Hughes (1997);

Molenberghs et al. (2002,

2008)). If a summary measure is a good predictor of the level of clinical treatment efficacy, then trials with high surrogate value (according to the measure) will have a tight correlation, and trials with low surrogate value according to the measure will have low correlation. This kind of assessment could be formalized into a metric for comparing the predictiveness of different summary measures.