Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2921328

Formats

Article sections

- SUMMARY
- 1. Introduction
- 2. Estimating Subject-Specific Prediction Error Based on Risk Score Constructed from Conventional Markers
- 3. Examples
- 4. Remarks
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 September 1.

Published in final edited form as:

Published online 2008 September 29. doi: 10.1111/j.1541-0420.2008.01125.x

PMCID: PMC2921328

NIHMSID: NIHMS102410

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Suppose that we are interested in using new bio- or clinical-markers to improve prediction or diagnosis of the patient’s clinical outcome in addition to the conventional markers. The incremental value from the new markers is typically assessed by averaging across patients in the entire study population. However, when measuring the new markers is costly or invasive, an overall improvement does not justify measuring the new markers in all patients. A more practical strategy is to utilize the patient’s conventional markers to decide whether the new markers are needed for improving prediction of his/her health outcomes. In this article, we propose inference procedures for the incremental values of new markers across various subgroups of patients classified by the conventional markers. The resulting point and interval estimates can be quite useful for medical decision makers seeking to balance the predictive or diagnostic value of new markers against their associated cost and risk. Our proposals are theoretically justified and illustrated empirically with two real examples.

Biological and technological advances continually generate promising new clinical- and biomarkers with the potential to improve medical care by providing more accurate, personalized predictions of health outcomes and diagnoses of clinical phenotypes. However, extensive use of new markers may provide only negligible improvements in prediction or diagnosis, while subjecting patients to additional risks and costs. It is therefore important to develop statistical methods that can quantify for individual patients the value of new markers over conventional ones, especially when measuring these markers is costly or invasive. As an example from a recent study, Wang et. al. (2006) examined extensively the incremental values of ten new biomarkers in predicting first major cardiovascular events or death in a Framingham cohort. There were 3209 participants in the study. They were followed for a median of 7.4 years, during which 207 participants died and 169 had a first major cardiovascular event. Based on various prediction precision criteria, the study investigators found that these ten contemporary biomarkers added only moderate *overall* predictive value to the classical risk factors including gender, age, total cholesterol, HDL cholesterol, blood pressure, smoking status, and diabetes mellitus. In contrast, other investigators studying different populations with different prediction precision measures demonstrated that certain biomarkers, for example, the high-sensitivity C-reactive protein, provide clinically useful prognostic information on top of the traditional Framingham risk score for heart diseases (Ridker et. al., 2002, 2007, Blumenthal et. al., 2007).

Despite these often controversial findings in the literature, clinical practitioners would generally not change their recommendation for the patient’s care with the extra marker information if the patient, for example, has either *high* or *low* conventional risk score. Therefore, a practically important question is how to *systematically* identify patients who would benefit from the additional markers instead of evaluating these markers based only on their average incremental value across the entire population (D’Agostino, 2006). In this article, we propose procedures to estimate the incremental values of new markers for diagnosis or prediction in various subgroups of patients classified by conventional markers. These, coupled with the sampling variations of the estimates, provide a useful tool for researchers and practitioners to decide when, after observing the conventional risk factors, the new markers are needed. In Section 2, we describe in detail the new procedure and provide theoretical justification. In Section 3, we illustrate our methods with two examples, one with a continuous response and the other with a binary outcome.

There are quite a few procedures in the literature for evaluating the over-all incremental value of new markers for an entire population of interest. For example, Pepe et. al. (2004) compared the ROC curves among models with and without an additional marker. Recently, Tian et. al. (2007) and Uno et. al. (2007) proposed robust inference procedures for evaluating prediction rules. Prediction or diagnostic precision measures, which may be used for comparing different prediction procedures, have also been proposed and utilized, for example, by Brier (1950), Breiman et. al. (1984), Speigelhalter (1986), Korn and Simon (1990), McLachlan (1992), Mittlböck and Schemper (1996), Ripley (1996), Zhou et. al. (2002), Pepe (2003) and Pencina et. al. (2008).

Let *Y* be a continuous or binary response variable, *U* be the set of its conventional marker values, and *V* be the corresponding counterpart from the new markers. Our data consist of *n* independent copies {(*Y _{i}*,

A practically feasible alternative to handle this problem is to consider a parametric or semi-parametric approach. To this end, let *X* be a *p*-dimensional vector, a function of *U*. Assume that the conditional mean of *Y* given *U* can be approximated by the following *working* model

$$E(Y\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}U)={g}_{1}(\beta \prime X),$$

(2.1)

where *g*_{1}(·) is a smooth, strictly increasing, known function and β is an unknown vector of parameters. Note that the first component of *X* is one. In this article, we deal with the interesting and challenging case that β′*X* is a continuous variable.

To estimate the regression parameters for model (2.1) which, most likely, is an approximation to the true conditional mean of *Y* given *U*, one may use the estimator based on the simple estimating function

$${S}_{1}(\beta )={\displaystyle \sum _{i=1}^{n}{X}_{i}\phantom{\rule{thinmathspace}{0ex}}\{{Y}_{i}-{g}_{1}(\beta \prime {X}_{i})\},}$$

(2.2)

where {(*Y _{i}*,

Now, consider a future independent subject with (*Y*, *X*) = (*Y*^{0}, *X*^{0}). For a given β in (2.1), let *Ŷ*_{1}(β′*X*^{0}) be the predictor for *Y*^{0}. For example, when *Y*^{0} is continuous, one may let *Ŷ*_{1}(β′*X*^{0}) = *g*_{1}(β′*X*^{0}) and when *Y*^{0} is binary, one may predict *Y*^{0} by a binary variable *Ŷ*_{1}(β′*X*^{0}) = *I*{*g*_{1}(β′*X*^{0}) ≥ 0.5}, where *I*(·) is the indicator function. Other prediction rules for the binary case will be discussed in the Example Section. To evaluate the performance of *Ŷ*_{1}(′*X*^{0}), we first need to quantify its prediction accuracy based on a “distance” between the true *Y*^{0} and the predicted *Ŷ*_{1}(′*X*^{0}), denoted by *D*{*Y*^{0}, *Ŷ*_{1}(′*X*^{0})}. For example, a simple, physically interpretable function is the absolute prediction error with *D*(*a*, *b*) = |*a* − *b*|. When *Y* is binary, this distance function is simply the overall mis-classification rate. Choosing an appropriate distance measure is a crucial yet often difficult step in evaluating the incremental value for the cost-benefit decision. It is important to note that such quantification should have a practical/clinical interpretation. We discuss this issue in great details with real examples in Section 3 and also in the Remarks Section.

Since clinical practitioners almost always group subjects with a “risk scoring system” for medical decision making, we consider an average prediction error over a set of *X*’s which have “similar” *g*_{1}(′*X*) to evaluate *Ŷ*_{1}(·). To be specific, let *J _{z}* = (

$${\mathcal{D}}_{1}(z)=E\phantom{\rule{thinmathspace}{0ex}}[D\{{Y}^{0},{\hat{Y}}_{1}({\beta}_{0}^{\prime}{X}^{0})\}|\phantom{\rule{thinmathspace}{0ex}}{g}_{1}({\beta}_{0}^{\prime}{X}^{0})\in {J}_{z}].$$

(2.3)

As a process of *z*, this moving average process {_{1}(*z*)} provides a performance profile of *Ŷ*_{1}(·) over all possible values of ${g}_{1}({\beta}_{0}^{\prime}X)$. The proper choice of interval *J _{z}* would be made on a case-by-case basis and is illustrated with two examples in the next section.

Now, let *W* be a *q* × 1 vector, a function of *U* and *V*. Assume that a working model for the conditional mean of *Y* given *U* and *V* is

$$\mathrm{E}(Y\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}U,V)={g}_{2}(\theta \prime W),$$

(2.4)

where *g*_{2}(·) is a smooth, strictly increasing, known function and θ is an unknown vector of parameters. The first component of *W* is one. Again, we assume that *g*_{2}(θ′*W*) is a continuous variable. Let be the estimator for θ obtained from the following simple estimating function

$${S}_{2}(\theta )={\displaystyle \sum _{i=1}^{n}{W}_{i}\phantom{\rule{thinmathspace}{0ex}}\{{Y}_{i}-{g}_{2}(\theta \prime {W}_{i})\},}$$

(2.5)

where *W _{i}*,

$${\mathcal{D}}_{2}(z)=E\phantom{\rule{thinmathspace}{0ex}}\left[D\{{Y}^{0},{\hat{Y}}_{2}({\theta}_{0}^{\prime}{W}^{0})\}\right|\text{\hspace{1em}}{g}_{1}({\beta}_{0}^{\prime}{X}^{0})\in {J}_{z}],$$

(2.6)

where the expectation is taken with respect to (*Y*^{0}, *X*^{0}, *W*^{0}). Then, as a processes in *z*, _{1}(*z*), _{2}(*z*) and

$${\mathrm{\Delta}}_{0}(z)={\mathcal{D}}_{1}(z)-{\mathcal{D}}_{2}(z)$$

(2.7)

provide a global picture for identifying subgroups of patients who would benefit from the additional markers.

To estimate _{1}(*z*) and _{2}(*z*), one may use _{1}(*z*) = _{1}(*z*, ) and _{2}(*z*) = _{2}(*z*, , ), respectively, where

$${\tilde{\mathcal{D}}}_{1}(z,\beta )=\frac{{\displaystyle {\sum}_{i=1}^{n}D\{{Y}_{i},{\hat{Y}}_{1}(\beta \prime {X}_{i})\}I\{{g}_{1}(\beta \prime {X}_{i})\in {J}_{z}\}}}{{\displaystyle {\sum}_{i=1}^{n}I\{{g}_{1}(\beta \prime {X}_{i})\in {J}_{z}\}}}$$

(2.8)

and

$${\tilde{\mathcal{D}}}_{2}(z,\beta ,\theta )=\frac{{\displaystyle {\sum}_{i=1}^{n}D\{{Y}_{i},{\hat{Y}}_{2}(\theta \prime {W}_{i})\}I\{{g}_{1}(\beta \prime {X}_{i})\in {J}_{z}\}}}{{\displaystyle {\sum}_{i=1}^{n}I\{{g}_{1}(\beta \prime {X}_{i})\in {J}_{z}\}}}.$$

(2.9)

We then let (*z*) = _{1}(*z*) − _{2}(*z*) to estimate Δ_{0}(*z*). In Appendix A of the web-based supplementary material available at http://www.tibs.org/biometrics, we show that with the distance function *D*(*a*, *b*) = |*a* − *b*| or a function thereof, the above three estimators are uniformly consistent over an interval Ω consisting of all *z*’s whose intervals *J _{z}*’s are properly in the support of ${g}_{1}({\beta}_{0}^{\prime}X)$. Similar arguments may be used for cases with other distance functions.

To make further inferences about the added value from the new markers for predicting the response, in Appendix A of the web-based supplementary material, we show that the limiting distributions of the processes ${\hat{\mathcal{W}}}_{1}(z)={n}^{1/2}\{{\hat{\mathcal{D}}}_{1}(z)-{\mathcal{D}}_{1}(z)\}$, ${\hat{\mathcal{W}}}_{2}(z)={n}^{1/2}\{{\hat{\mathcal{D}}}_{2}(z)-{\mathcal{D}}_{2}(z)\}$ and $\hat{\mathcal{W}}(z)={n}^{1/2}\{\hat{\mathrm{\Delta}}(z)-{\mathrm{\Delta}}_{0}(z)\}$, are the same as those of the Gaussian processes ${\mathcal{W}}_{1}^{*}(z),{\mathcal{W}}_{2}^{*}(z)\phantom{\rule{thinmathspace}{0ex}}\text{and}\phantom{\rule{thinmathspace}{0ex}}{\mathcal{W}}^{*}(z)$, respectively, for *z* Ω, where ${\mathcal{W}}^{*}(z)={\mathcal{W}}_{1}^{*}(z)-{\mathcal{W}}_{2}^{*}(z)$,

$$\begin{array}{c}{\mathcal{W}}_{1}^{*}(z)={n}^{\frac{1}{2}}\left\{\frac{{\displaystyle {\sum}_{i=1}^{n}[D\{{Y}_{i},{\hat{Y}}_{1}(\hat{\beta}\prime {X}_{i})\}-{\hat{\mathcal{D}}}_{1}(z)]I\{g(\hat{\beta}\prime {X}_{i})\in {J}_{z}\}{G}_{i}}}{{\displaystyle {\sum}_{i=1}^{n}I\{g(\hat{\beta}\prime {X}_{i})\in {J}_{z}\}}}+\tilde{\mathcal{D}}(z,{\hat{\theta}}^{*})-{\hat{\mathcal{D}}}_{1}(z)\right\},\hfill \\ {\mathcal{W}}_{2}^{*}(z)={n}^{\frac{1}{2}}\left\{\frac{{\displaystyle {\sum}_{i=1}^{n}[D\{{Y}_{i},{\hat{Y}}_{2}(\hat{\theta}\prime {W}_{i})\}-{\hat{\mathcal{D}}}_{2}(z)]I\{g(\hat{\beta}\prime {X}_{i})\in {J}_{z}\}{G}_{i}}}{{\displaystyle {\sum}_{i=1}^{n}I\{g(\hat{\beta}\prime {X}_{i})\in {J}_{z}\}}}+\tilde{\mathcal{D}}(z,{\hat{\beta}}^{*},{\hat{\theta}}^{*})-{\hat{\mathcal{D}}}_{2}(z)\right\},\hfill \\ \hfill {\hat{\beta}}^{*}=\hat{\beta}+{\left\{{\displaystyle \sum _{i=1}^{n}{\dot{g}}_{1}(\hat{\beta}\prime {X}_{i}){X}_{i}{X}_{i}^{\prime}}\right\}}^{-1}{\displaystyle \sum _{i=1}^{n}{X}_{i}\{{Y}_{i}-{g}_{1}(\hat{\beta}\prime {X}_{i})\}{G}_{i}}\hfill \\ \hfill {\hat{\theta}}^{*}=\hat{\theta}+{\left\{{\displaystyle \sum _{i=1}^{n}{\dot{g}}_{2}(\hat{\theta}\prime {W}_{i}){W}_{i}{W}_{i}^{\prime}}\right\}}^{-1}{\displaystyle \sum _{i=1}^{n}{W}_{i}\{{Y}_{i}-{g}_{2}(\hat{\theta}\prime {W}_{i})\}{G}_{i,}}\hfill \end{array}$$

and {*G*_{1}, …, *G _{n}*} are independent standard normal random variables that are independent of the data. Here, realizations from three Gaussian processes ${\mathcal{W}}_{1}^{*}(z),{\mathcal{W}}_{2}^{*}(z)\phantom{\rule{thinmathspace}{0ex}}\text{and}\phantom{\rule{thinmathspace}{0ex}}{\mathcal{W}}^{*}(z)$ given above can be generated easily for any interval of

$${\hat{\mathcal{D}}}_{l}(z)\pm {n}^{-\frac{1}{2}}{\xi}_{\alpha /2}{\sigma}_{{\mathcal{W}}_{l}^{*}(z)}.$$

(2.10)

Here, ${\sigma}_{{\mathcal{W}}_{l}^{*}(z)}^{2}$ is the variance of the random variable ${\mathcal{W}}_{l}^{*}(z)$ and ξ_{α} is the upper 100αth percentage point of the standard normal. Furthermore, a (1 − α) simultaneous confidence band for {_{l}(*z*), *z* } is

$${\hat{\mathcal{D}}}_{l}(z)\pm {n}^{-\frac{1}{2}}{\tau}_{l\alpha}{\sigma}_{{\mathcal{W}}_{l}^{*}(z)},$$

(2.11)

where

$$\text{pr}\phantom{\rule{thinmathspace}{0ex}}\{{\text{sup}}_{z\in \hat{\mathrm{\Omega}}}\phantom{\rule{thinmathspace}{0ex}}|{\mathcal{W}}_{l}^{*}(z)/{\sigma}_{{\mathcal{W}}_{l}^{*}(z)}|<{\tau}_{l\alpha}\}=1-\alpha .$$

It is important to note that in contrast to the standard subgroup analysis, our proposal takes care of the multiple comparison problems with such simultaneous confidence interval estimates via the scoring system indexed by *z*.

To construct interval estimators for Δ_{0}(*z*), it is important to note that (*z*) has a degenerate limiting distribution when ${\hat{Y}}_{1}({\beta}_{0}^{\prime}X)={\hat{Y}}_{2}({\theta}_{0}^{\prime}W)$ for all ${g}_{1}({\beta}_{0}^{\prime}X)\in {J}_{z}$. Therefore, to obtain reasonable interval estimators in practice, we consider the set such that for $z\in \tilde{\mathrm{\Omega}},\phantom{\rule{thinmathspace}{0ex}}{\displaystyle {\sum}_{i=1}^{n}I\{{\hat{Y}}_{1}(\hat{\beta \prime}{X}_{i})\ne {\hat{Y}}_{2}(\hat{\theta \prime}{W}_{i}),{g}_{1}(\hat{\beta \prime}{X}_{i})\in {J}_{z}\}/{\displaystyle {\sum}_{i=1}^{n}I({g}_{1}(\hat{\beta \prime}{X}_{i})\in {J}_{z})>{d}_{3}}}$, where *d*_{3} is a given positive number. Then, for *z* , a (1 − α), 0 < α < 1, point-wise confidence interval for Δ_{0}(*z*), is

$$\hat{\mathrm{\Delta}}(z)\pm {n}^{-\frac{1}{2}}{\xi}_{\alpha /2}{\sigma}_{{\mathcal{W}}^{*}(z)}.$$

(2.12)

Here, ${\sigma}_{{\mathcal{W}}^{*}(z)}^{2}$ is the variance of the random variable *(*z*). Moreover, a (1 − α) simultaneous confidence band for {Δ_{0}(*z*), *z* } is

$$\hat{\mathrm{\Delta}}(z)\pm {n}^{-\frac{1}{2}}{\tau}_{\alpha}{\sigma}_{{\mathcal{W}}^{*}(z)},$$

(2.13)

where

$$\text{pr}\phantom{\rule{thinmathspace}{0ex}}\{{\text{sup}}_{z\in \tilde{\mathrm{\Omega}}}\phantom{\rule{thinmathspace}{0ex}}|{\mathcal{W}}^{*}(z)/{\sigma}_{{\mathcal{W}}^{*}(z)}|<{\tau}_{\alpha}\}=1-\alpha .$$

Note that for the case with a continuous response *Y*, = .

Now, since we use the entire data set to estimate the parameters in (2.1) and (2.4) and also to estimate the average prediction errors (2.3) and (2.6), _{1}(·) and _{2}(·) may be significantly underestimated. To reduce such potential bias, one may consider the commonly used *K*-fold crossvalidation scheme. Specifically, we randomly split the data into *K* disjoint subsets of about equal size and label them as _{k}, *k* = 1, , *K*. For each *k*, we use all the observations, which are *not* in _{k}, to estimate parameters in (2.1) and (2.4) via estimating functions (2.2) and (2.5), and then use the observations in _{k} to estimate prediction errors _{1}(·) and _{2}(·) with (2.8) and (2.9). Let the resulting estimators be denoted by _{1k}(·) and _{2k}(·), respectively. The crossvalidated estimators for _{1}(·), _{2}(·) and Δ_{0}(·) are ${\tilde{\mathcal{D}}}_{1}(\xb7)={K}^{-1}{\displaystyle {\sum}_{k=1}^{K}{\tilde{\mathcal{D}}}_{1k}(\xb7),{\tilde{\mathcal{D}}}_{2}(\xb7)=}{K}^{-1}{\displaystyle {\sum}_{k=1}^{K}{\hat{\mathcal{D}}}_{2k}(\xb7)}$ and (·) = _{1}(·) − _{2}(·), respectively. Again, these estimators are uniformly consistent if *K* is relatively small with respect to *n*.

In Appendix B of the web-based supplementary material, we show that for large *n*, the distributions of the processes ${\stackrel{\uff5e}{\mathcal{W}}}_{1}(\xb7)={n}^{1/2}\{{\tilde{\mathcal{D}}}_{1}(\xb7)-{\mathcal{D}}_{1}(\xb7)\},{\stackrel{\uff5e}{\mathcal{W}}}_{2}(\xb7)={n}^{1/2}\{{\tilde{\mathcal{D}}}_{2}(\xb7)-{\mathcal{D}}_{2}(\xb7)\}$ and $\stackrel{\uff5e}{\mathcal{W}}(\xb7)={n}^{1/2}\{\tilde{\mathrm{\Delta}}(\xb7)-{\mathrm{\Delta}}_{0}(\xb7)\}$ can also be approximated well by those of ${\mathcal{W}}_{1}^{*}(\xb7),{\mathcal{W}}_{2}^{*}(\xb7)\phantom{\rule{thinmathspace}{0ex}}\text{and}\phantom{\rule{thinmathspace}{0ex}}{\mathcal{W}}^{*}(\xb7)$, respectively. Point-wise and simultaneous confidence intervals for _{1}(·), _{2}(·), and Δ_{0}(·) can then be constructed based on the crossvalidated estimates and their large sample distributions accordingly.

We use two examples to illustrate the new proposals. The first example is from a clinical trial conducted by the AIDS Clinical Trials Group, ACTG 320 (Hammer et. al., 1997). The study demonstrates that for various response endpoints, on average the three-drug combination therapy consisting of indinarvir, zidovudine and lamivudine, is much better than the two drug combination without indinarvir for treating HIV infected patients. Unfortunately, even with this potent combination, some patients may not respond to treatment, but suffer from non-trivial toxicity. Therefore, for future patients’ management, it is important to have a reliable model for predicting patient’s treatment responses based on certain “baseline” markers. A general conception is to use the baseline CD4 count and HIV RNA, a measure of viral load, and the early changes of these two markers after initiation of therapy for treatment guidance (Demeter et. al., 2001). For resource-limited regions, however, the cost of obtaining RNA is relatively expensive. Therefore, a challenging question is when we need RNA in addition to CD4 for better prediction of patient’s response.

Recently Tian et. al. (2007) demonstrated that, on a population average sense, neither the baseline nor early RNA change (from baseline to week 8) would add a clinically meaningful value for predicting the long term change of CD4 (from baseline to Week 24), an important measure of the patient’s immune response. Here, we try to locate a subgroup of patients, if any, who would benefit from the expensive marker RNA. To this end, let the response *Y* be the change of CD4 cell counts from Week 0 to 24, let *U* consist of age, baseline CD4 and the early change in CD4, and let *V* consist of the baseline RNA and the early change in RNA. For our analysis, in Models (2.1) and (2.4), we let *X* = (1, *U*′)′, *W* = (1, *U*′, *V*′)′, and *g*_{1}(·) and *g*_{2}(·) be the identity function. Also, we let Ŷ_{1}(β′*X*) = β′*X*, Ŷ_{2}(θ′*W*) = θ′*W*, *D*(*a*, *b*) = |*a* − *b*| and interval *J _{z}* be [

Estimates of the regression parameters with their standard errors and corresponding p-values for testing zero covariate effects for the AIDS example

For both working models, we utilized 5-fold crossvalidation scheme discussed in Section 2 to obtain the regression parameters and then _{1}(·), _{2}(·), and (·). In Figure 1, we present these estimated prediction errors and their differences with the corresponding 0.95 point-wise and simultaneous confidence intervals given in (2.10)–(2.13). The values of {_{1}(*z*)} based on the model with age, baseline CD4 and early change in CD4 range from 37 to 74. The values of {_{2}(*z*)} based on the model with additional RNA information range from 36 to 73. The estimated differences {(*z*)} range from −1.7 to 6.0. These indicate that there is no clinically meaningful gain from RNA for any subgroup of patients classified by ′*X*. One may draw further statistical inference about the Δ_{0}(·). For example, for subjects whose score *g*_{1}(′*X*) *J _{z}* = [40, 60], the estimated (50) = 0.45 with 0.95 point-wise interval of (−3.25, 4.15) and simultaneous interval of (−7.48, 8.38). Note that the results reported here are based on

Point estimates for _{1}(·), _{2}(·) and Δ_{0}(·) with corresponding 0.95 point-wise (dashed lines) and simultaneous (shaded regions) confidence intervals for the HIV example.

The data for the second example is from a population of patients screened for a clinical study, called TRACE, for treating heart failure or acute myocardial infraction (MI) (Kober et. al., 1995). There were 6676 patients screened. Each patient had six routine clinical covariates: age, creatine (CRE), occurrence of heart failure (CHF), history of diabetes (DIA), history of hypertension (HYP), and cardiogenic shock after MI (KS). Moreover, each patient had an echocardiographic assessment of left ventricular systolic function which was quantified by a measure called the wall motion index (WMI). Compared with the above six covariates, the WMI is relatively expensive to obtain. Although not every screened patient entered the clinical trial, all patients screened were followed closely for mortality.

Recently, Thune et. al. (2005) studied the prognostic importance of left ventricular systolic function in patients diagnosed with either heart failure or acute MI in addition to the patient’s medical history. It would be interesting to identify subpopulations that can benefit from the extra WMI measure for predicting clinical outcomes such as mortality. Here, we let the outcome *Y* be a binary variable, which is one if the patient died within five years. The five-year survival rate for this data set is approximately 42%. To evaluate the incremental value of WMI, we first fit the data using Model (2.1) with *X* = (1, AGE, CRE, CHF, DIA, HYP, KS), and *g*_{1}(*s*) = exp(*s*)/{1 + exp(*s*)}. With the extra variable WMI, we fit a second logistic regression model with *W* = (*X*′, WMI)′. A total of 5921 subjects have complete predictor information. The estimates for the regression parameters with their standard errors are reported in Table 2. Note that the WMI is highly statistically significant.

Here, we consider the prediction rules

$${\hat{Y}}_{1}(\beta \prime X)=I\{{g}_{1}(\beta \prime X)\ge c\},$$

(3.1)

and

$${\hat{Y}}_{2}(\theta \prime W)=I\{{g}_{2}(\theta \prime W)\ge c\}.$$

(3.2)

Note that we have also fitted the data with more complicated models, for example, by including various interaction terms. The results for the present case, however, are almost identical to that based on the above two additive models. For binary response variables, we consider the distance function *D*(*a*, *b*) = |*a* − *b*|, where *a* is the observed response and *b* is a binary predicted response based on the working model with the assumed event rate of *g*(·). This distance function is a conventionally used metric for evaluating the binary prediction rules. One may use other possible distance functions, for instance, by letting *b* be *g*(·) from (2.1) and (2.4) and *D*(*a*, *b*) = |*a* − *b*| or (*a* − *b*)^{2}. Moreover, one may consider a likelihood-based criterion to evaluate Models (2.1) and (2.4). In general, the metric comparing *Y* and the estimated *g*(·) can efficiently discriminate two prediction models. On the other hand, for the present case, we are more interested in evaluating the practical performance of the specific prediction rules, not the adequacy of the model fitting (although these two are closely related). Thus, considering distance functions between a practically applicable prediction rule *Ŷ* and the true response *Y* seems more relevant. The distance function, |*a* − *b*|, consists of two discordance rates or two types of error rates. Specifically, _{1}(*z*) in (2.3) is _{11}(*z*) + _{10}(*z*), where ${\mathcal{D}}_{11}(z)=E[{Y}^{0}D\{1,{\hat{Y}}_{1}({\beta}_{0}^{\prime}{X}^{0})\}|\phantom{\rule{thinmathspace}{0ex}}{g}_{1}({\beta}_{0}^{\prime}{X}^{0})\in {J}_{z}]\phantom{\rule{thinmathspace}{0ex}}\text{and}\phantom{\rule{thinmathspace}{0ex}}{\mathcal{D}}_{10}(z)=E[(1-{Y}^{0})D\{0,{\hat{Y}}_{1}({\beta}_{0}^{\prime}{X}^{0})\}|\phantom{\rule{thinmathspace}{0ex}}{g}_{1}({\beta}_{0}^{\prime}{X}^{0})\in {J}_{z}]$ are the discordance rates for false negative and false positive errors, respectively. Similarly, _{2}(*z*) = _{20}(*z*) + _{21}(*z*). Let Δ_{0}(*z*) = _{10}(*z*) − _{20}(*z*) and Δ_{1}(*z*) = _{11}(*z*) − _{21}(*z*). Oftentimes, a false negative error is more serious than a false positive one. Therefore, one may consider a weighted sum Δ_{0}(*w*, *z*) = *w*_{0}Δ_{0}(*z*) + *w*_{1}Δ_{1}(*z*) to evaluate the prediction rules. Here, *w* = (*w*_{0}, *w*_{1})′ and the weights *w*_{0} and *w*_{1} reflect the “cost” of making these two types of errors. It is interesting to note that the corresponding distance function for Δ_{0}(*w*, *z*) is ${w}_{0}^{1-Y}{w}_{1}^{Y}|Y-\hat{Y}|$. For a given *w*, the crossvalidated point estimates (*w*, *z*) and their interval estimates for Δ_{0}(*w*, *z*) can be constructed as for Δ_{0}(*z*) in Section 2. The large sample properties for (*w*, *z*) are derived in Appendix A of the web-based supplementary material.

We first considered the most commonly used prediction rule with *c* = 0.5 and *w*_{0} = *w*_{1} = 1. The 5-fold crossvalidated estimates, obtained by letting *J _{z}* be the entire real line in (2.8) and (2.9), for the overall prediction errors

Point estimates for _{1}(·), _{2}(·) and Δ_{0}(·) with corresponding 0.95 point-wise (dashed lines) and simultaneous (shaded regions) confidence intervals for the screened population of the TRACE study (the prediction **...**

To illustrate the effect of the weighting parameter *w* = (*w*_{0}, *w*_{1}) on the incremental value of WMI, we present in Figure 3(a),(b),(c) and (d) the point and interval estimates of Δ_{0}(*w*, *z*) for the predictors (3.1) and (3.2) with *c* = 0.5 and various choices of *w*.

Point estimate (*w*, ·) for Δ_{0}(*w*, ·) with various weights and the corresponding 0.95 point-wise (dashed lines) and simultaneous (shaded regions) confidence intervals for the screened population of the TRACE study **...**

Note that when *w*_{0} ≠ *w*_{1}, even if the working model is correctly specified, the prediction rule in (3.1) or (3.2) with *c* = 0.5 is not optimal with respect to the weighted error rate. Furthermore, with the unequal weighting criterion, for some subgroups of patients, inclusion of the extra information of WMI may significantly decrease the prediction precision. For a given *w*, with the weighted sum prediction precision measure, *w*_{0}_{10}(*z*) + *w*_{1}_{11}(*z*), it is straightforward to show that the optimal prediction rule based on *X* that minimizes the above criterion is *Ŷ* = *I*{pr(*Y* = 1 | *X*) ≥ *c _{w}*}, where

Point estimate (*w*, ·) for Δ_{0}(*w*, ·) with the “optimal” weights and the corresponding 0.95 point-wise (dashed lines) and simultaneous (shaded regions) confidence intervals for the screened population **...**

As suggested by a reviewer of the paper, we have also applied our method to quantify the incremental value of new biomarkers based on the integrated discrimination improvement (IDI) index (Pencina et. al., 2008, Pepe et. al., 2008). The IDI index is defined as the integrated difference in Youden’s indices (Youden, 1950). It can be viewed as an improvement of the average sensitivity and specificity with the new markers. The crossvalidated point estimates and the corresponding 0.95 simultaneous interval estimates are presented in Figure 1 of the web-based supplementary material. With this utility function, it appears that the WMI would be beneficial for all patients whose conventional risk scores are between 0.10 and 0.81 for predicting subject’s five year survival. It is important to note that there are no specific prediction rules attached to this approach (similar to the area under curve as an average measure over a class of prediction rules). Therefore, at the end it is not clear which prediction rule(s) one would recommend for practical usage based on the model with the additional markers. Utilizing an over all measure of the incremental value for evaluating new markers over various subgroups of patients can be useful at the initial stage of the investigation.

As for any scientific investigation, we need to define the endpoint of the study at the very first step. For the HIV example in Section 3, instead of using the change of the CD4 counts as the endpoint, one may be interested in the percent of change in CD4 counts from the baseline level. This is equivalent to considering the change in the log transformed CD4 counts as the endpoint. For the TRACE study example in Section 3, we dichotomized the patient’s survival time to make the response variable being the binary survival status at 5 years in our analysis. Naturally we may consider prediction rules for other time points. A good prediction rule for five-year survival may not be appropriate for ten-year survival. Therefore, when evaluating prediction rules for the patient’s survival time, a global distance function, for example, based on the *L*_{1} norm, between the observed and predicted survival times may not be desirable. Moreover, in practice, often the survival time cannot be observed completely and the support of its censoring variable is generally shorter than that of the survival time. As a result, it is difficult to evaluate prediction rules efficiently with, for example, the *L*_{1} distance without artificial extrapolation. On the other hand, using the approach taken by Uno et. al. (2007), one may identify patients who would benefit from the additional biomarkers via prediction of *t*-year survival.

After we select the endpoint of the study, we should make every effort to find the “best” models (2.1) and (2.4), which fit the data well. Model (2.1) would thus provide us a reasonable scoring system with which we can classify future patients into different subgroups via the conventional markers. Subsequently, we fit the response variable with the conventional and new marker values jointly to build model (2.4). Such an elaborate joint model may include interactions between the conventional and new markers, which would be potential contributing factors to the varying incremental values from the new markers.

The next crucial step is to choose a proper prediction precision measure to quantify the incremental value of the new markers. Different distance functions between the predicted and observed may result in quite different conclusions regarding the selection of subset of patients as illustrated by two real examples in Section 3. Since the final decision of using the new biomarkers would be based on the trade off between the risk/cost and benefit, the distance function needs to be “clinically” or heuristically interpretable. In analyzing the data from the HIV and TRACE trials, we proposed several distance functions for illustration. The choice of such functions is by no means restricted to those discussed in Section 3. For example, one may use a theoretically interesting metric such as the conventional likelihood ratio statistic or the mean square error loss to differentiate two prediction models (one with and the other without new markers). However, the magnitude of gain under such a metric is often difficult to interpret when the cost or risk is involved for decision making.

In practice the choice of the distance function even for the simple case with a binary response is rather complex. In the cardiovascular disease arena, conventionally a patient with more than 10% risk for having a serious cardiovascular event within ten years is generally regarded as having a high risk, and usually would be recommended for certain preventive treatments. However, the utility function may vary across individuals and hence different patients may have different optimal cutoff points for predicting patient-level outcomes. The weighted sum of prediction error rates presented in this article is an attempt to cope with this complicated cost-benefit issue. The complexities of choosing a distance function extend to the case with continuous responses. For example, weighting absolute prediction errors according to the observed response may lead to a more meaningful penalty in some applications compared to the un-weighted counterpart.

In this paper, we provide a useful tool for making valid inferences on the incremental value of new markers simultaneously over subsets of patients with well-defined endpoint, prediction models, utility (distance) function and the study population. We propose the use of the simultaneous confidence band for the incremental values to control the type one error and determine whether the new markers have a positive incremental value in a subset of patients. As the sample size increases, the confidence bands become tighter and one would be able to more accurately identify all subsets of interest.

Lastly, we may want to identify subsets where the incremental value is greater than a positive threshold to incorporate the cost of measuring the new markers. As such, subsets where the new markers have a positive, yet very small added predictive value would be excluded. If the new markers become less costly or invasive in the future, we may construct a new scoring system to index patients. It is likely that some old markers may not be needed on top of the new ones.

The authors are grateful to Drs. Jens Jakob Thune and James Signorovitch for helpful comments on the paper. The authors also very grateful to two reviewers and an associate editor for their insightful comments on the paper. This research was partially supported by the US NIH grants for I2B2, AIDS and General Medical Sciences.

**Supplementary Materials**

Web Appendices and figure referenced in Sections 2 and 3 are available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.

- Blumenthal R, Michos E, Nasir K. Further improvements in CHD risk prediction for women. JAMA. 2007;297:641–643. [PubMed]
- Breiman L, Friedman J, Stone C, Olshen R. Classification and regression trees. Chapman & Hall/CRC; 1984.
- Brier G. Verification of forecasts expressed in terms of probability. Monthly Weather Review. 1950;78:1–3.
- D’Agostino RB. Risk prediction and finding new independent prognostic factors. Journal of Hypertension. 2006;24:643–645. [PubMed]
- Demeter L, Hughes M, Coombs R, Jackson J, Grimes J, Bosch R, Fiscus S, Spector S, Squires K, Fischl M, Hammer S. Predictors of virologic and clinical outcomes in HIV-1-infected patients receiving concurrent treatment with indinavir, zidovudine, and lamivudine. AIDS Clinical Trials Group Protocol 320. Annals of Internal Medicine. 2001;135:954–964. [PubMed]
- Hammer S, Squires K, Hughes M, Grimes J, Demeter L, Currier J, Eron J, Feinberg J, Balfour H, Deyton L, Chodakewitz J, Fischl M. A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and CD4 cell counts of 200 per cubic millimeter or less. New England Journal of Medicine. 1997;337:725–733. [PubMed]
- Kober L, Torp-Pedersen C, Carlsen J, Bagger H, Eliasen P, Lyngborg K, Videbak J, Cole D, Auclert L, Pauly N, Aliot E, Persson S, Camm A. A clinical trial of the angiotensin-converting-enzyme inhibitor trandolapril in patients with left ventricular dysfunction after myocardial infarction. New England Journal of Medicine. 1995;333:1670–1676. [PubMed]
- Korn E, Simon R. Measures of explained variation for survival data. Statistics in Medicine. 1990;9:487–503. [PubMed]
- McLachlan J. Discriminant analysis and statistical pattern recognition. John Wiley & Sons; 1992.
- Mittlböck M, Schemper M. Explained Variation for Logistic Regression. Statistics in Medicine. 1996;15:1987–1997. [PubMed]
- Pencina J, D’Agostino R, Sr, D’Agostino R, Jr, Vasan R. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine. 2008;27:157–172. [PubMed]
- Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford University Press; 2003.
- Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. American Journal of Epidemiology. 2004;159:882–890. [PubMed]
- Pepe MS, Feng Z, Gu JW. Comments on Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond by M. J. Pencina et al. Statistics in Medicine. 2008;27:173–181. [PubMed]
- Ridker P, Rifai N, Rose L, Buring J, Cook N. Comparison of C-reactive protein and low-density lipoprotein cholesterol levels in the prediction of first cardiovascular events. New England Journal of Medicine. 2002;347:1557–1565. [PubMed]
- Ridker P, Buring J, Rifai N, Cook N. Development and validation of improved algorithms for the assessment of global cardiovascular risk in women: the Reynolds risk score. JAMA. 2007;297:611–619. [PubMed]
- Ripley B. Pattern recognition and neural networks. Cambridge University Press; 1996.
- Speigelhalter D. Probabilistic predictin in patient management and clinical trials. Statistics in Medicine. 1986;5:421–433. [PubMed]
- Thune J, Carlsen C, Buch P, Seibak M, Burchardtb H, Torp-Pedersen C, Kober L. Different prognostic impact of systolic function in patients with heart failure and/or acute myocardial infarction. European Journal of Heart Failure. 2005;7:852–858. [PubMed]
- Tian L, Cai T, Goetghebeur E, Wei LJ. Model evaluation based on the distribution of estimated absolute prediction error. Biometrika. 2007;94:297–311.
- Uno H, Cai T, Tian L, Wei LJ. Evaluating prediction rules for t-Year survivors with censored regression models. Journal of American Statistical Association. 2007;102:527–537.
- Wang T, Gona P, Larson M, Tofler G, Levy D, Newton-Cheh C, Jacques P, Rifai N, Selhub J, Robins S, Benjamin E, D’Agostino R, Vasan R. Multiple biomarkers for the prediction of first major cardiovascular events and death. New England Journal of Medicine. 2006;355:2631–2639. [PubMed]
- Youden W. An index for rating diagnostic tests. Cancer. 1950;3:32–35. [PubMed]
- Zhou XH, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. John Wiley & Sons; 2002.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |