Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2712304

Formats

Article sections

Authors

Related links

J Appl Stat. Author manuscript; available in PMC 2010 July 7.

Published in final edited form as:

J Appl Stat. 2009 July 7; 36(7): 769–778.

doi: 10.1080/0266476YYxxxxxxxxPMCID: PMC2712304

NIHMSID: NIHMS75077

Binbing Yu, Laboratory of Epidemiology, Demography and Biometry, National Institute on Aging, Bethesda, Maryland 20892, U.S.A.

See other articles in PMC that cite the published article.

In disease screening and diagnosis, often multiple markers are measured and they are combined in order to improve the accuracy of diagnosis. McIntosh and Pepe (2002, *Biometrics* **58**, 657-644) showed that the risk score, defined as the probability of disease conditional on multiple markers, is the optimal function for classification based on the Neyman-Pearson Lemma. They proposed a two-step procedure to approximate the risk score. However, the resulted ROC curve is only defined in a subrange (*L*, *h*) of the false-positive rates in (0,1) and determination of the lower limit *L* needs extra prior information. In practice, most diagnostic tests are not perfect and it is usually rare that a single marker is uniformly better than the other tests. Using simulation, I show that multivariate adaptive regression spline (MARS) is a useful tool to approximate the risk score when combining multiple markers, especially when the ROC curves from multiple tests cross. The resulted ROC is defined in the whole range of (0,1) and it is easy to implement and has intuitive interpretation. The sample code of the application is shown in the appendix.

The effectiveness of continuous markers in distinguishing between healthy and diseased subjects is generally assessed through the use of the Receiver Operating Characteristic (ROC) curve [11]. A subject is assessed as diseased (positive) or healthy (negative) according to whether the subject's marker value is greater than or less than or equal to a specified threshold value. Associated with any threshold value is the probability of a true positive (sensitivity) and the probability of a true negative (specificity). The resulting theoretical ROC curve is the plot of sensitivity versus 1-specificity for all possible threshold values. The area under the ROC curve (AUC) is a widely accepted summary index of the overall performance of diagnostic procedures.

In the screening and diagnosis of disease, often multiple tests are performed or multiple markers are routinely measured. For example, the neuropathologic diagnosis of Alzheimer's disease is typically based on the densities and distributions of brain lesions such as senile plaques and neurofibrillary tangles from different brain regions [17]. Strategies for combining information from multiple tests are therefore important since a combination of the tests may provide more accurate diagnosis than any single test alone[13]. When the multiple markers are multivariate normal in both diseased and healthy populations, Su and Liu [16] derived an optimal linear discriminant function that maximizes the AUC. Pepe and Thompson [13] considered how to choose linear combinations of markers in order to optimize diagnostic accuracy. Pepe et al. [12] proposed to use the empirical AUC as the objective function for combining multiple markers. The literature in the field of combining multiple biomarkers (tests) is vast. For the other approaches of combination, see Baker [1], Ma and Huang [9], Yang and Jin [18], Zou et al. [20], among others.

Let *D* be a binary indicator for disease status, with *D* = 1 for diseased subjects and *D* = 0 for nondiseased subjects and let the multiple diagnostic markers denoted by **Y** = (*Y*_{1}, …, *Y _{K}*). Several methods have been proposed to find the optimal linear combination of the multiple markers, i.e., to find a vector

$${L}_{\beta}(\mathbf{\text{Y}})={\beta}_{1}{Y}_{1}+{\beta}_{2}{Y}_{2}+\cdots +{\beta}_{K}{Y}_{K}$$

(1)

maximizes the objective function, e.g. the AUC. For standardization, one can restrict *β*_{1} = 1. Based on the Neyman-Pearson Lemma, McIntosh and Pepe [10] showed that the likelihood ratio (LR) rule is the uniformly most sensitive (UMS) screening test to combine multiple markers in the sense that the ROC curve is maximized at every point. They also cleverly used Bayes' theorem to show that the screening rules based on risk scores, i.e., probabilities of being diseased, are equivalent to the LR rule conditional on multiple markers.

They proposed a two-step screening procedure to approximate the risk score, which involves refitting a regression model for a restricted region of the observations. Although the ROC curve from the two-step procedure is substantially closer to the optimal ROC curve, it is only estimable for a sub range of (0,1). The choice of restricted region needs extra information. I will show later that the Multivariate Adaptive Regression Spline (MARS) is a useful alternative tool to approximate the risk score.

Under the likelihood ratio rule, a decision of positive test is made whenever

$$\mathit{\text{LR}}(\mathbf{\text{Y}})\equiv \frac{P(\mathbf{\text{Y}}|D=1)}{P(\mathbf{\text{Y}}|D=0)}>{c}_{\alpha},$$

(2)

where *c _{α}* is chosen so that the false positive rate (Type I error) is

The Neyman-Pearson Lemma proves that the LR test is the most power test under the simplest situation where both the null and alternative hypothesis are simple, i.e., each contains a single known distribution. The uniform powerful test (UMP) also exists if both the hypothesis are from a real-parameter family with monotone likelihood ratio [8]. However, it is rarely true that the marker distributions *P*(**Y**|*D*) are simple. They usually involve more than one unknown parameter. For example in McIntosh and Pepe [10], the joint distribution of the two ovarian cancer markers is multivariate normal in the control group and is mixture multivariate normal in the diseased group, respectively. In practice, the parameters are unknown and have to be estimated from sample. The optimality of the LR rule relies on the correct specification of the distributions *P*(**Y**|*D*), a misspecified model could lead to a less powerful, or even biased, test. A more general class of models should provide a better approximation to the risk scores than the linear combination of the multiple tests.

By Bayes' rule, McIntosh and Pepe [10] show that the risk score is a monotone function of the LR function. The risk score is defined as

$$P(D=1|\mathbf{\text{Y}})=\frac{\mathit{\text{LR}}(\mathbf{\text{Y}})q}{\mathit{\text{LR}}(\mathbf{\text{Y}})q+1},$$

where *q* = *P*(*D* = 1)/*P*(*D* = 0) is the odds of disease in the population. They then propose a two-step screening procedure to approximate the risk score rule. Although the two-step procedure yields the AUC close to that from the LR rule, it has several drawbacks. First, the estimation of ROC in the two-step procedure is restricted to a sub-range of false positive rate *α* (*L*, *h*) (0, 1). Often, the researchers are interested in the ROC in the whole range (0,1), especially both ends where false positive (negative) rate is low. Second, the choice of the lower bound *L* needs some information a prior. It is suggested to use *L* = 5/*n _{D̄}*, where

Pepe et al. [12] mention that the linear predictor class for approximating the risk score is quite large, which includes generalized additive model, smoothing and regression spline, kernel method etc. They subsequently model the risk score using a generalized linear model with an unspecified link *g*(·), where

$$P(D=1|\mathbf{\text{Y}})=g({L}_{\beta}(\mathbf{\text{Y}})).$$

This relaxes the logistic assumption for *g*(·). In the literature by far, the linear score *L _{β}*(

The MARS, developed by Friedman [4], is a multivariate non-parametric regression procedure. The MARS procedure builds flexible regression models by fitting separate splines (or basis functions) to distinct intervals of the predictor variables. Kooperberg et al. [6] extend the MARS to categorical outcome. The procedure uses linear spline and their tensor products to fit a regression model for polychotomous response. Based on simulations, Kooperberg et al. [6] show that MARS performs at least as well as other established methods for binary response, e.g., logistic regression, classification and regression trees etc., but yields the advantage of being much faster in fitting huge datasets. The procedure is implemented by the functions
`POLYCLASS` and
`POLYMARS` in the library
`polspline` on R 2.7.0 [14]. To avoid over-fitting and for easier interpretation, I only consider the model with linear splines but no interaction (product) terms. As a result, the risk score is modeled as

$$\text{logit}\{P\left(D=1|\mathbf{\text{Y}}\right)\}={g}_{0}+\sum _{i=1}^{K}{g}_{i}({Y}_{i}),$$

(3)

where

$${g}_{i}({Y}_{i})={b}_{i0}{Y}_{i}+\sum _{J=1}^{{J}_{i}}{b}_{ij}{({Y}_{i}-{\tau}_{ij})}^{+},$$

with (*z* − *τ*)^{+} = max(*z* − *τ*, 0). This model is also called a piece-wise linear model and *τ _{ij}*,

There are several advantages of using MARS to approximate the risk score. First, the implementation is simpler than the two-step procedure and the resulted ROC is defined in the whole range (0,1). The combination in (3) has an easy interpretation. For example, for two markers *Y*_{1} and *Y*_{2} with one change point *τ*_{1} for *Y*_{1}, the linear combination is *b*_{10}*Y*_{1} + *b*_{20}*Y*_{2} when *Y*_{1} < *τ*_{1} and the combination is (*b*_{10} + *b*_{11})*Y*_{1} + *b*_{20}*Y*_{2} when *Y*_{1} ≥ *τ*_{1}. Based on this interpretation, the procedure based on MARS yields similar piecewise screening rule by the two-step procedure as shown in Figure 3a in McIntosh and Pepe [10], but the calculation is simpler and less likely to fail. Next, we will use simulation to compare the proposed method with the linear combination *L _{β}*(

Here I consider two markers *Y*_{1} and *Y*_{2}. In all cases the marker values for the nondiseased subjects are generated from a standard normal *N*(0, 1) distribution, while the marker values for the diseased subjects are generated from
$N({\mu}_{1},{\sigma}_{1}^{2})$ and
$N({\mu}_{2},{\sigma}_{2}^{2})$ for tests *Y*_{1} and *Y*_{2} respectively. The marker values could be independent (*ρ* = 0) or correlated with *ρ* = 0.5.

The parameters of the distributions for the two markers are shown in Table 1. For Cases 1-3, marker *Y*_{2} has better diagnostic accuracy than marker *Y*_{1}. For Cases 4-6, the ROC curves for the two markers cross. Figure 1 shows the ROC curves for two typical cases. The sample size (*n*) in each of the control and diseased groups *n _{D}* =

The results are based on 1000 simulations. The AUCs of the ROC curves from different methods are shown in Table 2. The columns *Y*_{1}, *Y*_{2} and *L _{β}*(

$$\mathit{\text{ELR}}(\mathbf{\text{Y}})=\frac{\widehat{P}(\mathbf{\text{Y}}|D=1)}{\widehat{P}(\mathbf{\text{Y}}|D=0)},$$

where (**Y**|*D*) are the estimated normal distributions
$N({\widehat{\mu}}_{k},{\widehat{\sigma}}_{k}^{2}),k=1,2$. Because the LR rule is the optimal classification rule, the AUC based on the LR rule is the largest and is used as bench mark. Because the simulated data are normally distributed, the AUCs from the normal linear discriminant (NLD) rule [16] is shown in the last column. For all cases, the AUCs based on *L _{β}*(

When the correlation *ρ* increases from 0 to 0.5, the relative gain for the MARS and the ELR methods compared to the linear score remains similar. Compared to the linear score method, the increase of AUC by using the MARS is slightly more for the smaller sample size. I also examined the case of markers with high correlation *ρ* = 0.8, which is often the case when exposures are mixtures. The AUC estimates for *ρ* = 0.8 are similar to those for *ρ* = 0.5. Overall, the MARS method performs well and yields AUCs close to the optimal LR rule, especially when two ROC curves cross. This is because neither methods rely on the linear combination of the markers. Thus, the MARS appears to be a useful alternative of approximating the risk score.

DeLong et al. [3] compare the discriminating ability of three screening scores on 49 consecutive ovarian cancer patients undergoing correction of intestinal obstruction at Duke University Medical Center. The 12 patients surviving more than 2 months postoperatively were considered surgical successes and the remaining 37 are considered failures. The failure cases are considered as test positive, so *n _{D}* = 37 and

The AUCs are 0.622 for ALB, 0.543 for TP and 0.687 for the K-G score, respectively. One would like to see whether the combination could improve the test accuracy. Figure 2a shows the ROC curves for the three scores and their linear combination. The K-G score has higher sensitivity than the other two methods when the false positive rate is greater than 0.5, but has lower sensitivity than the other two when the false positive rate is low. Thus, their ROC curves cross.

When the Su and Liu's linear combination is used, the vector of weights is (*β _{ALB}*,

Figure 2b compares the ROC curves from the NLD method, the two-step method and the MARS. Because *n _{D̄}* = 12, the lower bound

When the MARS is used to fit the risk score, one change point is found for the K-G score at *τ* = 3. The vector of weights are (*β _{ALB}*,

Motivated by a biological model, McIntosh and Pepe [10] simulate two biomarkers, e.g., CA 125 (marker 1) and Her 2/neu (marker 2). For the control group, both markers have standard normal marginal distributions. For the diseased group, marker 1 has a mixture normal distribution *Y*_{1} ~ **I**(*W*_{1} = 0)*N*(0, 1) + **I**(*W*_{1} = 1)*N*(1.68, 2) and marker 2 (Her2/neu) has a mixture normal distribution *Y*_{2} ~ **I**(*W*_{2} = 0)*N*(0, 1) + **I**(*W*_{2} = 1)*N*(2.54, 2), where (*W*_{1}, *W*_{2}) are binary indicators to show that the markers elevate with the onset of cancer. Let *p _{ij}* =

The AUCs for the two markers are 0.640 and 0.614, respectively. The AUCs from the linear combination, the MARS and LR rule are 0.715, 0.739 and 0.745, respectively. As shown in Figure 3, the bottom two ROC curves are for the two markers and they cross at false positive rate 0.8. The ROC curve from the MARS method is very close to that from the LR rule and is almost indistinguishable for false positive rate less than 0.4. The ROC from the MARS is above that from the linear combination for the whole range (0,1). The difference of the AUCs between the MARS and the linear combination is highly statistically significant because of the large sample size. If we look closely at the range (0, 0.15), the figure is very similar to Figure 2b in [10].

In this article, the MARS is proposed as a tool to approximate the risk score in disease diagnosis. This approach extends the linear combination of the markers by adding linear spline terms. The resulted ROC curve is defined in the whole range of false positive rate (0,1) and the implementation is easier than the two-step procedure. Based on simulation, the AUCs based on the MARS are very close to those based on the LR rule.

The MARS technique is now widely used as one of the classification tools in the problems with a large number of predictors, for example gene expression data [7]. MARS has been previously shown to be potentially useful for gene association studies and the detection of variable interactions [2, 19]. As a new basis of diagnostic tests is being provided by the vast amount of data on gene expression that are now becoming available through large-scale measurement of mRNA abundance [15], we expect that MARS should become a useful tool for disease diagnosis using gene expression data. To facilitate the employment of the proposed method, the sample code for the analysis of DeLong et al's data is provided in the appendix.

This research was supported in part by the Intramural Research Program of the National Institute on Aging.

#=================================================================== |

# This program compares different methods of approximating risk |

# scores using the data from DeLong et al. (1988) |

#Data:http://ssc.utexas.edu/docs/sashelp/sugi/24/Posters/p250-24.pdf |

#=================================================================== |

# Include R libraries |

library(ROCR) |

library(caTools) |

library(polspline) |

# Read DeLong et al's data |

YD<-read.table(“Delong.txt”,na.string=“.”,sep=“\t”,header=TRUE) |

YD$y1<-as.numeric(YD$y1) |

YD$y2<-as.numeric(YD$y2) |

# Figure 2a pred.y1<-prediction(YD$y1, YD$d) |

perf.y1<-performance(pred.y1, ‘tpr’, ‘fpr’) |

pred.y2<-prediction(YD$y2, YD$d) |

perf.y2<-performance(pred.y2,‘tpr’,‘fpr’) |

pred.y3<-prediction(YD$y3,YD$d) |

perf.y3<-performance(pred.y3,‘tpr’,‘fpr’) |

pred4<-glm(d^{~}y1+y2+y3,family=binomial(link=“logit”),data=YD) |

pred.y4<-prediction(pred4$fitted.value, YD$d) |

perf.y4<-performance(pred.y4, ‘tpr’, ‘fpr’) |

plot(perf.y1, lty=2, lwd=2, main=“”,xaxis.lwd=2, yaxis.lwd=2, |

yaxis.cex.axis=1.5,xaxis.cex.axis=1.5,box.lwd=2, |

cex.label=2,cex.main=2) |

plot(perf.y2, lty=1, lwd=2,add=TRUE) |

plot(perf.y3,lty=3,lwd=2,add=TRUE) |

plot(perf.y4,lty=1,lwd=3,add=TRUE) |

legend(0,1.0,c(“ALB”,“TP”,“K-G Score”,“Linear |

Combination”),lty=c(1,2,3,1),lw=c(1,1,1,2),cex=1.25) |

# Figure 2b |

plot(perf.y4,lty=1,lwd=2,main=“”,xaxis.lwd=2,yaxis.lwd=2, |

yaxis.cex.axis=1.5,xaxis.cex.axis=1.5,box.lwd=2,cex.label=2,cex.main=2) |

#------ Approximate risk score using POLYCLASS (MARS)------------- |

fmars<-polyclass(YD$d,YD[,c(“y1”,“y2”,“y3”)],silent=TRUE,maxdim=7) |

nk<-fmars$knots[,1] na<-sum(nk) |

aY<-matrix(0,nrow=nrow(YD),ncol=na) |

j<-1 |

for (k in 1:3) { |

if (nk[k]>0) { |

for (l in 1:(nk[k])) { |

tau<-fmars$knots[k,l+1] |

aY[,j]<-pmax(YD[,paste(“y”,k,sep=“”)]-tau,0) |

j<-j+1 |

} |

} |

} |

aY<-data.frame(aY) if (na>0) { |

names(aY)<-paste(“a”,1:na,sep=“”) |

YD<-cbind(YD,aY) |

if (na==1) pred5<-glm(d^{~}y1+y2+y3+a1, family=binomial, data=YD) |

if (na==2) pred5<-glm(d^{~}y1+y2+y3+a1+a2, family=binomial, data=YD) |

if (na==3) pred5<-glm(d^{~}y1+y2+y3+a1+a2+a3, family=binomial, data=YD) |

print(summary(pred5)) |

print(“Chi-square test (linear combination vs. MARS)=”) |

print(round(1-pchisq(deviance(pred4)-deviance(pred5),df=2,ncp=0),3)) |

pred.y5 <- prediction(pred5$fitted.value, YD$d) |

perf.y5 <- performance(pred.y5, ‘tpr’, ‘fpr’) |

plot(perf.y5, lty=2,lwd=3,add=TRUE) |

} |

if (na==0) print(“no improvement of MARS”) |

legend(0,1.0,c(“Linear-Combination”,“MARS”),lty=c(1,2),lw=c(3,3),cex=1.25) |

1. Baker S. Identifying combination of cancer markers for further study as triggers of early intervention. Biometrics. 2000;56:1082–1087. [PubMed]

2. Cook NR, Zee RYL, Ridker PM. Tree and spline based association analysis of genegene interaction models for ischemic stroke. Statistics in Medicine. 2004;23:1439–1453. [PubMed]

3. DeLong ER, Delong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. [PubMed]

4. Friedman JH. Multivariate adaptive regression splines (with discussion) Annals of Statistics. 1991;19:1–141.

5. Kim HJ, Fay MP, Feuer EJ, Midthune DN. Permutation tests for joinpoint regression with applications to cancer rates. Statistics in Medicine. 2000;19:335–351. (correction: 2001 20, 655). [PubMed]

6. Kooperberg C, Bose S, Stone CJ. Polychotomous Regression. Journal of the American Statistical Association. 1997;92:117–127.

7. Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis. 2005;48:869–885.

8. Lehman EL. Testing Statistical Hypothesis. 2nd. Chapman and Hall; New York: 1994.

9. Ma S, Huang J. Combining multiple markers for classification using ROC. Biometrics. 2007;63:751–757. [PubMed]

10. McIntosh MW, Pepe MS. Combining several screening tests: optimality of the risk score. Biometrics. 2002;58:657–664. [PubMed]

11. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press; 2002.

12. Pepe MS, Cai T, Longton G. Combining predictors for classification using the area under the Receiver Operating Characteristic curve. Biometrics. 2006;62:221–229. [PubMed]

13. Pepe MS, Thompson ML. Combining diagnostic test results to increase accuracy. Biostatistics. 2000;1:123–140. [PubMed]

14. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2008. URL http://www.R-project.org.

15. Sandvik AK, Alsberg BK, Norsett KG, Yadetie F, Waldum HL, Lagreid A. Gene expression analysis and clinical diagnosis. Clinica Chimica Acta. 2006;363:157–164. [PubMed]

16. Su JQ, Liu JS. Linear combinations of multiple diagnostic markers. Journal of the American Statistical Association. 1993;88:1350–1355.

17. Xiong C, McKeel DW, Miller JP, Morris JC. Combining correlated diagnostic tests: application to neuropathologic diagnosis of Alzheimer's disease. Medical Decision Making. 2004;24:659–669. [PubMed]

18. Yang Y, Jin Z. Combining dependent tests to compare the diagnostic accuracies - a nonparametric approach. Statistics in Medicine. 2006;25:1239–1250. [PubMed]

19. York TP, Eaves LJ. Common disease analysis using multivariate adaptive regression splines (MARS): Genetic Analysis Workshop simulated sequence data. Genetic Epidemiology. 2001;21:S649–S654. [PubMed]

20. Zou KH, Bhagwat JG, Carrino JA. Statistical combination schemes of repeated diagnostic test data. Academic Radiology. 2006;13:566–572. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |