Stat Probab Lett. Author manuscript; available in PMC 2010 November 15.
Published in final edited form as:
Stat Probab Lett. 2009 November 15; 79(22): 2321–2327.
PMCID: PMC2772215
NIHMSID: NIHMS139290

# The Optimal Linear Combination of Multiple Predictors Under the Generalized Linear Models

Hua Jin, Ph.D.1 and Ying Lu, Ph.D.2

## Summary

Multiple alternative diagnostic tests for one disease are commonly available to clinicians. It’s important to use all the good diagnostic predictors simultaneously to establish a new predictor with higher statistical utility. Under the generalized linear model for binary outcomes, the linear combination of multiple predictors in the link function is proved optimal in the sense that the area under the receiver operating characteristic (ROC) curve of this combination is the largest among all possible linear combination. The result was applied to analysis of the data from the Study of Osteoporotic Fractures (SOF) with comparison to Su and Liu’s approach.

Keywords: Multiple diagnostic predictors, Linear combinations, Receiver operating character curve, Generalized linear model

## 1. Introduction

Multiple alternative diagnostic tests for one disease are commonly available to clinicians, and statistical methods are available for evaluating these alternatives. For instance, multivariate regression models can be used to determine the marginal advantage of additional tests (Richards, 1995), and this approach has been used in osteoporosis research to determine the relative risk and/or odds ratio of osteoporotic frature (Hans, 1999). While these parameters are important for determining future fracture risk or developing clinical predictor rules based on tests, multivariate regression does not directly provide information on test utility or optimizing test combinations.

Because sensitivity and specificity of a diagnostic test depend on the threshold used to define abnormal, the receiver operating characteristic (ROC) curve is often used to assess the utility of a diagnostic test (DeLong, 1988; Pepe, 1997). This methodology has been extended to multivariate tests by constructing a linear function and then evaluating a combined utility based on the ROC of the linear function. Su and Liu (1993) have shown that under a normal distribution assumption, the best linear combination of diagnostic tests to achieve maximum area under a ROC curve is the linear discriminant function. However, no literature covers the general situation.

In this article, under the generalized linear model, we provide the optimal linear combination among all possibilities under the criterion that the area under the corresponding ROC curve of the combination is maximized, which is named the ROC criterion. The main results are presented in Section 2. Section 3 considers the estimation of the optimal linear combination and its corresponding area under ROC curve. Section 4 demonstrates an application of the proposed method to data from the Study of Osteoporotic Fractures (SOF) with comparison to Su and Liu’s approach. Discussion and conclusions are presented in the last section.

## 2. The Best Linear Combination under the ROC Criterion

In this section it is proved that the linear combination in the link function of a model for binary responses maximizes sensitivity uniformly at any given specificity under the generalized linear model. Here we only present results for a simple case with two predictors, which can be easily generalized to the situation of multiple diagnostic markers.

Let Z be the binary outcome, (X1, X2) be the random predictive variables. Suppose the generalized linear model for binary responses holds, that is to say,

(1)

where β = (β0, β1, β2)′ is the 3-vector of parameters and h is a known function that is bounded in the unit interval (0,1). Although a wide choice of link function is available, the most commonly used in practical are the three models as follows:

1. the logistic regression model:
$h(x)=exp(x)1+exp(x)$
2. the probit model:
$h(x)=Φ(x)$
3. the complementary log-log model:
$h(x)=1−exp(−exp(x))$

Let α1X1 + α2X2 be any linear combination, and their sensitivity and specificity are respectively Sn and Sp. First, we will show that the selected linear combination by the link function of the model, β1X1 + β2X2, dominates all the other possible linear combinations in the sense that it provides the highest sensitivity uniformly over the entire range of specificity and thus, it is the best linear combination that results in the largest area under the ROC curve (AUC) among all linear combinations. Then we give the mathematical formula to compute the corresponding largest AUC.

The following useful lemma was originally due to Chebyshev (see Hardy, 1959)

### Lemma 1

Let U be a one-dimensional non-degenerate random variable. If g is bounded, positive, and strictly decreasing, then

$E[U·g(U)]

holds provide the expectation of U exists.

### Theorem 1

Suppose the generalized linear model (1) for binary responses holds and the function h is continuous and strictly monotone. If (X1, X2) is two-dimensional continuous variable with continuous probability density and its expectation exists, then for any given specificity Sp, the coefficients for the best linear combination that provides the highest sensitivity uniformly is (α1, α2) (β1, β2).

Theorem 1 may be easily extended to multiple-markers situation, which is presented in the following theorem without proof.

### Theorem 2

Suppose the generalized linear model for binary responses holds, i.e.,

(2)

where β = (β0, β1, ···, βk)′ is the (k + 1)-vector of parameters with k ≥ 2. Let α1X1 + αkXk be any linear combination. If h is a known function that is also continuous, strictly monotone and bounded in the unit interval (0,1), (X1, ···, Xk) is k-dimensional continuous variable with continuous probability density and its expectation exists, then for any given specificity, the coefficients for the optimal linear combination that provides the highest sensitivity uniformly is (α1, ···, αk) (β1, ···, βk).

### Theorem 3

Suppose the conditions in Theorem 2 hold. Then the best linear combination under the ROC criterion is also (α1, ···, αk) (β1, ···, βk). And if h is strictly increasing, then the largest AUC is

$LAUC=1P(Z=1)P(Z=0)∫G(x)h(x)dG(x)−P(Z=1)2P(Z=0)$

where P(Z = 1) = ∫h(x)dG(x), P(Z = 0) = 1 − P(Z = 1) and G(x) is the distribution function of X = β0 + β1X1 + ··· + βkXk, which can be derived from the joint distribution of (X1, ···, Xk).

For rare diseases such as Osteoporotic Fractures discussed in the example section, we have the following simple formula to calculate the largest area under the ROC curve of the best linear combination, which can be directly derived from simple computation.

### Corollary 1

If the prevalence of the disease, P(Z = 1), is small enough, for example less than 5%, then the largest AUC may be approximated as follows:

$LAUC≈∫G(x)h(x)dGP(Z=1)$

whose error is about P(Z = 1)/2, say less than 2.5% if P(Z = 1) ≤ 5%.

## 3. Estimation of the Best Linear Combination and the largest AUC

In practical application, we need first to fit the generalized linear model from data. Suppose that for the study subject i, the response Zi has the Bernoulli distribution

and θi is determined by the link function θi = h(β0 + β1xi1 +···+ βkxik), where β = (β0, β1, ···, βk)′ is the (k + 1) -vector of parameters with k ≥ 2, h is a known continuous and strictly monotone function bounded in the unit interval (0,1), and (xi1, ···,xik) are the values of the two diagnostic predictors included in the model for the ith subject (i = 1, ···,n).

It is common to make use of the likelihood function as follows to estimate the parameters β = (β0, β1, ···, βk)′:

In fact, we can obtain the maximum likelihood estimate of β by maximizing L(β; z). An iterative algorithm, such as the popular Newton-Raphson method implemented in standard software from SAS and S-Plus, may be used to compute numerically (Cox, 1989).

As for estimation of the largest AUC, we need further know the joint distribution of the diagnostic predictors. There are two approaches for this problem. One belongs to conventional parametric methods, among which the normal approximation is the most popular. Another is the standard nonparametric technique, which use the empirical distribution, based on the sample data, to directly estimate the population distribution. After that, we could derive the estimation of the distribution function of β0 + β1X1 +···+ βkXk, and the corresponding area under the ROC curve based on Theorem 3 in this paper.

## 4. Application

In this section, we apply our method to the data obtained from the Study of Osteoporotic Fractures (SOF) for illustration and compare it to the approach of Su and Liu (1993). From 1986 to 1988, SOF recruited 9,704 white women aged 65 years or older from four areas of the United States. At baseline, bone mineral density (BMD) was measured at the calcaneus, distal radius and proximal radius using single photon absorptiometry (SPA). At the second visit (1988–89), surviving participants had BMD measurements of the posterior-anterior (PA) spine (L1–L4) and proximal femur (neck, trochanter, total hip regions of interest) using dual x-ray absorptiometry (DXA). Fractures of the hip were recorded for each subject at each visit. More details about the study design and the data have been published previously (Cummings, 1995).

We included 7,127 women from the study. All of these women had forearm, calcaneal, hip and spine bone mineral density (BMD) measurements. In addition to the BMD measurements, many other previously identified predictive variables at baseline were also investigated. Furthermore, they all had known 5-year hip fracture status: either they were followed for 5 years after visit 2 without hip fracture or they had hip fracture within five years after visit 2. Women who were lost to follow-up within 5 years without known hip fracture were excluded.

Although all these 43 variables, including patient demographic data, clinical BMD measured by DXA and SXA, medical history, X-ray for prevalent fracture and vertebral heights, functional status, vision test results, and nutrition, etc, have been identified as significant predictors of hip fracture risk and they may reflect different aspects of osteoporosis and aging and may help in understanding the etiology of osteoporosis and hip fracture, only a few of them are necessary to identify subjects with elevated fracture risks. Standard approaches to the analysis of binary data such as logistic regression and probit model show that a linear combination of age, the femoral neck BMD and loss of height could best predict the hip fracture. Our previous study also suggested that these 3 variables could be used to build a non-inferior classification rule to the optimum recursive partitioning rule (Jin, 2004). So here we consider these three predictors and want to find out the best linear combination under the ROC criterion.

Assume the generalized linear model (2) holds for the SOF data, where Z1 = 1 stands for hip fracture, X1, X2 and X3 denote age, femoral neck BMD and loss of height respectively. If we further assume $h(x)=exp(x)1+exp(x)$, the standard software S-plus or SAS provides the following fitted logistic regression model:

It follows from our theorems 2 and 3 that X(l) = 0.075X1 − 8.90X2 + 0.100X3 is the optimal linear combination in the sense that it provides the highest sensitivity uniformly over the entire range of specificity and also under the ROC criterion. Hence, the estimated best coefficients are proportional to .

We need to estimate the joint distribution of (X1, X2, X3) in order to get the corresponding largest area under the ROC curve. If we further assume that (X1, X2, X3)τ ~ N μ,Σ, where τ means the transpose of a vector, the standard normal estimation gives = (71.06,0.65,3.36)τ and

$∑^=(23.88−0.133.77−0.130.012−0.053.77−0.058.05),$

which leads to −3.89 + 0.075X1−8.90X2 + 0.10X3 ~ N −4.05,1.22. It follows from Theorem 3 that the largest AUC under the logistic regression model is 0.798. The 250 times bootstrap standard error is 0.049.

We may obtain similar result under the probit model, that is to say, h(x) = Φ(x). Here, the fitted model of hip fracture risk is

which suggests X(p) = 0.036X1 −4.09X2 + 0.048X3 be the best linear combination under the probit model. Hence, the estimated best coefficients are proportional to , and the corresponding largest AUC be 0.805 (with standard error 0.050). By the way, if we use the approximate formula in Corollary 1, these two largest AUCs are estimated to be 0.788 and 0.795, just 1% lower than the true values.

We also can consider applying Su and Liu’s method to the hip fracture data. Under their approach, the two conditional distributions are estimated as

$(X1X2X3)|Z=1~N((74.970.565.28),(36.83−0.0798.28−0.0790.007−0.0698.28−0.06913.23))$

and

$(X1X2X3)|Z=0~N((70.930.663.30),(22.92−0.123.36−0.120.012−0.0468.28−0.0467.76))$

The general method in Section 3 of their paper may give the best coefficients as follows, with the estimated largest AUC 0.760.

For comparison of fit goodness of the three models, we directly estimate the ROC curves corresponding to the three coefficients $(α1l,α2l,α3l),(α1p,α2p,α3p)$ and $(α1s,α2s,α3s)$ empirically, i.e. based on the SOF data itself, and then calculate the areas under the curves. It’s not surprising that the real AUCs are all 0.805 because of little difference between the three optimal linear combinations resulted by different models. Therefore, the logistic regression and probit model fit the SOF data very well while Su and Liu’s method underestimates the area under the ROC curve of the best linear combination.

## 5. Discussion and Conclusion

In medical applications, it is important to use all the good diagnostic predictors of a disease simultaneously to establish a new predictor with higher statistical utility. The linear combinations of multiple predictors are of particular interest to us. In this paper, we consider a common issue where the generalized linear models are utilized to fit the data with binary outcomes. Now that the linear combination identified by the link function is proved optimal under the ROC criterion, we just need to estimate the best linear combination using the standard procedure once the generalized linear models pass the statistical test. Therefore it is easy for clinicians to get the optimal linear combination of multiple diagnostic predictors under the condition of generalized linear models.

One referee of Su and Liu’s paper pointed out, “Using logistic regression to identify tests that best predict presence or absence of disease is also common”. Although the logistic regression is usually less efficient than the normal discriminant analysis with the normal assumption holding (Efron, 1975; Ruiz-Velasco, 1991), the generalized linear models for binary data will be more robust than the latter, because choice and estimation of the best linear combination need no assumption of the joint distribution of the multiple predictors. Su and Liu admitted this feature, exposed clearly by Cox and Snell (1989) that “once a vector of explanatory variables is given, then the probability that this individual belongs to one of the two groups is determined.” Our SOF data could be a real example for illustration of that point.

There may be two directions for extension of this paper. On the one hand, we can extend our work from linear to non-linear models, which may explore a new concept--the ROC region to assess statistical utility of a diagnostic predictor instead of the ROC curve. On the other hand, we may extend our interest from binary data to more complex data with more than two outcomes, which was discussed by Yang and Carlin (2000) who used ROC surface approach. All these kinds of research are currently under investigation.

## Acknowledgments

The study is supported by grants from the National Institutes of Health R01EB004079 and National Bureau of Statistics of China LX:2006B45

## APPENDIX: PROOFS

#### Proof of Theorem 1

Without loss of generality, let α2 = β2. For simplicity, let α1 = α.

Suppose that β2 > 0 and h is strictly increasing. Let f(x1, x2) be the distribution density function of the two-dimensional continuous variable (X1, X2), and (x) = 1 − h(x), then the specificity of the linear combination, αXl + β2X2, can be expressed as

and, therefore,

$∫−∞∞[∫−∞c−αx1β2h¯(β0+β1x1+β2x2)·f(x1,x2)dx2]dx1=Sp·P(Z=0)$
(A1)

For any given specificity Sp, the equation (A1) defines c as a differentiable function of α, denoted as c(α) below. Furthermore, it follows from differentiating both sides of equation (A1) with respective to α that

$∫−∞∞h¯(β0+β1x+c(α)−αx)·f(x,c(α)−αxβ2)·c′(α)−xβ2dx=0$
(A2)

Evaluating (A2) at the specific choice α = β1, and noting that the nonzero function (β0 + β1x + c(β1) − β1x) does not depend on x and so can be taken out of the above integral, leads to the equality:

$∫−∞∞c′(β1)−xβ2f(x,c(β1)−β1xβ2)dx=0$
(A3)

where c′(α) denotes the derivative of c(α) with respect to α.

As for the sensitivity of X, it can be rewritten as

by utilizing (A1). So, we just need to prove that $S(α)=∫−∞∞[∫−∞c(α)−αx1β2f(x1,x2)dx2]dx1$ can reach the absolute minimum at the point α = β1 under the restriction of (A1) or (A2).

Because

$S′(α)=∫−∞∞c′(α)−xβ2f(x,c(α)−αxβ2)dx,$
(A4)

it follows from (A3) and (A4) that

$S′(β1)=∫−∞∞c′(β1)−xβ2f(x,c(β1)−β1xβ2)dx=0$
(A5)

On the other hand, it’s easy to see that, for any fixed α, $1β2∫−∞∞f(x,c(α)−αxβ2)dx$ is the value of the density function of the linear combination αX1 + β2X2 at point c(α). Let $p(x)=f(x,c(α)−αxβ2)∫−∞∞f(x,c(α)−αxβ2)dx$ be the density function of a random variable U, and g(x) = (β0 + β1x+ c(α) − αx). So, when α < βl, g is bounded, positive, and strictly decreasing. From Lemma 1, we have

$E[U·g(U)]

which leads to the following inequality:

$∫−∞∞xg(x)f(x,c(α)−αxβ2)dx∫−∞∞g(x)f(x,c(α)−αxβ2)dx<∫−∞∞xf(x,c(α)−αxβ2)dx∫−∞∞f(x,c(α)−αxβ2)dx$
(A6)

From (A2), we see that

$∫−∞∞xg(x)f(x,c(α)−αxβ2)dx∫−∞∞g(x)f(x,c(α)−αxβ2)dx=c′(α)$
(A7)

Combining (A4),(A6) and (A7), we obtain that S′(α) < 0 when α < βl.

Similarly, we can prove that S′(α) > 0 when α > β1. Therefore, combining with (A5), we finish the proof that S(α) has the absolute minimum, hence Sn(c(α)) has the absolute maximum, at α = β1 when β2 > 0 and h is strictly increasing.

In a similar way, it’s easy to prove that the coefficients for the best linear combination that provides the highest sensitivity uniformly is (α1, α2) (β1, β2) when β2 < 0 or h is strictly decreasing. Thus the proof of the theorem is complete.

#### Proof of Theorem 3

Once again, we let (x) = 1 − h(x). It follows from (2) that

So we have P(Z = 1) = ∫h(x)dG(x), and the specificity of the best linear combination can be expressed as

Then $dSp(c)=1P(Z=0)h¯(c)dG(c)$, and

$∫1P(Z=1)G(c)dSp(c)=1P(Z=1)P(Z=0)∫G(c)h¯(c)dG(c).$

On the other hand, its sensitivity is

Then the largest area under the ROC curve is given by

$AUC=∫Sn(c)dSp(c)=∫(1+P(Z=0)P(Z=1)Sp(c))dSp(c)−∫1P(Z=1)G(c)dSp(c)=(1+P(Z=0)2P(Z=1))−1P(Z=1)P(Z=0)∫G(c)h¯(c)dG(c)=(1+P(Z=0)2P(Z=1))−1P(Z=1)P(Z=0)∫G(c)(1−h(c))dG(c)=1P(Z=1)P(Z=0)∫G(c)h(c)dG(c)+(1+P(Z=0)2P(Z=1)−12P(Z=1)P(Z=0))=1P(Z=1)P(Z=0)∫G(x)h(x)dG(x)−P(Z=1)2P(Z=0).$

## Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

## References

• Cox DR, Snell EJ. The Analysis of Binary Data. London: Chapman and Hall; 1989. p. 132.
• Cummings SR, Nevitt MC, Browner WS, et al. Risk factors for hip fracture in white women. Study of osteoporosis research group [see comments] The New England Journal of Medicine. 1995;332:767–773. [PubMed]
• DeLong ER, DeLong DM, Clark-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a non-parametric approach. Biometrics. 1988;44:837–845. [PubMed]
• Efron B. The Efficiency of logistic Regression Compared to Normal Discriminant Analysis. Journal of the American Statistical Association. 1975;70:892–898.
• Hans D, Srivastav SK, Singal C, et al. Does combining the results from multiple bone sites measured by a new quantitative ultrasound device improve discrimination of hip fracture? Journal of Bone and Mineral Research. 1999;14:644–651. [PubMed]
• Hardy GH, Littlewood JE, Polya G. Inequalities. London and New York: Cambridge University Press; 1959. p. 43.
• Jin H, Lu Y, Stone KL, et al. Classification algorithms for hip fracture prediction based on recursive partitioning methods. Medical Decision Making. 2004;24(4):386–397. [PubMed]
• Pepe MS. A regression modelling framework for receiver operating characteristic curves in medical diagnostic testing. Biometrika. 1997;85:595–608.
• Richards RJ, Hammitt JK, Tsevat J. Finding the optimal multiple-test strategy using a method analogous to logistic regression analysis. Medical Decision Making. 1995;16:367–375. [PubMed]
• Ruiz-Velasco S. Asymptotic Efficiency of Logistic Regression Relative to Linear Discriminant Analysis. Biometrika. 1991;78:235–243.
• Su J, Liu J. Linear combinations of multiple diagnostic markers. Journal of the American Statistical Association. 1993;88:1350–1355.
• Yang H, Carlin D. ROC surface: A generalization of ROC curve analysis. Journal of Biopharmaceutical Statistics. 2000;10:183–96. [PubMed]

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.