Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3359021

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Techniques for the generalized linear model and generalized estimating equations with a single covariate
- 3. The multiple covariate case
- 4. Monte Carlo study
- 5. An application to analysis of survival data based on pseudo-values
- 6. Discussion
- Supplementary Material
- References

Authors

Related links

Comput Stat Data Anal. Author manuscript; available in PMC 2012 May 23.

Published in final edited form as:

Comput Stat Data Anal. 2011 January 1; 55(1): 226–235.

PMCID: PMC3359021

NIHMSID: NIHMS374003

We consider the problem of dichotomizing a continuous covariate when performing a regression analysis based on a generalized estimation approach. The problem involves estimation of the cutpoint for the covariate and testing the hypothesis that the binary covariate constructed from the continuous covariate has a significant impact on the outcome. Due to the multiple testing used to find the optimal cutpoint, we need to make an adjustment to the usual significance test to preserve the type-I error rates. We illustrate the techniques on one data set of patients given unrelated hematopoietic stem cell transplantation. Here the question is whether the CD34 cell dose given to patient affects the outcome of the transplant and what is the smallest cell dose which is needed for good outcomes.

In many medical studies it is of interest to investigate the relationship between explanatory variables, such as prognostic factors, treatment factors or patient characteristics, and the outcome. The outcome may be continuous, such as the time to some event or the level of some assayed enzyme, or it may be a discrete outcome, such as an indicator of relapse or death. We may also be interested in state occupation probabilities in multistate processes modeled via a pseudo-observation approach applied to censored data as discussed in, for example, Andersen et al. (2003). In any case, regression models based on generalized estimating equations (GEEs) (McCullagh and Nelder, 1989; Liang and Zeger, 1986) can be applied to examine the effects of covariates on the outcome.

In many cases the covariates of interest are continuous in nature. While they could be modeled as such, many clinicians find the interpretation of such covariates difficult and they prefer to model these effects as categorical or binary covariates reflecting different prognostic groups of patients based on the measured value of the continuous covariate. In other cases, the covariate may represent the dose of some drug or some other therapeutic agent, and the clinician is interested in identification of a therapeutic threshold. Even though evaluation of a variable’s prognostic value is best done with the variable in its continuous form and there is a possible loss of information when categorizing a continuous covariate, the need for thresholds for clinical use and treatment decisions justifies the development of appropriate statistical methods for finding optimal cutpoints. Methodologies for finding optimal cutpoints are also needed and used in various tree building algorithms (Lausen et al., 2004).

When discretizing a continuous covariate and then testing for the covariate effect, a number of techniques can be used. One group of techniques relies on the investigator to provide cutpoints based on historical data or it uses cutpoints based on a split into groups at a predetermined percentile of the continuous covariate. Another approach is to let the data decide on the cutpoint and then perform the test of the covariate effect on the two resulting groups. In most cases the continuous covariate in this approach is split into groups based on either the largest value of the likelihood or the largest value of some two-sample test statistic after a search of possible cutpoints. This selection of the largest likelihood or test statistic leads to an inflated type-I error due to multiple testing, so some correction needs to be applied to obtain the correct type-I error.

A data-dependent cutpoint selection and test adjustment has been proposed for survival data by a number of authors. The approach is based on a Cox proportional hazards model with a single covariate defined as 1 or 0 depending on the value of the continuous covariate. Jespersen (1986) based an adjusted test on the maximum value of the score statistic from the Cox model which suitably standardized converges to a Brownian bridge under the null hypothesis. Contal and O’Quigley (1999) modify the log rank test statistic and show that the process consisting of the score statistic using cutpoints at the order statistics of the continuous covariates converges to a Brownian bridge process. Lausen and Schumacher (1992, 1996) showed that for any standardized test statistic *C* (δ) for the two-sample problem with groups defined by a threshold parameter, δ, the following convergence result holds:

Here *X*_{(m)} is the *m*th-order statistic of the continuous covariate and *W*^{0} is a standard Brownian bridge process. Klein and Wu (2004) compare these estimators and extend the Contal and O’Quigley approach to the accelerated failure time model and the Cox model with additional covariates.

In this note we examine these data-driven methods in a generalized linear model framework. We are particularly interested in using these techniques in pseudo-observation regression problems. The pseudo-observation approach has been suggested as a method for direct censored data regression modeling for the survival function (Logan et al., 2008), for the cumulative incidence function (Klein and Andersen, 2005), for the mean survival time (Andersen et al., 2004), for multistate probabilities (Andersen et al., 2003) and for mean quality of life (Andrei and Murray, 2007). In this approach, pseudo-observations are formed as the difference between the full sample and leave-one-out estimator based on an approximately unbiased estimator of the parameter of interest. These pseudo-observations are then used in a GEE model.

The methods we developed will be illustrated using data from a study from the Center for Blood and Marrow Transplantation Registry. In this study 709 patients with Acute Myeloid Leukemia (*n* = 395) or Acute Lymphocytic Leukemia (*n* = 397) in first (*n* = 204) or second (*n* = 488) complete remission were given a Bone Marrow Transplant (BMT) from an unrelated donor transplant. The cell source for all the transplants was bone marrow. The research question of interest was: “What level of CD34 cells in the graft is needed to lower the death in remission rates and is there a threshold dose of CD34 cells to affect the relapse rates?”. The questions require an estimation of the threshold dose level of CD34 cells and a comparison of the “high” and “low” CD34 dose groups for death in remission and relapse cumulative incidence probabilities.

In this section we show two approaches to discretizing a continuous covariate and testing for its significance based on the generalized estimating equation approach. Here we assume that the response for individual *i, i* = 1, …, *n*, is possibly multivariate and is denoted by **Y**_{i} = (*Y*_{i1}, …, *Y _{iJ})^{t}*. We allow for the

where *g*(·) is a known link function, α_{j} for *j* = 1, …,*J*, δ and γ are unknown parameters and is the dichotomized covariate obtained from *X*, namely

(1)

We are interested in estimating the threshold parameter δ as well as the inference on the parameter γ. For a given δ, estimates of parameters can be based on the generalized estimating equations (GEEs) suggested by Liang and Zeger (1986):

(2)

where **β** = (γ, α_{1}, …, α_{j})^{T} and **V**_{i} is a working covariance matrix. GEE estimates of **β**, are found by solving Eq. (2). A sandwich estimator can be used to estimate the covariance matrix of **β**:

where

and

Standard packages such as GEESE in R or PROC GENMOD in SAS can be used to obtain estimates and their variances for a given δ.

In our setup, we are interested on the hypothesis H_{0} : γ = 0. One test that can be used is the Wald test given by , which has a standard normal distribution when *H*_{0} is true and δ is known *a priori*. A second test is the generalized score test (Boos, 1992). For the two-sample problem with no covariates, Liu et al. (2008) show that the score statistic for the test H_{0} : γ = 0 is given by

where *n*_{0} is the number of observations with , *n*_{1} is the number of observations with is the estimator of α_{j} under the null hypothesis, . Under the null hypothesis, has a chi-square distribution with one degree of freedom when δ is known *a priori*.

In order to estimate δ, one common approach is to compute either for values of δ in some range and pick up the value that maximizes . This procedure is appropriate for estimating δ, but an adjustment must be made in order to make an inference for γ and preserve the type-I error rate.

The first adjustment that can be made is a GEE version of the approach in Lausen and Schumacher (1992, 1996). Both the generalized score statistic and the Wald statistic can be used. In order to apply this approach, the possible range of threshold values must be restricted to [*X*_{(nε)}; *X*_{(n(1−ε))}], where *X*_{(k)} is the *k*th-order statistic of the continuous variable and 0 < ε < 0.5. Let *C* (δ) be the test statistic for the two-sample problem with groups defined by δ. It can be shown that

as *n* → ∞, where *W*^{0} is a Brownian bridge. Details of the proof can be found in Lausen and Schumacher (1992) and Billingsley (1968, chap. 4). Miller and Siegmund (1982) show that, for *b* > 0,

where (·) is the standard normal density function. This result motivates the following approximation for a corrected *p*-value:

where *z* = Φ^{−1}(1 − *p*/2), Φ(·) is the standard normal cumulative distribution function and *p* is the unadjusted *p*-value computed at the maximum of the test statistic. Notice that the approximation is valid for *p* < 0.5. Inferences for γ can be made by computing the corrected *p*-value for the maximum value of the test statistic.

The second approach considered here is a modification of the method in Contal and O’Quigley (1999). To construct the test statistic, recall that, for the two-sample problem without covariates, assuming that the working covariance matrix is the identity matrix, the score function can be written as

where is the linear predictor and μ_{ij} = *g*^{−1}(η_{ij}). Let

(3)

be the score evaluated under the null hypothesis.

Billingsley (1999, pg. 196) proved the following theorem.

**Theorem 1.**
*Let* {ξ_{1}, ξ_{2}, …} *be a stationary and ergodic process for which its conditional expected value* E(ξ_{i}|ξ_{1}, …, ξ_{i−1}) *is equal to zero with probability* 1 *and for which the expected value*
*is positive and finite. Define*

*If*

*for p* [0, 1]; *then S _{n}*(

In our setup, in order to apply this result, we must first put the data in increasing order with respect to the continuous covariate *X*. Now, with the ordered data, define

and assume that **Y**_{i}**Y**_{i′}. With this assumption, we can then see that the condition E(ξ_{i}|ξ_{1}, …, ξ_{i−1}) = 0 is verified. It is important to notice that we are assuming that individuals are independent, but observations within the same individual may be correlated. Also, under the null hypothesis H_{0} : γ = 0, recalling that we are assuming a common dispersion parameter for distribution of the responses, we have that the **Y**_{i} are independent and identically distributed (i.i.d.), so the ξ_{i} are also independent and identically distributed. Therefore, the ξ_{i} are stationary and ergodic. Finally, under the null hypothesis, we have

where **V**_{Y} is the covariance matrix of **Y**_{i} and **ġ** is the vector of elements , which is finite. With the ordered data, it is clear that there is a value δ_{p} for which . In fact, any value of δ_{p} between *X*_{([np])} and *X*_{([np]+1)} satisfies , so the midpoint can be used. If we define

we have *S _{n}*(

In practice, we replace the unknown quantities in the ξ_{i} by consistent estimators under the null, so we work with *Ŝ _{n}*(

(4)

Therefore, because *Ŝ _{n}*(1) = 0, we have that

(5)

If we order the ξ by the continuous covariate for which we want to find a cutpoint, under the null hypothesis this sequence is formed by i.i.d. observations and the sequence of *Ŝ _{n}*(

In the previous section, we considered the situation with only a single covariate in the model. However, in most practical applications there are other covariates, and we now discuss the extension to allow for many covariates in the model. The test statistic remains the same, so we discuss it here briefly.

As before, assume that *Y _{ij}* is the outcome for the

where *g*(·) is a known link function, α_{j} for *j* = 1, …,*J*, γ and **β**^{*} are the unknown parameters and is the categorized covariate given by (1).

As before, we are interested in finding the cutpoint δ and testing the hypothesis about γ. For a given δ, denote by **β** = (γ, α_{1}, α_{2}, …, α_{J}, **β**^{*T})^{T} the vector of all unknown parameters. The estimating equations are given by (2) and the variance of estimates can be estimated by the sandwich estimator.

In order to test the hypothesis H_{0} : γ = 0, the usual Wald test or the generalized score test (Boos, 1992) can be used. The Wald test can be obtained easily using any standard package for GEEs. In order to derive an expression for the generalized score test, using an independence working covariance matrix **V**_{i} = **I**, we rewrite the score Eq. (2) as

where **Z**_{i} is the (*J* × *p*) matrix of covariates (design matrix) of the *i*th observation, i.e., it is the augmented matrix given by , *p* is the number of rows of the parameter vector **β**, **1**_{J} is the vector with *J* rows of 1s, **I**_{J×J} is the identity matrix and **Ġ**_{i} = diag(*ġ*(η_{ij})) is the diagonal matrix of derivatives . The score test for H_{0} : γ = 0 is based on the first element of **U**(**β**) and the expression is exactly the same as that obtained for the single covariate case. As before, we must order the data increasing on the continuous covariate *X* and let be defined as in (3). We assume that the **Y**_{i} are independent random variables with mean **μ**_{i} and the same covariance matrix **Σ**, so the (**Y**_{i} − **μ**_{i}) are also independent with mean zero and common covariance matrix **Σ**. Now suppose we have covariates which constitute an i.i.d. sample from a distribution *P*(·). The random vectors **Ġ**_{i}(**Y**_{i} − **μ**_{i}) are, therefore, i.i.d. with zero mean and variance E_{P}(**Ġ**_{i}**ΣĠ**_{i}). This allows us to apply the same results as in the situation without any covariates. Therefore, we can compute the sequence of *Ŝ _{n}*(

A simulation study was designed to compare the performance of the different test procedures in terms of their type-I error rate and power. We constructed the simulation for correlated observations, and *J*-variate normal random variables were generated for each subject, *J* = 2 or 4. We took the means to be

(6)

for *i* = 1, …, *n*, *j* = 1, …, *J*. Here the *J* variables had a common correlation of ρ = 0, 0.25, or 0.5. The *z*-values used in generating the *T*-values were obtained by dichotomizing a continuous covariate *X* taken to be uniform [−1, 1]. Under the alternative hypothesis we used cutpoints of zero and 0.2. Total samples of size *n* = 50, 100, 200 and 400 were examined. Ten thousand samples were generated for each combination of γ, δ, ρ, *J* and *n*.

We computed the Wald and score test statistics along with both uncorrected and corrected *P*-values, and the sup score test statistic. We used ε = 0, 0.1, 0.15, 0.2 and 0.25 to restrict the range of possible values for the cutpoint. Although this restriction is not required for the new test statistic, we also computed the new test statistic using this range restriction.

Since we have a large number of scenarios, we applied analysis of variance (ANOVA) techniques to summarize the results. For the type-I error rate, we defined the outcome, *R*, as the percentage rejection rate minus the nominal rate of 5%, so good performance is indicated by values of the expectation of *R* near 0. Negative values indicate a conservative test procedure and positive vales indicate that the test procedure inflates the type-I error. The first model fit to our results has the following effects:

(7)

When fitting the model without an intercept and normalizing the effects of the other factors to have a sum of zero, the estimates for the interaction terms have the interpretation as average deviations from the nominal level of 5% adjusted for the effects of the other factors. Table 1 compares the size of the tests by values of ε. The results on Table 1 show clearly that an adjustment must be made because there is an enormous inflation of type-I error rate for both the uncorrected Wald and score test statistics. The adjusted Wald statistic also inflates the type-I error rate a little bit, but the inflation gets smaller when we increase the value of ε. The score test statistic is conservative, and we can also see that the type-I error gets closer to 5% when we move ε towards 0.5. The sup score test statistic is still conservative, but has a type-I error closer to the nominal one when compared to the other test statistic.

Table 2 examines, using appropriate ANOVA models, the effect of sample size, number of points and correlation on the size of the three adjusted tests. In Table 2, in the first model fitted we see that the Wald test is anti-conservative for small samples while the other two tests are conservative. For moderate samples the test based on the sup scores seems to perform the best. The models for the number of time points and the correlation ρ in Table 2 suggest that the results hold regardless of the dimension of **Y** or the correlation.

To study the power of the tests we simulated as described above with γ = 0.25 or 0.5 and δ = 0 or 0.5. ANOVA methods on the percent rejections, *R*, are used to summarize the data. Table 3, based on the ANOVA model

(8)

shows that the power of the tests is not affected by the choice of ε. We see in Table 4 that the power increases as the sample size increases and as the dimensionality of **Y** increases. Also in Table 4, we see no difference in power for the two true cutpoints. In all tables the adjusted Wald statistic has the highest power. However, since it was anti-conservative this does not suggest that it is the best statistic to use. The new sup score statistic performs quite well and we suggest its use.

Finally, we show some results for the cutpoint estimates. Fig. 1 shows the mean bias for different sample sizes and different cutpoints. The results suggest small bias for all three procedures, with a smaller standard error for the sup score method. We also conducted a simulation study with data generated with a different link function and different values for the β in the linear predictor of the mean function and different cutpoints. The conclusions are similar to the ones shown here, so we do not include the results here.

We illustrate the use of these techniques using data on 708 patients given an unrelated donor bone marrow transplant for acute leukemia. The question of interest is the effect of the number of CD34 cells in the donor marrow on the incidence of relapse and death in remission. Clinicians would like to know if there is a minimum cell dose beyond which patients have either lower relapse or less death in remission.

To examine the threshold CD34 count question for these two competing risks we use the pseudo-observation method of Klein and Andersen (2005). In this approach for relapse we compute at a grid of time points τ_{j} and the pseudo-observations, *Y _{ij}*, are based on the difference between the full sample estimate of the cumulative incidence of relapse and the leave-one-out estimator of the cumulative incidence. This difference or pseudo-observation is used in a generalized estimating equation to study the covariate effects on the relapse cumulative incidence function. In the example we use

Fig. 2 shows the value of the three statistics. All statistics are adjusted for the patient’s age, Karnofsky performance score and the degree of HLA matching. Here we see that the optimal cutpoint for both relapse and death in remission is at 2.93 × 10^{6} CD34 cells. Table 5 shows the unadjusted and adjusted tests for the effect of CD34 count. We see that if no adjustment is made for the estimation of the cutpoint that all tests are significant at the 5% level, while the more appropriate adjusted tests suggest that the CD34 cell count is only important for death in remission. In Table 6 we present the parameter estimates which shows that patients with low CD34 counts have significantly more death in remission.

Cutpoints for the bone marrow data example, model for death in remission and relapse with covariates age, group and KPS.

We have proposed a new test statistic that can be used to estimate a threshold value and also to make inference about the regression coefficient associated with the categorized covariate for generalized linear models and generalized estimating equations. We compared our test statistic with the Wald and score tests statistics as well as their corrected versions. Our simulations show clearly that adjustments must be made when making inference for the regression parameter after a cutpoint is selected because the type-I error rates are extremely high when no correction is used. The adjusted Wald test still inflates the type-I error a little while the adjusted score test statistic is more conservative. Our proposed test statistic is conservative, but with type-I error probability closer to the desired one. Also, the method provides good estimates of the true cutpoint when a threshold model is correct.

Professor Tunes-da-Silva’s work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo – FAPESP, Brazil, grant number 2007/02823-3. Professor Klein’s work was supported by a grant from the National Cancer Institute (R01 CA54706-14).

**Appendix. Supplementary data**

Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.csda.2010.02.016.

- Andersen PK, Hansen MG, Klein JP. Regression analysis of restricted mean survival time based on pseudo-observations. Lifetime Data Analysis. 2004;10:335–350. [PubMed]
- Andersen PK, Klein JP, Rosthoj S. Generalized linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika. 2003;90:15–27.
- Andrei A-C, Murray S. Regression models for the mean of the quality-of-life-adjusted restricted survival time using pseudo-observations. Biometrics. 2007;63:398–404. [PubMed]
- Billingsley P. Convergence of Probability Measures. New York: Wiley and Sons; 1968.
- Billingsley P. Convergence of Probability Measures. 2nd ed. New York: Wiley and Sons; 1999.
- Boos D. On generalized score tests. The American Statistician. 1992;46(4):327–333.
- Contal C, O’Quigley J. An application of changepoint methods in studying the effect of age on survival in breast cancer. Computational Statistics and Data Analysis. 1999;30:253–270.
- Graw F, Gerds TA, Schumacher M. On pseudo-values for regression analysis in competing risks models. Lifetime Data Analysis. 2009;15:241–255. [PubMed]
- Jespersen NCB. Dichotomizing a continuous covariate in the Cox regression model. Statistical Research Unit of University of Copenhagen, Research Report. 1986;86(2)
- Klein JP, Andersen PK. Regression modeling of competing risks data based on pseudovalues of the cumulative incidence function. Biometrics. 2005;61:223–229. [PubMed]
- Klein JP, Wu J-T. Discretizing a continuous covariate in survival studies. In: Balakrishnan N, Rao CR, editors. Handbook of Statistics 23: Advances in Survival Analysis. New York: Elsevier; 2004. pp. 27–42.
- Lausen B, Hothorn T, Bretz F, Schumacher M. Assessment of optimal selected prognostic factors. Biometrical Journal. 2004;46(3):364–374.
- Lausen B, Schumacher M. Maximally selected rank statistics. Biometrics. 1992;48:73–85.
- Lausen B, Schumacher M. Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Computational Statistics and Data Analysis. 1996;21:307–326.
- Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;78:13–22.
- Liu L, Logan B, Klein JP. Inference for current leukemia free survival. Lifetime Data Analysis. 2008;14:432–446. [PMC free article] [PubMed]
- Logan BR, Klein JP, Zhang MJ. Comparing treatments in the presence of crossing survival curves: an application to bone marrow transplantation. Biometrics. 2008;64(3):733–740. [PMC free article] [PubMed]
- McCullagh P, Nelder J. Generalized Linear Models. London: Chapman and Hall; 1989.
- Miller R, Siegmund D. Maximally selected chi square statistics. Biometrics. 1982;38:1011–1016.
- O’Quigley J. Proportional Hazards Regression. New York: Springer; 2008.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |