Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2994590

Formats

Article sections

- Summary
- 1. Introduction
- 2. Profile likelihood approach
- 3. EM algorithm for a semiparametric model
- 4. General concepts
- 5. Non-linear transformation models
- 6. Summary and justification of the procedure
- 7. Real data example
- 8. Conclusion
- References

Authors

Related links

J R Stat Soc Series B Stat Methodol. Author manuscript; available in PMC 2010 November 30.

Published in final edited form as:

J R Stat Soc Series B Stat Methodol. 2003 August 1; 65(3): 759–774.

doi: 10.1111/1467-9868.00414PMCID: PMC2994590

NIHMSID: NIHMS252524

A. Tsodikov, University of Utah, Salt Lake City, USA;

See other articles in PMC that cite the published article.

In semiparametric models, the dimension *d* of the maximum likelihood problem is potentially unlimited. Conventional estimation methods generally behave like *O*(*d*^{3}). A new *O*(*d*) estimation procedure is proposed for a large class of semiparametric models. Potentially unlimited dimension is handled in a numerically efficient way through a Nelson–Aalen-like estimator. Discussion of the new method is put in the context of recently developed minorization–maximization algorithms based on surrogate objective functions. The procedure for semiparametric models is used to demonstrate three methods to construct a surrogate objective function: using the difference of two concave functions, the EM way and the new quasi-EM (QEM) approach. The QEM approach is based on a generalization of the EM-like construction of the surrogate objective function so it does not depend on the missing data representation of the model. Like the EM algorithm, the QEM method has a dual interpretation, a result of merging the idea of surrogate maximization with the idea of imputation and self-consistency. The new approach is compared with other possible approaches by using simulations and analysis of real data. The proportional odds model is used as an example throughout the paper.

Potentially unlimited dimension has been the most critical deterrent to the use of maximum likelihood estimation (MLE) in semiparametric regression models. In survival analysis, methods based on the partial likelihood (Cox, 1972) are specific to the proportional hazards (PH) model and do not extend to other models. Straightforward Newton-type methods of maximizing the likelihood for the full model generally require *O*(*d*^{3}) operations to solve the system of score equations, where *d* is the number of model parameters. The principal part of the set of *d* parameters in a semiparametric model is used to specify a stepwise function *H* which approaches a continuous ‘true’ *H* in probability, as *d* → ∞. Although theoretically almost any likelihood can be maximized by a Newton-type method, its high complexity makes the problem computationally difficult for large *d*. The development of general, stable and numerically efficient algorithms for semiparametric MLE has been a long-standing problem (Fleming and Lin, 2000). Such algorithms are the subject of this paper. The argument goes as follows. The bottle-neck of a maximization algorithm for a semiparametric likelihood is the estimation of *H*. Let *l* be the log-likelihood of a semiparametric model, treated as a functional of *H*. Consider a class of continuous semiparametric models with the log-likelihood of the form (informally)

$$l={\displaystyle \sum _{t}{D}_{t}\phantom{\rule{thinmathspace}{0ex}}\text{log}\{\mathrm{d}H(t)\}+{\displaystyle \sum _{t}\text{log}\{\vartheta (H,t|z)\},}}$$

(1)

where *D _{t}* is the number of exact observations at

$$\mathrm{d}H(\tau )={D}_{\tau}/{\displaystyle \sum _{t}\mathrm{\Theta}(H,t|z),}$$

(2)

where Θ is a functional representing a negative ‘derivative’ of log(). Since both sides of equation (2) depend on *H*, an iterative procedure is required to make the equation self-consistent,

$$\mathrm{d}{H}^{(k+1)}(\tau )={D}_{\tau}/{\displaystyle \sum _{t}\mathrm{\Theta}({H}^{(k)},t|z)},$$

(3)

where *k* counts iterations. Iterative updating of *H* by using equation (3) is the basic idea behind the algorithm. As we shall see, the above procedure is intimately linked to the EM algorithm as used to fit certain PH frailty models in survival analysis (Oakes, 1989; Klein, 1992; Nielsen *et al.*, 1992). The EM algorithm handles *H* in an *O*(*d*) way through the use of the Nelson–Aalen–Breslow estimator (Andersen *et al.*, 1993) for the cumulative hazard *H*. This is made possible as the M-step reduces to the PH model. However, a large amount of analytic work would be required to specify an estimation procedure for a new non-PH model. Expectation at the E-step may prove to be inaccessible in a closed form, and Monte Carlo extensions of the EM approach are much less computationally attractive. Recently, an optimization transfer approach (Lange *et al.*, 2000) was proposed that allows us to construct EM-like procedures without the use of missing data. For a target function *l*(**x**), **x** ^{n}, the minorization–maximization (MM) algorithm (Lange *et al.*, 2000) proceeds by construction of the so-called surrogate objective function *Q*(**x**|**y**) such that *Q*(**y**|**y**) = *l*(**y**), and *Q*(**x**|**y**) ≤ *l*(**y**), for any **x**, to ensure monotonicity of the procedure. Maximization of the target function *l* proceeds iteratively as

$${\mathbf{x}}^{(k+1)}=\text{arg}\phantom{\rule{thinmathspace}{0ex}}\underset{\mathbf{x}}{\text{max}}\{Q(\mathbf{x}|{\mathbf{x}}^{(k)})\}.$$

(4)

The MM algorithm converges in *l* and in **x** under fairly general conditions (Lange *et al.*, 2000). In the likelihood interpretation, the EM algorithm is a particular case of the MM algorithm. Unfortunately, there is no automatic way to construct *Q*. The procedure (3) interpreted as an MM algorithm is used to highlight three methods to construct a surrogate objective function: using the difference of two concave functions, the EM way and a new quasi-EM (QEM) approach. These methods link the EM algorithm for frailty models and its modifications with the MM algorithms. In the QEM approach, ‘E’ in the EM is replaced by the quasi-expectation operator QE, which is not based on the concept of a random variable. The result is the so-called QEM algorithm, which presents us with a recipe of generalizing an EM procedure into a ‘distribution-free’ one, representing a particular MM algorithm.

The problem of nonparametric maximum likelihood estimation (NPMLE) with the semiparametric model is to find estimates of regression coefficients **β** and an NPMLE estimate of *H* such that they deliver the maximum of a suitably defined likelihood function *l* = *l*(**β**, *H*). In this paper we use a profile likelihood approach to maximize *l*. The profile likelihood is defined as a supremum of the full likelihood taken over the nonparametric part of the model

$${l}_{\text{pr}}(\mathit{\beta})=\underset{H}{\text{max}}\phantom{\rule{thinmathspace}{0ex}}\{l(\mathit{\beta},H)\}.$$

(5)

Assuming that we can find the global maximum of *l* with respect to *H*, given **β**, we may write the profile likelihood as an implicit function of **β**

$${l}_{\text{pr}}(\mathit{\beta})=l\{\mathit{\beta},H(\mathit{\beta})\},$$

(6)

where *H*(**β**) is the solution of equation (5). Our algorithms will be designed following a straightforward nested procedure:

- maximize
*l*_{pr}(**β**) by a conventional non-linear programming method (e.g. a directions set method); - for any
**β**as demanded in the above maximization procedure, solve problem (5).

As the number of parameters of a semiparametric model is potentially unlimited, obtaining the inverse of the full information matrix can be computationally prohibitive. Therefore, we use the profile information matrix

$${\mathbf{I}}_{\mathit{\beta},\mathit{\beta}}^{\mathrm{P}}=-\frac{{\partial}^{2}{l}_{\text{pr}}\{\mathit{\beta}{l}_{\text{pr}}(\mathit{\beta})\}}{\partial \mathit{\beta}\partial {\mathit{\beta}}^{T}}$$

(7)

to derive a standard error estimator for **β**. In this paper we adopt a pragmatic numerical approach. In the course of maximization of the profile likelihood with respect to **β**, a dense sample of the profile likelihood surface is generated near a stationary point. The curvature of the profile likelihood surface at the stationary point can be estimated by fitting a quadratic function to some domain around the point by using least squares. For example, the domain can be limited to points that cannot be rejected by using the likelihood ratio test (applied informally). Alternatively, a more sophisticated approach can be used based on implicit differentiation of *l*_{pr}.

The rest of the paper will be devoted to constructing efficient NPMLE methods for obtaining *l*_{pr}, i.e. for maximizing *l* with respect to *H*, given **β**, as this is the crux of the matter.

Practically, inference based on the profile likelihood is similar to that based on the partial likelihood for the PH model, which is quite convenient. Theoretically, however, inference based on the profile likelihood is not straightforward, as the usual theory of MLE does not apply to unlimited dimension. Important results have been obtained regarding a theoretical justification for the NPMLE method and the profile likelihood for semiparametric models (Murphy, 2000; van der Vaart, 1998; Murphy and van der Vaart, 1997). It was shown that profile likelihoods with nuisance parameters estimated out behave like ordinary likelihoods under some conditions. In particular, these results apply to the PH model, the proportional odds (PO) model (Murphy, 2000; Murphy *et al.*, 1997) and the PH frailty model (Murphy, 1994, 1995), and presumably to most other models.

Let *t _{i}*,

$$l={\displaystyle \sum _{i=1}^{n}{D}_{i}\phantom{\rule{thinmathspace}{0ex}}\text{log}(\mathrm{\Delta}{H}_{i})+{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j\in {\mathcal{C}}_{i}\cup {\mathcal{D}}_{i}}\text{log}\{\vartheta (\mathrm{\Delta}\mathbf{H},{t}_{i}|\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}},{c}_{\mathit{\text{ij}}})\},}}}$$

(8)

where *D _{i}* is the number of failures that are associated with

For example, consider a PO model for the survival function *G*, given covariates **z**,

$$G(t|\mathit{\beta},\mathbf{z})=G\{t|\theta (\mathit{\beta},\mathbf{z})\}=\frac{\theta (\mathit{\beta},\mathbf{z})}{\theta (\mathit{\beta},\mathbf{z})+H(t)},$$

(9)

where θ is a predictor and *H* is some nonparametrically specified base-line cumulative hazard. The model is named after the PO property that for any two values of the predictor, θ_{1} and θ_{2}, with corresponding survival functions *G _{i}*(

$$\frac{\text{odds}\{{G}_{1}(t)\}}{\text{odds}\{{G}_{2}(t)\}}=\frac{{\theta}_{1}}{{\theta}_{2}}$$

is a constant in *t*, where odds(*a*) = *a*/(1 − *a*).

This paper was inspired by the idea of representing a semiparametric model as a mixture (frailty) model, and to use the EM algorithm to fit it. With this idea in mind, consider a PH mixture model

$$G(t|\mathit{\beta},\mathbf{z})=E\{F{(t)}^{U(\mathit{\beta},\mathbf{z})}|\mathbf{z}\},$$

(10)

where *F* = exp(−*H*) is the base-line survival function corresponding to *H*, and *U* = *U*(**β**, **z**) is used to indicate that the distribution of random variable *U* depends on covariates and regression coefficients. This model can be considered a compact expression for a family of so-called PH frailty models, or PH models with random effects considered by Hougaard (1984), Klein (1992), Nielsen *et al.* (1992), Wassel and Moeschberger (1993), Clayton and Cuzick (1985) and many others, for different distributions of *U*, possibly dependent on covariates.

To construct the EM algorithm for a particular model (PO in the example), we represent it as a PH mixture model (inverse transform), and then follow the usual logic of the EM algorithm construction for frailty models, as for example in Nielsen *et al.* (1992).

We note that (*s*|·) = *E*[exp{−*s* *U*(·)}] is the Laplace transform of *U*(·), and that for the PH mixture model (10)

$$G(t|\xb7)=\mathcal{L}\{H(t)|\xb7\}=\mathcal{L}[-\text{log}\{F(t)\}|\xb7].$$

From the latter equation and equation (9), we conclude that *U* for the PO model represents exponential regression, as (*s*|·) = θ(·)/{θ(·) + *s*} is the Laplace transform of an exponential distribution with mean θ^{−1}.

With the PH mixture model (10), pretend that *U* is known for each subject *ij*, continuing the notation of Section 2. The complete-data likelihood under non-informative right censoring corresponds to the PH model with predictors *U _{ij}*

$${l}_{\text{cd}}={\displaystyle \sum _{i=1}^{n}\{{D}_{i}\phantom{\rule{thinmathspace}{0ex}}\text{log}(\mathrm{\Delta}{H}_{i})-{\displaystyle \sum _{j\in {\mathcal{C}}_{i}\cup {\mathcal{D}}_{i}}{U}_{\mathit{\text{ij}}}{H}_{i}\}.}}$$

(11)

Since the complete-data likelihood (11) is linear in missing data *U _{ij}*, the E-step reduces to imputation of each

$$\widehat{U}=\frac{{\displaystyle \int u\text{}{F}^{u}{(\mathit{\text{uh}})}^{c}\theta \text{exp}(-\theta u)\phantom{\rule{thinmathspace}{0ex}}\mathrm{d}u}}{{\displaystyle \int {F}^{u}{(\mathit{\text{uh}})}^{c}\theta \phantom{\rule{thinmathspace}{0ex}}\text{exp}}(-\theta u)\phantom{\rule{thinmathspace}{0ex}}\mathrm{d}u}=\frac{\mathrm{\Gamma}(c+2)\theta /{(\theta +H)}^{c+2}}{\mathrm{\Gamma}(c+1)\theta /{(\theta +H)}^{c+1}}=\frac{c+1}{\theta (\mathit{\beta},\mathbf{z})+H(t)},$$

(12)

where *h* is the hazard function corresponding to *H*. A similar derivation of *Û* for a gamma frailty model can be found, for example, in Parner (1998).

Maximization of the complete-data likelihood (11) with respect to *H*, and with *U _{ij}* imputed by

$$\mathrm{\Delta}{H}_{m}={D}_{m}/{\displaystyle \sum _{\mathit{\text{ij}}\in {\mathcal{R}}_{m}}{\widehat{U}}_{\mathit{\text{ij}}},}\text{\hspace{1em}\hspace{1em}\hspace{1em}}m=1,\dots ,n,$$

where _{m} = {*ij* : *j* _{i} _{i}, *i* ≥ *m*} is the set of subjects at risk just before *t _{m}*.

Finally, for the PO model we have the iterative EM procedure

$$\mathrm{\Delta}{H}_{m}^{(k+1)}={D}_{m}\phantom{\rule{thinmathspace}{0ex}}{\left\{{\displaystyle \sum _{\mathit{\text{ij}}\in {\mathcal{R}}_{m}}\frac{{c}_{\mathit{\text{ij}}}+1}{\theta (\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}})+{H}_{i}^{(k)}}}\right\}}^{-1},\text{\hspace{1em}\hspace{1em}\hspace{1em}}m=1,\dots ,n,$$

(13)

where *k* counts iterations.

It is intriguing that we can formally derive procedure (13) as an immediate corollary of the argument presented in Section 1. Indeed, using equation (9), we write the likelihood for the PO model as

$$l={\displaystyle \sum _{i=1}^{n}{D}_{i}\text{log}(\mathrm{\Delta}{H}_{i})+{\displaystyle \sum _{j\in {\mathcal{C}}_{i}\cup {\mathcal{D}}_{i}}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left[\frac{\theta (\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}})}{{\{\theta (\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}})+{H}_{i}\}}^{{c}_{\mathit{\text{ij}}}+1}}\right]\phantom{\rule{thinmathspace}{0ex}},}}$$

(14)

On differentiating equation (14) with respect to Δ*H _{m}*, and assigning the iteration index

This observation deserves discussion. The EM derivation presented above for the PO model is model specific and its feasibility depends on the success and simplicity of the inverse Laplace transformand the integrals that are evaluated at the E-step (12). The PH mixture representation of a semiparametric model may not exist, in which case the EM derivation ultimately fails. Necessary and sufficient conditions for this representation to exist are a corollary of the Bernstein theorem (Feller, 1971): the survival function must be a completely monotonic function of *H*. A function ψ(*H*) is called completely monotonic if all its derivatives ψ^{(i)} exist, *i* = 1, 2,…, and (−1)^{i} ψ^{(i)}(*H*) ≥ 0, *H* > 0. In particular, the survival function (10) of the PH mixture model is an infinitely differentiable function of *F*. The alternative derivation of procedure (13) bypasses all the above-mentioned difficulty and formally works for any model in a straightforward and simple fashion. This raises a series of questions. Does the procedure of Section 1 work for any model? What is its relationship to the EM algorithm? Does it inherit the monotonicity, stability and convergence of the EM algorithm?

A clue to generalizing the EM algorithm described above is the observation that the derivation of the E-step (12) does not require knowledge of the distribution of *U*. Indeed, denote by γ(*x*) the moment-generating function of *U* (other arguments are omitted), so that γ(*x*) = *E*(*x ^{U}*) = {−log(

$$\widehat{U}=\frac{E({F}^{U}{U}^{c+1})}{E({F}^{U}{U}^{c})}=c+F\frac{{\mathrm{\gamma}}^{(c+1)}(F)}{{\mathrm{\gamma}}^{(c)}(F)},$$

(15)

where γ^{(c)} denotes the derivative of order *c*; γ^{(0)} := γ, *c* = 0, 1. Expression (15) represents a variation on the topic of the derivation of moments from the transform of a distribution. The consequence of equation (15) is a straightforward and general specification of the *E*-step for any mixture model formulated in terms of the moment-generating function. In fact, it is even more general as will be shown in what follows. To elaborate further on the issues raised above, we need to make the few theoretical observations considered in the next section.

For studying procedure (3), the following MM construction (Lange *et al.*, 2000) is useful. Let *l*(**x**) = *B*(**x**) − *A*(**x**), where *A* and *B* are differentiable concave functions. The iterative maximization procedure,

$$\nabla B({\mathbf{x}}^{(k+1)})=\nabla A({\mathbf{x}}^{(k)}),$$

(16)

where *A*(**x**) = *A*/**x** is the gradient of *A*, represents an MM algorithm, as follows from convexity arguments. The surrogate objective function for the above construction has the form

$$\mathcal{Q}(\mathbf{x}|{\mathbf{x}}^{(k)})=B({\mathbf{x}}^{(k)})-A(\mathbf{x})+{\nabla}^{\mathrm{T}}A({\mathbf{x}}^{(k)})(\mathbf{x}-{\mathbf{x}}^{(k)}).$$

(17)

Let be the observed event, and *U* be a random variable (vector) representing missing data. The EM algorithm is a method to maximize the log-likelihood function *l*(**x**) = log{*L*(**x**)} of the form

$$L(\mathbf{x})=E\{{L}_{0}(\mathbf{x}|U)\},$$

(18)

where *L*_{0}(**x**|*U*) is the conditional likelihood, given missing data (the likelihood constructed to estimate **x** as if *U* were a covariate), and **x** does not include parameters of the distribution of *U*. To facilitate ‘distribution-free’ generalizations, we intentionally avoid explicit expressions involving the distribution of *U*, and we use the conditional rather than the joint likelihood of *U* and to represent the EM procedure. In this construction, unknown parameters entering the distribution of *U* do not participate in the procedure, and the maximization is considered with respect to parameters **x** only. For any function *f*(*U*), the conditional expectation of *f*(*U*), given observed event and **x**, is represented as

$$E\{f(U)|\mathcal{E},\mathbf{x}\}=\frac{E\{f(U){L}_{0}(\mathbf{x}|U)\}}{E\{{L}_{0}(\mathbf{x}|U)\}}.$$

This suggests the following explicit functional notation for the conditional expectation operator

$$E(f|g)\u2254\frac{E(fg)}{E(g)},$$

(19)

for any functions *f* and *g* of *U*, where *U* is a random variable, and *E*(*g*) is the probability of the condition. A standard Jensen inequality argument shows that, with this notation,

$$Q(\mathbf{x}|\mathbf{y})=l(\mathbf{y})+E\{{l}_{0}(\mathbf{x}|U)-{l}_{0}(\mathbf{y}|U)|{L}_{0}(\mathbf{y}|U)\},\text{\hspace{1em}\hspace{1em}\hspace{1em}}{l}_{0}={\text{log}(L}_{0}),$$

(20)

is a surrogate objective function for the target function *l*(**x**). The operation of finding *Û* such that *l*_{0}(**x**|*Û*) = *E*{*l*_{0}(**x**|*U*)|*L*_{0}(**y**|*U*)} is referred to as missing data imputation. If imputation is easy (E-step), maximization of *Q* with respect to **x** reduces to that of *l*_{0}(**x**|*Û*), a complete-data problem.

To prove that any converging sequence **x**^{(k)} → **x**^{*}, designed according to equation (4), gives us a stationary point in the limit, we follow the argument at **x** = **y** = **x**^{*}

$$\frac{\partial Q(\mathbf{x}|\mathbf{y})}{\partial \mathbf{x}}=\frac{E\{\partial {L}_{0}(\mathbf{x}|U)/\partial \mathbf{x}\}}{L(\mathbf{y})}=\frac{1}{L(\mathbf{y})}\frac{\partial L(\mathbf{x})}{\partial \mathbf{x}}=0,$$

(21)

which implies that the score equation *L*(**x**)/**x** = 0 is satisfied in the limit.

The EM algorithm proceeds by iterations
$E\{{l}_{0}^{\prime}({\mathbf{x}}^{(k+1)}|U)\phantom{\rule{thinmathspace}{0ex}}{L}_{0}({\mathbf{x}}^{(k)}|U)\}=0$, where at the *k*th iteration this equation is solved for *x*^{(k+1)}.

Let us revisit the EM construction under the question what properties of the E-operator did we actually use in the derivation of Section 4.2? They are conditional expectation performed according to expression (19), linearity and the Jensen inequality in equation (20) and interchangeability of *E* and /**x** in equation (21). Operators satisfying these properties will be called QE. As soon as a QE operator satisfying these requirements and such that *L*(**x**) = QE{*L*_{0}(**x**|*U*)} is constructed, the likelihood function *L* can be maximized by an MM algorithm with the surrogate objective function as in equation (20) with *E* replaced by QE. The rationale behind this substitution is that the evaluation of *E* requires that *U* be a random variable, and that we know its distribution, whereas that of QE does not.

Formally, let be some set of basis functions (including the function *f*(*u*) 1), and be a set of all admissible functions stretched on using linear combinations. In other words consists of functions *f* such that *f* = Σ_{i}*a _{i}*

Define QE as a linear functional mapping to real numbers such that

- QE(
**1**) := 1, where**1**means a function that is equivalent to 1 (*normalization*), - for any
*f*= Σ_{i}*a*_{i}*f*,_{i}*f*, QE(_{i}*f*) := Σ_{i}*a*QE(_{i}*f*) (_{i}*linearity*), - for any function
*f*(*u*,*a*) , and such that*f*(*u*,*a*)/*a*,($$\frac{\partial \text{QE}\{f(U,a)}{\partial a}=\text{QE}\left\{\frac{\partial f(U,a)}{\partial a}\right\}$$*interchangeability*), - as in expression (19), for any functions
*g*and*fg*and such that QE(*g*) ≠ 0,($$\text{QE}(f|g)\u2254\frac{\text{QE}(fg)}{\text{QE}(g)}$$(22)*conditional QE*) and - given the functions
*g*,*fg*and*g*log(*f*) ,($$\text{QE}\{\text{log}(f)|g\}\le \text{log}\{\text{QE}(f|g)\}$$(23)*Jensen inequality*).

Let us consider the QE requirements more closely. We start by postulating one basis function *f*_{0}(*U*, *a*) and the value of QE on that basis function QE(*f*_{0}) := γ(*a*), where γ is some function of *a*. Dependent on how many times we are allowed to differentiate under the QE symbol, the derivatives

$${f}^{(i)}(U,a)={\partial}^{(i)}{f}_{0}(U,a)/\partial {a}^{i}$$

must also be included in the set of admissible functions , so that derivatives of QE(*f*_{0}) can be defined. Moreover, QE on *f*^{(i)}, *i* = 1, 2,…, are automatically defined by the interchangeability property through derivatives of γ, as QE{*f*^{(i)}} = γ^{(i)}. As we can see, QE construction is cloned from *f*_{0} and γ. Mathematical expectation *E*{*f*_{0}(*U*, *a*)} is an integral transform of *U*, where *a* is an argument of the transform. Dependent on the choice of the function *f*_{0}, QE mimics differential properties of the corresponding transform, whereas the function γ is not necessarily an integral transform.

To study procedure (3), a QE construction based on moment-generating functions is useful. This construction is cloned from the basis function *f*_{0}(*U*, *a*) = *a ^{U}*, 0 ≤

$$\begin{array}{c}\hfill \text{QE}({a}^{U})=\mathrm{\gamma}(a),\hfill \\ \hfill \text{QE}({\mathit{\text{Ua}}}^{U})=a\mathrm{\gamma}\prime (a),\hfill \\ \hfill \text{QE}({U}^{2}{a}^{U})=a\mathrm{\gamma}\prime (a)+{a}^{2}\mathrm{\gamma}\u2033(a),\hfill \end{array}\}$$

(24)

by the interchangeability property with two derivatives of γ allowed. We now derive the Jensen inequality for this construction. First, noting equation (15), introduce the conditional QE

$$\mathrm{\Theta}(x|c)\u2254\text{QE}(U|{U}^{c}{x}^{U})=c+x\frac{{\mathrm{\gamma}}^{(c+1)}(x)}{{\mathrm{\gamma}}^{(c)}(x)},$$

(25)

where we used expressions (22) and (24) to obtain the right-hand part of expression (25), *c* = 0, 1. The function Θ is a surrogate of conditional expectation of *U*, given observed data, and it serves as an imputation operator in the QEM construction. Next, let _{k} be a family of functions _{k} = {*U ^{k}*

$$\frac{\partial}{\partial a}[\text{QE}\{\text{log}(f)|g\}-\text{log}\{\text{QE}(f|g)\}]=\frac{1}{a}\{\mathrm{\Theta}(b|c)-\mathrm{\Theta}(\mathit{\text{ab}}|c)\},$$

the following can be proved.

Theorem 1(Jensen inequality for QE). Let Θ(x|c) as defined by expression (25) be a non-decreasing function ofx,c= 0, 1. Then inequality (23) holds true for anyf_{0}andg_{0}_{1}.

It is interesting that mathematical expectation satisfies the assumption of non-decreasing Θ, which we call the *convexity assumption* for reasons that will become clear later as we discuss an application of the above theory to semiparametric likelihood.

Theorem 2(convexity for mathematical expectation). Let γ be a function defined by using the mathematical expectation operatorEas γ(x) :=E(x), where^{U}Uis a non-negative random variable. Then Θ(x|c) as defined in expression (25) is non-decreasing inxfor anyc= 0, 1.

*Proof*. The Cauchy–Schwartz inequality with functions ξ(*U*, *x*) = *U*^{1+c/2}*x*^{U/2} and ζ(*U*, *x*) = *U*^{c/2}*x*^{U/2} can be used to show that Θ′(*x*|*c*) ≥ 0.

To study procedure (1)–(3) in more detail using the developments of Section 4, we need to specify a certain structure of the likelihood function to be optimized. To do this, let us confine ourselves to a large, yet specific, class of semiparametric survival models. Consider a parametric regression model with support on [0, 1]. Let γ(*x*|**β**, **z**), *x* [0, 1], be a parametrically specified distribution function in *x*, conditional on covariates **z**. We require that γ be twice differentiable with respect to *x* and regression coefficients **β**.

We can nowdefine a semiparametrically specified survival function *G*(*t*|**β**, **z**), given covariates, as

$$G(t|\mathit{\beta},\mathbf{z})=\mathrm{\gamma}\{F(t)|\mathit{\beta},\mathbf{z}\},$$

(26)

where the base-line survival function *F* is specified nonparametrically. The class of models (26) will be called NTMs, to give it a name. Functions like γ will be called NTM-generating functions. An NTM is obtained by plugging a nonparametrically specified survival function *F* into a parametric distribution function γ with the support compatible with the range of *F*. One important subclass of NTMs is the family of PH mixture models (10), for which γ is a moment-generating function of *U*,

$$\mathrm{\gamma}(x|\mathit{\beta},\mathbf{z})=E({x}^{U(\mathit{\beta},\mathbf{z})}|\mathbf{z}).$$

To represent a semiparametric model in the NTM form, we need to express its survival function *G* as a function γ of a base-line survival function *F* (this representation is not unique and is not always possible). For example, from equation (9) we obtain the PO model in the form *G*(*t*|·) = θ(·)/[θ(·) − log{*F*(*t*)}], which gives

$$\mathrm{\gamma}(x|\xb7)=\frac{\theta (\xb7)}{\theta (\xb7)-\text{log}(x)}.$$

(27)

The class of NTMs includes the class of linear transformation models (Cheng *et al.*, 1995, 1997). It is easy to show that a linear transformation model can be represented as γ(*x*|**β**, **z**) = *p*[log{θ(**β**, **z**)} + *q*(*x*)], where *p* is a parametrically specified tail function, *q* is an inverse tail function and θ is a predictor (it is convenient to specify *q* as the inverse of *p*; then θ = 1 corresponds to the base-line γ(*x*|·) = *x*).

In the survival analysis formulation, under non-informative right censoring, the contribution of observations sampled from an NTM (26) to the likelihood are log(−d*G*) and log(*G*), for a failure and censored observation respectively. We have

$$-\mathrm{d}G(t|\mathit{\beta},\mathbf{z})=\mathrm{\gamma}\prime \{F(t)|\mathit{\beta},\mathbf{z}\}\phantom{\rule{thinmathspace}{0ex}}F(t)\phantom{\rule{thinmathspace}{0ex}}\mathrm{d}H(t),$$

where γ′(*x*|·) = γ(*x*|·)/*x*, differentials are taken with respect to *t* under a continuous model and we recall that *F* = exp(−*H*). We may now rewrite the likelihood (8) for an NTM as

$$l={\displaystyle \sum _{i=1}^{n}{D}_{i}\text{log}(\mathrm{\Delta}{H}_{i})+}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j\in {\mathcal{C}}_{i}\cup {\mathcal{D}}_{i}}\text{log}\{\vartheta ({F}_{i}|\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}},{c}_{\mathit{\text{ij}}})\},}}$$

(28)

where

$$\vartheta (x|\xb7,c)={x}^{c}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{\gamma}}^{(c)}(x|\xb7)\phantom{\rule{thinmathspace}{0ex}},$$

and Δ*H _{i}* is substituted for d

$$\mathrm{\Delta}{H}_{m}^{(k+1)}={D}_{m}/{\displaystyle \sum _{\mathit{\text{ij}}\in {\mathcal{R}}_{m}}\mathrm{\Theta}({F}_{i}^{(k)}|\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}},{c}_{\mathit{\text{ij}}})}.$$

(29)

This procedure is a generalization of procedure (13) to the NTM family. For the PO model, substituting equation (27) into expression (25), we obtain

$$\mathrm{\Theta}(x|\xb7,c)=\frac{c+1}{\theta (\xb7)-\phantom{\rule{thinmathspace}{0ex}}\text{log}(x)}.$$

(30)

It is clear that, with Θ given by equation (30), the general procedure (29) turns into the procedure (13) derived for the PO model in Section 3.

We now make use of the QE theory of Section 4.3 to provide a link between NTMs and the QE operator. Equations (24) summarizing second-order differential properties of the QE operator will be the main instrument of this section.

First, let us synchronize the development of Section 4.3 and the definition of NTM (26) in Section 5.1 by assuming that the function γ that is used in both sections is the same function. In fact, we already used this synchronization when we noticed in the previous section that Θ in equation (29) and in expression (25) is the same function. Now, from the first line of expression (24), with *F*(*t*) instead of *a*, we obtain the QE form of the NT model

$$G(t|\mathit{\beta},\mathbf{z})={\text{QE}}_{\mathit{\beta},\mathbf{z}}\{F{(t)}^{U}\}\phantom{\rule{thinmathspace}{0ex}},$$

(31)

where the subscript **β**, **z** to the QE operator indicates that QE is defined by using the function γ(*x*|**β**, **z**). Equation (31) is a postulate in the definition of the QE operator, and its link to the NTMs is established as we assume that QE is defined by using an NTM-generating function γ.

Now, consider the likelihood function *l* (28). Given an observation (*t*, **z**, *c*), *c* = 0, 1, its contribution υ{*F*(*t*)|**β**, **z**, *c*} to the likelihood *L* = exp(*l*) can be written as

$$\upsilon (F|\xb7,c)=\vartheta (F|\xb7,c)\mathrm{\Delta}{H}^{c}=\mathrm{\Delta}{H}^{c}{F}^{c}{\mathrm{\gamma}}^{(c)}(F|\xb7,c)=\text{QE}(\mathrm{\Delta}{H}^{c}{U}^{c}{F}^{U})\phantom{\rule{thinmathspace}{0ex}},$$

where the last equation follows from the first two lines of expression (24) and linearity of QE. As a result, the likelihood of an NTM mimics that of a mixture model

$$L=\prod \text{QE}(\mathrm{\Delta}{H}^{c}{U}^{c}\phantom{\rule{thinmathspace}{0ex}}{F}^{U}).$$

Consider the hazard function λ(*t*|**z**), corresponding to the survival function *G*(*t*|**z**). Differentiating the survival function (26), and using expression (24), we have

$$\lambda (t|\xb7)=-\frac{1}{G(t|\xb7)}\frac{\partial G(t|\xb7)}{\partial t}=\frac{\mathrm{\gamma}\prime \{F(t)|\xb7\}\phantom{\rule{thinmathspace}{0ex}}F(t)}{\mathrm{\gamma}\{F(t)|\xb7\}}h(t)=\frac{\text{QE}\{U\phantom{\rule{thinmathspace}{0ex}}F{(t)}^{U}\}}{\text{QE}\{F{(t)}^{U}\}}h(t),$$

where *h* is the hazard function corresponding to *F*. Applying the definition of conditional QE (22) to this expression, and using expression (25), we obtain

$$\lambda (t|\mathbf{z})=\text{QE}\{U|F{(t)}^{U}\}\phantom{\rule{thinmathspace}{0ex}}h(t)=\mathrm{\Theta}(F|\xb7,0)\phantom{\rule{thinmathspace}{0ex}}h(t).$$

This is a generalization of the fact that the population hazard function at time *t* in a heterogeneous population is represented as a conditional average, given survival up to time *t*.

Bringing these derivations together with expression (25), we have the following theorem.

Theorem 3(QEM construction). Consider a survival analysis problem for an NTM generated by the function γ(x|β,z), with fixed covariates. With the QE operator as defined in Section 4.3, and using the same NTM-generating function γ in its definition, the following representations are valid:survival function,$$G(t|\mathit{\beta},\mathbf{z})={\text{QE}}_{\mathit{\beta},\mathbf{z}}\{F{(t)}^{U}\};$$

hazard function,$$\lambda (t|\mathbf{z})=\text{QE}\{U|F{(t)}^{U}\}\phantom{\rule{thinmathspace}{0ex}}h(t)=\mathrm{\Theta}(F|\xb7,0)\phantom{\rule{thinmathspace}{0ex}}h(t),$$where λ and

hare hazards functions corresponding toGandFrespectively;likelihood function,$$l={\displaystyle \sum _{i=1}^{n}\left({\displaystyle \sum _{j\in {\mathcal{C}}_{i}\cup {\mathcal{D}}_{i}}\text{log}[{\text{QE}}_{\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}}}\{{(U\mathrm{\Delta}{H}_{i})}^{{c}_{\mathit{\text{ij}}}}{F}_{i}^{U}\}]}\right)}\phantom{\rule{thinmathspace}{0ex}};$$

imputation operator,$$\widehat{U}=\text{QE}(U|{U}^{c}\phantom{\rule{thinmathspace}{0ex}}{F}^{U})=\mathrm{\Theta}(F|\xb7,c),\text{\hspace{1em}\hspace{1em}\hspace{1em}}c=0,1,$$where

ÛdenotesU, imputed by using the conditional QE operator.

Let us now go back to the EM algorithm of Section 3 and see how the results obtained since then allow us to streamline and justify our algorithm construction, using the PO model as an example. In summary, we now have the following procedure.

- Obtain the NTM-generating function, representing the model survival function as a function of
*F*. For the PO model (9)*G*(*t*|·) = θ(·)[θ(·) − log{*F*(*t*)}]^{−1}, we have equation (27),$$\mathrm{\gamma}(x|\xb7)=\theta (\xb7){\{\theta (\xb7)-\text{log}(x)\}}^{-1}.$$ - Obtain the imputation operator (25)$$\mathrm{\Theta}(x|\xb7)=c+x\phantom{\rule{thinmathspace}{0ex}}{\mathrm{\gamma}}^{(c+1)}(x|\xb7){\mathrm{\gamma}}^{(c)}{(x|\xb7)}^{-1}.$$For the PO model, this results in equation (30),$$\mathrm{\Theta}(x|\xb7)=(c+1){\{\theta (\xb7)-\text{log}(x)\}}^{-1}.$$Check that Θ(
*x*|·) is a non-decreasing function of*x*(see the justification below). - Obtain the profile likelihood by iterations (29),$$\mathrm{\Delta}{H}_{m}^{(k+1)}={D}_{m}\phantom{\rule{thinmathspace}{0ex}}{\left\{{\displaystyle \sum _{\mathit{\text{ij}}\in {\mathcal{R}}_{m}}\mathrm{\Theta}({F}_{i}^{(k)}|\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}},{c}_{\mathit{\text{ij}}})}\right\}}^{-1}.$$
- Maximize the profile likelihood with respect to
**β**as in Section 2.

For the PH mixture model, QE is equivalent to E (compare equations (31) and (10)), which makes Θ the conditional expectation of *U*, given observed data (compare expressions (25) and (15)). In this case, the above procedure is an EM algorithm.

Justification of this procedure works through the proof of monotonicity (i.e. the likelihood is improved at each step) under the following assumption.

Consider an NTM with the NTM-generating function γ. Assume that

$$\mathrm{\Theta}(x|\mathit{\beta},\mathbf{z},c)\phantom{\rule{thinmathspace}{0ex}}\text{is a non-decreasing function of}\phantom{\rule{thinmathspace}{0ex}}x,\text{for any}\phantom{\rule{thinmathspace}{0ex}}\mathit{\beta},\mathbf{z},c.$$

(32)

We have two ways to show monotonicity.

- Observe that, under assumption (32), the likelihood (28), as a function of the vector Δ
**H**, represents a difference between two concave functions Σ_{i}*D*log(Δ_{i}*H*) and −Σ_{i}_{ij}log{(*F*|·)}. This follows from the fact that Θ is the negative derivative of log() with respect to_{i}*H*. Therefore, monotonicity follows from the results of Section 4.1. - Observe that the likelihood is represented as a QE
*L*= Π QE(Δ*H*^{c}*U*^{c}*F*) (Section 5.3), and that under assumption (32) the QE operator satisfies the Jensen inequality (Section 4.3). Therefore, the EM proof of monotonicity works.^{U}

Convergence of the algorithm under monotonicity follows from the results of Lange *et al.* (2000) and Wu (1983) under fairly general conditions.

As an example we use data from the National Cancer Institute’s ‘Surveillance epidemiology and end results’ programme. Using the publicly available database for the programme, 11621 cases of primary prostate cancer diagnosed in the state of Utah between 1988 and 1999 were identified. The following selection criteria were applied to a total of 19819 Utah cases registered in the database: valid positive survival time, valid stage of the disease and age 18 years or more. Prostate cancer specific survival was analysed by the stage of the disease (localized or regional *versus* distant). For the definition of stages as well as for other details of the data we refer the reader to the documentation for the programme at http://seer.cancer.gov/.

The PH and the PO models with **z** representing two groups corresponding to the localized or regional stage (10765 cases) and distant stage (856 cases) respectively were fitted by using the profile MM algorithm. The log-odds-ratio was estimated as = −3:251 with 95% asymptotic confidence interval (−3:416, −3:086).A likelihood ratio test showed that the difference between groups is highly significant (*p* < 0:0001). Observed (Kaplan–Meier) and expected model-based estimates of the survival functions by group are shown in Fig. 1. The PO model showed a superior fit to the data.

Observed (————) *versus* expected (**————**) plots corresponding to (a) the PH and (b) the PO model fitted to the prostate cancer data

On the basis of the PO model, four different approaches to model fitting are compared.

*MM or QEM*: the method is described in Section 6. Maximization of the profile likelihood is performed by the Powell method (Press*et al.*, 1994).*EM*: the EM method is similar to the EM algorithm as used to fit frailty models with predictor θ(**β**,**z**). Using the QEM formulation, the procedure is as follows.- With the current iteration
**β**^{(k)}and*F*^{(k)}compute ${V}_{\mathit{\text{ij}}}^{(k)}={\theta}^{-1}({\mathit{\beta}}^{(k)},{\mathbf{z}}_{\mathit{\text{ij}}})\phantom{\rule{thinmathspace}{0ex}}\mathrm{\Theta}({F}_{i}^{(k)}|{\mathit{\beta}}^{(k)},{\mathbf{z}}_{\mathit{\text{ij}}})$ for each subject*ij*. - Maximize the partial likelihood for a PH model with (imputed) predictor ${V}_{\mathit{\text{ij}}}^{(k)}\theta (\mathit{\beta},{\mathbf{z}}_{\mathit{\text{ij}}})$ with respect to
**β**. Set**β**^{(k+1)}at the solution. - Update the function
*F*by using the Nelson–Aalen estimator for the PH model fitted at the previous step. Denote the solution by*F*^{(k+1)}. - Set
*k*=*k*+ 1. Continue iterations until convergence.

*Parametric*: in the parametric method, the function*F*is specified as a Weibull survival function. The parametric regression model is fitted by using the Powell method applied to all model parameters.*Full model (Powell)*: apply the Powell method to maximize the log-likelihood of the full semiparametric model with respect to the joint vector of regression coefficients**β**and the base-line hazard Δ**H**.

Computation of θ, Θ or γ is counted as one operation. For a given tolerance ε, the stopping rule is defined as *l*^{(k+1)} − *l*^{(k)} < ε. If the method required solving a nested numerical problem (MM or EM), the tolerance for the nested problem is specified as ε/100.

First, we evaluate the precision by operations characteristics of the above numerical methods. The precision is measured by *l** − l^{(k)}, where the exact solution *l** was approximated by the solution obtained for ε = 10^{−20}. Shown in Fig. 2 are the precision by operations curves for the four methods, obtained by varying ε. It is clear from Fig. 2 that the profile MM algorithm outperforms the other approaches in the number of operations that are required to reach a given precision. The profile MM method is closely followed by the frailty EM algorithm. Fitting the full semiparametric model by the Powell method shows the worst performance. The advantage of the EM-like approaches compared with methods that invoke the function *F* into a conventional maximization is explained by the utilization of a closed form solution for *F* in the form of the Nelson–Aalen–Breslow estimator. For the same reason, the MM and the EM procedures show a linear trend with increasing dimension, given fixed precision as shown in Fig. 3. To obtain Fig. 3, samples of size 100–1000 were generated from the parametric (Weibull) PO model fitted to the same data. The MM, EM and full model (Powell) procedures were applied to each such sample. The tolerance ε = 10^{−3} was used for the MM algorithm. The tolerance for the other two methods was tuned to give a likelihood that was as close as possible yet smaller than the likelihood achieved by the MM method (to keep the comparison conservative). The profile MM algorithm shows the most favourable behaviour with increasing dimension, followed by the EM procedure. It comes as no surprise that the full model (Powell) method shows the greatest complexity.

Precision of likelihood maximization by the number of operations (precision is measured as the difference between the limiting value of the likelihood as operations tend to ∞ and the maximal likelihood value achieved under a stopping rule; curves **...**

We presented an application of the general MM principle to a class of semiparametric models. Three methods of specifying the surrogate objective function were demonstrated. In particular, we clarified the connection between the likelihood-based MM principle and the imputation-based self-consistency principle that is used in EM algorithms for semiparametric models. To study this connection, we built an EM-like world behind the MM algorithm by using the QEM construction. The approaches were illustrated by using continuous NTMs in a survival analysis context. This is just one example of how these constructions can be used. Discrete survival models, cure models, multivariate semiparametric models, models with time-dependent covariates and many other statistical models can be treated by application of the principles presented in this paper. Construction of surrogate objective functions is not straightforward. Having an option to work a particular problem from both ends (likelihood or convexity *versus* imputation or self-consistency) may increase the chance of finding efficient and general procedures that are applicable to large classes of models.

The author thanks Dr Ken Boucher, Dr Marvin Zelen, the referees and the Joint Editor for many helpful comments. This research is partially supported by National Cancer Institute grant 1U01 CA97414-01 (Cancer Intervention and Surveillance Modeling Network), Department of Defence grant DAMD17-03-1-0034 and National Institute of Ageing grant 1RO1 AG14650-01A2.

- Andersen P, Borgan Ø, Gill R, Keiding N. Statistical Models based on Counting Processes. New York: Springer; 1993.
- Cheng S, Wei L, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82:835–845.
- Cheng S, Wei L, Ying Z. Predicting survival probabilities with semiparametric transformation models. J. Am. Statist. Ass. 1997;92:227–235.
- Clayton D, Cuzick J. Multivariate generalizations of the proportional hazards model (with discussion) J. R. Statist. Soc. A. 1985;148:82–117.
- Cox D. Regression models and life-tables (with discussion) J. R. Statist. Soc. B. 1972;34:187–220.
- Feller W. An Introduction to Probability Theory and Its Applications. New York: Wiley; 1971.
- Fleming T, Lin D. Survival analysis in clinical trials: past developments and future directions. Biometrics. 2000;56:971–983. [PubMed]
- Hougaard P. Life table methods for heterogeneous populations: distributions describing the heterogeneity. Biometrika. 1984;71:75–83.
- Klein J. Semiparametric estimation of random effects using the Cox model based on the EM algorithm. Biometrics. 1992;48:795–806. [PubMed]
- Lange K, Hunter D, Yang I. Optimization transfer using surrogate objective functions (with discussion) J. Comput. Graph. Statist. 2000;9:1–59.
- Murphy S. Consistency in a proportional hazards model incorporating a random effect. Ann. Statist. 1994;22:712–731.
- Murphy S. Asymptotic theory for the frailty model. Ann. Statist. 1995;23:182–198.
- Murphy S. On profile likelihood. J. Am. Statist. Ass. 2000;95:449–485.
- Murphy S, Rossini A, van der Vaart A. Maximum likelihood estimation in the proportional odds model. J. Am. Statist. Ass. 1997;92:968–976.
- Murphy S, van der Vaart A. Semiparametric likelihood ratio inference. Ann. Statist. 1997;25:1471–1509.
- Nielsen G, Gill R, Andersen P, Sorensen T. A counting process approach to maximum likelihood estimation in frailty models. Scand. J. Statist. 1992;19:25–43.
- Oakes D. Bivariate survival models induced by frailties. J. Am. Statist. Ass. 1989;84:487–493.
- Parner E. Asymptotic theory for the correlated gamma frailty model. Ann. Statist. 1998;26:183–214.
- Press W, Flannery B, Teukolsky S, Vetterling W. Numerical Recipes in Pascal: the Art of Scientific Computing. New York: Cambridge University Press; 1994.
- van der Vaart A. Asymptotic Statistics. Cambridge: Cambridge University Press; 1998.
- Wassel J, Moeschberger M. A bivariate survival model with modified gamma frailty for assessing the impact of interventions. Statist. Med. 1993;12:241–248. [PubMed]
- Wu C. On the convergence properties of the EM algorithm. Ann. Statist. 1983;11:95–103.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |