PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Lifetime Data Anal. Author manuscript; available in PMC 2010 November 26.
Published in final edited form as:
PMCID: PMC2992554
NIHMSID: NIHMS252530

Profile information matrix for nonlinear transformation models

Abstract

For semiparametric models, interval estimation and hypothesis testing based on the information matrix for the full model is a challenge because of potentially unlimited dimension. Use of the profile information matrix for a small set of parameters of interest is an appealing alternative. Existing approaches for the estimation of the profile information matrix are either subject to the curse of dimensionality, or are ad-hoc and approximate and can be unstable and numerically inefficient. We propose a numerically stable and efficient algorithm that delivers an exact observed profile information matrix for regression coefficients for the class of Nonlinear Transformation Models [A. Tsodikov (2003) J R Statist Soc Ser B 65:759–774]. The algorithm deals with the curse of dimensionality and requires neither large matrix inverses nor explicit expressions for the profile surface.

Keywords: Profile likelihood, Semiparametric models, Information matrix

1 Introduction

In semiparametric models the parameter is partitioned as (β, H) with β a low-dimensional parameter of interest and H a high-dimensional nuisance parameter. For example, in semiparametric regression survival models, β is the vector of regression coefficients and H is the baseline cumulative hazard function estimated as a step-function by the Nonparametric Maximum Likelihood Estimator (NPMLE). The dimension of H is given by the number of distinct failure times and increases with the sample size.

Within the NPMLE framework the following tools are available for interval estimation and hypothesis testing for β.

  1. Likelihood Ratio. The likelihood ratio statistic for testing H0 : β = β0 is defined as,
    LR(β0)=2((β^,H^)(β0,H^(β0))),
    where [ell] is the log-likelihood function, ([beta], Ĥ) is the NPMLE of (β, H), and Ĥ(β) is the MLE of H given β. Although classical ML theory does not directly apply to unlimited dimension, for many semiparametric models LR has an asymptotic chi-square distribution with d degrees of freedom, where d is the dimension of β. A (1 − α)% confidence set for β is given by
    {β:LR(β)Cd,α},
    where Cd is the α percentile of the chi-square distribution with d degrees of freedom. When the asymptotic distribution of LR is unknown, bootstrap can be used to approximate Cd.
    The likelihood ratio approach for building confidence regions for β involves inverting the LR surface, which is quite computer intensive as repeated maximizations of the likelihood with respect to H are required.
  2. Wald Statistic. An alternative method of inference for β is based on the Wald statistic defined as
    W(β)=(β^β)Tββ1(β^β),
    where ∑ββ is the β-submatrix of the inverse of the observed information matrix
    I=(2(β,H)ββT2(β,H)βHT2(β,H)HβT2(β,H)HHT)|β=β^,H=H^
    Note that in the presence of nuisance parameters the information matrix needs to be inverted twice (Severini 2000, p. 121), the first time in its high-dimensional full model form I, and the second time as a dim(β)-submatrix of ∑ = I−1.
    Under certain conditions, W is asymptotically equivalent to the likelihood ratio and asymptotically has a chi-square distribution with d degrees of freedom. In this case,
    {β:W(β)Cd,α},
    is a confidence set of approximate coverage probability 1 − α.

The bottleneck of this procedure is the invertion of a potentially infinitely large matrix I.

The two methods of inference on β described above are based on the full model. An appealing alternative is to consider the so-called profile likelihood (Murphy and van der Vaart 2000)

pr(β)=maxH(β,H).

The profile likelihood may be used as a likelihood for β. The MLE for β, the first component of the pair ([beta], Ĥ) that maximizes [ell](β, H), is the maximizer of the profile likelihood function [ell]pr(β).

Theoretical justification for the use of the profile likelihood for semiparametric models was given in (Murphy and van der Vaart 1997, 2000; van der Vaart 1998). It was shown that profile likelihoods with the nuisance parameter estimated out behave like ordinary likelihoods under regularity conditions. These conditions need to be verified on a case-by-case basis as the general theory remains a challenge. Theoretical justification has been obtained for the proportional odds (PO) model (Murphy and van der Vaart 2000; Murphy et al. 1997) and the PH frailty models (Murphy 1994, 1995; Parner 1998; Kosorok et al. 2004).

The observed profile information matrix will be denoted Ipr,

Ipr=2pr(β)ββT|β=β^.

This matrix is asymptotically the same as ββ1, and summarizes partial information on β.

The Likelihood Ratio and Wald statistics based on [ell]pr are easier to obtain than the ones based on the full model provided

  • a numerically efficient method is available to profile out the nuisance parameter H, and
  • it is possible to derive the exact observed profile information matrix or estimate it in a computationally efficient way.

Fulfilling both conditions is a challenge. First, maximization over H is a problem of potentially very large dimension. Second, in most cases [ell]pr cannot be differentiated analytically. Several alternatives for estimating the profile information matrix have been proposed in the literature. However, they are all approximations, often difficult to calibrate in practice, and algorithms to obtain them are computationally costly.

In this paper we propose a computationally efficient exact solution for the class of semiparametric Nonlinear Transformation Models (NTM) (Tsodikov 2003). The basic assumption that defines this model family is that the survival function at each timepoint t is a function of H(t) mapping real numbers [0, ∞] → [0, 1] rather than a functional mapping a functional space to [0,1]. In other words, the model-based survival function is obtained by plugging a cumulative hazard H or a baseline survival function F = exp(− H) into a suitably defined parametric function (so-called model-generating function, see Sect.2). Note that a similar assumption underlies the von-Mises Calculus (van der Vaart 1998, p. 291). The NTM class includes the proportional hazards (PH) model, univariate PH frailty models, the PO model, and cure models such as the PHPH model (Tsodikov 2002; Tsodikov et al. 2003). A numerically efficient Quasi-EM algorithm, a subset of the MM family (Lange et al. 2000) was developed to obtain the maximum profile likelihood for NTM models (Tsodikov 2003). The algorithm has since been used in computer intensive settings such as the bootstrap (Dixon et al. 2005).

The algorithm for the exact Ipr proposed in this paper works under the following two basic assumptions.

Independence of the future. Independence of the future means that the contribution to the likelihood of an observed event at time t depends on the past H[0,t] of the function H, but not on the future.

Nonlinear Transformation Model Assumption. The survival function given covariates is specified as a parametric transformation of H. A detailed definition is given below.

We compare our method to the following three existing techniques used to estimate the profile information matrix that amount to particular forms of numerical differentiation of the second order.

  1. Discretized second derivative. Corollary 3 of Murphy and van der Vaart (2000) shows that under certain conditions
    2logpr(β^+hnνn)logpr(β^)nhn2PνTIprν,
    (1)
    for all sequences νnPνandhnP0 such that (nhn)1=OP(1).
    This result can be used to derive an estimate of Ipr. Note that this method requires careful maintenance of the speed of convergence of the sequence {hn} as the condition (nhn)1=OP(1) implies that the convergence should be neither too slow nor too fast. The reason is that the precision of the discrete differential operator as n → ∞ on the left side of (1) needs to be measured against the convergence of MLEs to the true value. Indeed, under regularity conditions, the asymptotic expansion of the likelihood ratio statistic about the MLE [beta] has the form n(ββ^)TIpr(ββ^)+oP(nββ*+1), where β* is the true value. The procedure (1) is designed to extract the quadratic term by setting β = [beta] + hnνn, and by ensuring the 1/n rate of convergence of β to simultaneously [beta] and β* so that the quadratic term is indeed the dominant one. Otherwise, the expansion would be dominated by its oP(1) part if hn is too fast or by oP(nββ*) if hn is too slow.
    See Sect. 4 for further details and implementation of this method.
  2. Fitting a Quadratic Form. Asymptotically, under regularity conditions, the profile likelihood surface around the true β is quadratic. Nielsen et al. (1992) proposed fitting a quadratic form to [ell]pr(β) in some domain around the maximum likelihood estimator, [beta], and the derivation of an approximate profile information matrix using the estimated coefficients of the form. Note that globally the likelihood surface is not quadratic. The quadratic approach is difficult to implement as a sufficiently small domain around [beta] where the likelihood surface can be well approximated by a quadratic form is not well defined. Misspecification of this domain with the quadratic method often leads to estimates of the profile information matrix that are not positive definite, particularly if the number of covariates is large. Yet the domain needs to be large enough to ensure adequate precision and sufficient sample size representing the number of likelihood evaluations within the domain. This balancing act is notoriously difficult as the true variance is unknown and the likelihood surface is specific to the data set being analyzed.
  3. Numerical Differentiation of the Profile Likelihood. Standard numerical algorithms can be used to numerically differentiate the profile likelihood function. We use Ridder’s method (Press et al. 1994) in the examples presented in Sect. 4. The difficulties in the implementation of this idea are similar to the ones with the Quadratic Form approach. Numerical differentiation requires choosing a tolerance for the estimation of the derivatives, and typically involves interpolation of the function. The precision and speed of these methods are inversely related and they vary widely dependent on the tolerance. Since the likelihood surface is dataset-specific, this method may require calibration and tuning for a particular dataset.

The approximating nature of the standard approaches outlined above, the need to balance various tradeoffs in their implementation, and a likely need to tweak implementation based on the dataset at hand, makes it difficult to develop these approaches to the point of automation sufficient for use in standard statistical software.

The algorithm proposed in this paper is exact, automatic and requires no tuning or calibration. This makes it an attractive alternative, particularly with statistical software applications in mind.

The PO model is used in this paper to compare via simulations the performance of the three estimation methods for Ipr and the proposed exact algorithm. For different sample sizes, the approaches were compared in terms of the number of operations required to achieve a reasonable standardized precision. Naturally, the exact method outperforms any approximating method if an ever better precision is demanded. In our numerical study we focus on practical precisions where approximating methods could nevertheless represent a viable competition to an exact procedure. Numerical efficiency and precision of the computation of Ipr is of great importance for variable selection procedures. In an example involving seven variables, backward variable selection using the Wald statistic based on the exact profile information matrix took less than one third of the time of the quadratic approach. We also compared the estimation methods in terms of relative error. Of all approximating methods, the numerical approach has the smallest relative error.

As a result of these studies we believe that the exact method should be the primary choice for NTM.

2 Nonlinear transformation models

NTM are defined as follows (Tsodikov 2002, 2003).

Definition 1 Let γ(x |β,z) be a parametrically specified distribution function with x-domain of [0,1]. Let F(t) be a nonparametrically specified baseline survival function. A semiparametric regression survival model is called a Nonlinear Transformation Model if, conditional on the covariates z, its survival function G can be represented in the form

G(t|β,z)=γ(F(t)|β,z).
(2)

The function γ is called the NTM-generating function.

Note that F(t) = exp(− H(t)) where H(t) is the baseline cumulative hazard function. With this in mind we can write the hazard function of the model as

λ(t|β,z)=γ(F(t)|β,z)γ(F(t)|β,z)F(t)h(t),
(3)

where h(t) = H′(t) is the baseline hazard function.

In Tsodikov (2003) a Quasi-EM (QEM) point estimation algorithm for the NTM was developed and conditions that ensure its convergence were given.

The algorithm solves a functional self-consistency score equation of the form H = ψ(β, H) for H, where ψ is a mapping that generalizes a Nelson–Aalen–Breslow estimator for the proportional hazards model so that its denominator depends on H as well as β. Functional iterations

H(k+1)=ψ(β,H(k)),  k=1,2,
(4)

are exercised until Ĥ, the fixed-point of ψ, has been approximated, H(k)Ĥ, as k → ∞, see Tsodikov (2003) for details.

Although any parameterization of γ in terms of β and z is allowed, in the examples we assume that γ is parameterized through a set of parameters/predictors θ, η, …, where each predictor is further parameterized using generally different sets of regression coefficients β1, β2, …, so that θ=exp(β1Tz),η=exp(β2Tz),.

2.1 Profile likelihood approach

Let ti, i = 1, …, n be a set of failure times, arranged in ascending order, tn+1 := ∞. Associated with each ti is a set of subjects Di with covariates zij, j [set membership] Di who fail at ti, and a set of subjects Ci with covariates zij, j [set membership] Ci who are censored at time t [set membership][ti, ti+1). The observed event for the subject ij is a triple (ti, zij, cij), where c is a censoring indicator, c = 1 if failure, c = 0 if right censored. Let H be the baseline cumulative hazard, with H(0) = 0. We assume than H(t) is a step function with jumps at the failure times ti, i = 1, …, n. As a step-function, H can be characterized by the vector h = (h1, …, hn), where hi = ΔHi is the jump of H at ti. With this notation, under an NT model (2), (3) and non-informative censoring, the likelihood of survival data takes the form

=i=1nDilog(hi)+i=1nj𝒞i𝒟ilogϑ(Fi|β,zij,cij),

where

ϑ(x|β,z,c)=xccγ(x|β,z)xc,

[partial differential]0γ/[partial differential]x0 = γ, Di is the number of failures associated with ti and

Fi=F(ti)=exp(l=1ihl).

The profile likelihood is defined as a supremum of the full likelihood [ell] taken over the nonparametric part of the model

pr(β)=maxh(β,h).

The MLE of h for a given β will be denoted ĥ(β) = (ĥ1, …, ĥn), with ĥk = ĥk(β), then [ell]pr(β) = [ell](β, ĥ(β)).

Differentiating [ell] with respect to h and setting the score equal to 0 we obtain Ĥ(β) as the solution of the functional self-consistency equation

h^m=Dm(i,j)mΘ(Fi|β,zij,cij),  m=1,,n,
(5)

where Fi is a function of h1, …, hi,

Θ(x|β,z,c)= log ϑ(x|β,z,c)x=c+xγ(c+1)(x|β,z,c)γ(c)(x|β,z,c),
(6)

γ(a) = [partial differential]aγ(x)/[partial differential]xa, and Rm is the set of subjects at risk just prior to tm, Rm = {(i, j) : im, j [set membership] Ci [union or logical sum] Di}.

2.2 Point estimation

Point estimation proceeds along the lines of the following nested procedure,

  • maximize [ell]pr(β) by a conventional nonlinear programming method, for example, the Powell method (Press et al., 1994),
  • for each β demanded in the above maximization procedure, find maxh(β,h) as the fixed point of (5).

The QEM algorithm makes use of the straightforward recursion to obtain the profile likelihood,

hm(k+1)=Dm(i,j)mΘ(exp(l=1ihl(k))|β,zij,cij),=1,2,;m=1,,n,
(7)

where k counts iterations. Note that an increment of k occurs only once all the parameters hi, i = 1, …, n have been updated.

It can be shown that if Θ is nondecreasing in x, each update of H using (7) strictly improves the likelihood, given β. This guarantees convergence of the sequence of likelihood values [ell](β, h(k)) to [ell](β, ĥ(β)) under fairly general conditions (Tsodikov 2003); that is, Ĥ(β) is the fixed point of the recursion given in (7).

It should be noted that the proposed information matrix algorithm is not contingent on using a specific method for point estimation. Yet it builds on the idea of the self-consistency through implicit differentiation of the self-consistency equation.

3 Profile information matrix

The profile information matrix is the observed information matrix derived from the profile likelihood,

Ipr(β)=2pr(β)ββT.

Implicit differentiation of the profile likelihood yields the following expression for the profile information matrix

Ipr(β)=Iββ+hβTIhhhβ+hβTIhβ+IhβThβ+m=1nhmhm,ββ,
(8)

where h = h(β) is some function of β, and

hβ=h(β)β,  Iab=2abT,  hm=(β,h)hm, and hm,ββ=2hm(β)ββT,

with a and b equal to β or h.

When evaluated at the MLE Ĥ(β), where Ĥ is a function defined implicitly as the solution of the score equation

h(β,h)=0h^=h^(β),
(9)

the information matrix simplifies to

Ipr=Iββ+Ih^βTh^β.
(10)

Indeed, by virtue of the score equation (9),

h(β,h^(β))0.
(11)

Differentiating (11) with respect to β, we also have

dh(β,h^(β))dβ=Ih^β+Ih^h^h^β0,
(12)

with (10) now following from (8) on substitution of (11) and (12).

It should be noted, however, that unless the score equation (9) is solved for h exactly, the short form of the observed profile information matrix (10) is generally not going to be symmetric. Except in the Cox model, there is no closed form solution to Ĥ, and this function is an output of a numerical algorithm such as (7) converging to Ĥ with some tolerance. To preserve the symmetry of Ipr, we prefer to keep some of the theoretically redundant terms in (8) and use the form

Ipr(β)=Iββ+h^βIh^h^h^β+h^βIh^β+Ih^βTh^β.
(13)

Notice that Ipr has dimension d × d, d = dim(β). Therefore only a small matrix needs to be inverted in order to get an estimator of the covariance matrix of regression coefficients.

The difficulty in (13) is that since Ĥ(β) is defined implicitly, so is the potentially large Jacobian matrix [partial differential]ĥ/[partial differential]β. Therefore, the Jacobian is generally unavailable in a closed form. The success in the calculation of the profile information matrix is determined by the existence of an efficient numerical method to compute [partial differential]ĥ/[partial differential]β. Generally, computation of [partial differential]ĥ/[partial differential]β is as difficult as taking the inverse of the original full model information matrix (O(n3) operations required), and this derivation defeats the purpose. However, if the functional 𝛝(H, t|·) that defines model contributions to the likelihood depends on (H,t) only through H(t), which is the case for the NT models (2), [partial differential]ĥ/[partial differential]β can be obtained by solving a system of linear equations with a special structure. This specific structure of the linear system can be exploited to derive an efficient numerical solution given in Proposition 1.

We first show how to obtain Iββ, Ihβ and Ihh.

The H-score of an NT model is,

hk=Dkhk(i,j)kΘ(Fi|β,zij,cij).

Differentiating the H-score with respect to β we get,

2hkβm=(i,j)kΘβm(Fi|β,zij,cij).

Evaluation of derivatives of Θ or γ with respect to β depends on the parameterization of the model’s predictor as a function of explanatory variables z, which is model-specific. Once a model is specified, the calculation of Iββ and Ihβ is straightforward.

Since Fi=exp(l=1ihl), we have

Θ(Fi|·)hm={Q(Fi|·),mi,0,m>i,
(14)

where

Q(x|·,c)=xΘ(x|·,c)x=(Θ(x|·,c)c)(Θ(x|·,c+1)Θ(x|·,c)),
(15)

and “·” stands for “β, z”, and Θ is given by (6) extended to c = 0,1,2. Note that [partial differential]Θ(Fi|·)/[partial differential]hm is a constant in m for mi or m > i.

From (14) it follows that,

2hkhm=(i,j)max{k,m}Q(Fi|β,zij,cij)+Dkhk21{k=m},

where

1{k=m}={1,k=m,0,km.

From this we get Ihh.

Now we turn our attention to the Jacobian [partial differential]ĥ/[partial differential]β. Proposition 1 gives the main result used to efficiently calculate [partial differential]ĥ/[partial differential]β in the case of NT models. Its proof is given in the Appendix.

Proposition 1 Let D be an n × n diagonal matrix with diagonal elements di ≠ 0, i = 1, …, n. Let R = (Rkl) be an n × n matrix, with Rkl=i=max{k,l}nai, where ai, i = 1, … n are real numbers. Let b be an n-dimensional vector.

Define the functions [var phi]k: RR, k = 1, …, n recursively as

φn(y)=bndnandny,φk(y)=1dk(bki=knaiy+l=k+1ni=kl1aiφl(y)),=n1,,1,

for y in R. Let [phi] : RR be the function given by φ˜(y)=k=1nφk(y) and let

y˜=φ˜(0)1+φ˜(0)φ˜(1).

Then the solution to the system of equations (D + R)x = b is the n-dimensional vector x = ([var phi]1(), …, [var phi]n())T.

We now show that the Jacobian [partial differential]ĥ/[partial differential]β satisfies a relationship of the form as discussed in Proposition 1. Differentiating the self-consistency equation (5) implicitly, we get that Ĥ satisfies the relationship

h^mβk=h^m2Dm(l=1n(i,j)max{m,l}Q(Fi|β,zij,cij)h^lβk+(i,j)mΘβk(Fi|β,zij,cij)),
(16)

where Q is the function given in (15).

Let D be the diagonal matrix with elements

dm=Dm(h^m)2,m=1,,d.

Let R = (Rml) with Rml=i=max{m,l}nai, where

ai=j𝒞i𝒟iQ(Fi|β,zij,cij),i=1,,n

and for k = 1, …, d let

b(k)=((i,j)1Θ(Fi|β,zij,cij)βk,,(i,j)nΘ(Fi|β,zij,cij)βk)T.

It follows from (16) that

h^βk=D1(Rh^βkb(k)).

Hence,

(R+D)h^βk=b(k).

Therefore, for each k = 1, …, n the vector [partial differential]ĥk can be obtained from Proposition 1. We now have all the components of (13) defined. This completes the exposition of our method.

4 Examples

In the examples we compare the performance of four methods to compute the observed profile information matrix. A brief explanation of the methods and details on how they were implemented in our examples are given below.

  1. Discretized. The estimation is based on the result of Corollary 3 in Murphy and van der Vaart (2000). Under certain conditions
    2logpr(β^+hnνn)logpr(β^)nhn2PνTIprν,
    (17)
    for all sequences νnPνandhnP0such that(nhn)1=OP(1).
    In order to estimate all the elements of Ipr, we chose ν = ei, i = 1, …, d and ν = ei + ej, 1 ≤ i<jd, where ei are Euclidean basis vectors.
    We set νn [equivalent] ν, and hn=10/  (nCk) with C = 1.4 and k such that |f(10/(nCk))f(10/(nCk1))|<0.001, where f(h) is the left-hand side of Eq. (17) considered as a function of h. This procedure was motivated by Dixon et al. (2005) who considered a choice of hn in the one dimensional (d = 1) situation.
  2. Quadratic. This approach approximates the profile likelihood surface by a quadratic form and derives the estimate of the information matrix from the coefficients of the form fitted to the surface. Specifically, let Δβ be a vector of deviations of the β values sampled in the vicinity of [beta], and let Δ[ell]pr be the induced vector of deviations of the profile likelihood from its maximum value, [ell]pr([beta]). Then, if Δβ is sufficiently small
    Δpr12ΔβTIprΔβ.
    Fitting the quadratic form (1/2)ΔβT AΔβ to points (Δβ, Δ[ell]pr) by least squares produces an estimate, Â, of the profile information matrix Ipr.
    In our implementation of this method we limit the domain to points that are not rejected at 0.05 significance level by the LR test (applied informally and disregarding the multi-comparison issue). In other words, points β are included if −2{[ell]pr([beta]) − [ell]pr(β)} ≤ Cd,0.05, where Cd,0.05 is the 0.05th upper tail percentile of the χ2 distribution with d = dim(β) degrees of freedom. Since the validity of the quadratic approximation is itself a prerequisite for the validity of the likelihood ratio statistic, this choice is far from perfect. Yet this procedure would ensure a desired property of the domain shrinking with sample size, and we know of no better alternative.
  3. Numerical. The calculation of the observed profile information matrix is carried on using Ridder’s numerical differentiation of the profile likelihood function, see Press et al. (1994).
    Let f : RR be a differentiable function. By definition, the derivative of f is the limit as h → 0 of the incremental quotient
    q(h)=f(x+h)f(x)h.
    The basic idea of Ridder’s method is to calculate q(h) for several values of h, and then extrapolate the result to the limit h = 0. In the case of a function with domain on R, the vector of first derivatives is obtained by applying the algorithm on each coordinate at a time and leaving the other coordinates fixed.
    Numerical experimentation showed that this approach gives the second derivatives of [ell]pr(β) with very high precision, albeit at a greater computational cost than other methods.
  4. Exact. This is the method developed in Sect. 3 of this paper for computation of the exact observed profile information matrix for NTM.
    PO model will be used as a basis for all our comparisons. The validity of NPMLE and the profile likelihood for this model has been demonstrated elsewhere.

4.1 The PO model

Given covariates z, the survival function G(t|β,z) of a PO model can be written in the form,

G(t|β,z)=G(t|θ(β,z))=θ(β,z)θ(β,z)+H(t),
(18)

where H is some nonparametrically specified baseline cumulative hazard function, and θ is a predictor. Since H = − log F, the NTM-generating function of the PO model is

γ(x|·)=θ(·)θ(·)logx.

A characteristic feature of the PO model is that for any two values, θ1, θ2, of the predictor, the odds ratio

Odds(G(t|θ1))Odds(G(t|θ2))=θ1θ2

is constant in t.

It follows that

Θ(x|·,c)=c+1θ(·)logx.

We consider an exponential parameterization of the predictor θ(β, z) = exp(βTz). With this parameterization,

θβ=θz,  2θββT=θzzT.

The following derivatives of Θ are necessary to specify the algorithm of Sect.3,

Θβ=Θθθz,  2ΘββT=(2Θθ2θ2+Θθθ)zzT,

where

Θ(x|·,c)θ=c+1(θ(·)logx)2,  and  2Θ(x|·,c)θ2=2(c+1)(θ(·)logx)3.

4.2 Real data

As an example, we use data from the National Cancer Institutes Surveillance Epidemiology and End Results (SEER) program. Using the publicly available SEER database, 11,621 cases of primary prostate cancer diagnosed in the state of Utah between 1988 and 1999 were identified. The following selection criteria were applied to a total of 19,819 Utah-cases registered in the database: valid positive survival time, valid stage of the disease, age ≥18 years. Prostate cancer specific survival was analyzed by stage of the disease (localized/regional vs. distant). For the definition of stages as well as for other details of the data we refer the reader to SEER documentation http://www.seer.cancer.gov/.

The data analysis presented in this paper is a continuation of the one given in Tsodikov (2003). Two groups of patients representing stage at diagnosis of the disease are considered, hence the predictor in the PO model has a single parameter β. The log odds ratio β measures the disadvantage of being in the distant stage relative to local/regional stage. The QEM algorithm was applied to fit the PO model to the data. The maximum likelihood estimate of β was [beta] = −3.251. Confidence intervals for β were obtained using the Wald statistic based on the profile information matrix. The confidence interval based on the quadratic approximation of the profile information matrix was (− 3.416, − 3.086) and the one obtained through the exact profile information matrix was (− 3.415, − 3.086). Excellent concordance of the two confidence intervals is due to the large sample size and the small dimension of the regression parameter, a situation when approximating methods tend to be accurate.

In the case of a single parameter, the observed profile information matrix is a scalar. The estimates of the observed profile information matrix were 142.1011, 141.2158 and 141.7424 for the Discretized, Quadratic and Numerical approaches respectively and the Exact value was 141.7423. Although the values were quite similar it is clear that the discretized and quadratic approaches depart from the true value.

4.3 Simulations

4.3.1 Simulations setup

We simulated age at diagnosis for an adult-onset disease using a PO model. The example in its baseline survival is loosely based on incidence of prostate cancer. The baseline survival function was assumed to follow a Weibull distribution with the median of 38 years and shape of 1.8, the risk starting at the age of 18 (a fixed number of 18 years was added to survival time and censoring). With these parameters incidence before the age of 40 is negligible. An independent censoring mechanism was assumed. Censoring times were generated using Weibull distribution with a median of 46 years and shape parameter of 4. Observations in excess of 105 years were type-I censored at 105. Two covariates were introduced, one categorical with 3 levels, and one continuous (a risk factor) with a range between − 1 and 1. Values for both covariates were generated independently. The continuous covariate followed a uniform distribution. The discrete distribution for the categorical covariate assumed the following probabilities, 0.7 (level 1), 0.5 (level 2), and 0.1 (level 3). The following covariate effects were assumed. An effect of the log odds ratio of 2 was assumed for a unit change in the continuous factor. The categorical covariate was assumed to have a progressing effect on the risk of the disease. The log odds ratios comparing level 2 and level 3 to the baseline level 1 were 1.5 and 2.5, respectively.

4.3.2 Speed

To assess the speed of performance of the four methods we calculated the number of operations required to compute the exact information matrix and its approximations. Evaluation of Θ, γ, their analytically specified derivatives or similar comparable procedures were counted as one operation. Figure 1 shows the number of operations by sample size and method. In order to make the performance results comparable, the precision of estimation algorithms was calibrated on an ad-hoc basis so that the relative error of the three methods (Discretized, Quadratic, Numerical) was approximately the same (0.02). For any method A, the relative error was defined as ‖Ipr(A) − Ipr(Exact)‖/‖Ipr(Exact)‖, where Ipr(A) is an estimate of the observed profile information matrix computed using method A, and the norm is defined as the sum of absolute values of all elements of the matrix. Regardless of the sample size, the exact calculation outperformed the approximate methods. Inference based on the discretized second derivative required between 10 and 30 times as many operations as the exact calculation. The quadratic approach required between 60 and 200 times as many operations as the calculation of the exact Ipr matrix. The numerical method was computationally very costly requiring between 600 and 7,000 as many operations as the exact approach. However, the numerical approach behaved better than the other two methods in terms of relative error as shown in Sect.4.3.3.

Fig. 1
Operations by sample size characteristics of four methods of computation of the observed profile information matrix. The Exact method developed in this paper shows the highest numerical efficiency

4.3.3 Precision

A sample of size 500 was used to find the smallest possible relative error of the method when adjusting the different parameters involved on an ad-hoc basis. The best relative error achieved by the Discretized method was 0.01 and 8.13 × 105 operations were required. This number was 0.013 for the quadratic approach with 5.32 × 106 operations required, while the numerical approach achieved a relative error of 8 × 10−7 and required 3.87 × 108 operations. This example shows that the numerical approach is perhaps the only one of the approximating methods that can compete with the exact procedure in terms of precision required in real-life analysis. Its high computational cost, however, makes it a poor choice for variable selection and other procedures requiring repeated evaluations of Ipr.

4.3.4 Statistical properties

Three sets of experiments were performed with samples of size 100, 500, and 1,000. For each sample size, 1,000 simulated samples were generated. The covariance matrices based on Ipr were computed for each sample using the four approaches discussed. The mean and standard deviation of each of the entries of the estimators of covariance matrices under study were estimated from the 1,000 replicates. In addition, point semiparametric MLE estimates of the three parameters entering the profile likelihood (log odds ratios for the continuous factor and level 2 vs. 1 and 3 vs. 1 contrasts) were used to compute the empirical covariance matrix based on 1,000 replicates. A comparison of the estimated means of the entries of the covariance matrices calculated using exact and numerical approaches with the empirical ones were used to evaluate how well these methods estimate the true finite sample variance-covariance. The results are shown in Table 1. Two factors contributed to the distance between the exact and numerical approaches and the empirical one: the finite-sample bias of covariance estimates based on Ipr, and the bias in the estimate of Ipr by an approximating method (this latter bias does not pertain to the exact method). The following basic conclusions are evident from the Table 1.

  1. All methods are much better at estimating variances (left half of the table) than covariances (right half of the table);
  2. The precision of estimation of covariance improves rapidly with sample size;
  3. Under all sample sizes the numerical approach showed excellent concordance with the exact method. This is in agreement with our earlier observation that of all approximating methods, the numerical approach is most precise albeit computationally costly. The quadratic method was the least precise in this analysis;
  4. For reasonably large sample sizes all methods are in good agreement with the empirical estimate.
Table 1
Empirical covariance matrix based on 1,000 replicates and mean covariance matrix and sd based on 1,000 replicates for the four methods studied

5 Discussion

In this paper we have proposed a method to compute the profile information matrix based on implicit differentiation of the self-consistency equation. Computationally the method outperformed all existing approaches to the best of our knowledge. An attractive property of the procedure is that it is exact contingent upon point estimates. Even though exact point estimates are hardly ever available, the precision of variance-covariance estimation is improved as the method does not add any error to the one associated with imprecision of point estimates. Numerically efficient and stable procedures for point estimates have been developed earlier and provide a good complement to this methodology. We recommend the Exact method as a preferred choice with NTM.

Since derivatives of the profile likelihood are defined implicitly, applying the Newton–Raphson method to the profile likelihood for point estimation is a challenge. The Newton–Raphson method typically requires exact inverse Hessian matrix and is not guaranteed to converge if this matrix is approximated. The results of this paper can be used to provide an exact inverse Hessian matrix at any point in the parameter space and thus enable the Newton–Raphson method for use with the profile likelihood.

Acknowledgements

This research is supported by National Cancer Institute grant U01 CA97414, and Department of Defence grant DAMD17-03-1-0034.

Appendix

6.1 Proof of Proposition 1

The equation (D + R)x = b implies

dkxk+l=1nRklxl=bk,  k=1,,n.
(19)

Since Rkl=i=max{k,l}nai, it follows from (19) that for k = 1, …, n,

bk=dkxk+i=knail=1kxl+l=k+1ni=lnaixl=dkxk+i=knail=1kxl+i=k+1nl=k+1iaixl=dkxk+i=knai(l=1nxll=i+lnxl).

The second equality above is a consequence of a change of summation order.

Hence, solving the system of equations (D + R)x = b is equivalent to solving the system

xk=1dk(bki=knaiy+l=k+1ni=kl1aixl),  k=1,,ny=l=1nxl.

Notice that {xk} are in fact functions of y obtained recursively, xk = [var phi]k(y), and y=k=1nφk(y)=φ˜(y). The solution of this system of equations is the vector ([var phi]1(), …, [var phi]n())T where satisfies [phi]() = . Since [var phi]k, k = 1, …, n are linear functions of y, so is [phi]. Hence, [phi](y) − y = ay + b with

a=φ˜(1)1φ˜(0)   and =φ˜(0).

Therefore,

y˜=φ˜(0)1+φ˜(0)φ˜(1).

Contributor Information

A. Tsodikov, Department of Biostatistics, School of Public Health, University of Michigan, 1420 Washington Heights, Ann Arbor, MI 48109-2029, USA, ude.hcimu@vokidost.

G. Garibotti, Centro Regional Universitario Bariloche, Universidad Nacional del Comahue, Quintral 1250, 8400 Bariloche, Argentina, na.vog.aenc.bac@ttobirag.

References

  • Dixon JR, Kosorok MR, Lee BL. Functional inference in semiparametric models using the piggyback bootstrap. Ann Inst Statist Math. 2005;57:255–277.
  • Kosorok MR, Lee BL, Fine JP. Robust inference for univariate proportional hazards frailty regression models. Ann Statist. 2004;32:1448–1491.
  • Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions (with discussion) J Comput Graph Statist. 2000;9:1–59.
  • Murphy SA. Consistency in a proportional hazards model incorporating a random effect. Ann Statist. 1994;22(2):712–731.
  • Murphy SA. Asymptotic theory for the frailty model. Ann Statist. 1995;23(1):182–198.
  • Murphy SA, van der Vaart AW. Semiparametric likelihood ratio inference. Ann Statist. 1997;25:1471–1509.
  • Murphy SA, van der Vaart AW. On profile likelihood. J Am Statist Assoc. 2000;95:449–485.
  • Murphy SA, Rossini AJ, van der Vaart AW. Maximum likelihood estimation in the proportional odds model. J Am Statist Assoc. 1997;92(439):968–976.
  • Nielsen GG, Gill RD, Andersen PK, Sorensen TI. A counting process approach to maximum likelihood estimation in frailty models. Scand J Statist. 1992;19:25–43.
  • Parner E. Asymptotic theory for the correlated Gamma-frailty model. Ann Statist. 1998;26:183–214.
  • Press WH, Flannery BP, Teukolsky SA, Vetterling WT. The art of scientific computing. New York: Cambridge University Press; 1994. Numerical recipies in Pascal.
  • Severini TA. Likelihood methods in statistics. New York: Oxford University Press; 2000.
  • Tsodikov A. Semiparametric models of long- and short-term survival: an application to the analysis of breast cancer survival in Utah by age and stage. Statist Med. 2002;21:895–920. [PubMed]
  • Tsodikov A. Semiparametric models: a generalized self-consistency approach. J R Statist Soc, Ser B. 2003;65:759–774. [PMC free article] [PubMed]
  • Tsodikov A, Ibrahim JG, Yakovlev AY. Estimating cure rates from survival data: an alternative to two-component mixture models. J Am Statist Assoc. 2003;98:1063–1078. [PMC free article] [PubMed]
  • van der Vaart AW. Cambridge series in statistical and probabilistic mathematics. Cambridge UK: Cambridge University Press; 1998. Asymptotic statistics.