Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3016756

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Missing data in longitudinal studies
- 3 The normal random-effects model
- 4 Generalized linear mixed models
- 5 Nonignorable missing covariates and responses in the GLMM
- 6 Shared-parameter models
- 7 Bayesian methods
- 8 Concluding remarks
- References

Authors

Related links

Test (Madr). Author manuscript; available in PMC 2011 January 6.

Published in final edited form as:

Test (Madr). 2009 May 1; 18(1): 1–43.

doi: 10.1007/s11749-009-0138-xPMCID: PMC3016756

NIHMSID: NIHMS198968

Joseph G. Ibrahim, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA ; Email: ude.cnu.soib@miharbi;.

See other articles in PMC that cite the published article.

Incomplete data are quite common in biomedical and other types of research, especially in longitudinal studies. During the last three decades, a vast amount of work has been done in the area. This has led, on the one hand, to a rich taxonomy of missing-data concepts, issues, and methods and, on the other hand, to a variety of data-analytic tools. Elements of taxonomy include: missing data patterns, mechanisms, and modeling frameworks; inferential paradigms; and sensitivity analysis frameworks. These are described in detail. A variety of concrete modeling devices is presented. To make matters concrete, two case studies are considered. The first one concerns quality of life among breast cancer patients, while the second one examines data from the Muscatine children’s obesity study.

In a longitudinal study, each experimental or observational unit is measured at baseline and repeatedly over time. Incomplete data are not unusual under such designs, as many subjects are not available to be measured at all time points. In addition, a subject can be missing at one follow-up time and then measured again at one of the next, resulting in nonmonotone missing data patterns. Such data present a considerable modeling challenge for the statistician.

Rubin (1976) distinguished between three important mechanisms. When missingness is unrelated to the data, missingness is termed *missing completely at random* (MCAR). When missingness depends on the observed data and, when given the observed data, it does not depend on the unobserved data, the mechanism is *missing at random* (MAR). A mechanism where missigness depends on the unobserved data, perhaps in addition to the observed data, is termed *missing not at random* (MNAR). In the likelihood and Bayesian paradigm, and when mild regularity conditions are satisfied, the MCAR and MAR mechanisms are *ignorable*, in the sense that inferences can proceed by analyzing the observed data only, without explicitly addressing a (parametric) form of the missing data mechanism. In this situation, MNAR mechanisms are *nonignorable*. Note that frequentist inference is generally ignorable only under MCAR.

In the ignorable situation, standard longitudinal data software allowing for unbalanced data can be used. Examples include the SAS procedures MIXED, GLIMMIX, and NLMIXED, and the SPlus and R functions `lme` and `nlme`, etc… Such tools eliminate complete-case bias by incorporating all available information. However, in the nonignorable case, methods that do not model the missing data mechanism are subject to bias.

Whereas ignorable likelihood analyses and appropriate frequentist techniques, such as weighted generalized estimating equations (Robins et al. 1995), provide a versatile framework, as opposed to the collection of simple methods, such as complete case analysis or last observation carried forward, nonignorable missing data occur very commonly in longitudinal studies. In many cancer and AIDS clinical trials, the side effects of the treatment may affect participation, and missingness can depend on the outcome and the treatment covariate. In quality of life studies, compliance is not compulsory, and those with a poor prognosis may be more likely not to complete the questionnaire at every visit. In environmental studies, geographic location or environmental factors may influence the response. Examples of nonignorable missingness can also be found in longitudinal psychiatric studies (Molenberghs et al. 1997; Little and Wang 1996).

Estimating parameters with nonignorable missing data is complex. Likelihood-based methods require specification of the joint distribution of the data and the missing data mechanism. This specification can be further classified into three types of models: selection, pattern-mixture, and shared-parameter models (Little 1995). The selection approach models the hypothetical complete data together with the missing data process conditional on the hypothetical complete data. The pattern-mixture approach models the distribution of the data conditional on the missing data pattern. Both of these approaches will be discussed in this paper. The third approach, shared-parameter models, accounts for the dependence between the measurement and missingness processes by means of latent variables such as random effects (Wu and Bailey 1988, 1989; Wu and Carroll 1988; Creemers et al. 2009).

There is an enormous literature on literature missing data methods in longitudinal studies. We refer the reader to the excellent books by Diggle et al. (2002), Fitzmaurice et al. (2004), Verbeke and Molenberghs (2000), Molenberghs and Verbeke (2005), Molenberghs and Kenward (2007), Daniels and Hogan (2008), Fitzmaurice et al. (2008), and the many references therein. Most of the literature focuses on maximum likelihood methods of estimation with nonignorable missing longitudinal data, predominantly focusing on mixed-effects models and normally distributed outcomes. A substantial part of the literature also assumes monotone patterns of missingness, where sequences of measurements on some subjects simply terminate prematurely. Approaches using selection models include Diggle and Kenward (1994), Little (1995), and Ibrahim et al. (2001). Approaches based on pattern-mixture models include Little (1994, 1995), Little and Wang (1996), Hogan and Laird (1997), and Thijs et al. (2002). Troxel et al. (1998a, 1998b) propose a selection model which is valid for nonmonotone missing data but is intractable for more than three time points. There is less literature, however, on estimating parameters for the class of generalized linear mixed models (GLMM) with nonignorable missing data. Follman and Wu (1995) consider an extension of the conditional linear model to generalized linear models. Molenberghs et al. (1997) propose a selection model for longitudinal ordinal data with nonrandom dropout. Ekholm and Skinner (1998) discuss a pattern-mixture model for a longitudinal binary, partially incomplete data set. Ibrahim et al. (2001) propose a method for estimating parameters in the GLMM using a selection model with nonignorable missing response data, while Fitzmaurice and Laird (2000) propose a method based on generalized estimating equations for estimating parameters in the GLMM using a mixture model with nonignorable dropouts.

While other methods of estimation with nonignorable nonresponse will be considered briefly, likelihood-based frequentist methods using selection and pattern-mixture models will be the primary focus of this paper. The literature is just too enormous to review all possible inference paradigms in this paper, such as multiple imputation, Bayesian methods, and weighted estimating equations, for example. For the class of generalized linear models, Ibrahim et al. (2005) present a detailed overview and comparisons of the four main paradigms for handling missing covariate data, these being (i) maximum likelihood (ML), (ii) multiple imputation (MI), (iii) Bayesian methods, and (iv) weighted estimating equations (WEE).

The remainder of this section motivates the setting with two real longitudinal data sets with likely nonignorable missing data. Section 2 discusses types of missing data in longitudinal studies. Section 3 focuses on estimation in the normal random effects model. Section 4 discusses methods for estimation in the GLMM. Section 5 reviews methods for handling nonignorable missing covariate and/or response data in the GLMM. Shared-parameter models are the topic of Sect. 6. We give a brief discussion of Bayesian methods in Sect. 7 and give some concluding remarks in Sect. 8.

As previously mentioned, many longitudinal studies call for estimation methods that can handle nonignorable missing data, since the possibility of such mechanism operation is impossible to rule out. This section presents two common examples to illustrate the problem in more detail.

Consider a data set concerning the quality of life among breast cancer patients in a clinical trial comparing four different chemotherapy regimens conducted by the International Breast Cancer Study Group (IBCSG Trial VI; Ibrahim et al. 2001). The main outcomes of the trial were time until relapse and death, but patients were also asked to complete quality of life questionnaires at baseline and at three-month intervals. Some patients did refuse, on occasion, to complete the questionnaire. However, even when they refused, the patients were asked to complete an assessment at their next follow-up visit. Thus, the structure of this trial resulted in nonmonotone patterns of missing data. One longitudinal quality of life outcome was the patient’s self-assessment of her mood, measured on a continuous scale from 0 (best) to 100 (worst). The three covariates of interest included a dichotomous covariate for language (Italian or Swedish), a continuous covariate for age, and three dichotomous covariates for the treatment regimen (4 regimens). Data from the first 18 months of the study were used, implying that each questionnaire was filled out at most seven times, i.e., at baseline plus at six follow-up visits.

There are 397 observations in the data set, and mood is missing at least one time for 71% of the cases, resulting in 116 (29%) complete cases. The amount of missing data is minimal at baseline (2%) and ranges between 24% and 31% at the other six times: 26.2% at the second, 24.2% at the third, 29% at the fourth, 24.9% at the fifth, 28.2% at the sixth, and 30.5% at the seventh occasion. Table 1 provides a summary of the missing data patterns; the overall fraction of missing measurements is 23.6%. All patients were alive at the end of 18 months, so no missingness is due to death. However, it is reasonable to conjecture that the mood of the patient affected their decision to fill out the questionnaire. In this case, the missingness would be MNAR, and an analysis that does not include the missing data mechanism would be biased. In fact, Ibrahim et al. (2001) show a slight difference in the significance of one of the treatment covariates and the age covariate between their ignorable and nonignorable models.

The Muscatine Coronary Risk Factor Study (MCRFS) was a longitudinal study of coronary risk factors in school children (Woolson and Clarke 1984; Ekholm and Skinner 1998). Five cohorts of children were measured for height and weight in 1977, 1979, and 1981. Relative weight was calculated as the ratio of a child’s observed weight to the median weight for their age-sex-height group. Children with a relative weight greater than 110% of the median weight for their respective stratum were classified as obese. The analysis of this study involves binary data (1 = obese, 0 = not obese) collected at successive time points. For every cohort, each of the following seven response patterns occurs: (*p, p, p*), (*p, p, m*), (*p, m, p*), (*m, p, p*), (*p, m, m*), (*m, p, m*), and (*m, m, p*), where a *p* (*m*) denotes that the child was present (missing) for the corresponding measurement. The distribution over the patterns is shown in Table 2.

The statistical problem is to estimate the obesity rate as a function of age and sex. However, as can be seen in Table 2, many data records are incomplete since many children participated in only one or two occasions of the survey. Ekholm and Skinner (1998) report that the two main reasons for nonresponse were: (i) no consent form signed by the parents was received, and (ii) the child was not in school on the day of the examination. If the parent did not sign the consent form because they did not want their child to be labeled as obese, or if the child did not attend school the day of the survey because of their weight, then the missingness would at least be MAR, and likely even MNAR. In the latter case, an analysis that ignores the missing data mechanism would be biased. However, since the outcome is binary, these data cannot be modeled using the normal random effects model. Instead, a general method for estimating parameters for the class of GLMM’s with nonignorable missing response data is needed.

We will now formalize the ideas loosely described in the introduction. Methods for handling missing data often depend on the pattern of missingness and the mechanism that generates the missing values. To illustrate the various missingness patterns and mechanisms in a regression setting, consider a data set that consists of a univariate vector of responses *y _{i}* = (

Data follow a *monotone missing pattern* if, once a subject misses a measurement occasion, s/he is never observed again. Monotone missing data are also termed dropout. For example, missing values in the vector of responses, *y _{i}* take the dropout form if, whenever

Data follow a *nonmonotone missing pattern* if at least some subject values are observed again after a missing value occurs. For example, if *y _{i}* contains missing values, they are intermittent, and

We present the mechanisms, in accordance with Rubin (1987) and Little and Rubin (2002).

Missing data are *missing completely at random* if the failure to observe a value does not depend on any values of *y _{i}*, either observed or missing, or any other observed values. For example, suppose that some components of

Missing data are *missing at random* if the failure to observe a value does not depend on the values of *y _{i}* which are unobserved, given the observed ones. However, the missingness may depend on other observed values. For example, suppose, as before, that

The missing data mechanism is said to be *missing not at random* if the failure to observe a value depends on the value that would have been observed. For example, suppose that some components of *y _{i}* are missing but that

Within the likelihood or Bayesian inferential frameworks, and when the parameters governing the measurement and missingness process are functionally independent, then MCAR and MAR mechanisms are ignorable. However, the frequentist framework generally requires the mechanism to be MCAR for ignorability to apply (Rubin 1976).

The normal random-effects model, also known as the *Laird–Ware model* (Laird and Ware 1982), is a special case of the generalized linear mixed model, which is the subject of the next section. The model is intended for continuous, normally distributed outcomes. Precisely, for a given individual *i* with *n _{i}* repeated measurements, the Laird–Ware model for outcome vector

$${y}_{i}={X}_{i}\beta +{Z}_{i}{b}_{i}+{e}_{i},\text{\hspace{1em}}i=1,\dots ,N,$$

(1)

where *y _{i}* is

$${e}_{i}~{N}_{{n}_{i}}(0,{\sigma}^{2}{I}_{{n}_{i}}),\text{\hspace{1em}\hspace{1em}}{b}_{i}~{N}_{q}(0,D),$$

where *I _{ni}* is the

$$({y}_{i}|\beta ,{\sigma}^{2},{b}_{i})~{N}_{{n}_{i}}({X}_{i}\beta +{Z}_{i}{b}_{i},{\sigma}^{2}{I}_{{n}_{i}}).$$

(2)

The model in (2) assumes a distinct set of regression coefficients for each individual once the random effects are known. Upon integration over the random effects, the so-called marginal distribution of *y _{i}* is

$$({y}_{i}|\beta ,{\sigma}^{2},D)~{N}_{{n}_{i}}({X}_{i}\beta ,{Z}_{i}D{Z}_{i}^{\prime}+{\sigma}^{2}{I}_{{n}_{i}}).$$

(3)

We first describe maximum likelihood estimation in the normal mixed model with no missing response or covariate data. Maximum likelihood (ML) estimation has been extensively considered for the normal random effects model (see, for example, Laird and Ware 1982; Jennrich and Schluchter 1986). The standard approach is to take the first and second derivatives of the log-likelihood based on the marginal distribution of *y _{i}* and use Newton–Raphson (based on the observed information) or Fisher scoring (based on the expected information) methods as the basis for iteratively obtaining the maximum likelihood estimates. Often, a hybrid approach to this iterative method is taken, where the updated value of is used to calculate = (

The method described here uses the expectation-maximization (EM) algorithm (Dempster et al. 1977) for computing ML estimates. The EM algorithm is advantageous over the Newton–Raphson or Fisher scoring algorithms when formulating models with large numbers of covariance parameters. The procedure consists of two steps. The first step uses weighted least squares ideas to update , which is equivalent to maximizing the likelihood with respect to β while holding the covariance parameters θ = (σ^{2}, *D*) fixed. In the second step, is updated using *Y* = (*y*_{1}, …, *y _{N}*) as the observed data and

Starting out with the first step, the log-likelihood based on the observed data, *Y*, is

$$\begin{array}{cc}\ell (\beta ,{\sigma}^{2},D)\hfill & =\text{log}\left[{\displaystyle \prod _{i=1}^{N}f({y}_{i}|\beta ,{\sigma}^{2},D)}\right]\hfill \\ \hfill & ={\displaystyle \sum _{i=1}^{N}-\frac{{n}_{i}}{2}\text{log}(2\pi )-\frac{1}{2}{\displaystyle \sum _{i=1}^{N}\text{log}|{\Sigma}_{i}|-\frac{1}{2}{\displaystyle \sum _{i=1}^{N}({y}_{i}-{X}_{i}\beta )\prime {\Sigma}_{i}^{-1}({y}_{i}-{X}_{i}\beta ),}}}\hfill \end{array}$$

where ${\Sigma}_{i}={Z}_{i}{\mathit{\text{DZ}}}_{i}^{\prime}+{\sigma}^{2}{I}_{{n}_{i}}$. The score equation for β is given by

$$\frac{d\ell}{d\beta}={\displaystyle \sum _{i=1}^{N}{X}_{i}^{\prime}{\Sigma}_{i}^{-1}{y}_{i}-}{\displaystyle \sum _{i=1}^{N}{X}_{i}^{\prime}{\Sigma}_{i}^{-1}{X}_{i}\beta}.$$

Setting this first derivative equal to zero and solving for β produces the ML estimate,

$$\widehat{\beta}={\left({\displaystyle \sum _{i=1}^{N}{X}_{i}^{\prime}{\Sigma}_{i}^{-1}{X}_{i}}\right)}^{-1}{\displaystyle \sum _{i=1}^{N}{X}_{i}^{\prime}{\Sigma}_{i}^{-1}{y}_{i}}.$$

The second step uses the complete data log-likelihood given by

$$\begin{array}{c}\ell (\beta ,{\sigma}^{2},D)\hfill \\ \text{}=\text{log}\left[{\displaystyle \prod _{i=1}^{N}f({y}_{i},{b}_{i}|\beta ,{\sigma}^{2},D)}\right]={\displaystyle \sum _{i=1}^{N}\text{log}[f({y}_{i}|\beta ,{\sigma}^{2},{b}_{i})]}+{\displaystyle \sum _{i=1}^{N}\text{log}[f({b}_{i}|D)]}\hfill \\ \text{}={\displaystyle \sum _{i=1}^{N}(-\frac{{n}_{i}}{2}\text{log}(2\pi )-\frac{{n}_{i}}{2}\text{log}({\sigma}^{2})-\frac{1}{2{\sigma}^{2}}({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i})\prime ({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i}))}+{\displaystyle \sum _{i=1}^{N}(-\frac{q}{2}\text{log}(2\pi )-\frac{1}{2}\text{log}|D|-\frac{1}{2}{b}_{i}^{\prime}{D}^{-1}{b}_{i})}.\hfill \end{array}$$

This expression establishes that $\sum}_{i=1}^{N}({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i})\prime ({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i})}\equiv {\displaystyle {\sum}_{i=1}^{N}{e}_{i}^{\prime}{e}_{i}\text{and}{\displaystyle {\sum}_{i=1}^{N}{b}_{i}{b}_{i}^{\prime}$ are the complete data sufficient statistics for σ^{2} and *D*, respectively. The M-step is then given by

$${\widehat{\sigma}}^{2}=\frac{{\displaystyle {\sum}_{i=1}^{N}{e}_{i}^{\prime}{e}_{i}}}{{\displaystyle {\sum}_{i=1}^{N}{n}_{i}}},\text{\hspace{1em}}\widehat{D}={\displaystyle \sum _{i=1}^{N}\frac{{b}_{i}{b}_{i}^{\prime}}{N}},$$

and thus

$${\widehat{\Sigma}}_{i}={Z}_{i}\widehat{D}{Z}_{i}^{\prime}+{\widehat{\sigma}}^{2}{I}_{{n}_{i}}.$$

The E-step consists of calculating the expected value of the sufficient statistics given the observed data and the current parameter estimates:

$$\begin{array}{c}E({b}_{i}{b}_{i}^{\prime}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D})\hfill \\ \text{}=E({b}_{i}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D})E({b}_{i}^{\prime}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D})+\text{Var}({b}_{i}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D})\hfill \\ \text{}=\widehat{D}{Z}_{i}^{\prime}{\widehat{\Sigma}}_{i}^{-1}({y}_{i}-{X}_{i}\widehat{\beta})({y}_{i}-{X}_{i}\widehat{\beta})\prime {\widehat{\Sigma}}_{i}^{-1}{Z}_{i}\widehat{D}+\widehat{D}-\widehat{D}{Z}_{i}^{\prime}{\widehat{\Sigma}}_{i}^{-1}{Z}_{i}\widehat{D},\hfill \\ E({e}_{i}^{\prime}{e}_{i}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D})\hfill \\ \text{}=\text{tr}(E({e}_{i}{e}_{i}^{\prime}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D}))\hfill \\ \text{}=\text{tr}(E({e}_{i}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D})E({e}_{i}^{\prime}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D})+\text{Var}({e}_{i}|{y}_{i},\widehat{\beta},{\widehat{\sigma}}^{2},\widehat{D}))\hfill \\ \text{}=\text{tr}({\widehat{\sigma}}^{4}{\widehat{\Sigma}}_{i}^{-1}({y}_{i}-{X}_{i}\widehat{\beta})({y}_{i}-{X}_{i}\widehat{\beta})\prime {\widehat{\Sigma}}_{i}^{-1}+{\widehat{\sigma}}^{2}{I}_{{n}_{i}}-{\widehat{\sigma}}^{4}{\widehat{\Sigma}}_{i}^{-1}),\hfill \end{array}$$

where *e _{i}* =

Note that the EM algorithm converges linearly, in contrast to super-linear convergence of Fisher scoring and even quadratic convergence of Newton–Raphson. However, key advantages of the EM algorithm are that (1) implementation is frequently more straightforward and intuitive and (2) there is a much lower risk for divergence. Sometimes, hybrid algorithms can be used, setting out with EM and then switching to Fisher scoring or Newton–Raphson. Alternatively, EM-acceleration methods can be used (Louis 1982; Meilijson 1989). Such methods are also useful when determining measures of precision.

When the missing data mechanism is MNAR, one needs to specify a (parametric) model for missingness alongside the aforementioned model for the outcomes and incorporate it into the complete data log-likelihood. The missing data mechanism is defined as the distribution of the *n _{i}* × 1 random vector

$$f({y}_{i},{b}_{i},{r}_{i}|\beta ,{\sigma}^{2},D,\varphi )=f({y}_{i}|\beta ,{\sigma}^{2},{b}_{i})f({b}_{i}|D)f({r}_{i}|\varphi ,{y}_{i}),$$

whereas *pattern-mixture models* employ the reverse factorization

$$f({y}_{i},{b}_{i},{r}_{i}|\beta ,{\sigma}^{2},D,\varphi )=f({y}_{i}|\beta ,{\sigma}^{2},{b}_{i},{r}_{i})f({b}_{i}|D)f({r}_{i}|\varphi ).$$

The term “pattern-mixture” emphasizes that the marginal distribution of $y=({y}_{1}^{\prime},\dots ,{y}_{N}^{\prime})\prime $ is a mixture of pattern-specific distributions. Most estimation methods assume that the distribution of *r _{i}* depends on (

The complete data log-likelihood for the selection model is

$$\ell (\gamma )=\text{log}\left[{\displaystyle \prod _{i=1}^{N}f({y}_{i},{b}_{i},{r}_{i}|\beta ,{\sigma}^{2},D,\varphi )}\right]$$

(4)

$$\begin{array}{c}={\displaystyle \sum _{i=1}^{N}l(\gamma ;{y}_{i},{b}_{i},{r}_{i})}\hfill \\ ={\displaystyle \sum _{i=1}^{N}\text{log}[f({y}_{i}|\beta ,{\sigma}^{2},{b}_{i})]}+{\displaystyle \sum _{i=1}^{N}\text{log}[f({b}_{i}|D)]}+{\displaystyle \sum _{i=1}^{N}\text{log}[f({r}_{i}|\varphi ,{y}_{i})]},\hfill \end{array}$$

(5)

where γ = (β, σ^{2}, *D*, ϕ). Estimation of (β, σ^{2}, *D*) is usually of interest with often, but not always, both the random effects and ϕ being viewed as nuisance parameters. Diggle and Kenward (1994) discuss estimation methods for selection models assuming monotone missing data. However, these methods are not easily extended to the analysis of nonmonotone missing data, where a subject may be observed after a missing value occurs. The method described next, based on the so-called EM by Method of Weights (Ibrahim 1990), is general in that it applies to both monotone and nonmonotone missing data.

For ease of exposition, write *y _{i}* = (

$$\begin{array}{cc}{Q}_{i}(\gamma |{\gamma}^{(t)})\hfill & =E(l(\gamma ;{y}_{i},{b}_{i},{r}_{i})|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\hfill \\ \hfill & ={\displaystyle \int {\displaystyle \int \text{log}[f({y}_{i}|\beta ,{\sigma}^{2},{b}_{i})]\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}({y}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{db}}}_{i}\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{dy}}}_{\text{mis},i}+{\displaystyle \int {\displaystyle \int \text{log}[f({b}_{i}|D)]\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}({y}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{db}}}_{i}\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{dy}}}_{\text{mis},i}}+{\displaystyle \int {\displaystyle \int \text{log}[f({r}_{i}|\varphi ,{y}_{i})]\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}({y}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{db}}}_{i}\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{dy}}}_{\text{mis},i}}}}}}\hfill \\ \hfill & \equiv {I}_{1}+{I}_{2}+{I}_{3},\hfill \end{array}$$

(6)

where γ^{(t)} = (β^{(t)}, σ^{2(t)}, *D*^{(t)}, ϕ^{(t)}), and *f* (*y*_{mis,i}, *b _{i}*|

To integrate out *b _{i}* from

$$f({y}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})=f({b}_{i}|{y}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}({y}_{\text{mis},i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})$$

and note that standard conditional distribution calculations yield

$$({b}_{i}|{y}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}{N}_{q}({b}_{i}^{(t)},{\Sigma}_{i}^{(t)}),$$

where

$$\begin{array}{cc}{b}_{i}^{(t)}\hfill & ={D}^{(t)}{Z}_{i}^{\prime}{({Z}_{i}{D}^{(t)}{Z}_{i}^{\prime}+{\sigma}^{2(t)}{I}_{{n}_{i}})}^{-1}({y}_{i}-{X}_{i}{\beta}^{(t)})\hfill \\ \hfill & ={\Sigma}_{i}^{(t)}{Z}_{i}^{\prime}({y}_{i}-{X}_{i}{\beta}^{(t)})/{\sigma}^{2(t)},\hfill \end{array}$$

and

$$\begin{array}{cc}{\Sigma}_{i}^{(t)}\hfill & ={D}^{(t)}-{D}^{(t)}{Z}_{i}^{\prime}{({Z}_{i}{D}^{(t)}{Z}_{i}^{\prime}+{\sigma}^{2(t)}{I}_{{n}_{i}})}^{-1}{Z}_{i}{D}^{(t)}\hfill \\ \hfill & ={[{\sigma}^{-2(t)}{Z}_{i}^{\prime}{Z}_{i}+{({D}^{(t)})}^{-1}]}^{-1}.\hfill \end{array}$$

Now, *I*_{1} can be written as

$$\begin{array}{cc}{I}_{1}\hfill & ={\displaystyle \int {\displaystyle \int \text{log}[f({y}_{i}|\beta ,{\sigma}^{2},{b}_{i})]\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}({y}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{db}}}_{i}\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{dy}}}_{\text{mis},i}}}\hfill \\ \hfill & =-\frac{{n}_{i}}{2}\text{log}(2\pi )-\frac{{n}_{i}}{2}\text{log}({\sigma}^{2})-{\displaystyle \int \frac{1}{2{\sigma}^{2}}\left({\displaystyle \int ({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i})\prime ({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i})f({b}_{i}|{y}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{db}}}_{i}}\right)\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({y}_{\text{mis},i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{dy}}}_{\text{mis},i}}.\hfill \end{array}$$

(7)

To evaluate the integral with respect to *b _{i}* in (7), note that

$$({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i})\prime ({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i})=({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(t)})\prime ({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(t)})-2({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(t)})\prime {Z}_{i}({b}_{i}-{b}_{i}^{(t)})+({b}_{i}-{b}_{i}^{(t)})\prime ({Z}_{i}^{\prime}{Z}_{i})({b}_{i}-{b}_{i}^{(t)}).$$

(8)

Substituting (8) into (7), we have

$${I}_{1}=-\frac{{n}_{i}}{2}\text{log}(2\pi )-\frac{{n}_{i}}{2}\text{log}({\sigma}^{2})-\frac{1}{2{\sigma}^{2}}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}\left(\text{tr}({Z}_{i}^{\prime}{Z}_{i}{\Sigma}_{i}^{(t)})+{\displaystyle \int ({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(t)})\prime ({y}_{i}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(t)})}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({y}_{\text{mis},i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{dy}}}_{\text{mis},i}\right).$$

(9)

Following similar logic and upon noting that *b _{i}* ~

$${I}_{2}=-\frac{q}{2}\text{log}(2\pi )-\frac{1}{2}\text{log}(|D|)-\frac{1}{2}\text{tr}({D}^{-1}{\Sigma}_{i}^{(t)})-\frac{1}{2}{\displaystyle \int ({b}_{i}^{(t)\prime}{D}^{-1}{b}_{i}^{(t)})\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}({y}_{\text{mis},i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{dy}}}_{\text{mis},i}}.$$

(10)

Finally, for *I*_{3}, *b _{i}* can be easily integrated out since log[

$${I}_{3}={\displaystyle \int \text{log}[f({r}_{i}|\varphi ,{y}_{i})]}\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}({y}_{\text{mis},i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{dy}}}_{\text{mis},i}.$$

(11)

The E-step, expressed via (9), (10), and (11) does not involve *b _{i}*. Thus, to complete the E-step, we merely need to sample from [

$$f({y}_{\text{mis},i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\propto \text{exp}\left(-\frac{1}{2}({y}_{i}-{X}_{i}{\beta}^{(t)})\prime {({Z}_{i}{D}^{(t)}{Z}_{i}^{\prime}+{\sigma}^{2(t)}{I}_{{n}_{i}})}^{-1}({y}_{i}-{X}_{i}{\beta}^{(t)})\right)\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({r}_{i}|{y}_{\text{mis},i},{y}_{\text{obs},i},{\gamma}^{(t)}),$$

(12)

which has the form of a normal density times a logistic regression for the *r _{i}*’s. Thus, the distribution is from the class of concave log-densities, and Gibbs sampling from (12) is straightforward, using the adaptive rejection algorithm of Gilks and Wild (1992).

Precisely, the procedure is as follows. Let *u*_{i1}, …, *u _{imi}* be a sample of size

$${b}_{i}^{(\mathit{\text{tk}})}={\Sigma}_{i}^{(t)}{Z}_{i}^{\prime}({y}_{i}^{(k)}-{X}_{i}{\beta}^{(t)})/{\sigma}^{2(t)}.$$

Then, the E-step for the *i*th observation at the (*t* + 1)th iteration takes the form

$${Q}_{i}(\gamma |{\gamma}^{(t)})=-\frac{{n}_{i}}{2}\text{log}(2\pi )-\frac{{n}_{i}}{2}\text{log}({\sigma}^{2})-\frac{1}{2{\sigma}^{2}}\left(\text{tr}({Z}_{i}^{\prime}{Z}_{i}{\Sigma}_{i}^{(t)})+\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}({y}_{i}^{(k)}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})})\prime ({y}_{i}^{(k)}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})})}\right)-\frac{q}{2}\text{log}(2\pi )-\frac{1}{2}\text{log}(|D|)-\frac{1}{2}\text{tr}({D}^{-1}{\Sigma}_{i}^{(t)})-\frac{1}{2{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}{b}_{i}^{(\mathit{\text{tk}})\prime}{D}^{-1}{b}_{i}^{(\mathit{\text{tk}})}}+\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}\text{log}[f({r}_{i}|\varphi ,{y}_{i}^{(k)})]}.$$

Obviously, the E-step for all *N* observations is given by

$$Q(\gamma |{\gamma}^{(t)})={\displaystyle \sum _{i=1}^{N}{Q}_{i}(\gamma |{\gamma}^{(t)})}.$$

Stubbendick and Ibrahim (2003) extend this approach to the problem of nonignorable missing covariates and/or responses in the normal mixed model. Similar MCEM algorithms have been developed for other types of models, such as generalized linear models and survival models by Ibrahim et al. (1999a, 1999b), Chen and Ibrahim (2002), Herring and Ibrahim (2001, 2002), Herring et al. (2002, 2004), Chen and Ibrahim (2006), and Chen et al. (2007).

Let us turn to the M-step, which maximizes *Q*(γ|γ^{(t)}), and closed forms are available for (β, σ^{2}, *D*). The procedure for the M-step is as follows:

- Find ϕ
^{(t+1)}to maximize$${Q}_{\varphi}={\displaystyle \sum _{i=1}^{N}\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}\text{log}[f({r}_{i}|\varphi ,{y}_{i}^{(k)})]}}.$$(13) - Find
*D*^{(t+1)}to maximizewhich yields$${Q}_{D}=-\frac{N}{2}\text{log}(|D|)-\frac{1}{2}\text{tr}\left({D}^{-1}{\displaystyle \sum _{i=1}^{N}{\Sigma}_{i}^{(t)}}\right)-\frac{1}{2}{\displaystyle \sum _{i=1}^{N}\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}{b}_{i}^{(\mathit{\text{tk}})\prime}{D}^{-1}{b}_{i}^{(\mathit{\text{tk}})}}},$$(14)$${D}^{(t+1)}=\frac{1}{N}{\displaystyle \sum _{i=1}^{N}\left[\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}{b}_{i}^{(\mathit{\text{tk}})}{b}_{i}^{(\mathit{\text{tk}})\prime}+{\Sigma}_{i}^{(t)}}\right]}.$$ - Find β
^{(t+1)}to minimizewhich yields$${Q}_{\beta}={\displaystyle \sum _{i=1}^{N}\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}({y}_{i}^{(k)}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})})\prime \phantom{\rule{thinmathspace}{0ex}}({y}_{i}^{(k)}-{X}_{i}\beta -{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})}),}}$$(15)$${\beta}^{(t+1)}={\left({\displaystyle \sum _{i=1}^{N}{X}_{i}^{\prime}{X}_{i}}\right)}^{-1}{\displaystyle \sum _{i=1}^{N}\left({X}_{i}^{\prime}\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}({y}_{i}^{(k)}-{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})})}\right)}.$$ - Find σ
^{2(t+1)}to minimizewhich leads to$${Q}_{{\sigma}^{2}}=\frac{n}{2}\text{log}({\sigma}^{2})+\frac{1}{2{\sigma}^{2}}{\displaystyle \sum _{i=1}^{N}\text{tr}({Z}_{i}^{\prime}{Z}_{i}{\Sigma}_{i}^{(t)})}+\frac{1}{2{\sigma}^{2}}\left({\displaystyle \sum _{i=1}^{N}\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}({y}_{i}^{(k)}-{X}_{i}{\beta}^{(t+1)}-{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})})\prime}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}({y}_{i}^{(k)}-{X}_{i}{\beta}^{(t+1)}-{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})})}\right),$$(16)where $n={\displaystyle {\sum}_{i=1}^{N}{n}_{i}.}$$${\sigma}^{2(t+1)}=\frac{1}{n}{\displaystyle \sum _{i=1}^{N}\left(\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}({y}_{i}^{(k)}-{X}_{i}{\beta}^{(t+1)}-{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})})\prime \phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}({y}_{i}^{(k)}-{X}_{i}{\beta}^{(t+1)}-{Z}_{i}{b}_{i}^{(\mathit{\text{tk}})})+\text{tr}({Z}_{i}^{\prime}{Z}_{i}{\Sigma}_{i}^{(t)})}\right)},$$

Diggle and Kenward (1994) proposed a binomial model for the missing data mechanism under the selection modeling approach, i.e.,

$$f(r|\varphi ,y)={\displaystyle \prod _{i=1}^{N}{\displaystyle \prod _{j=1}^{{n}_{i}}{[P({r}_{\mathit{\text{ij}}}=1|\varphi )]}^{{r}_{\mathit{\text{ij}}}}\phantom{\rule{thinmathspace}{0ex}}{[(1-P({r}_{\mathit{\text{ij}}}=1|\varphi ))]}^{1-{r}_{\mathit{\text{ij}}}}}},$$

where *P*(*r _{ij}* = 1|ϕ) is modeled via a logistic regression involving all of the previous outcomes and the current outcome. This model takes the form

$$\text{logit}(P({r}_{\mathit{\text{ij}}}=1|\varphi ))\equiv \text{log}\phantom{\rule{thinmathspace}{0ex}}\left[\frac{P({r}_{\mathit{\text{ij}}}=1|\varphi )}{1-P({r}_{\mathit{\text{ij}}}=1|\varphi )}\right]={\varphi}_{0}+{\varphi}_{1}{y}_{\mathit{\text{ij}}}+{\displaystyle \sum _{k=2}^{j}{\varphi}_{k}{y}_{j+1-k}}$$

for *i* = 1, …, *N* and *j* = 1, …, *n _{i}*. The model can be extended to permit possible relationships between the missing data process and covariates, including time, by making ϕ

$${\varphi}_{0}={\displaystyle \sum _{q=1}^{r}{\varphi}_{q0}{x}_{\mathit{\text{qj}}}}.$$

(17)

For example, for the IBCSG data, consider a logistic regression model that includes the previous and current outcomes and treatment as covariates. Such a choice would specialize (17) to

$$\text{logit}(P({r}_{\mathit{\text{ij}}}=1|\varphi ))={\varphi}_{0}+{\varphi}_{1}{y}_{i,j-1}+{\varphi}_{2}{y}_{\mathit{\text{ij}}}+{\varphi}_{3}{\text{trt}}_{\mathit{\text{Ai}}}+{\varphi}_{4}{\text{trt}}_{\mathit{\text{Bi}}}+{\varphi}_{5}{\text{trt}}_{\mathit{\text{Ci}}}$$

for *i* = 1, …, *N*, *j* = 1, …, *n _{i}*, and trt

A more general multinomial missing data model which incorporates a general correlation structure can be constructed by specifying the joint distribution of *r _{i}* = (

$$f({r}_{i1},\dots ,{r}_{{\mathit{\text{in}}}_{i}}|\varphi ,{y}_{i})=f({r}_{{\mathit{\text{in}}}_{i}}|{\varphi}_{{n}_{i}},{r}_{i1},\dots ,{r}_{i({n}_{i-1})},{y}_{i})\xb7f({r}_{i({n}_{i-1})}|{\varphi}_{{n}_{i-1}},{r}_{i1},\dots ,{r}_{i({n}_{i-2})},{y}_{i})\phantom{\rule{thinmathspace}{0ex}}\cdots \phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({r}_{i2}|{\varphi}_{{n}_{2}},{r}_{i1},{y}_{i})\xb7f({r}_{i1}|{\varphi}_{{n}_{1}},{y}_{i}),$$

(18)

where ϕ_{a} is a vector of indexing parameters for the *a*th conditional distribution and $\varphi =({\varphi}_{{n}_{1}}^{\prime},\dots ,{\varphi}_{{n}_{N}}^{\prime})\prime $. Thus,

$$f(r|\varphi ,y)={\displaystyle \prod _{i=1}^{N}f({r}_{i1},\dots ,{r}_{{\mathit{\text{in}}}_{i}}|\varphi ,{y}_{i})},$$

where *f*(*r*_{i1}, …, *r _{ini}*|ϕ,

Also, it has been assumed throughout that [*r _{i}*|ϕ,

$$f({r}_{i1},\dots ,{r}_{{\mathit{\text{in}}}_{i}}|\varphi ,\beta ,{\sigma}^{2},{b}_{i})=f({r}_{{\mathit{\text{in}}}_{i}}|{\varphi}_{{n}_{i}},{r}_{i1},\dots ,{r}_{i({n}_{i-1})},E({y}_{i}|\beta ,{\sigma}^{2},{b}_{i}))\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({r}_{i({n}_{i-1})}|{\varphi}_{{n}_{i-1}},{r}_{i1},\dots ,{r}_{i({n}_{i-2})},E({y}_{i}|\beta ,{\sigma}^{2},{b}_{i}))\phantom{\rule{thinmathspace}{0ex}}\cdots \phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({r}_{i2}|{\varphi}_{{n}_{2}},{r}_{i1},E({y}_{i}|\beta ,{\sigma}^{2},{b}_{i}))\phantom{\rule{thinmathspace}{0ex}}f\phantom{\rule{thinmathspace}{0ex}}({r}_{i1}|{\varphi}_{{n}_{1}},E({y}_{i}|\beta ,{\sigma}^{2},{b}_{i})).$$

Other innovations for the normal mixed model include Lipsitz et al. (2000), who consider Box–Cox transformations on the response variable in the presence of missing data, and Lipsitz et al. (2002), who consider missing data mechanisms based on outcome-dependent follow-up. Lipsitz et al. (2002) extend their method to longitudinal binary outcome data in Fitzmaurice et al. (2006).

Pattern-mixture models are based on an alternative factorization of *f*(*y _{i}*,

$$\begin{array}{cc}\ell (\gamma )\hfill & =\text{log}\phantom{\rule{thinmathspace}{0ex}}\left[{\displaystyle \prod _{i=1}^{N}f({y}_{i},{b}_{i},{r}_{i}|\beta ,{\sigma}^{2},D,\varphi )}\right]\hfill \\ \hfill & ={\displaystyle \sum _{i=1}^{N}l\phantom{\rule{thinmathspace}{0ex}}(\gamma ;{y}_{i},{b}_{i},{r}_{i})}\hfill \\ \hfill & ={\displaystyle \sum _{i=1}^{N}\text{log}\phantom{\rule{thinmathspace}{0ex}}[f({y}_{i}|\beta ,{\sigma}^{2},{b}_{i},{r}_{i})]}+{\displaystyle \sum _{i=1}^{N}\text{log}\phantom{\rule{thinmathspace}{0ex}}[f({b}_{i}|D)]}+{\displaystyle \sum _{i=1}^{N}\text{log}\phantom{\rule{thinmathspace}{0ex}}[f({r}_{i}|\varphi )]},\hfill \end{array}$$

where γ = (β, σ^{2}, *D*, ϕ). Since the distribution of *y _{i}* depends on

Recall the Laird–Ware model for the outcome vector *y _{i}* in (1). If

$$({y}_{i}|\beta ,{\sigma}^{2},{b}_{i},{r}_{i}=k)\phantom{\rule{thinmathspace}{0ex}}~\phantom{\rule{thinmathspace}{0ex}}{N}_{{n}_{i}}({X}_{i}{\beta}^{(k)}+{Z}_{i}{b}_{i},{\sigma}^{2(k)}{I}_{{n}_{i}})$$

under the pattern-mixture factorization. The specification of distinct fixed parameters, (β^{(k)}, σ^{2(k)}), creates major identification problems for each pattern because not all parameters of the complete data distribution of *y _{i}* are estimable from incomplete pattern data. However, assumptions about the missing data mechanism can yield additional restrictions that help to identify the models, while avoiding explicit specification of the mechanism’s parametric form, such as required in the selection model approach. See also Verbeke and Molenberghs (2000) and Molenberghs and Verbeke (2005) for reviews.

To illustrate this idea, consider the analysis presented in Little and Wang (1996), in which pattern-mixture models are developed for a multivariate multiple regression. Suppose that we have a sample of *N* independent observations on *p* continuous outcome variables and *q* covariates, so that *y _{i}* = (

$$\begin{array}{c}({y}_{i}|\beta ,\Sigma ,{r}_{i}=k)\sim {N}_{p}\left({\beta}^{(k)}{x}_{i},{\Sigma}^{(k)}\right),\text{\hspace{1em}}k=0,1;\hfill \\ {r}_{i}|\varphi \sim \text{Bernoulli}\phantom{\rule{thinmathspace}{0ex}}\left(\frac{{e}^{\varphi \prime {x}_{i}}}{1+{e}^{\varphi \prime {x}_{i}}}\right)\Rightarrow \text{logit}\left[P({r}_{i}=1|\varphi )\right]=\varphi \prime {x}_{i},\hfill \end{array}$$

(19)

where *y _{i}* is

The important step in developing pattern-mixture models is in making an assumption about the missing data mechanism. Suppose that

$$P({r}_{i}=1|{y}_{(1)i},{y}_{(2)i},{x}_{i})=g({y}_{(2)i,}{x}_{i}),$$

(20)

where *g* is an arbitrary function of *y*_{(2)i} and *x _{i}*. Since missingness depends on the value of the missing variable,

$$f({y}_{(1)i},{y}_{(2)i}|{\theta}^{(k)},{x}_{i},{r}_{i}=k)=f({y}_{(2)i}|{\theta}_{2}^{(k)},{x}_{i},{r}_{i}=k)f({y}_{(1)i}|{\theta}_{1\xb72}^{(k)},{y}_{(2)i},{x}_{i},{r}_{i}=k),$$

where ${\theta}_{2}^{(k)}=({\beta}_{x:2\xb7x}^{(k)},{\Sigma}_{22\xb7x}^{(k)})$ consists of the (*p*_{2} × *q*) regression coefficient matrix and (*p*_{2} × *p*_{2}) residual covariance matrix for the regression of *y*_{(2)i} on *x _{i}* within pattern

Note that assumption (20) states that [*r _{i}*|

$${\theta}_{1\xb72}^{(0)}={\theta}_{1\xb72}^{(1)}={\theta}_{1\xb72}.$$

(21)

This yields ${p}_{1}({p}_{2}+q)+\frac{{p}_{1}({p}_{1}+1)}{2}$ restrictions that help to identify the model, and likelihood inference now depends on the relative sizes of *p*_{1} and *p*_{2}.

The log-likelihood of θ for the model in (19) is

$$\begin{array}{cc}\ell \left({\theta}^{(0)},{\theta}^{(1)}\right)\hfill & =\text{log}\phantom{\rule{thinmathspace}{0ex}}\left[{\displaystyle \prod _{i=1}^{N}f({y}_{i}|\beta ,\Sigma ,{r}_{i}=k)}\right]\hfill \\ \hfill & =-\frac{{N}_{0}{p}_{1}}{2}\text{log}(2\pi )-\frac{{N}_{0}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{11\xb7x}^{(0)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{0}}({y}_{(1)i}-{\beta}_{x:1\xb7x}^{(0)}{x}_{i})\prime {\displaystyle {\Sigma}_{11\xb7x}^{(0)-1}\left({y}_{(1)i}-{\beta}_{x:1\xb7x}^{(0)}{x}_{i}\right)}}-\frac{{N}_{1}{p}_{1}}{2}\text{log}(2\pi )-\frac{{N}_{1}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{11\xb7x}^{(1)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{1}}\left({y}_{(1)i}-{\beta}_{x:1\xb7x}^{(1)}{x}_{i}\right)\prime {\displaystyle {\Sigma}_{11\xb7x}^{(1)-1}\left({y}_{(1)i}-{\beta}_{x:1\xb7x}^{(1)}{x}_{i}\right)}}-\frac{{N}_{0}{p}_{2}}{2}\text{log}(2\pi )-\frac{{N}_{0}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{22\xb71x}^{(0)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{0}}({y}_{(2)i}-{\beta}_{1:2\xb71x}^{(0)}{y}_{(1)i}-{\beta}_{x:2\xb71x}^{(0)}{x}_{i})\prime {\Sigma}_{22\xb71x}^{(0)-1}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}\left({y}_{(2)i}-{\beta}_{1:2\xb71x}^{(0)}{y}_{(1)i}-{\beta}_{x:2\xb71x}^{(0)}{x}_{i}\right)}.\hfill \end{array}$$

The model has 2*p*_{1}*q* + *p*_{1}(*p*_{1} + 1) + 2*p*_{2}(*p*_{1} + *q*) + *p*_{2}(*p*_{2} + 1) parameters, but only $2{p}_{1}q+{p}_{1}({p}_{1}+1)+{p}_{2}({p}_{1}+q)+\frac{{p}_{2}({p}_{2}+1)}{2}$ can be identified from the data, namely

$${\theta}_{\mathit{\text{id}}}=\left({\beta}_{x:1\xb7x}^{(0)},{\Sigma}_{11\xb7x}^{(0)},{\beta}_{x:1\xb7x}^{(1)},{\Sigma}_{11\xb7x}^{(1)},{\beta}_{1:2\xb71x}^{(0)},{\beta}_{x:2\xb71x}^{(0)},{\Sigma}_{22\xb71x}^{(0)}\right).$$

If *p*_{1} = *p*_{2}, then the number of restrictions in (21) equals the number of unidentified parameters, and the model is just identified. Maximum likelihood (ML) estimates for the identified parameters are obtained by standard complete data methods, namely two multivariate regressions of *y*_{(1)} on *x* and one multivariate regression of *y*_{(2)} on *y*_{(1)} and *x*. For example,

$$\begin{array}{c}{\widehat{\beta}}_{x:1\xb7x}^{(0)}={\left(X\prime X\right)}^{-1}X\prime Y,\hfill \\ {\widehat{\Sigma}}_{11\xb7x}^{(0)}=\frac{1}{{N}_{0}}\left(Y-X{\widehat{\beta}}_{x:1\xb7x}^{(0)}\right)\prime \phantom{\rule{thinmathspace}{0ex}}\left(Y-X{\widehat{\beta}}_{x:1\xb7x}^{(0)}\right),\hfill \end{array}$$

where *Y* is an *N*_{0} × *p*_{1} matrix of responses, and *X* is an *N*_{0} × *q* matrix of covariates. The estimates of interest, however, are from [*y*_{(1)i}, *y*_{(2)i}|*x _{i}*], averaged over patterns, (β

$$\begin{array}{c}{\widehat{\beta}}_{x:1\xb7x}=(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}{\widehat{\beta}}_{x:1\xb7x}^{(0)}+{\widehat{p}}_{x}{\widehat{\beta}}_{x:1\xb7x}^{(1)},\hfill \\ \text{}{\widehat{\Sigma}}_{11\xb7x}=(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}{\widehat{\Sigma}}_{11\xb7x}^{(0)}+{\widehat{p}}_{x}\phantom{\rule{thinmathspace}{0ex}}{\widehat{\Sigma}}_{11\xb7x}^{(1)}+{\widehat{p}}_{x}(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}\left({\widehat{\beta}}_{x:1\xb7x}^{(0)}-{\widehat{\beta}}_{x:1\xb7x}^{(1)}\right)\mathit{\text{xx}}\prime \phantom{\rule{thinmathspace}{0ex}}\left({\widehat{\beta}}_{x:1\xb7x}^{(0)}-{\widehat{\beta}}_{x:1\xb7x}^{(1)}\right)\prime ,\hfill \\ {\widehat{\beta}}_{x:2\xb7x}={\widehat{\beta}}_{x:2\xb7x}^{(0)}+{\left({\widehat{\beta}}_{2:1\xb72x}^{(0)}\right)}^{-1}\left({\widehat{\beta}}_{x:1\xb7x}-{\widehat{\beta}}_{x:1\xb7x}^{(0)}\right),\hfill \\ \text{}{\widehat{\Sigma}}_{22\xb7x}={\widehat{\Sigma}}_{22\xb7x}^{(0)}+{\left({\widehat{\beta}}_{2:1\xb72x}^{(0)}\right)}^{-1}\left({\widehat{\Sigma}}_{11\xb7x}-{\widehat{\Sigma}}_{11\xb7x}^{(0)}\right){\phantom{\rule{thinmathspace}{0ex}}\left({\widehat{\beta}}_{2:1\xb72x}^{(0)\prime}\right)}^{-1},\hfill \\ \text{}{\widehat{\Sigma}}_{21\xb7x}={\widehat{\Sigma}}_{21\xb7x}^{(0)}+{\left({\widehat{\beta}}_{2:1\xb72x}^{(0)}\right)}^{-1}\left({\widehat{\Sigma}}_{11\xb7x}-{\widehat{\Sigma}}_{11\xb7x}^{(0)}\right),\hfill \end{array}$$

where * _{x}* =

If *p*_{1} > *p*_{2}, then the number of restrictions in (21) exceeds the number of unidentified parameters, and the model is overidentified. Explicit ML estimates cannot be obtained, and an iterative method such as the EM algorithm is required. The complete data log-likelihood of θ is

$$\begin{array}{cc}l\left({\theta}^{(0)},{\theta}^{(1)}\right)\hfill & =\text{log}\phantom{\rule{thinmathspace}{0ex}}\left[{\displaystyle \prod _{i=1}^{N}f({y}_{i}|\beta ,\Sigma ,{r}_{i}=k)}\right]\hfill \\ \hfill & =-\frac{{N}_{0}{p}_{1}}{2}\text{log}(2\pi )-\frac{{N}_{0}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{11\xb7x}^{(0)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{0}}\left({y}_{(1)i}-{\beta}_{x:1\xb7x}^{(0)}{x}_{i}\right)\prime {\displaystyle {\Sigma}_{11\xb7x}^{(0)-1}\left({y}_{(1)i}-{\beta}_{x:1\xb7x}^{(0)}{x}_{i}\right)}}-\frac{{N}_{1}{p}_{1}}{2}\text{log}(2\pi )-\frac{{N}_{1}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{11\xb7x}^{(1)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{1}}\left({y}_{(1)i}-{\beta}_{x:1\xb7x}^{(1)}{x}_{i}\right)\prime {\displaystyle {\Sigma}_{11\xb7x}^{(1)-1}\left({y}_{(1)i}-{\beta}_{x:1\xb7x}^{(1)}{x}_{i}\right)}}-\frac{{N}_{0}{p}_{2}}{2}\text{log}(2\pi )-\frac{{N}_{0}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{22\xb71x}^{(0)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{0}}\left({y}_{(2)i}-{\beta}_{1:2\xb71x}^{(0)}{y}_{(1)i}-{\beta}_{x:2\xb71x}^{(0)}{x}_{i}\right)\prime \phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}{\Sigma}_{22\xb71x}^{(0)-1}\left({y}_{(2)i}-{\beta}_{1:2\xb71x}^{(0)}{y}_{(1)i}-{\beta}_{x:2\xb71x}^{(0)}{x}_{i})-\frac{{N}_{1}{p}_{2}}{2}\text{log}(2\pi )-\frac{{N}_{1}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{22\xb71x}^{(1)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{1}}\left({y}_{(2)i}-{\beta}_{1:2\xb71x}^{(1)}{y}_{(1)i}-{\beta}_{x:2\xb71x}^{(1)}{x}_{i}\right)\prime \phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}{\Sigma}_{22\xb71x}^{(1)-1}({y}_{(2)i}-{\beta}_{1:2\xb71x}^{(1)}{y}_{(1)i}-{\beta}_{x:2\xb71x}^{(1)}{x}_{i}}\right)}.\hfill \end{array}$$

From this it can be seen that the complete data sufficient statistics involving missing data are

$${S}_{12}^{(1)}={\displaystyle \sum _{i=1}^{{N}_{1}}{y}_{(1)i}{y}_{(2)i}^{\prime}},\text{\hspace{1em}\hspace{1em}}{S}_{x2}^{(1)}={\displaystyle \sum _{i=1}^{{N}_{1}}{x}_{i}{y}_{(2)i}^{\prime}},\text{\hspace{1em}\hspace{1em}}{S}_{22}^{(1)}={\displaystyle \sum _{i=1}^{{N}_{1}}{y}_{(2)i}{y}_{(2)i}^{\prime}}.$$

The E-step at each iteration replaces these statistics by their expected values given the observed data and current parameter estimates, which can be calculated from the first and second moments of *y*_{(2)i}:

$$\begin{array}{c}\text{}E\phantom{\rule{thinmathspace}{0ex}}\left({y}_{(2)i}|\widehat{\beta},\widehat{\Sigma},{y}_{(1)i},{x}_{i},{r}_{i}=1\right)={\widehat{\beta}}_{1:2\xb71x}^{(1)}{y}_{(1)i}+{\widehat{\beta}}_{x:2\xb71x}^{(1)}{x}_{i},\hfill \\ \text{Var}\left({y}_{(2)i}|\widehat{\beta},\widehat{\Sigma},{y}_{(1)i},{x}_{i},{r}_{i}=1\right)={\widehat{\Sigma}}_{22\xb71x}^{(1)}.\hfill \end{array}$$

The M-step computes new parameter estimates by a complete-data maximization subject to the constraints induced by the missing data assumption. Therefore, for the restrictions of (21), the likelihood function for the complete data is rewritten as

$$\begin{array}{cc}\ell \left({\theta}^{(0)},{\theta}^{(1)}\right)\hfill & =\text{log}\phantom{\rule{thinmathspace}{0ex}}\left[{\displaystyle \prod _{i=1}^{N}f({y}_{i}|\beta ,\Sigma ,{r}_{i}=k)}\right]\hfill \\ \hfill & =-\frac{{N}_{0}{p}_{1}}{2}\text{log}(2\pi )-\frac{{N}_{0}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}|{\Sigma}_{11\xb72x}|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{0}}({y}_{(1)i}-{\beta}_{2:1\xb72x}{y}_{(2)i}-{\beta}_{x:1\xb72x}{x}_{i})\prime \phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}{\displaystyle {\Sigma}_{11\xb72x}^{-1}({y}_{(1)i}-{\beta}_{2:1\xb72x}{y}_{(2)i}-{\beta}_{x:1\xb72x}{x}_{i})}}-\frac{{N}_{0}{p}_{2}}{2}\text{log}(2\pi )-\frac{{N}_{0}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{22\xb7x}^{(0)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{0}}\left({y}_{(2)i}-{\beta}_{x:2\xb7x}^{(0)}{x}_{i}\right)\prime \phantom{\rule{thinmathspace}{0ex}}{\displaystyle {\Sigma}_{22\xb7x}^{(0)-1}\left({y}_{(2)i}-{\beta}_{x:2\xb7x}^{(0)}{x}_{i}\right)}}-\frac{{N}_{1}{p}_{2}}{2}\text{log}(2\pi )-\frac{{N}_{1}}{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\left|{\Sigma}_{22\xb7x}^{(1)}\right|-\frac{1}{2}{\displaystyle \sum _{i=1}^{{N}_{1}}\left({y}_{(2)i}-{\beta}_{x:2\xb7x}^{(1)}{x}_{i}\right)\prime \phantom{\rule{thinmathspace}{0ex}}{\Sigma}_{22\xb7x}^{(1)-1}\left({y}_{(2)i}-{\beta}_{x:2\xb7x}^{(1)}{x}_{i}\right)}.\hfill \end{array}$$

(22)

Note that the E-step requires the regression of *y*_{(2)i} on *y*_{(1)i} and *x _{i}* for the pattern with missing data, whereas the M-step requires the regression of

$$D=\left[\begin{array}{ccc}\hfill {D}_{11}^{-1}\hfill & \hfill {\beta}_{x:1\xb72x}^{\prime}\hfill & \hfill {D}_{12}^{-1}\hfill \\ \hfill {\beta}_{x:1\xb72x}\hfill & \hfill {\Sigma}_{11\xb72x}\hfill & \hfill {\beta}_{2:1\xb72x}\hfill \\ \hfill {D}_{12}^{-1}\hfill & \hfill {\beta}_{2:1\xb72x}^{\prime}\hfill & \hfill {D}_{22}^{-1}\hfill \end{array}\right]\phantom{\rule{thinmathspace}{0ex}},$$

where

$$\left[\begin{array}{cc}\hfill {D}_{11}\hfill & \hfill {D}_{12}\hfill \\ \hfill {D}_{21}\hfill & \hfill {D}_{22}\hfill \end{array}\right]=-{N}_{1}\phantom{\rule{thinmathspace}{0ex}}{\left[\begin{array}{cc}\hfill {S}_{\mathit{\text{xx}}}^{(1)}\hfill & \hfill {S}_{x2}^{(1)\prime}\hfill \\ \hfill {S}_{2x}^{(1)}\hfill & \hfill {S}_{22}^{(1)}\hfill \end{array}\right]}^{-1}.$$

Note that the elements of *D* are calculated using the parameter estimates from the previous M-step. Once the E-step is completed, the missing parts of the following matrix, *A*_{1}, can be filled in, leading to

$${A}_{1}=\frac{1}{{N}_{1}}\left[\begin{array}{ccc}{S}_{\mathit{\text{xx}}}^{(1)}\hfill & {S}_{x1}^{(1)}\hfill & {S}_{x2}^{(1)}\hfill \\ {S}_{x1}^{(1)\prime}\hfill & {S}_{11}^{(1)}\hfill & {S}_{21}^{(1)\prime}\hfill \\ {S}_{x2}^{(1)\prime}\hfill & {S}_{12}^{(1)}\hfill & {S}_{22}^{(1)}\hfill \end{array}\right]\phantom{\rule{thinmathspace}{0ex}}.$$

If we let

$${A}_{0}=\frac{1}{{N}_{0}}\left[\begin{array}{ccc}{S}_{\mathit{\text{xx}}}^{(0)}\hfill & {S}_{x1}^{(0)}\hfill & {S}_{x2}^{(0)}\hfill \\ {S}_{x1}^{(0)\prime}\hfill & {S}_{11}^{(0)}\hfill & {S}_{21}^{(0)\prime}\hfill \\ {S}_{x2}^{(0)\prime}\hfill & {S}_{12}^{(0)}\hfill & {S}_{22}^{(0)}\hfill \end{array}\right]$$

and *A* = (*N*_{0}*A*_{0} + *N*_{1}*A*_{1})/*N*, then the M-step is completed by sweeping on the first block of *A*_{0} and *A*_{1} to obtain ${\theta}_{2}^{(0)}=({\beta}_{x:2\xb7x}^{(0)},{\Sigma}_{22\xb7x}^{(0)})\text{and}{\theta}_{2}^{(1)}=({\beta}_{x:2\xb7x}^{(1)},{\Sigma}_{22\xb7x}^{(1)})$, respectively. The values of θ_{1·2} = (β_{2:1·2x}, β_{x:1·2x}, Σ_{11·2x}) are obtained by sweeping on the first and third blocks of *A*. Notice that ${\theta}_{2}^{(0)}$ is not affected by the E-step and so does not need iteration. This process yields ML estimates of $({\beta}_{x:2\xb7x}^{(0)},{\Sigma}_{22\xb7x}^{(0)},{\beta}_{x:2\xb7x}^{(1)},{\Sigma}_{22\xb7x}^{(1)})$ ML estimates of $({\beta}_{x:1\xb7x}^{(0)},{\Sigma}_{11\xb7x}^{(0)},{\beta}_{x:1\xb7x}^{(1)},{\Sigma}_{11\xb7x}^{(1)})$ can be obtained from the regression of *y*_{(1)i} on *x _{i}* for each pattern, and ML estimates of ϕ can be obtained from a logistic regression of

$$\begin{array}{c}{\widehat{\beta}}_{x:1\xb7x}=(1-{\widehat{p}}_{x}){\widehat{\beta}}_{x:1\xb7x}^{(0)}+{\widehat{p}}_{x}{\widehat{\beta}}_{x:1\xb7x}^{(1)},\hfill \\ \text{}{\widehat{\mathit{\Sigma}}}_{11\xb7x}=(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}{\widehat{\mathit{\Sigma}}}_{11\xb7x}^{(0)}+{\widehat{p}}_{x}\phantom{\rule{thinmathspace}{0ex}}{\widehat{\mathit{\Sigma}}}_{11\xb7x}^{(1)}+{\widehat{p}}_{x}(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}({\widehat{\beta}}_{x:1\xb7x}^{(0)}-{\widehat{\beta}}_{x:1\xb7x}^{(1)})\mathit{\text{xx}}\prime \phantom{\rule{thinmathspace}{0ex}}({\widehat{\beta}}_{x:1\xb7x}^{(0)}-{\widehat{\beta}}_{x:1\xb7x}^{(1)})\prime ,\hfill \\ {\widehat{\beta}}_{x:2\xb7x}=(1-{\widehat{p}}_{x}){\widehat{\beta}}_{x:2\xb7x}^{(0)}+{\widehat{p}}_{x}{\widehat{\beta}}_{x:2\xb7x}^{(1)},\hfill \\ \text{}{\widehat{\mathit{\Sigma}}}_{22\xb7x}=(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}{\widehat{\mathit{\Sigma}}}_{22\xb7x}^{(0)}+{\widehat{p}}_{x}\phantom{\rule{thinmathspace}{0ex}}{\widehat{\mathit{\Sigma}}}_{22\xb7x}^{(1)}+{\widehat{p}}_{x}(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}({\widehat{\beta}}_{x:2\xb7x}^{(0)}-{\widehat{\beta}}_{x:2\xb7x}^{(1)})\mathit{\text{xx}}\prime \phantom{\rule{thinmathspace}{0ex}}({\widehat{\beta}}_{x:2\xb7x}^{(0)}-{\widehat{\beta}}_{x:2\xb7x}^{(1)})\prime ,\hfill \\ \text{}{\widehat{\mathit{\Sigma}}}_{21\xb7x}=(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}{\widehat{\mathit{\Sigma}}}_{21\xb7x}^{(0)}+{\widehat{p}}_{x}\phantom{\rule{thinmathspace}{0ex}}{\widehat{\mathit{\Sigma}}}_{21\xb7x}^{(1)}+{\widehat{p}}_{x}(1-{\widehat{p}}_{x})\phantom{\rule{thinmathspace}{0ex}}({\widehat{\beta}}_{x:1\xb7x}^{(0)}-{\widehat{\beta}}_{x:1\xb7x}^{(1)})\mathit{\text{xx}}\prime \phantom{\rule{thinmathspace}{0ex}}({\widehat{\beta}}_{x:2\xb7x}^{(0)}-{\widehat{\beta}}_{x:2\xb7x}^{(1)})\prime ,\hfill \end{array}$$

where * _{x}* =

If *p*_{1} < *p*_{2}, then the model remains underidentified, and additional restrictive assumptions are needed to identify the model parameters. Little and Wang (1996) suggest assuming

$$P\phantom{\rule{thinmathspace}{0ex}}({r}_{i}=1{|{y}_{(1)i},{y}_{(2)i},{x}_{i})=g(y}_{(2s)i},{x}_{i}),$$

where *y*_{(2s)i} is a subset of the variables *y*_{(2)i} with dimension *p*_{(2s)} ≤ *p*_{1}. Using this approach, inference follows directly from the two scenarios previously described (*p*_{1} = *p*_{2} case when *p*_{1} = *p*_{(2s)} and *p*_{1} > *p*_{2} case when *p*_{1} > *p*_{(2s)}). The choice of subset variables is important to the success of the model, and reasons for dropout should be determined.

All likelihood-based methods for handling nonignorable missing data must make some unverifiable assumptions, since the missing data mechanism included in the model depends on unobserved responses. Such a model is essentially nonidentifiable unless some unverifiable constraints are imposed. Inferences are only possible once these assumptions have been made, and the following aspects of the model need to be carefully considered: the bias and efficiency of parameter estimates, sensitivity to model specification, computational expense, and ease of implementation and interpretation. Selection and pattern-mixture models represent two different methods for handling nonignorable missing longitudinal data; each has its advantages and disadvantages.

Selection models directly model the distribution of primary interest, that is, the marginal distribution of the longitudinal outcomes. Thus, this method is more intuitive to most investigators. Selection models allow for a more natural way to model the missing data process, and since the missing data mechanism is modeled conditional on the repeated outcomes, it is very easy to formulate hypotheses about the missing data mechanism. However, to ensure identifiability, the set of outcomes is usually restricted in some way, and arbitrary constraints must be applied to the missing data model. It is unclear how these restrictions on the missing data mechanism translate into assumptions about the distribution of the unobserved outcomes. Sensitivity of parameter estimates to model assumptions need to be considered, as well as the complexity of the computational algorithms required to fit the models.

Pattern-mixture models make specific assumptions about the distribution of the unobserved outcomes, and therefore, it may be easier to explore the sensitivity of results to model specification. By modeling the outcomes separately for each pattern, problems of identifiability are made explicit. Model identifiability is more obscure in the selection modeling approach, and in this case, one needs to characterize identifiability theoretically. Chen et al. (2004a, 2006, 2009) have carried out such investigations. The main drawback of pattern-mixture models is that the parameters of interest are not immediately available. The primary focus of inference is on the marginal distribution of the outcomes, which can only be obtained by averaging over patterns. Hence, one cannot examine the effects of the individual covariates on the marginal distribution of the outcomes in terms of the regression coefficients. Also, as shown in the previous section, the computations needed for a simple multivariate multiple regression with just one pattern of missing data are complex. It is possible that pattern-mixture models may be computationally intractable for random-effects models or more general patterns of incomplete data.

Several methods have been proposed for dealing with series of measurements that may be right censored due to death or withdrawal. The right censoring is termed informative if the censoring probabilities depend on an individual subject’s underlying rate of change (slope) of the outcome variable. Thus, informative censoring is a special type of nonignorable missing data, and the class of joint models for longitudinal data and a nonignorable censoring process represent a specific case of the selection model. Wu and Carroll (1988) combine the normal random effects model with a probit model for the censoring process. They derive pseudo-maximum likelihood estimates and refer to their procedure as probit pseudo-maximum likelihood estimation (PPMLE). Wu and Bailey (1989) prove that under the probit model, the expectation of the slope for subject *i* is a monotonic increasing (decreasing) function of the censoring time, and instead of modeling the censoring process, they propose a conditional linear model for the individual least squares estimated slope. This method can be described as an approximation to account for the informative right censoring when estimating and comparing changes of a continuous outcome variable.

Consider the following general framework. Assume that in a longitudinal study, *n* measurements on the outcome variable are planned to be made for each participant and that the participants are to be allocated into two equal sized treatment groups. Let *y _{i}* = (

$${y}_{i}={X}_{i}{\beta}_{i}+{e}_{i},$$

where

$${e}_{i}~{N}_{{n}_{i}}(0,{\sigma}^{2}{I}_{{j}_{i}}),\text{\hspace{1em}\hspace{1em}}{\beta}_{i}~{N}_{2}({\beta}_{k},D),$$

and

$${\widehat{\beta}}_{i}|{j}_{i}={({X}_{i}^{\prime}{X}_{i})}^{-1}{X}_{i}^{\prime}{y}_{i},$$

where

$${X}_{i}^{\prime}=\left[\begin{array}{ccc}1\hfill & \cdots \hfill & 1\hfill \\ {t}_{1}\hfill & \cdots \hfill & {t}_{{j}_{i}}\hfill \end{array}\right]\phantom{\rule{thinmathspace}{0ex}}.$$

The conditional linear model approach writes the slope as a linear function of the censoring time with normal errors. Specifically,

$$({\widehat{\beta}}_{i1}|{j}_{i}=j)={\gamma}_{0k}+{\gamma}_{1}{t}_{j}+{e}_{\mathit{\text{kj}}},$$

(23)

where *E*(*e _{kj}*) = 0 and $\text{Var}({e}_{\mathit{\text{kj}}})={\sigma}_{\mathit{\text{kj}}}^{2}$. Two methods to estimate the expected slopes, β

$$\text{LMVUB}\phantom{\rule{thinmathspace}{0ex}}({\beta}_{k1})={\widehat{\gamma}}_{0k}+{\widehat{\gamma}}_{1}{E}_{\mathit{\text{ik}}}\phantom{\rule{thinmathspace}{0ex}}({t}_{{j}_{i}}),$$

where *E _{ik}*(

$$\text{LMMSE}({\beta}_{k1})={\displaystyle \sum _{j=2}^{n}{W}_{\mathit{\text{kj}}}({\overline{\widehat{\beta}}}_{k1}),}$$

where

$${\overline{\widehat{\beta}}}_{k1}=\frac{{\displaystyle {\sum}_{i\epsilon k}({\widehat{\beta}}_{i}|{j}_{i}=j)}}{{n}_{\mathit{\text{kj}}}},$$

with *n _{kj}* denoting the number of subjects censored after

The generalized linear mixed model (GLMM) is the generalized linear model (GLM) extension of the normal linear random effects model. It is defined as follows. Suppose that the sampling distribution of *y _{ij}*,

$$f({y}_{\mathit{\text{ij}}}|{\theta}_{\mathit{\text{ij}}},\tau )=\text{exp}\{\tau [{y}_{\mathit{\text{ij}}}{\theta}_{\mathit{\text{ij}}}-a({\theta}_{\mathit{\text{ij}}})]+c({y}_{\mathit{\text{ij}}},\tau )\},$$

(24)

where τ is a scalar dispersion parameter. Except for the normal random effects model, it will be assumed that τ = τ_{0}, where τ_{0} is known, since τ_{0} = 1 in the logistic and Poisson regression models. The *y _{ij}*’s are assumed to be independent given the random effects, and each

$$f(y,b|\beta ,D)={\displaystyle \prod _{i=1}^{N}{\displaystyle \prod _{j=1}^{{n}_{i}}f({y}_{\mathit{\text{ij}}}|\beta ,{b}_{i})f({b}_{i}|D),}}$$

where *f*(*b _{i}*|

$$f({b}_{i}|D)={(2\pi )}^{-q/2}{|D|}^{-1/2}\text{exp}\left\{-\frac{1}{2}{b}_{i}^{\prime}{D}^{-1}{b}_{i}\right\}.$$

To induce a correlation structure on the responses, inference is based on the marginal likelihood of β and *D* with the random effects integrated out. This is given by

$$f(y|\beta ,D)={\displaystyle {\int}_{{R}^{{N}_{q}}}f(y,b|\beta ,D)\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{db}},}$$

(25)

where *R ^{Nq}* denotes the

If *y* and *X* are fully observed, then the likelihood function based on the observed data is given by (25). Note that

$$\begin{array}{cc}f(y|\beta ,D)\hfill & ={\displaystyle {\int}_{{R}^{\mathit{\text{Nq}}}}\phantom{\rule{thinmathspace}{0ex}}f(y,b|\beta ,D)\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{db}}}\hfill \\ \hfill & ={\displaystyle {\int}_{{R}^{q}}\cdots {\displaystyle {\int}_{{R}^{q}}\left[{\displaystyle \prod _{i=1}^{N}{\displaystyle \prod _{j=1}^{{n}_{i}}f({y}_{\mathit{\text{ij}}}|\beta ,{b}_{i})f({b}_{i}|D)\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{db}}}_{i}}}\right]}}\hfill \\ \hfill & ={\displaystyle \prod _{i=1}^{N}\left[{\displaystyle {\int}_{{R}^{q}}{\displaystyle \prod _{j=1}^{{n}_{i}}f({y}_{\mathit{\text{ij}}}|\beta ,{b}_{i})f({b}_{i}|D)\phantom{\rule{thinmathspace}{0ex}}{\mathit{\text{db}}}_{i}}}\right].}\hfill \end{array}$$

(26)

Thus, the marginal likelihood involves evaluating *N* *q*-dimensional integrals. For the general class of GLMM’s, these integrals do not have a closed form and are very difficult to evaluate. This problem led to the development of quasi-likelihood based methods. Quasi-likelihood was first introduced for the generalized linear model by Wedderburn (1974), who defined the quasi-likelihood function as follows. Suppose that *y _{i}*,

$$\frac{\partial Q({y}_{i},{\mu}_{i})}{\partial {\mu}_{i}}=\frac{{y}_{i}-{\mu}_{i}}{V({\mu}_{i})}.$$

The log-likelihood is a special case of the quasi-likelihood function, but Wedderburn (1974) showed that one can use any function *Q*(*y _{i}*, µ

In the GLMM, the conditional distribution of [*y*|β, *b*] plays the same role as the distribution of [*y*|β] in the fixed-effects GLM, and the joint quasi-likelihood function is the sum of the quasi-likelihoods of [*y*|β, *b*] and [*b*|*D*]. Since inference is based on the marginal likelihood of β and *D* with the random effects integrated out, an integrated quasi-likelihood function is used to estimate θ = (β, *D*). This is defined by

$$\text{exp}(Q(y,\mu |b))\propto |{I}_{N}\otimes D{|}^{-1}{\displaystyle \int \text{exp}\phantom{\rule{thinmathspace}{0ex}}\left\{\left(-\frac{1}{2}{\displaystyle \sum _{i=1}^{N}{\text{dev}}_{i}}\right)-\frac{1}{2}b\prime {({I}_{N}\otimes D)}^{-1}b\right\}\mathit{\text{db}},}$$

where dev_{i} denotes the deviance measure of fit for subject *i*, *b* is the *N _{q}* × 1 vector of the

$$Q(y,\mu |b)\approx [y\prime \theta (\eta )-J\prime a(\theta (\eta ))]-\frac{1}{2}b\prime {({I}_{N}\otimes D)}^{-1}b,$$

where *y* is the $n\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}1(n={\displaystyle {\sum}_{i=1}^{N}{n}_{i}})$ vector of the *y _{ij}*’s, θ(η) is the

$$\left[\begin{array}{cc}\hfill X\prime \mathit{\text{WX}}\hfill & \hfill X\prime \mathit{\text{WZ}}\hfill \\ \hfill Z\prime \mathit{\text{WX}}\hfill & \hfill Z\prime \mathit{\text{WZ}}+{({I}_{N}\otimes D)}^{-1}\hfill \end{array}\right]\phantom{\rule{thinmathspace}{0ex}}\left[\begin{array}{c}\hfill \beta \hfill \\ \hfill b\hfill \end{array}\right]=\left[\begin{array}{c}\hfill X\prime {\mathit{\text{WY}}}^{*}\hfill \\ \hfill Z\prime {\mathit{\text{WY}}}^{*}\hfill \end{array}\right]\phantom{\rule{thinmathspace}{0ex}},$$

where *W* = *G R*^{−1}*G*, *Y** = + (*y* − )*G*^{−1}, $G=\frac{d\mu}{d\eta}$, and *R* = Var(*y*|β, *b*). Substitution of and into the approximated quasi-likelihood function and evaluation of *W* at and generate an approximate profile quasi-likelihood function for inference on *D*. Breslow and Clayton (1993) show that differentiating a REML version of this function with respect to the components of *D* yields the following estimating equations for the variance parameters:

$$-\frac{1}{2}\left[({Y}^{*}-X\beta )\prime {V}^{-1}\frac{\partial V}{\partial {D}_{\mathit{\text{ij}}}}{V}^{-1}({Y}^{*}-X\beta )-\text{tr}\left(P\frac{\partial V}{\partial {D}_{\mathit{\text{ij}}}}\right)\right]=0,$$

where *V* = *W*^{−1} + *Z*(*I _{N}*

Alternatively, indeed, numerical integration methods have been proposed, based on so-called nonadaptive or adaptive Gaussian quadrature. The first of these methods implements a conventional quadrature rule. The second one makes use of the bell-shaped form of the conditional likelihood function, focusing attention on the portion with highest mass. While more accurate than the PQL and PL methods, numerical integration can be computationally intensive and very sensitive to starting values. It has been implemented in the SAS procedures GLIMMIX and NLMIXED.

When some components of *y* are nonignorably missing, the estimation problem based on the observed data likelihood in (26) becomes more complicated since another integral over the missing data and the missing data mechanism would be introduced. Ibrahim et al. (2001) have developed a Monte Carlo EM algorithm for the selection model that facilitates straightforward estimation of β and *D*. Less work has been done in estimating parameters for the GLMM with nonignorable missing data using a pattern-mixture modeling approach. Fitzmaurice and Laird (2000) propose a method based on generalized estimating equations (Liang and Zeger 1986), but theirs rather is an extension of Wu and Bailey’s conditional linear model (Wu and Bailey 1988, 1989) than a pattern-mixture model as described by Little (1993, 1995).

Recall that the complete data log-likelihood for the selection model is given by (5), where now *f*(*y _{i}*|β,

$$[{y}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)}]$$

for each *i*. This can be done via the Gibbs sampler by sampling from the complete conditionals [*y*_{mis,i}|*y*_{obs,i}, *b _{i}*,

$$f({y}_{\text{mis},i}|{y}_{\text{obs},i},{b}_{i},{r}_{i},{\gamma}^{(t)})\propto f({y}_{i}|{b}_{i},{\gamma}^{(t)})f({r}_{i}|{y}_{i},{\gamma}^{(t)})$$

(27)

and

$$f({b}_{i}|{y}_{\text{mis},i},{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\propto f({y}_{i}|{b}_{i},{\gamma}^{(t)})f({b}_{i}|{\gamma}^{(t)}).$$

(28)

The products on the right side of (27) and (28) are log-concave for the class of GLMM’s in (24). This is true since *f*(*y _{i}*|

$$f({y}_{\text{mis},i\ell}|{y}_{\text{mis},\mathit{\text{i j}}},j\ne \ell ,{y}_{\text{obs},i},{b}_{i},{r}_{i},{\gamma}^{(t)})$$

(29)

and

$$f({b}_{i\ell}|{b}_{\mathit{\text{ij}}},j\ne \ell ,{y}_{\text{mis},i},{y}_{\text{obs},i},{r}_{i},{\gamma}^{(t)}),$$

(30)

where *y*_{mis,i} denotes the th component of *y*_{mis,i} (*s _{i}* × 1), and

Suppose that for the *i*th observation, a sample of size *m _{i}*, υ

$$\begin{array}{cc}{Q}_{i}(\gamma |{\gamma}^{(t)})\hfill & =\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}l(\gamma ;{y}_{\text{obs},i},{\upsilon}_{\mathit{\text{ik}}},{r}_{i})}\hfill \\ \hfill & =\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}f({y}_{i}|\beta ,{b}_{i})+}\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}f({b}_{i}|D)+}\frac{1}{{m}_{i}}{\displaystyle \sum _{k=1}^{{m}_{i}}f({r}_{i}|\varphi ,{y}_{i}).}\hfill \end{array}$$

(31)

Note that this E-step takes a complete data weighted form in which each (*y*_{mis,i}, *b _{i}*) gets filled in by a set of

$$Q(\gamma |{\gamma}^{(t)})={\displaystyle \sum _{i=1}^{N}{\displaystyle \sum _{k=1}^{{m}_{i}}\frac{1}{{m}_{i}}l(\gamma ;{y}_{\text{obs},i},{\upsilon}_{\mathit{\text{ik}}},{r}_{i})}.}$$

The resulting M-step is like one of complete data for the GLMM and can be obtained as follows. Let

$$\dot{Q}(\gamma |{\gamma}^{(t)})=({\dot{Q}}^{(1)}(\beta |{\gamma}^{(t)}),{\dot{Q}}^{(2)}(D|{\gamma}^{(t)}),{\dot{Q}}^{(3)}(\varphi |{\gamma}^{(t)}))\prime $$

denote the score vector of *Q*(γ|γ^{(t)}) so that

$$\dot{Q}(\gamma |{\gamma}^{(t)})\equiv {\displaystyle \sum _{i=1}^{N}{\dot{Q}}_{i}(\gamma |{\gamma}^{(t)})}={\displaystyle \sum _{i=1}^{N}{\displaystyle \sum _{k=1}^{{m}_{i}}\frac{1}{{m}_{i}}\frac{\partial l(\gamma ;{y}_{\text{obs},i},{\upsilon}_{\mathit{\text{ik}}},{r}_{i})}{\partial \gamma}}}.$$

Also, let

$$\ddot{Q}(\gamma |{\gamma}^{(t)})\equiv \frac{{\partial}^{2}Q(\gamma |{\gamma}^{(t)})}{\partial \gamma \partial \gamma \prime}$$

denote the Hessian matrix. Since β, *D*, and ϕ are distinct, derivatives of *l*(γ; *y*_{obs,i}, υ* _{ik}*,

$$\mathcal{I}(\widehat{\gamma})=-\ddot{Q}(\widehat{\gamma}|\widehat{\gamma})-\left\{\left[{\displaystyle \sum _{i=1}^{N}{\displaystyle \sum _{k=1}^{{m}_{i}}\frac{1}{{m}_{i}}{S}_{i}(\widehat{\gamma};{y}_{\text{obs},i},{\upsilon}_{\mathit{\text{ik}}},{r}_{i})\phantom{\rule{thinmathspace}{0ex}}{S}_{i}(\widehat{\gamma};,{y}_{\text{obs},i},{\upsilon}_{\mathit{\text{ik}}},{r}_{i})}\prime}\right]-{\displaystyle \sum _{i=1}^{N}{\dot{Q}}_{i}(\widehat{\gamma}|\widehat{\gamma}){\dot{Q}}_{i}(\widehat{\gamma}|\widehat{\gamma})\prime}\right\},$$

(32)

where is the estimate of γ at EM convergence, and

$${S}_{i}(\widehat{\gamma};{y}_{\text{obs},i},{\upsilon}_{\mathit{\text{ik}}},{r}_{i})={\left[\frac{\partial l(\gamma ;{y}_{\text{obs},i},{\upsilon}_{\mathit{\text{ik}}},{r}_{i})}{\partial \gamma}\right]}_{\phantom{\rule{thinmathspace}{0ex}}\gamma =\widehat{\gamma}}\phantom{\rule{thinmathspace}{0ex}}.$$

The quantities in (32) are easily computed since both (|) and _{i}(|) are obtained from the M-step and *S _{i}*(;

The method described here is valid for arbitrary patterns of missing data in the response variable. The complexity of the estimate of *D* in the M-step depends on the structure of *D*. In any case, the estimation of *D* corresponds to estimation from a problem of complete data, and one can use any existing complete data software to estimate *D*. Also, note that models for the missing data mechanism in GLMM’s are the same as the normal random-effects model.

Recall that pattern-mixture models stratify the incomplete data by the pattern of missing values and formulate distinct models within each stratum. Thus, the complete data log-likelihood is written as

$$l(\gamma )={\displaystyle \sum _{i=1}^{N}\text{log}[f({y}_{i}|\beta ,{b}_{i},{r}_{i})]}+{\displaystyle \sum _{i=1}^{N}\text{log}[f({b}_{i}|D)]+{\displaystyle \sum _{i=1}^{N}\text{log}[f({r}_{i}|\varphi )].}}$$

Little work has been done using pattern-mixture models for GLMM’s with nonignorable missing data. Ekholm and Skinner (1998) analyze longitudinal binary data using a pattern-mixture model but do not generalize their method to the GLMM. Fitzmaurice and Laird (2000) develop a model for the GLMM with nonignorable dropout, which they consider to be a *mixture* model based on Wu and Bailey’s conditional linear model (Wu and Bailey 1988, 1989), since dropout time is used as a covariate. This is the method that will be described here.

Consider the following notation. Assume that *N* subjects are to be observed at the same set of *n* occasions, {*t*_{1}, …, *t _{n}*}. Let ${y}_{i}^{c}=({y}_{i1},\dots ,{y}_{\mathit{\text{in}}})\prime $ denote the complete response vector for subject

Consider models for *y _{i}*, conditional of the time of dropout, that are of the following general form:

$$g(E({y}_{\mathit{\text{ij}}}|\beta ,{r}_{i}))={z}_{\mathit{\text{ij}}}^{\prime}\beta ,$$

(33)

where *g*(·) is a known link function, and the design vector, *z _{ij}*, includes the dropout time, the covariates, and their interactions. The parameters in this model have an unappealing interpretation due to the stratification by pattern of dropout, which may depend on the outcome. Therefore, the parameter of interest is not β, but the marginal expectation of the repeated outcome averaged over the distribution of dropout times,

$$E({y}_{\mathit{\text{ij}}}|\beta )={\mu}_{\mathit{\text{ij}}}={\displaystyle \sum _{l=2}^{n+1}{\varphi}_{\mathit{\text{il}}}\phantom{\rule{thinmathspace}{0ex}}{g}^{-1}({z}_{\mathit{\text{ij}}}^{\prime}\beta ),}$$

where *z _{ij}* includes the dropout time and

Unlike the normal random effects model, it is difficult to account for the covariance among the repeated outcomes when the response variable is categorical, ordinal, or count data. Generalized estimating equations (GEE’s) (see Liang and Zeger 1986; Zeger and Liang 1986) represent a general method for incorporating within-subject correlation in the GLM without having to completely specify the joint distribution of *y _{i}*. Only the forms of the first and second moments are required. Note that the GEE approach can accommodate any intermediate MCAR missingness in the outcome since each subject is allowed a distinct set of measurement times. The estimating equations for β with nonignorable missing data are given by

$$U(\beta )={\displaystyle \sum _{i=1}^{N}{G}_{i}^{\prime}{V}_{i}^{-1}[{y}_{i}-E({y}_{i}|\beta ,{r}_{i})]}=0,$$

where *y _{i}* is the

$${V}_{\widehat{\beta}}=\underset{N\to \mathrm{\infty}}{\text{lim}}N\phantom{\rule{thinmathspace}{0ex}}{\left[{\displaystyle \sum _{i=1}^{N}{G}_{i}^{\prime}{V}_{i}^{-1}{G}_{i}}\right]}^{-1}\left[{\displaystyle \sum _{i=1}^{N}{G}_{i}^{\prime}{V}_{i}^{-1}\text{cov}({Y}_{i}){V}_{i}^{-1}{G}_{i}}\right]\phantom{\rule{thinmathspace}{0ex}}{\left[{\displaystyle \sum _{i=1}^{N}{G}_{i}^{\prime}{V}_{i}^{-1}{G}_{i}}\right]}^{-1}.$$

Estimation of the dropout probabilities also needs to be considered. With a small number of discrete covariates, the dropout probabilities, ϕ_{ij}, can be estimated as the sample proportion with each dropout time stratified by covariate pattern. The asymptotic covariance matrix of *N*^{1/2}( − ϕ) is then given by

$${V}_{\widehat{\varphi}}=\text{diag}(\varphi )-\varphi \varphi \prime .$$

When the number of dropout times or covariates is large, then parametric models such as a multinomial log-linear regression model can be used to estimate ϕ.

The appealing aspect of the mixture model presented above is that the SAS procedure GENMOD, or any other statistical software for GEE’s, can be used to estimate β. The dropout times and their interactions with the other covariates are simply included as additional covariates in the model. The marginal means at times *t _{j}* can then be estimated by

$${\widehat{\mu}}_{\mathit{\text{ij}}}={\displaystyle \sum _{l=2}^{n+1}{\widehat{\varphi}}_{\mathit{\text{il}}}\phantom{\rule{thinmathspace}{0ex}}{g}^{-1}({z}_{\mathit{\text{ij}}}^{\prime}\widehat{\beta})}.$$

Robins et al. (1995) develop a class of estimators for generalized linear mixed models that are based on inverse probability weighted estimating equations (WEE) when the data are MAR. Rotnitzky et al. (1998) extend this methodology to account for nonignorable nonresponse in the outcomes. Their conditional mean model of *y _{it}*,

$$E({y}_{\mathit{\text{it}}}|{X}_{i})={k}_{t}({X}_{i};\beta ),$$

where *k _{t}*(

Consider the following notation. Let υ_{it} be a vector of time-dependent covariates that are not of interest. Define ${w}_{\mathit{\text{it}}}=({\upsilon}_{\mathit{\text{it}}}^{\prime},{y}_{\mathit{\text{it}}})\prime ,t=1,\dots ,T,{w}_{i0}=({x}_{i0}^{\prime},{\upsilon}_{i0}^{\prime},{y}_{i0})\prime ,{w}_{i}=({w}_{i0}^{\prime},{w}_{i1}^{\prime},\dots ,{w}_{\mathit{\text{iT}}}^{\prime})\prime $, and let *r _{it}* be an indicator variable for time

$$P({r}_{i}=r|{w}_{i})={\displaystyle \prod _{t=1}^{T}P{({r}_{\mathit{\text{it}}}=1|{\overline{r}}_{\mathit{\text{it}}},{w}_{i})}^{{r}_{t}}P{({r}_{\mathit{\text{it}}}=0|{\overline{r}}_{\mathit{\text{it}}},{w}_{i})}^{1-{r}_{t}}={\pi}_{i}(r)},$$

where *P*(*r _{it}* = 1|

$${\lambda}_{\mathit{\text{it}}}={\lambda}_{\mathit{\text{it}}}(\alpha ),$$

where

$$\text{logit}\phantom{\rule{thinmathspace}{0ex}}{\lambda}_{\mathit{\text{it}}}(\alpha )={h}_{t}({\overline{r}}_{\mathit{\text{it}}},{w}_{i};\alpha ),$$

and *h _{t}*(

In the no missing data case, parameter estimates are found by solving the estimating equations

$$\sum _{i=1}^{N}d({X}_{i};\beta )[{y}_{i}-g({X}_{i};\beta )]=0,$$

where *d*(*X _{i}*; β) is a

$$E(d({X}_{i};\beta )\phantom{\rule{thinmathspace}{0ex}}[{y}_{i}-g({X}_{i};\beta )])=0.$$

In the incomplete data case with unknown response probabilities, the parameters β and α can be jointly estimated from solutions to a simultaneous set of *p* + *q* estimating equations,

$$\sum _{i=1}^{N}\frac{I({r}_{i}=1\prime )}{{\pi}_{i}(1;\alpha )}\phantom{\rule{thinmathspace}{0ex}}[\begin{array}{c}{d}^{(1)}({X}_{i};\beta )\hfill \\ {d}^{(2)}({X}_{i};\beta )\hfill \end{array}}]\phantom{\rule{thinmathspace}{0ex}}[{y}_{i}-g({X}_{i};\beta )]-[\begin{array}{c}{A}_{i}^{(1)}(\alpha )\hfill \\ {A}_{i}^{(2)}(\alpha )\hfill \end{array}]\phantom{\rule{thinmathspace}{0ex}},$$

where *d*^{(1)}(*X _{i}*; β) and

$${A}_{i}^{(j)}(\alpha )={\displaystyle \sum _{r\ne 1}[I({r}_{i}=r)-\frac{I({r}_{i}=1\prime )}{{\pi}_{i}(1;\alpha )}{\pi}_{i}(r;\alpha )]\phantom{\rule{thinmathspace}{0ex}}{f}_{r}^{(j)}({w}_{\mathit{\text{ri}}})}$$

such that

$$\sum _{i=1}^{N}{A}_{i}^{(j)}(\alpha )=0.$$

Rotnitzky et al. (1998) show that the solution to these equations is consistent and asymptotically normally distributed, provided that the conditional mean model and the model for the response probabilities are correctly specified. The variance of depends on the choice of functions *d*^{(j)}(*X _{i}*; β) and ${f}_{r}^{(j)}({w}_{\mathit{\text{ri}}})$, and optimal choices for these functions are discussed in the paper. This approach is extended to semiparametric models for the dropout mechanism by Scharfstein et al. (1999). Lipsitz et al. (1999b) examine theoretical connections between WEE and ML methods.

Lipsitz et al. (1999a) consider maximum likelihood estimation for the special case of nonignorable missing responses and MAR categorical covariates in longitudinal binary data. More generally, the work of Ibrahim et al. (2001) involving missing nonignorable responses in GLMM’s was extended to include both nonignorable missing responses and/or covariates for the normal mixed model in Stubbendick and Ibrahim (2003) and for the multivariate probit model by Stubbendick and Ibrahim (2006). Following Stubbendick and Ibrahim (2003), the E-step for the *i*th observation at the (*t* + 1)th iteration for the normal mixed model is

$$\begin{array}{cc}{Q}_{i}(\gamma |{\gamma}^{(t)})\hfill & =E(l(\gamma ;{y}_{i},{X}_{i},{b}_{i},{r}_{i})|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\hfill \\ \hfill & ={\displaystyle \int {\displaystyle \int {\displaystyle \int \text{log}[f({y}_{i}|\beta ,{\sigma}^{2},{X}_{i},{b}_{i})]}}}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({y}_{\text{mis},i},{X}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)}){\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{db}}}_{i}{\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dX}}}_{\text{mis},i}{\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dy}}}_{\text{mis},i}+{\displaystyle \int {\displaystyle \int {\displaystyle \int \text{log}[f({X}_{i}|\alpha )]}}}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({y}_{\text{mis},i},{X}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)}){\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{db}}}_{i}{\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dX}}}_{\text{mis},i}{\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dy}}}_{\text{mis},i}+{\displaystyle \int {\displaystyle \int {\displaystyle \int \text{log}[f({b}_{i}|D)]}}}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({y}_{\text{mis},i},{X}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)}){\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{db}}}_{i}{\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dX}}}_{\text{mis},i}{\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dy}}}_{\text{mis},i}+{\displaystyle \int {\displaystyle \int {\displaystyle \int \text{log}[f({r}_{i}|\varphi ,{y}_{i},{X}_{i})]}}}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({y}_{\text{mis},i},{X}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)}){\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{db}}}_{i}{\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dX}}}_{\text{mis},i}{\phantom{\rule{thinmathspace}{0ex}}\mathit{\text{dy}}}_{\text{mis},i}\hfill \\ \hfill & \equiv {I}_{1}+{I}_{2}+{I}_{3}+{I}_{4},\hfill \end{array}$$

where γ^{(t)} = (β^{(t)}, σ^{2(t)}, *D*^{(t)}, ϕ^{(t)}), and *f*(*y*_{mis,i}, *X*_{mis,i}, *b _{i}*|

To integrate out *b _{i}* from

$$f({y}_{\text{mis},i},{X}_{\text{mis},i},{b}_{i}|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})=f({b}_{i}|{y}_{i},{\gamma}^{(t)})f({y}_{\text{mis},i},{X}_{\text{mis},i}|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)}),$$

where

$$({b}_{i}|{y}_{i},{\gamma}^{(t)})~{N}_{q}{({\displaystyle {\Sigma}_{i}^{(t)}{Z}_{i}^{\prime}({y}_{i}-{X}_{i}{\beta}^{(t)})/{\sigma}^{2(t)},[{\sigma}^{-2(t)}{Z}_{i}^{\prime}}{Z}_{i}+{({D}^{(t)})}^{-1}]}^{-1}).$$

Then, to complete the E-step, samples only need to be taken from

$$[{y}_{\text{mis},i},{X}_{\text{mis},i}|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)}].$$

This distribution can be written up to a constant of proportionality as

$$f({y}_{\text{mis},i},{X}_{\text{mis},i}|{y}_{\text{obs},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\propto \text{exp}\phantom{\rule{thinmathspace}{0ex}}(-\frac{1}{2}({y}_{i}-{X}_{i}{\beta}^{(t)})\prime {({Z}_{i}{D}^{(t)}{Z}_{i}^{\prime}+{\sigma}^{2(t)}{I}_{{n}_{i}})}^{-1}({y}_{i}-{X}_{i}{\beta}^{(t)}))\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({r}_{i}|{y}_{\text{mis},i},{y}_{\text{obs},i},{X}_{\text{mis},i},{X}_{\text{obs},i},{\gamma}^{(t)})f({X}_{\text{mis},i}|{X}_{\text{obs},i},{\gamma}^{(t)}),$$

(34)

which has the form of a normal density times a logistic regression for the *r _{i}*’s times some sort of regression for the

This methodology has been extended to the GLMM by Stubbendick and Ibrahim (2006). Thus, an MCEM sample must be generated from [*y*_{mis,i}, *X*_{mis,i}, *b _{i}*|

$$f({y}_{\text{mis},i}|{y}_{\text{obs},i},{X}_{\text{mis},i},{X}_{\text{obs},i},{b}_{i},{r}_{i},{\gamma}^{(t)})\propto f({y}_{i}|{X}_{i},{b}_{i},{\gamma}^{(t)})f({r}_{i}|{y}_{i},{X}_{i},{\gamma}^{(t)}),$$

(35)

$$f({X}_{\text{mis},i}|{y}_{\text{mis},i},{y}_{\text{obs},i},{X}_{\text{obs},i},{b}_{i},{r}_{i},{\gamma}^{(t)})\propto f({y}_{i}|{X}_{i},{b}_{i},{\gamma}^{(t)})f({r}_{i}|{y}_{i},{X}_{i},{\gamma}^{(t)})f({X}_{\text{mis},i}|{X}_{\text{obs},i},{\gamma}^{(t)}),$$

(36)

$$f({b}_{i}|{y}_{\text{mis},i},{y}_{\text{obs},i},{X}_{\text{mis},i},{X}_{\text{obs},i},{r}_{i},{\gamma}^{(t)})\propto f({y}_{i}|{X}_{i},{b}_{i},{\gamma}^{(t)})f({b}_{i}|{\gamma}^{(t)}).$$

(37)

When the products on the right-hand side of (35)–(37) are log-concave for the class of GLMMs, then the Gibbs sampler along with adaptive rejection algorithm of Gilks and Wild (1992) can be used to sample from the complete conditionals.

Allowing for nonignorable missing responses *and* covariates presents several additional modeling and computational challenges compared to just the missing response situation. First, a covariate distribution needs to be specified and its parameters estimated. This is done by specifying the covariate distribution via a sequence of one-dimensional conditional distributions as

$$f({x}_{\mathit{\text{ij}}1},\dots ,{x}_{\mathit{\text{ijp}}}|\alpha )=f({x}_{\mathit{\text{ijp}}}|{x}_{\mathit{\text{ij}}1},\dots ,{x}_{\mathit{\text{ij}}(p-1)},{\alpha}_{p})\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({x}_{\mathit{\text{ij}}(p-1)}|{x}_{\mathit{\text{ij}}1},\dots ,{x}_{\mathit{\text{ij}}(p-2)},{\alpha}_{(p-1)})\dots \phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}f({x}_{\mathit{\text{ij}}2}|{x}_{\mathit{\text{ij}}1},{\alpha}_{2})f({x}_{\mathit{\text{ij}}1}|{\alpha}_{1}),$$

(38)

where *x _{ijm}* is the

Unfortunately, the parametric forms of the assumed missing data mechanism and the covariate model are not testable from the data. Many models need to be evaluated owing to the numerous possibilities for the missing data mechanism and/or the covariate distribution and for carrying out sensitivity analyses. In addition, issues related to bias, efficiency, and model fit need to be addressed. In the presence of missing data, Lipsitz et al. (2001) and Fitzmaurice et al. (2001) examine bias issues in longitudinal data. Chen et al. (2008) examine bias and efficiency issues in regression models with missing responses and/or covariates. To address general issues regarding model fit and assessment in the presence of missing data, new methods are needed for defining residuals, diagnostic measures, assessing model fit, and assessing the influence of model perturbations for all types of models, such as GLMs, survival models, and models for longitudinal data. This is a currently growing, active, and open research area. AIC and BIC are common model assessment tools under the frequentist paradigm. In the presence of missing data, the definition of the AIC/BIC criterion is not clear. Ibrahim et al. (2008b) define AIC as AIC = −2*Q*(|) + 2*d*, where *d* is the total number of parameters in the model and *Q*(|) is the *Q* function from the EM algorithm at convergence. Similarly, they define BIC as BIC = −2*Q*(|) + log(*N*)*d*. Such measures can be used to assess fit in models for longitudinal data.

A more general framework for model assessment in complete data problems is given in Cook (1986), where he describes a method for assessing the local influence of minor perturbations of a statistical model. His method uses the geometric normal curvature to characterize the behavior of an influence graph based on a well-behaved likelihood function. In the context of the linear mixed model with complete data, Beckman et al. (1987) use local influence to assess the effect of perturbing the error variances, the random-effects variances, and the response vector. Lesaffre and Verbeke (1998) show that the local influence approach is also useful for the detection of influential subjects in a longitudinal data analysis. Zhu and Lee (2001) apply Cook’s approach to the conditional expectation of the complete data log-likelihood function in the EM algorithm instead of the more complicated observed data log-likelihood function. Their Q-displacement function, 2[*Q*(|) − *Q*((ω)|)], was used in assessing the local influence of perturbations of selection models with nonignorable missing data in Shi et al. (2009). Zhu et al. (2009) examine residuals, diagnostic measures, and goodness of fit statistics for GLMs with missing covariate data. Shi et al. (2009) also examine local influence approaches for GLMs with missing covariate data, and Garcia et al. (2009) investigate variable selection in GLMs with missing covariate data using penalized likelihood approaches. These procedures are currently being extended to longitudinal data.

Interest in methods for joint modeling of longitudinal and survival time data has developed considerably in recent years (see, e.g., Pawitan and Self 1993; DeGruttola and Tu 1994; Taylor et al. 1994; Faucett and Thomas 1996; Lavalley and DeGruttola 1996; Hogan and Laird 1997, 1998; Henderson et al. 2000; Xu and Zeger 2001; Brown and Ibrahim 2003a, 2003b; Ibrahim et al. 2004; Chen et al. 2004a; Brown et al. 2005; Chi and Ibrahim 2006, 2007).

Broadly speaking, there are three main reasons to consider such models. First, a time-to-event outcome may be measured alongside a longitudinal covariate. Such a joint model then allows, in a natural way, for incorporation of measurement error present in the longitudinal covariate into the model. Second, a number of researchers have used joint modeling methods to exploit longitudinal markers as surrogates for survival. Tsiatis et al. (1995), for instance, propose a model for the relationship of survival to longitudinal data measured with error and, using Prentice’s (1989) criteria, examine whether CD4 counts may serve as a useful surrogate marker for survival in patients with AIDS. Xu and Zeger (2001) investigate the issue of evaluating multiple surrogate endpoints and discuss a joint latent model for a time to clinical event and for repeated measures over time on multiple biomarkers that are potential surrogates. In addition, they propose two complementary measures to assess the relative benefit of using multiple surrogates as opposed to a single one. Another aspect of the problem, discussed by Henderson et al. (2000), Brown and Ibrahim (2003a, 2003b), Ibrahim et al. (2004), Chen et al. (2004b), Brown et al. (2005), and Chi and Ibrahim (2006, 2007), is the identification of longitudinal markers for survival. These authors focus on the use of longitudinal marker trajectories to investigate the association between a longitudinal marker and survival. Renard et al. (2002) used a joint model to explore the usefulness of prostate-specific antigen as a marker for prostate cancer.

Third, and most relevant for us here, such joint models can be used when incomplete longitudinal data are collected. Whenever data are incomplete, one should a priori consider the joint distribution of the responses and missing data process. In this sense, selection models and pattern-mixture models are merely convenient ways to decompose this joint distribution. In a number of applications, it may be attractive to write this joint distribution in terms of latent variables, latent classes, or random effects. This leads to so-called shared-parameter models. In principle, one can augment the full-data distribution with random effects

$$f({y}_{i},{r}_{i},{b}_{i}|{X}_{i},{W}_{i},{Z}_{i},\theta ,\psi ,\xi )$$

(39)

and then still consider the selection-model factorization

$$f({y}_{i},{r}_{i},{b}_{i}|{X}_{i},{W}_{i},{Z}_{i},\theta ,\psi )=f({y}_{i}|{X}_{i},{b}_{i},\theta )f({r}_{i}|{y}_{i},{b}_{i},{W}_{i},\psi )f({b}_{i}|{Z}_{i},\xi )$$

(40)

and the pattern-mixture model factorization

$$f({y}_{i},{r}_{i},{b}_{i}|{X}_{i},{W}_{i},{Z}_{i},\theta ,\psi ,\xi )=f({y}_{i}|{r}_{i},{b}_{i},{X}_{i},\theta )f({r}_{i}|{b}_{i},{W}_{i},\psi )f({b}_{i}|{Z}_{i},\xi ).$$

(41)

Here, *Z _{i}* and ξ are covariates and parameters, respectively, describing the random-effects distribution. Little (1995) refers to such decompositions as random-coefficient selection and pattern-mixture models, respectively.

Important early references to such models are Wu and Carroll (1988) and Wu and Bailey (1988, 1989). Wu and Carroll (1988) proposed this kind of model for what they termed informative right censoring. For a continuous response, Wu and Carroll suggested using a conventional Gaussian random-coefficient model combined with an appropriate model for time to dropout, such as a proportional hazards, logistic, or probit regression. The combination of probit and Gaussian response allows an explicit solution of the integral and was used in their application.

In a slightly different approach to modeling dropout time as a continuous variable in the latent variable setting, Schluchter (1992) and DeGruttola and Tu (1994) proposed joint multivariate Gaussian distributions for the latent variable(s) of the response process and a variable representing time to dropout. The correlation between these variables induces dependence between dropout and response. To permit more realistic distributions for dropout time, Schluchter (1992) proposed that dropout time itself should be some monotone transformation of the corresponding Gaussian variable. The use of a joint Gaussian representation does simplify computational problems associated with the likelihood. There are clear links here with the Tobit model, and this is made explicit by Cowles et al. (1996), who use a number of correlated latent variables to represent various aspects of an individual’s behavior, such as compliance and attendance at scheduled visits. Models of this type handle nonmonotone missingness quite conveniently. There are many ways in which such models can be extended and generalized.

An important simplification arises when *Y _{i}* and

$$f({y}_{i},{r}_{i},{b}_{i}|{X}_{i},{W}_{i},{Z}_{i},\theta ,\psi ,\xi )=f({y}_{i}|{X}_{i},{b}_{i},\theta )f({r}_{i}|{W}_{i},{b}_{i},\psi )f({b}_{i}|{Z}_{i},\xi ).$$

(42)

This route was followed by Follman and Wu (1995). Note that, when *b _{i}* is assumed to be discrete, a latent-class or mixture model follows. Rizopoulos et al. (2008) study the impact of random-effects misspecification in a shared-parameter model. Beunckens et al. (2008) combine continuous random effects with latent classes, leading to the simultaneous use of mixture and mixed-effects models ideas. It is very natural to handle random-coefficient models and in particular shared-parameter models in a Bayesian framework. Examples in the missing data setting are provided by Best et al. (1996) and Carpenter et al. (2002).

Daniels and Hogan (2008) provide a comprehensive survey of Bayesian methods for longitudinal models with missing data. We refer the reader to their book and the many references therein. Here, we only provide a brief general discussion of implementational and methodologic issues for the Bayesian paradigm in the presence of missing data. Fully Bayesian methods require specifying priors for all the parameters and specifying distributions for the missing covariates and/or missing data mechanisms, along with the sampling distribution of the response variable. We note here that Bayesian methods for any missing data problem are, in principal, quite straightforward to implement compared to the no missing data situation. This is due to the fact that all one needs to do in the Bayesian paradigm is to add additional steps to the Gibbs sampler, for example, to sample from the complete conditional distributions of the missing data. Such steps can be easily incorporated into an existing Gibbs sampler for a no missing data problem, and will be generally easier to implement than the MCEM algorithm discussed earlier. These fully Bayesian procedures can be easily implemented in WinBUGS or PROC MCMC in SAS for all types of models, including models for longitudinal data.

However, new issues arise in fitting Bayesian models with missing data that do not arise in the frequentist development. First, one has to ensure that the posterior distribution is proper when using improper priors, as it is very easy for the posterior to be improper especially in nonignorable missing data settings. These issues and other modeling and elicitation issues are discussed in Ibrahim et al. (2001, Chap. 8), Ibrahim et al. (2002, 2008a), and Chen et al. (2002, 2004a, 2006). Second, even when using proper priors, if the model is weakly identifiable, which is often the case in many nonignorable missing data problems, the inferences may be quite sensitive to the choices of the hyperparameters, and one needs clever strategies for specifying informative priors that do not dominate the likelihood. Such strategies are outlined in Huang et al. (2005) for GLMs that can be easily extended to models for longitudinal data. Thirdly, it is conceivable that fully Bayesian methods may be more computationally intensive than their frequentist counterparts and Markov chain Monte Carlo convergence may not be easily achieved.

Problems associated with incompletely gathered data, especially in longitudinal and clinical studies, have received considerable attention in recent times (Verbeke and Molenberghs 2000; Fitzmaurice et al. 2004; Molenberghs and Verbeke 2005; Molenberghs and Kenward 2007; Daniels and Hogan 2008; Fitzmaurice et al. 2008).

To efficiently describe these issues, a formal taxonomy, as laid out in this paper, is called for. We have placed emphasis on: (1) missing data patterns (monotone, nonmonotone); (2) missing data mechanisms (MCAR, MAR, MNAR); (3) modeling frameworks (selection, pattern-mixture, and shared-parameter models); (4) inferential paradigms (likelihood, Bayesian, frequentist); (5) ignorability; and (6) outcomes types (continuous/linear, noncontinuous/generalized linear). Finally, some attention has also been devoted to sensitivity analysis frameworks.

Thanks to advances in terms of both available methodology and efficient implementations thereof, not in the least in generally available statistical software tools, such as SAS, SPSS, SPlus, WinBUGS and R, quite advanced analyses are within reach, and there is no longer a need to focus on such simplistic methods as a complete-case analysis or last observation carried forward, to name but a few. At the same time, all methods, no matter how sophisticated, rest to some extent on unverifiable assumptions, owing to the simple fact that the missing data are unobserved. Therefore, rather than placing belief in a single such model, it should be supplemented with appropriate forms of sensitivity analysis.

Joseph G. Ibrahim, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA ; Email: ude.cnu.soib@miharbi..

Geert Molenberghs, Center for Statistics à International Institute for Biostatistic and Statistical Bioinformatics, Hasselt University and Catholic University Leuven, Agoralaan 1, 3590 Diepenbeek, Belgium ; Email: eb.tlessahu@shgrebnelom.treeg.

- Beckman RJ, Nachtsheim CJ, Cook RD. Diagnostics for mixed-model analysis of variance. Technometrics. 1987;29:413–426.
- Best NG, Spiegelhalter DJ, Thomas A, Brayne CEG. Bayesian analysis of realistically complex models. J R Stat Soc Ser A. 1996;159:323–342.
- Beunckens C, Molenberghs G, Verbeke G, Mallinckrodt C. A latent-class mixture model for incomplete longitudinal Gaussian data. Biometrics. 2008;64(1):96–105. [PubMed]
- Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. J Am Stat Assoc. 1993;88:9–25.
- Brown ER, Ibrahim JG. A Bayesian semiparametric joint hierarchical model for longitudinal and survival data. Biometrics. 2003a;59:221–228. [PubMed]
- Brown ER, Ibrahim JG. Bayesian approaches to joint cure rate and longitudinal models with applications to cancer vaccine trials. Biometrics. 2003b;59:686–693. [PubMed]
- Brown ER, Ibrahim JG, DeGruttola V. A flexible b-spline model for multiple longitudinal biomarkers and survival. Biometrics. 2005;61:64–73. [PubMed]
- Carpenter J, Pocock S, Lamm CJ. Coping with missing data in clinical trials: a model based approach applied to asthma trials. Stat Med. 2002;21:1043–1066. [PubMed]
- Chen M-H, Ibrahim JG. Maximum likelihood methods for cure rate models with missing covariates. Biometrics. 2002;57:43–52. [PubMed]
- Chen M-H, Ibrahim JG, Lipsitz SR. Bayesian methods for missing covariates in cure rate models. Lifetime Data Anal. 2002;8:117–146. [PubMed]
- Chen M-H, Ibrahim JG, Shao Q-M. Propriety of the posterior distribution and existence of the maximum likelihood estimator for regression models with covariates missing at random. J Am Stat Assoc. 2004a;99:421–438.
- Chen M-H, Ibrahim JG, Sinha D. A new joint model for longitudinal and survival data with a cure fraction. J Multivar Anal. 2004b;91:18–34.
- Chen M-H, Ibrahim JG, Shao Q-M. Posterior propriety anc computation for the Cox regression model with applications to missing covariates. Biometrika. 2006;93:791–807.
- Chen M-H, Ibrahim JG, Shao Q-M. Model identifiability for the Cox regression model with applications to missing covariates. J Multivar Anal. 2009 (in press) [PMC free article] [PubMed]
- Chen Q, Ibrahim JG. Missing covariate and response data in regression models. Biometrics. 2006;62:177–184. [PubMed]
- Chen Q, Zeng D, Ibrahim JG. Sieve maximum likelihood estimation for regression models with covariates missing at random. J Am Stat Assoc. 2007;102:1309–1317.
- Chen Q, Ibrahim JG, Chen M-H, Senchaudhuri P. Theory and inference for regression models with missing responses and covariates. J Multivar Anal. 2008;99:1302–1331. [PMC free article] [PubMed]
- Chi Y, Ibrahim JG. Joint models for multivariate longitudinal and survival data. Biometrics. 2006;62:432–445. [PubMed]
- Chi Y, Ibrahim JG. A new class of joint models for longitudinal and survival data accomodating zero and zon-zero cure fractions: a case study of an international breast cancer study group trial. Stat Sin. 2007;17:445–462.
- Cook RD. Assessment of local influence. J R Stat Soc Ser B. 1986;48:133–169.
- Cowles MK, Carlin BP, Connett JE. Bayesian tobit modeling of longitudinal ordinal clinical trial compliance data with nonignorable missingness. J Am Stat Assoc. 1996;91:86–98.
- Creemers A, Hens N, Aerts M, Molenberghs G, Verbeke G, Kenward MG. Shared-parameter models and missingness at random. 2009 (Submitted for publication)
- Daniels MJ, Hogan JW. Missing data in longitudinal studies. London: Chapman and Hall; 2008.
- DeGruttola V, Tu XM. Modelling progression of CD4 lymphocyte count and its relationship to survival time. Biometrics. 1994;50:1003–1014. [PubMed]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) J R Stat Soc Ser B. 1977;39:1–38.
- Diggle P, Kenward MG. Informative drop-out in longitudinal data analysis (with discussion) Appl Stat. 1994;43:49–93.
- Diggle PJ, Heagerty P, Liang K-Y, Zeger SL. Analysis of longitudinal data. London: Oxford University Press; 2002.
- Ekholm A, Skinner C. The Muscatine children’s obesity data reanalysed using pattern mixture models. Appl Stat. 1998;47:251–263.
- Faucett CL, Thomas DC. Simultaneously modelling censored survival data and repeatedly measured covariates: a Gibbs sampling approach. Stat Med. 1996;15:1663–1685. [PubMed]
- Fitzmaurice GM, Laird NM. Generalized linear mixture models for handling nonignorable dropouts in longitudinal studies. Biostatistics. 2000;1:141–156. [PubMed]
- Fitzmaurice GM, Lipsitz SR, Molenberghs G, Ibrahim JG. Bias in estimating association parameters for longitudinal binary responses with drop-outs. Biometrics. 2001;57:15–21. [PubMed]
- Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. New York: Wiley; 2004.
- Fitzmaurice GM, Lipsitz SR, Ibrahim JG, Gelber R, Lipshultz S. Estimation in regression models for longitudinal binary data with outcome-dependent follow-up. Biostatistics. 2006;7:469–485. [PubMed]
- Fitzmaurice GM, Davidian M, Verbeke G, Molenberghs M. Longitudinal data analysis. London: Chapman and Hall; 2008.
- Follman D, Wu M. An approximate generalized linear model with random effects for informative missing data. Biometrics. 1995;51:151–168. [PubMed]
- Garcia RI, Ibrahim JG, Zhu H. Variable selection for regression models with missing data. Stat Sin. 2009 (in press) [PMC free article] [PubMed]
- Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. Appl Stat. 1992;41:337–348.
- Henderson R, Diggle P, Dobson A. Joint modelling of longitudinal measurements and event time data. Biostatistics. 2000;1:465–480. [PubMed]
- Herring AH, Ibrahim JG. Likelihood-based methods for missing covariates in the Cox proportional hazards model. J Am Stat Assoc. 2001;96:292–302.
- Herring AH, Ibrahim JG. Maximum likelihood estimation in random effects cure rate models with nonignorably missing covariates. Biostatistics. 2002;3:387–405. [PubMed]
- Herring AH, Ibrahim JG, Lipsitz SR. Frailty models with missing covariates. Biometrics. 2002;58:98–109. [PubMed]
- Herring AH, Ibrahim JG, Lipsitz SR. Nonignorably missing covariate data in survival analysis: a case study of an international breast cancer study group trial. Appl Stat. 2004;53:293–310.
- Hogan JW, Laird NM. Mixture models for the joint distribution of repeated measures and event times. Stat Med. 1997;16:239–257. [PubMed]
- Hogan JW, Laird NM. Increasing efficiency from censored survival data using random effects from longitudinal covariates. Stat Methods Med Res. 1998;7:28–48. [PubMed]
- Huang L, Chen M-H, Ibrahim JG. Bayesian analysis for generalized linear models with nonignorably missing covariates. Biometrics. 2005;61:767–780. [PubMed]
- Ibrahim JG. Incomplete data in generalized linear models. J Am Stat Assoc. 1990;85:765–769.
- Ibrahim JG, Lipsitz SR, Chen M-H. Missing covariates in generalized linear models when the missing data mechanism is nonignorable. J R Stat Soc Ser B. 1999a;61:173–190.
- Ibrahim JG, Chen MH, Lipsitz SR. Monte Carlo EM for missing covariates in parametric regression models. Biometrics. 1999b;55:591–596. [PubMed]
- Ibrahim JG, Chen M-H, Lipsitz SR. Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika. 2001;88:551–564.
- Ibrahim JG, Chen M-H, Lipsitz SR. Bayesian methods for generalized linear models with covariates missing at random. Can J Stat. 2002;30:55–78.
- Ibrahim JG, Chen M-H, Sinha D. Bayesian methods for joint modeling of longitudinal and survival data with applicants to cancer vaccine trials. Stat Sin. 2004;14:863–883.
- Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH. Missing data methods in generalized linear models: a comparative review. J Am Stat Assoc. 2005;100:332–346.
- Ibrahim JG, Chen M-H, Kim S. Bayesian variable selection for the Cox regression model with missing covariates. Lifetime Data Anal. 2008a;14:496–520. [PMC free article] [PubMed]
- Ibrahim JG, Zhu H, Tang N. Model selection criteria for missing data problems using the EM algorithm. J Am Stat Assoc. 2008b;103:1648–1658. [PMC free article] [PubMed]
- Jennrich RI, Schluchter MD. Unbalanced repeated-measures models with structured covariance matrices. Biometrics. 1986;42:805–820. [PubMed]
- Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed]
- Lavalley MP, DeGruttola V. Models for empirical Bayes estimators of longitudinal CD4 counts. Stat Med. 1996;15:2289–2305. [PubMed]
- Lesaffre E, Verbeke G. Local influence in linear mixed models. Biometrics. 1998;54:570–582. [PubMed]
- Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
- Lipsitz SR, Ibrahim JG, Fitzmaurice GM. Likelihood methods for incomplete longitudinal binary responses with incomplete categorical covariates. Biometrics. 1999a;55:214–223. [PubMed]
- Lipsitz SR, Ibrahim JG, Zhao LP. A new weighted estimating equation for missing covariate data with properties similar to maximum likelihood. J Am Stat Assoc. 1999b;94:1147–1160.
- Lipsitz SR, Ibrahim JG, Molenberghs G. Using a Box–Cox transformation in the analysis of longitudinal data with incomplete responses. Appl Stat. 2000;49:287–296.
- Lipsitz SR, Parzen M, Molenberghs G, Ibrahim JG. Tesing for bias in weighted estimating equations. Biostatistics. 2001;2:295–307. [PubMed]
- Lipsitz SR, Fitzmaurice GM, Ibrahim JG, Gelber R, Lipshultz S. Parameter estimation in longitudinal studies with outcome-dependent follow-up. Biometrics. 2002;58:621–630. [PubMed]
- Little RJA. Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc. 1993;88:125–134.
- Little RJA. A class of pattern-mixture models for normal incomplete data. Biometrika. 1994;81:471–483.
- Little RJA. Modeling the drop-out mechanism in repeated-measures studies. J Am Stat Assoc. 1995;90:1113–1121.
- Little RJA, Wang Y. Pattern-mixture models for multivariate incomplete data with covariates. Biometrics. 1996;52:98–111. [PubMed]
- Little RJA, Rubin DB. Statistical analysis with missing data. New York: Wiley; 2002.
- Louis T. Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B. 1982;44:226–233.
- Meilijson I. A fast improvement to the EM algorithm on its own terms. J R Stat Soc Ser B. 1989;51:127–138.
- Molenberghs G, Verbeke G. Models for discrete longitudinal data. New York: Springer; 2005.
- Molenberghs G, Kenward MG. Missing data in clinical studies. New York: Wiley; 2007.
- Molenberghs G, Kenward MG, Lesaffre E. The analysis of longitudinal ordinal data with nonrandom drop-out. Biometrika. 1997;84:33–34.
- Pawitan Y, Self S. Modeling disease marker processes in AIDS. J Am Stat Assoc. 1993;88:719–726.
- Prentice RL. Surrogate endpoints in clinical trials: definitions and operational criteria. Stat Med. 1989;8:431–440. [PubMed]
- Renard D, Geys H, Molenberghs G, Burzykowski T, Buyse M. Validation of surrogate endpoints in multiple randomized clinical trials with discrete outcomes. Biom J. 2002;44:921–935.
- Rizopoulos D, Verbeke G, Molenberghs G. Shared parameter models under random-effects misspecification. Biometrika. 2008;94:63–74.
- Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc. 1995;90(429):106–121.
- Rotnitzky A, Robins JM, Scharfstein DO. Semiparametric regression for repeated outcomes with nonignorable nonresponse. J Am Stat Assoc. 1998;93:1321–1339.
- Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–592.
- Rubin DB. Wiley series in probability and mathematical statistics: applied probability and statistics. New York: Wiley; 1987. Multiple imputation for nonresponse in surveys.
- Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc. 1999;94:1096–1120.
- Schluchter MD. Methods for the analysis of informatively censored longitudinal data. Stat Med. 1992;11:1861–1870. [PubMed]
- Shi X, Zhu H, Ibrahim JG. Local influence for generalized linear models with missing covariates. Biometrics. 2009 (in press) [PMC free article] [PubMed]
- Stubbendick AL, Ibrahim JG. Maximum likelihood methods for nonignorable responses and covariates in random effects models. Biometrics. 2003;59:1140–1150. [PubMed]
- Stubbendick AL, Ibrahim JG. Likelihood-based inference with nonignorably missing responses and covariates in models for discrete longitudinal data. Stat Sin. 2006;16:1143–1167.
- Taylor JMG, Cumberland WG, Sy JP. A stochastic model for analysis of longitudinal AIDS data. J Am Stat Assoc. 1994;89:727–736.
- Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D. Strategies to fit pattern-mixture models. Biostatistics. 2002;3:245–265. [PubMed]
- Troxel AB, Harrington DP, Lipsitz SR. Analysis of longitudinal data with nonignorable non-monotone missing values. Appl Stat. 1998a;47:425–438.
- Troxel AB, Lipsitz SR, Harrington DP. Marginal models for the analysis of longitudinal measurements with nonignorable non-monotone missing data. Biometrika. 1998b;85:661–672.
- Tsiatis AA, DeGruttola V, Wulfsohn MS. Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. J Am Stat Assoc. 1995;90:27–37.
- Verbeke G, Molenberghs G. Linear mixed models for longitudinal data. New York: Springer; 2000.
- Wedderburn RWM. Quasi-likelihood methods, generalised linear models, and the Gauss–Newton method. Biometrika. 1974;61:439–447.
- Wei GC, Tanner MA. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc. 1990;85:699–704.
- Wolfinger R, O’Connell M. Generalized linear models: a pseudo-likelihood approach. J Stat Comput Simul. 1993;48:233–243.
- Woolson RF, Clarke WR. Analysis of categorical incomplete longitudinal data. J R Stat Soc Ser A. 1984;147:87–99.
- Wu MC, Bailey KR. Analysing changes in the presence of informative right censoring caused by death and withdrawal. Stat Med. 1988;7:337–346. [PubMed]
- Wu MC, Carroll RJ. Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics. 1988;44:175–188.
- Wu MC, Bailey KR. Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics. 1989;45:939–955. [PubMed]
- Xu J, Zeger SL. Joint analysis of longitudinal data comprising repeated measures and times to events. Appl Stat. 2001;50:375–387.
- Zeger SL, Liang K-Y. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–130. [PubMed]
- Zhu H-T, Lee S-Y. Local influence for incomplete-data models. J R Stat Soc Ser B. 2001;63:111–126.
- Zhu H, Ibrahim JG, Shi X. Diagnostic measures for generalized linear models with missing covariates. Scand J Stat. 2009 (in press) [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |