PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Am Stat. Author manuscript; available in PMC 2010 December 28.
Published in final edited form as:
Am Stat. 2010 August 1; 64(3): 263–267.
doi:  10.1198/tast.2010.09203
PMCID: PMC3010738
NIHMSID: NIHMS258556

Consistency of Normal Distribution Based Pseudo Maximum Likelihood Estimates When Data Are Missing at Random

Ke-Hai Yuan
University of Notre Dame

Abstract

This paper shows that, when variables with missing values are linearly related to observed variables, the normal-distribution-based pseudo MLEs are still consistent. The population distribution may be unknown while the missing data process can follow an arbitrary missing at random mechanism. Enough details are provided for the bivariate case so that readers having taken a course in statistics/probability can fully understand the development. Sufficient conditions for the consistency of the MLEs in higher dimensions are also stated, while the details are omitted.

Keywords: Consistency, maximum likelihood, model misspecification, missing data

1. Introduction

Incomplete or missing data exist in almost all areas of empirical research. They are especially common when data are collected longitudinally and/or by surveys. There can be various reasons for missing data to occur. The process by which data become incomplete was called the missing data mechanism by Rubin (1976). Missing completely at random (MCAR) is a process in which missingness of data is independent of both the observed and the missing values; missing at random (MAR) is a process in which missingness is independent of the missing values given the observed data. When missingness depends on the missing values themselves given the observed data, the process is not missing at random (NMAR). Missing data with an NMAR mechanism are also referred to as non-ignorable non-responses because maximum likelihood estimates (MLE), by ignoring the missing data mechanism, are generally inconsistent. This paper studies the consistency of the normal-distribution-based MLE with MAR data.

In contrast to many ad hoc procedures for missing data analysis, MLEs have the desired property of being consistent even when the specific MAR mechanism is ignored. When modeling real data, however, specifying the correct distribution form to obtain the true MLE is always challenging if not impossible. The normal distribution is often chosen for convenience, not because practical data tend to come from normally distributed populations. Geary (1947, p. 241) observed that “Normality is a myth; there never was, and never will be a normal distribution.” Such an observation was further supported by Micceri (1989), who examined 440 data sets obtained from journal articles, research projects as well as tests, and found that all were significantly nonnormally distributed. Thus, the normal-distribution-based MLEs for real data typically are pseudo MLEs, whose properties have been obtained by White (1982) and Gourieroux, Monfort and Trognon (1984) in the context of complete data. With missing data, however, according to Laird (1988) and Rotnitzky and Wypij (1994), pseudo MLEs will be inconsistent unless the missing data mechanism is MCAR. The need for a correct likelihood function with an MAR mechanism was also noted by Liang and Zeger (1986) and Little (1993). If a pseudo MLE is not consistent when data are MAR, then only the MCAR mechanism can be ignored when modeling practical multivariate data with unknown population distributions. Thus, in addition to being an important mathematical property, consistency of the normal-distribution-based pseudo MLE with MAR data also has wide implications for many areas of applied statistics where the normal distribution is routinely used to model missing data.

Let x = (x1, x2, …, xq)′ be a vector representing a q-variate population, xo contain the observed values and xm contain the missing values in a realization of x. Let r = (r1, r2, …, rq)′, where rj = 1 if xj is observed and 0 otherwise. The missing data in xm are MAR if

P(rj=0xo,xm)=P(rj=0xo)=gj(xo,γj),j=1,2,,q
(1)

and the rj's are conditionally/locally independent given x, where γj is a vector of parameters. In practice, three popular forms of gj(xo, γj) are the interval selection model (see e.g., Schafer, 1997, p. 25)

gj(xo,γj)=1

when xo falls into certain hyper-rectangles; the probit selection model

gj(xo,γj)=Φ(γj0+γj1xo),

where Φ(·) is the cumulative distribution function of N(0, 1) and γj1 contains the regression coefficients; and the logistic selection model

gj(xo,γj)=exp(γj0+γj1xo)1+exp(γj0+γj1xo).

The interval selection model is widely used in economics (Amemiya, 1973; Heckman, 1979) while the probit and logistic selection models are commonly used in many other disciplines (Allison 2001; Little & Rubin, 2002; Molenberghs & Kenward, 2007; Daniels & Hogan, 2008). Under the interval selection model, Yuan (2009) showed that the normal-distribution-based pseudo MLEs are consistent and asymptotically normally distributed even when the underlying distribution is unknown. The purpose of this note is to extend the result of Yuan (2009) by showing that the normal-distribution-based MLEs are still consistent when the underlying population is unknown and when the gj(xo, γj) in (1) is any function of the observed data, including both the probit and logistic selection models.

While all the results in Yuan (2009) can be generalized to an MAR mechanism described by probit and logistic selection models, for brevity we only give the details for the bivariate case in section 2. With missing data and a misspecified likelihood function, little literature exists that facilitates thorough understanding of issues related to consistency. We choose this simple model with enough details so that readers having taken a course in statistics/probability can fully understand the development. We will also state the results for consistency for a general q in section 3, but the details will be omitted. We conclude the paper by pointing out that not all pseudo MLEs are consistent.

2. Consistency in the bivariate case

Let x = (x1, x2)′ with

μ=E(x)=(μ1μ2)andΣ=Cov(x)=(σ11σ12σ21σ22),
(2)

where σ11=σ12,σ22=σ22 and σ12 = σ21 = σ1σ2ρ. A sample from x with missing values in x2 can be represented by

x11,,xn1,x(n+1)1,,xN1x12,,xn2,
(3)

where x(n+1)2, …, xN2 are missing. The interest is to infer (2) based on (3) using the possibly wrong assumption x ~ N2(μ,Σ). Notice that the number of cases with xi2 missing is not controllable so that the n in (3) is a random number.

With two variables, there are a total of 4 possible observed patterns: (xi1, xi2), (xi1, ), ( , xi2) and ( , ). The sample in (3) only contains two of the four. We chose this sample because the MLEs enjoy analytical solutions and the proof of their consistency is simple enough to be understood by a broad audience. We will discuss the consistency of the MLEs with more missing data patterns and more variables in section 3.

Let

x1=1Ni=1Nxi1,s11=1Ni=1N(xi1x1)2,x1=1ni=1nxi1,s11=1ni=1n(xi1x1)2,x2=1ni=1nxi2,s22=1ni=1n(xi2x2)2,s21=1ni=1n(xi2x2)(xi1x1),β^21=s21s11=Σi=1n(xi2x2)(xi1x1)Σi=1n(xi1x1)2.

Then it follows from Anderson (1957) that the MLEs of (μ1, σ11, μ2, σ22, σ12) based on the normal distribution assumption by ignoring the missing data mechanism are

μ^1=x1,σ^11=s11,
(4a)

μ^2=x2+β^21(x1x1),
(4b)

σ^22=s22+β^212(σ^11s11),
(4c)

σ^12=β^21σ^11.
(4d)

Through the work of Rubin (1976) and others, it is widely known thatμ^2, σ^22, σ^12 are consistent when all the missing data in (3) are MCAR or MAR and x ~ N2(μ,Σ). In the following, we will study the consistency of μ^2, σ^22 and σ^12 using (4) when x does not follow the bivariate normal distribution and the missing xi2's in (3) are MAR. For such a purpose, we assume that the population for the data in (3) is

x1=μ1+σ1z1,x2=μ2+σ2[ρz1+(1+ρ2)12z2],
(5)

where z1 and z2 are independent with E(z1) = E(z2) = 0 and Var(z1) = Var(z2) = 1. Clearly, (5) follows a bivariate normal distribution only when z1 ~ N(0, 1) and z2 ~ N(0, 1). However, the population mean vector and variance-covariance matrix of x = (x1, x2)′ remain the same regardless of the distributions of z1 and z2.

Corresponding to (3) there exist independent random variables zi1 and zi2 such that

xi1=μ1+σ1zi1,xi2=μ2+σ2[ρzi1+(1ρ2)12zi2].

Let

z1=1ni=1nzi1,z2=1ni=1nzi2,sz11=1ni=1n(zi1z)2,sz21=1ni=1n(zi2z2)(zi1z1),sz22=1ni=1n(zi2z2)2.

Then

x1=μ1+σ1z1,s11=σ11sz11,s21=σ2σ1[ρsz11+(1ρ2)12sz21],
(6a)

x2=μ2+σ2[ρz1+(1ρ2)12z2],s22=σ22[ρ2sz11+2ρ(1ρ2)12sz21+(1ρ2)sz22].
(6b)

The equations in (6) allow us to obtain the probability limits of xj and sjk* through those of zj and szjk*, which further lead to consistency of μ^2, σ^12 and σ^22 in (4).

We also need to connect the observations in (3) to the MAR mechanism. Let ri = 1 if the xi2 in (3) is observed and ri = 0 if the xi2 in (3) is missing. Because xi1 and zi1 are uniquely determined by each other, we can rewrite the MAR mechanism in equation (1) as

P(ri=0xi1,xi2)=P(ri=0xi1)=P(ri=0zi1)=g(xi1,γ)=h(zi1),
(7)

where the parameter vector γ is omitted from h(·). Let the probability density functions (pdf) of z1 and z2 be f1(t) and f2(t), respectively. Then n, the number of complete cases in (3), follows the binomial distribution B(N, po), where, with I being an indicator function,

po=P(ri=1)=E(I{ri=1})=E(1I{ri=0})=1E[E(T{ri=0}xi1)]=1E[P(ri=0xi1)]=1E[h(zi1)].

Let ti1 be the realized value of zi1 and f be a generic notation for the probability distribution/density function of the involved random variables. It follows from (7) and

f(ri=1,ti1)=pof(ti1ri=1)=P(ri=1ti1)f1(ti1)

that

f(ti1ri=1)=1po[1h(ti1)]f1(ti1).

Thus, the zi1 corresponding to the observed xi2 are independent, identically distributed (iid), and each follows the distribution with pdf

f1(t)=1po[1h(t)]f1(t).

Notice that, due to the MAR mechanism, the missingness in (3) has nothing to do with zi2. Each zi2 corresponding to either the observed xi2 or missing xi2 still has the same distribution as z2 ~ f2(t).

Let the mean and variance of u ~ f1* (t) be μz1* and σz11*. Let u1, u2, …, uN be iid with pdf f1*(t); ω1, ω2, …, ωN be iid with Pi = 1) = po and Pi = 0) = 1 – po; and the ωi's be independent of ui's and zi2's. Then n and Σi=1Nωi have the same distribution; Σi=1nzi1 and Σi=1Nωiui have the same distribution; Σi=1nzi12 and Σi=1Nωiui2 have the same distribution; Σi=1nzi1zi2 and Σi=1Nωiuizi2 have the same distribution; Σi=1nzi2 and Σi=1Nωizi2 have the same distribution; Σi=1nzi22 and Σi=1Nωizi22 have the same distribution. A brief proof for the above relationships using characteristic functions is provided in Appendix A of Yuan and Lu (2008).

We are now ready to show the result of consistency. Applying the law of large numbers to the average of ωi yields

nN=1Ni=1Nωiwp1po,

where the equal sign follows from equivalence in distribution and wp1 is the notation for convergence with probability one. Continuously applying equivalence in distribution and the law of large numbers to the averages of ωiui and ωizi2, respectively, we have

z1=Σi=1NωiuiNnNwp1poE(u)po=μz1
(8)

and

z2=Σi=1Nωizi2NnNwp1poE(z2)po=0.
(9)

Similarly,

1ni=1nzi12=Σi=1Nωiui2NnNwp1E(u2);1ni=1nzi1zi2=Σi=1Nωiuizi2NnNwp1E(u)E(z2)=0;1ni=1nzi22=Σi=1Nωizi22NnNwp1E(z22)=1;.

Thus,

sz11=1ni=1nzi12z12wp1E(u2)μz12=σz11;
(10)

sz21=1ni=1nzi1zi2z1z2wp10;
(11)

sz22=1ni=1nzi22z22wp11.
(12)

It is obvious that

μ^1=x1wp1μ1andσ^11=s11wp1σ11.
(13)

Regarding μ1, μ2, σ1, σ2 and ρ as constants, x1, x2, s11*, s21* and s22* in (6) are just linear combinations of z1, z2, sz11*, sz21* and sz22*, whose probability limits have already been obtained. Combining (6a), (8), (10) and (11) yields

x1wp1μ1=μ1+σ1μz1,s11wp1σ11=σ11σz11,ands21wp1σ2σ1σz11ρ.
(14)

Combining (6b) and (8) to (12) yields

xwp1μ2+σ2ρμz1ands22wp1σ22[ρ2σz11+(1ρ2)].
(15)

Thus,

β^21=s21s11wp1σ2σ1ρ.
(16)

It follows from (4b) and (13) to (16) that

μ^2wp1μ2+σ2ρμz1+σ2σ1ρ[μ1+(μ1+σ1μz1)]=μ2.

So μ^2 is consistent. It follows from (4c) and (13) to (16) that

σ^22wp1σ22[ρ2σz11+(1ρ2)]+σ22ρ2σ11(σ11σ11σz11)=σ22.

So σ^22 is also consistent. It follows from (4d), (13) and (16) that

σ^12wp1σ2σ1ρσ11=σ12.

So σ^12 is, again, consistent.

Notice that the g(·,·) or h(·) in (7) can be any function of the observed data. Thus, the normal-distribution-based pseudo MLEs are consistent for any MAR process.

3. Consistency in general

Parallel to (5), let

x=(x1,x2,,xq)=μ+Az,
(17)

where μ = (μ1, μ2, …, μq)′,

A=(a11000a21a2200aq1aq2aq3aqq)

and satisfies Σ = AA′, and z = (z1, z2, …, zq)′ with z1, z2, …, zq being independent and standardized random variables. Then E(x) = μ, Cov(x) = Σ and the distribution of x is determined by those of the zj's. When z ~ Nq(0, I), x ~ Nq(μ,Σ). We do not know the distribution form of x in general even when the distributions of zj's are known. Notice that each element xj of x in either (5) or (17) is a linear combination of independent random components in z, which is a necessary condition for the consistency of the pseudo MLE μ^ and Σ^.

Let x1, x2, …, xN be a random sample drawn from x with xi = μ + Azi, where xi = (xi1, xi2, …, xiq)′ and zi = (zi1, zi2, …, ziq)′. Consider the case xi in which d variables xij1, xij2, … xijd are missing and the event is related in probability to the c values of zil1, zil2, …, zilc. Let J = min(j1, j2, …, jd) and L = max(l1, l2, …, lc). When L < J and (xi1, xi2, …, xiL) are observed, all the information related to missing values is observed. Let ri = (ri1, ri2, …, riq)′ be the vector with rij = 1 if xij is observed and zero otherwise, xim = (xij1, xij2, …, xijd)′, and xio be the vectors of observed variables. There exists

P(rij=0xio;xim)=P(rij=0{xi1,xi2,,xiL,};xim)=P(rij=0{zi1,zi2,,ziL,};xim)=P(rij=0{zi1,zi2,,ziL,})=P(rij=0xio)=gj(xio,γj)=hj(ziL).
(18)

Thus, the probability of missing only depends on the observed values and the missing data mechanism is MAR (Rubin, 1976). Notice that xiL = (xi1, xi2,…, xiL)′ is a subset of xio, and thus, the probability function gj(xio, γj) in (18) can also be written as gj(xiL, γj). Also notice that, due to a random process, the number d and the subscripts j1, j2, …, jd may change from case to case while l1, l2, …, lc are held constant across the sample.

Unlike the problems considered in the previous section, the MLEs do not possess analytical forms when the observed data patterns are not monotonic. So we cannot directly show that the MLEs are consistent as was done in the previous section. We cannot use the established theory of maximum likelihood as in Rubin (1976) either, because the MLEs are obtained based on an incorrect likelihood function. By showing that the normal estimating equation is unbiased at the true population values, Yuan (2009) proved that the normal-distribution-based MLEs are consistent when the missingness of xim is due to (zil1, zil2, …, zilc) falling into certain hyper-rectangles. When (17) and (18) hold and (xi1, xi2, …, xiL) are observed, using essentially the same approach as in Yuan (2009) we can show that the normal-distribution-based MLEs are consistent regardless of the distribution shape of x and the form of gj(xio, γj) in (18). The MLEs are also asymptotically normally distributed and the covariance matrix of the MLEs can be consistently estimated by a sandwich-type covariance matrix.

4. Conclusions

It has been argued that, in any statistical modeling, the distribution specification is at best only an approximation to the real world (Box, 1979). Thus, all MLEs in practice are pseudo MLEs. In the context of missing data, it is nice to know that pseudo MLEs can remain consistent when (17) and (18) hold. We need to note that data model (17) does not include nonnormal distributions created by nonlinear functions of independent random variables z1, z2, …, zq; although it includes an infinite number of nonnormal distributions. Yuan (2009) describes an example in which the MLEs are not consistent when x = (x1, x2)′ with x1 = z1 and x2=az12+bz2, where z1 and z2 are independent, and x2 is missing when z1 falls into a certain interval. It can be shown that the MLEs also fail to be consistent for this data model when the MAR mechanism obeys either a logit or a probit selection process. We also would like to note that pseudo MLEs based on an incorrect distributional specification other than Nq(μ,Σ) may not be consistent when missing values are MAR. Actually, even without missing data, Gourieroux et al. (1984) showed that pseudo MLEs are consistent only when the assumed distribution belongs to a quadratic exponential family.

For the purpose of allowing missingness to depend on all the linear combinations of the previously observed variables, we specified A as a lower triangular matrix in (17) so that (z1, z2, …, zL) and (x1, x2, …, xL) are determined by each other. In practice, a participant may join the study after missing a few times and then be missing again. The missingness at the later stage may depend on all the previously observed variables. We can match such a case with (17) by specifying that the rows of A that correspond to the observed variables form the upper-left part of a lower triangular matrix, then the consistency result still holds.

As with a general MAR missing data mechanism, the MAR condition in (18) cannot be tested. Without extra information beyond the observed sample, it is impossible to distinguish between MAR and NMAR mechanisms (Molenberghs et al., 2008). Similarly, the data model (17) cannot be tested either because the distribution of z is arbitrary.

Acknowledgment

We would like to thank the editor, an associate editor, and a referee for comments that lead to a significant improvement of the paper.

This research was supported by Grants DA00017 and DA01070 from the National Institute on Drug Abuse and a grant from the National Natural Science Foundation of China (30870784).

References

  • Allison PD. Missing data. Sage; Thousand Oaks, CA: 2001.
  • Amemiya T. Regression analysis when the dependent variable is truncated normal. Econometrica. 1973;41:997–1016.
  • Anderson TW. Maximum likelihood estimates for the multivariate normal distribution when some observations are missing. Journal of the American Statistical Association. 1957;52:200–203.
  • Box GEP. Robustness in the strategy of scientific model building. In: Launer RL, Wilkinson GN, editors. Robustness in statistics. Academic Press; New York: 1979. pp. 201–236.
  • Daniels MJ, Hogan JW. Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. Chapman & Hall; Boca Raton, Florida: 2008.
  • Geary RC. Testing for normality. Biometrika. 1947;34:209–242. [PubMed]
  • Gourieroux C, Monfort A, Trognon A. Pseudo maximum likelihood methods: Theory. Econometrica. 1984;52:681–700.
  • Heckman JJ. Sample selection bias as a specification error. Econometrica. 1979;47:153–161.
  • Laird NM. Missing data in longitudinal studies. Statistics in Medicine. 1988;7:305–315. [PubMed]
  • Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
  • Little RJA. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association. 1993;88:125–134.
  • Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. Wiley; New York: 2002.
  • Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105:156–166.
  • Molenberghs G, Beunckens C, Sotto C, Kenward MG. Every missing not at random model has got a missing at random counterpart with equal fit. Journal of the Royal Statistical Society B. 2008;70:371–388.
  • Molenberghs G, Kenward MG. Missing data in clinical studies. Wiley; Chichester, England: 2007.
  • Rotnitzky A, Wypij D. A note on the bias of estimators with missing data. Biometrics. 1994;50:1163–1170. [PubMed]
  • Rubin DB. Inference and missing data (with discussions) Biometrika. 1976;63:581–592.
  • Schafer JL. Analysis of incomplete multivariate data. Chapman & Hall; London: 1997.
  • White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25.
  • Yuan K-H. Normal distribution based pseudo ML for missing data: With applications to mean and covariance structure analysis. Journal of Multivariate Analysis. 2009;100:1900–1918.
  • Yuan K-H, Lu L. SEM with missing data and unknown population distributions using two-stage ML: Theory and its application. Multivariate Behavioral Research. 2008;62:621–652.