Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Biom J. Author manuscript; available in PMC 2017 December 27.
Published in final edited form as:
Published online 2015 September 13. doi:  10.1002/bimj.201400124
PMCID: PMC5745043

Causal mediation analysis with a latent mediator


Health researchers are often interested in assessing the direct effect of a treatment or exposure on an outcome variable, as well as its indirect (or mediation) effect through an intermediate variable (or mediator). For an outcome following a nonlinear model, the mediation formula may be used to estimate causally interpretable mediation effects. This method, like others, assumes that the mediator is observed. However, as is common in structural equations modeling, we may wish to consider a latent (unobserved) mediator. We follow a potential outcomes framework and assume a generalized structural equations model (GSEM). We provide maximum-likelihood estimation of GSEM parameters using an approximate Monte Carlo EM algorithm, coupled with a mediation formula approach to estimate natural direct and indirect effects. The method relies on an untestable sequential ignorability assumption; we assess robustness to this assumption by adapting a recently proposed method for sensitivity analysis. Simulation studies show good properties of the proposed estimators in plausible scenarios. Our method is applied to a study of the effect of mother education on occurrence of adolescent dental caries, in which we examine possible mediation through latent oral health behavior.

Keywords: Factor analysis, Measurement error, Mediation formula, Monte Carlo EM algorithm, Structural equations model

1 Introduction

Mediation analysis seeks to determine the extent to which the effect of a treatment or exposure on an outcome is mediated by an intermediate variable. The mediation effect is also known as the indirect effect, while the portion of the treatment effect that does not go through the intermediate variable is referred to as the direct effect. Mediation is often represented by a causal diagram, such as that in Fig. 1a, which displays direct and indirect paths through which an exposure affects an outcome. The inferential goal is to estimate the direct and indirect (or mediation) effects comprising the overall effect of exposure on the final outcome.

Figure 1
(a) Standard mediation model: X, exposure; M, observed mediator; Y, final response. (b) Latent mediator model: X, exposure; U, unobserved mediator; Z1, …, ZM, observed intermediate variables; Y, final response.

Classically, mediation analysis has been conducted via an ad hoc approach involving the fit of multiple regression models (Baron and Kenny, 1986). Recently, a causal model approach to mediation analysis has been developed based on a potential outcomes framework (Robins and Greenland, 1992; Albert, 2008; Imai et al., 2010). To describe this framework, which will be integral to the method developed in this paper, we provide some notation and definitions.

We let Yi(x) denote the potential outcome for (response variable) Y if causal factor X, say, an exposure variable, were set to level x for individual i. (We will sometimes drop the subscript i when not needed for clarity.) The central problem of causal inference is that Y for any individual is observed (at a given time) for only one level of the exposure. Potential outcomes for exposure levels not actually observed are referred to as counterfactuals. To define mediation effects we use nested potential outcomes. For example, Y (x, M(x)) denotes the potential outcome for Y if exposure, X, were set to x, but the mediator, M, set to its potential outcome if X were set to x. We let E{Y (x, M(x))} denote the average of this potential outcome over the population. Then we define the direct effect as


and the indirect effect as


Note that D(x) and I(x) represent “natural” direct and indirect effects whereby the mediator, M, is set to the value (potential outcome) that would “naturally” be observed under the specified exposure level. The above definition is in contrast to that of the “controlled” direct effect whereby M is fixed at a particular common value. The total effect T = E{Y (1, M(1))} − E{Y (0, M(0))} can be written as T = D(0) + I(1) = D(1) + I(0), demonstrating that the natural direct and indirect effects represent an exact decomposition of the total exposure effect. The estimands D(0) and I(0) have been referred to as the pure natural (direct and indirect) effects, with D(1) and I(1) referred to as the total natural (direct and indirect) effects (Robins and Greenland, 1992).

Approaches to inference based on the mediation formula have been offered under certain restrictive assumptions (described further below). Among these is the assumption that the mediator is measured without error. Recently, the problem of mediator measurement error in mediation analysis has received attention (le Cessie et al., 2012; VanderWeele et al., 2012). This recent research has emphasized that standard approaches for estimation of direct and indirect effects can be biased in the presence of mediator measurement error. Corrections for measurement error are possible, but require specification of nonidentifiable parameters for the association between the observed and unobserved (true) mediator values (VanderWeele et al., 2012).

The measurement error situation can be characterized as one involving a true but unobserved (or latent) mediator measured by a single observed intermediate variable. The situation where multiple measurements are available seems to have received little attention in the causal mediation literature. However, this situation is of interest because it may allow the estimation of the mediation effect through an underlying true value while avoiding the need for a priori or supplementary information about the (otherwise nonidentifiable) association parameters.

Other researchers have addressed the situation of multiple intermediate variables conceived as separate “true” mediators. This situation was addressed in the linear case by Preacher and Hayes (2008), and in the nonlinear (logistic regression model) case by Wang et al. (2013). The latter method allowed for mixed types of (i.e., both continuous and dichotomous) mediators, and used the mediation formula (Albert and Nelson, 2011; Pearl, 2012) to compute natural direct and indirect effects. The joint distribution of the mediators, required in the mediation formula, was modeled by Wang et al. (2013) using a Gaussian copula approach (Song et al., 2009). As this method requires multivariate integration (over the joint distribution of the mediators), it can be rather computationally demanding and may be prohibitive for situations of high dimensionality (i.e., a large number of mediators). Another limitation is that nonidentifiability of specific path effects through individual mediators occurs when mediators are correlated, requiring specification of nonidentifiable counterfactual correlations. Further, these approaches do not account for measurement error in the intermediate variables. In the applicable situation in which the multiple intermediate variables may be considered to measure a single underlying (but unobserved) variable, the use of a latent mediator would appear to offer advantages of dimensionality reduction and improved validity by accounting for measurement error.

Latent variables models have long been used in psychology and sociology and increasingly in biomedical science (Rabe-Hesketh and Skrondal, 2008). The basic model (also referred to as a confirmatory factor analysis or measurement model) posits an unobserved (or latent) variable, also called a factor, that gives rise to a set of observed (or manifest) variables considered to measure the latent variable. More complex models, including structural equations models (SEMs; Bollen, 1989), extend the above measurement model by specifying directed (structural) links between the latent variable and other (latent or observed) variables. We note that a number of general software packages, including SAS and R, and more specialized packages, such as MPlus, can be used to fit SEMs, even handling such features as nonlinear link functions (as in logistic regression) and multilevel data. A reviewer has drawn our attention to recent updates to MPlus, and accompanying papers (Muthén, 2011; Muthén and Asparouhov, 2015), which provide similar analyses to those of the present paper. However, the present paper provides an explicit development of causal inference in the context of the latent mediator model, while also addressing such accompanying issues as identifiability, choice of reference group, and sensitivity analysis. Finally, we note that Wang and Albert (2012) considered a discrete latent mediator in the context of causal mediation analysis for zero-inflated outcomes.

In the present paper, we develop a method of inference for causal mediation effects when multiple intermediate variables are available and may be considered to measure a single latent mediator. To handle mixed (i.e., continuous and dichotomous) variable types, we model outcomes (i.e., the latent mediator, and the observed intermediate and final response variables) using generalized linear models, thus yielding a generalized SEM (GSEM). We use a Monte Carlo EM algorithm (or an approximation via Gauss–Hermite quadrature) to obtain estimates of GSEM model parameters. This approach to fitting the mediation model (which is similar to previous approaches for fitting latent variable models, e.g., Sammel et al., 1997) is coupled to the mediation formula and thus conducted in a causal model (potential outcomes) framework. Bootstrap resampling is used to obtain confidence intervals for mediation effects.

In the next section, we describe the method; in particular, we confirm identifiability of our causal mediation estimands under a specified latent mediator model, and provide expressions for mediation estimators. In Section 3, we illustrate the new method using data from an observational dental study. In this application we assess the direct effect of mother education on presence of dental caries in their adolescent children, and the indirect effect through a latent variable interpreted as mother’s oral health behavior. In addition, we conduct a sensitivity analysis for the latent mediator model context, using a recently developed method (Albert and Wang, 2015) that allows for mixed types of responses. Section 4 presents results from a simulation study of finite sample size properties of the mediation estimators. Our simulation results show that the proposed approach greatly reduces bias in the presence of substantial measurement error compared to a two-step approach that uses a summary measure of the multiple intermediate variables as a presumed observed mediator. Limitations and other concluding remarks are given in Section 5.

2 Latent mediator model and inference

2.1 Model and identifiability

The latent mediator model is represented graphically in Fig. 1b. In this model, X represents an exposure variable, Y an observed final outcome, and U represents a latent (unobserved) mediator. The mediator, U, is measured by multiple observed intermediate variables (Z1, …, ZM); the number (M) of such observed intermediate variables is arbitrary, but as noted below there should be at least two Zs for general identifiability. We also allow for preexposure variables (L) representing potential confounders, but which are not included in Fig. 1b.

Identifiability of natural direct and indirect effects under sequential ignorability was demonstrated by Imai et al. (2010) in the case of an observed mediator. However, the derivation of Imai et al. (2010) also goes through when the mediator is unobserved. The assumptions required, analogous to the sequential ignorability assumption of Imai et al., can be expressed as follows:

Assumption 1: Sequential ignorability (for latent mediator U)



Under Assumption 1 (along with a standard consistency assumption), it follows from the proof of Imai et al. (2010) that


Identifiability of direct and indirect effects for a latent mediator (expressions (1) and (2) with U in place of M) is thus obtained so long as the association model terms on the right side of Eq. (4) (namely, E(Y|U = u, X = x, L = l), as well as the conditional distribution of U, denoted as f (U|X = x, L = l)), can be estimated. Equation (4) along with Eqs. (1) and (2) yields what has been referred to (e.g., Pearl, 2012) as the mediation formula (although we will tend to refer to Eq. (4) itself as such).

An analogous problem was considered by Kuroki and Pearl (2011), but with an unobserved confounder (rather than mediator). Considering the linear model case, they showed that the effect of some X on Y (both observed) could be identified, given at least two conditional independent proxies (measurements) of an unobserved confounder. This result also obtains, again in the linear case, for an unobserved mediator. Kuroki (2007) also provides identifiability results, but where the unobserved confounder or mediator is discrete.

For a continuous unobserved mediator, general (nonparametric) identification results do not appear to be available. Note that identifiability (as seen in the proof by Imai et al., 2010) requires that the conditional expected value of Y given the mediator (along with the treatment indicator and included covariates) is estimable, which would not be the case in general when the mediator is not observed. However, identifiability may still be attained under standard (e.g., generalized linear) models given an adequate number of intermediate measurements (two, in most cases).

To start, we will assume the following generalized structural equations (association) model (GSEM):




where h1, h2, and h3m (m = 1, …, M) are invertible link functions, and the βs and γs are unknown regression parameters; the coefficients of L (as well as L itself) may be vectors. In our approach to estimation, which is based on maximum likelihood (ML), we need to further specify conditional distributions for each response variable (Y, U, Z1, …, ZM). In the present paper, we will assume response variables to follow exponential family distributions. As is commonly done in the literature on latent variable models, we will assume a normal distribution for the latent mediator, U, and a common conditional variance, V (U|X, L), equal to 1.

This model may be extended in a number of ways that can be accommodated by the method described below. These extensions include the following: (i) an exposure-by-mediator (X by U) interaction (by inclusion of an additional term, β4(X · U), in the Y model); (ii) effects of exposure on measured intermediate variables (by inclusion of X as a covariate in one or more of the Zm models in Eq. (5c)); and (3) alternative (i.e., nonnormal) distributions for U. Identifiability in any such extension will depend on identifiability of the corresponding GSEM model parameters.

2.2 GSEM estimation

Our approach to estimation closely follows Sammel et al. (1997) though with some extensions and modifications. Although Sammel et al. did not consider the mediation problem per se, but rather the problem of analyzing multiple outcomes, the situation they addressed is related to the present problem. Specifically, Sammel et al. considered a situation in which a set of response variables (of possibly mixed types) are predicted by a latent variable and possibly an additional, common, set of fixed covariates. This setup is similar to the present mediation problem (as represented in Fig. 1b) since the observed intermediate variables (Z1, …, ZM) t outcomes (e.g., X predicting Y but not the Zs). In this way, we distinguish between “measurement” variables that are affected by the latent mediator alone, and a final response variable that may be affected by an exposure variable as well as the latent mediator. Similarly to Sammel et al., we conduct ML estimation, focusing on a Gauss–Hermite quadrature approximation to the Monte Carlo EM algorithm. The reader is referred to Sammel et al. (1997) for theory underlying the algorithm.

2.3 Mediation effect estimators

From the general mediation formula (4), substituting the ML estimates for the parameters in the GSEM (5), we have the expression for the estimator of the expected nested potential outcome as


where NR is the size of the reference group (denoted as ПR) to which inference is being performed. Note that, as commonly done (Imai et al., 2010), in place of the outer integral in Eq. (4), we sum over the empirical distribution of L in the chosen reference group (thus, sum over the i’s for individuals in that group). An extension of Eq. (6) allowing the use of sampling weights is provided in Appendix A in the Supporting Information.

Finally, we use the expression for the estimated expected potential outcome (6) to obtain the mediation effect estimates of interest. For example, for the total natural direct and the pure natural indirect effects, D(1) and I(0), we obtain estimators as



To get standard errors and confidence intervals for the mediation effects, we use a bootstrap resampling approach. In our data example and simulations, we obtained 95% confidence intervals using the bootstrap percentile method (DiCiccio and Efron, 1996) based on 500 bootstrap samples.

3 Data example

We present an illustrative example using data from a dental caries study (Nelson et al., 2010). The study examined a cohort of very low birth weight (VLBW, with and without bronchopulmonary dysplasia, BPD) and a matched group of normal birth weight (NBW) children who were recruited at birth and followed through adolescence in an earlier study (Singer et al., 1997). This earlier study assessed a number of parental variables at birth and over time, including demographic variables such as mother’s level of education. The dental study involved a dental clinical exam at around age 14 that assessed, among other variables, the number of decayed, missing, and filled teeth (DMFT) and the oral hygiene index (OHI) score, an indicator of effective oral hygiene behavior.

A follow-up study (Nelson et al., 2012) focusing on parental factors found that adolescents whose mother had less than a high school education when the child was age 3 had a higher mean DMFT than comparable adolescents whose mother had at least a high school education. A further question of interest involves the mechanisms through which mother’s education affects the occurrence of dental caries in her children. One hypothesis is that the mother’s oral health behavior with regard to her child is an important mediator of this relationship. Oral health behavior is measured by such variables as the use of dental sealants, supplemental fluoride treatment, oral hygiene, and regularity of preventative dentist visits. These variables may be thought of as imperfect measures of an underlying latent mediator variable representing a continuum of favorable versus nonfavorable oral health behavior.

We examined this question using data from the study reported by Nelson et al. (2010). We defined a binary exposure variable, “MomEd” (X = 1 for mother’s education above high school level, 0 otherwise), and a binary final outcome, “DMFTD” (Y = 1 if the child had DMFT >0 at the age 14 exam; 0, otherwise). We considered a continuous latent mediator (U, labeled as “Behav”) measured by four “intermediate” variables (denoted as Z1, …, Z4). The first measurement, Z1, the “OHI”, is a roughly continuous variable, ranging from 0 to 3 (higher worse), that indicates the amount of debris on teeth, a surrogate for effectiveness of the child’s brushing behavior. The last three measurements are binary variables: Z2, use of any preventative fluoride treatment; Z3, use of dental sealants; and Z4, regularity of dentist visits (twice a year vs. less). Regularity of dental visits was assessed from a parental questionnaire regarding their child, while dental sealants and the OHI were determined from the child’s clinical exam. We used 173 complete cases (86 “low” and 87 “high” MomEd) in the present analysis.

We assumed that the above model variables are causally related as indicated in the graph in Fig. 1b. We further assumed sequential ignorability (3a) and (3b) once we control for the following set of potential confounders (not shown in Fig. 1b): race (African American vs. other) of child, sex, and birth status (two indicator variables for the three categories: VLBW without BPD, VLBW with BPD, and NBW).

In this data example, we focused on the total natural direct effect, D(1), and the pure natural indirect effect, I(0), noting that the complementary estimands, D(0) and I(1) provide an alternative decomposition of the total exposure effect. We chose the former pair, as they involve the potential outcome, Y (0, U(1)), representing a person with “low education” but where oral health behavior is at the level the person would have if received “high education.” The implied intervention in this case, for example, employing an education program targeting oral health behavior, is readily conceivable. In contrast, the potential outcome Y (1, U(0)), involved in D(0) and I(1), implies an intervention on a “high-education” person producing an oral health behavior level as if the person had “low education,” which is more difficult to imagine.

We estimated D(1) and I(0) using Eqs. (7) and (8) with parameter estimates obtained under a GSEM (5), and the VLBW (with and without BPD) subsample designated as the reference group (in accordance with the study design). Specifically, we assumed logistic regression models for the binary outcome variables and a linear regression model for the continuous outcome variables as follows:





where εU ~ N(0, 1), εmC~N(0,σmC2), all outcomes variables are conditionally independent, and mC indexes continuous and mD dichotomous, intermediate variables (mC = 1, …, MC, mD = MC + 1, …, M, M [equivalent] MC + MD).

Using the method of Section 2, based on model (9), the estimated natural direct effect (and bootstrap 95% confidence interval) was D^(1)=0.12(0.23,0.35), indicating that the probability of a DMFT would increase by an estimated 0.12 if MomEd (X) were changed from “low” to “high” but the level of Behav (U) fixed as if the mother had “high” education. The estimated natural indirect effect was Î(0) = −0.28(−0.52, 0.076), indicating that the probability of a DMFT would decrease by an estimated 0.28 if MomEd was shifted from “low” to “high” in a way that naturally affected Behav but without affecting mediators on any other pathway between mother’s education and DMFT. We note that the estimated coefficients (or factor loadings) for the regression of each of the four intermediate variables on Behav (Table 1) suggest the interpretation of this latent variable as a positive indicator of oral health behavior (with larger values indicating more favorable oral health behavior). As the above confidence intervals contain 0, we see that neither the estimated direct nor indirect effects are statistically significantly different from 0. Although we chose to present estimates of D(1) and I(0), it turns out that estimates for the corresponding alternative estimands, D(0) and I(1), are nearly the same—not surprisingly as the current model does not include an exposure-by-mediator interaction term.

Table 1
Parameter estimates from ML fit of Model (10) (without and with MomEd × Behav interaction in Y model) to dental data.

For comparison, we carried out a two-step approach in which a factor analysis of the four intermediate variables is conducted in the first step, and then the resulting factor score is used as an observed mediator (in the standard mediation formula approach) in the second step. The estimates for the natural direct and indirect effects (with 95% bootstrap confidence intervals) based on the two-step approach were −0.12 (−0.30, 0.026) and −0.037 (−0.089, 0.020). As with the ML approach, these effects were not found to be statistically significant (at the 0.05 α-level). However, the estimated indirect effect (presumably due to the nonoptimal construction of the factor) for the two-step, relative to the ML, approach is considerably reduced leaving more of the exposure effect as a direct effect.

As a second approach, we considered the same estimands but under an extended version of association model (9a) that includes an X by U interaction term. The estimated regression parameters for both the interaction and the previous additive model are given in Table 1. The resulting estimates of D(1) and I(0) from the interaction model are nearly the same as those from the additive model, indicating that our primary results are not changed substantially by inclusion of a MomEd-by-Behav (X by U) interaction effect in the Y model.

We note that, although our discussion focused on mediation effects defined on a risk difference scale for a binary outcome, estimands on other scales are readily defined (and estimated) from the expected potential outcomes in Eq. (4) (or its extensions such as Eq. (6)). For example, for the above data using the additive model (9), the natural direct and indirect effects (with 95% bootstrap confidence intervals) on an odds ratio scale are 1.69 (0.48, 5.8) and 0.31 (0.084, 1.4).

To examine the impact of violations of sequential ignorability, we conducted a sensitivity analysis using the hybrid model approach of Albert and Wang (2015). Details for the present application are given in Appendix B of the Supporting Information. The results of this analysis are presented in Fig. 2. As seen in the figure, the estimated (total) natural direct effect varied from −0.58 to 0.41, and the estimated (pure) natural indirect effect from −0.56 to 0.43, over sensitivity parameter, ϕ, values ranging from −8 to 8. The sensitivity parameter, ϕ, is interpreted as the proportion of the (estimable) MomEd–DMFTD association parameter, β1, that is due to selection bias (manifested as a mean difference between groups with high- and low-education mothers, apart from the effect of education, at a given Behav level), as opposed to a true causal effect of mother’s education level. In the present model, β1 is interpreted as the log DMFTD odds ratio for groups with high versus low mother education conditional on Behav and included baseline covariates. Our estimate of β1 from the dental data is 0.65, providing an estimated odds ratio of exp(0.65) = 1.92.

Figure 2
Sensitivity analysis for dental data. ML estimates from hybrid model of direct (left panel) and indirect (right panel) effects versus sensitivity parameter, ϕ. Solid line, estimates; dotted lines, lower and upper 95% confidence interval bounds. ...

We consider a plausible range for ϕ to be 0.5 to 2, corresponding to a range of odds ratios for the cohort (selection bias) effect and causal effect (of MomEd on DMFTD) of 1.4 to 3.7 and 0.52 to 1.4, respectively. A more-detailed discussion of the elicitation of values for ϕ is given in Albert and Wang, 2015. As illustrated in Fig. 2, over the range of ϕ from 0.5 to 2, estimated natural direct effects range from −0.13 to 0.06 and estimated natural indirect effects from −0.21 to −0.02. Note that the value of ϕ = 1, corresponding to 100% of the conditional MomEd–DMFTD association being attributable to selection bias, amounts to a deterministic specification of a zero natural direct effect, which is seen as a degenerate confidence interval for D(1) at ϕ = 1 in Fig. 2.

Thus, our sensitivity analysis indicates that a negative estimate of the natural indirect effect (indicating a beneficial effect of high mother education on dental caries through improved oral health behavior) is obtained over a plausible range of alternatives to the sequential ignorability assumption. We also see that there is a small range of values where the confidence intervals exclude 0. However, since a statistically significant result for the natural direct effect (based on the 95% confidence intervals) is not maintained over the plausible range for ϕ, and given, further, our lack of knowledge of the true value of ϕ, the results remain inconclusive. We emphasize, however, that the above data example is intended as illustrative and that more work is needed to draw firm substantive conclusions.

4 Simulation studies

4.1 Approach

Next, we describe a series of simulation studies that we conducted to examine the finite sample bias and precision of our natural direct and indirect effect estimators under various scenarios. In our first study, we considered a situation similar to the dental example described above. Namely, we supposed a binary outcome, four observed intermediate variables (three binary and one continuous), and a binary exposure variable. We also considered a single continuous covariate as a baseline confounder (reducing the number relative to our data analysis for simplification). The data were generated according to the GSEM as given in Eq. (9), with binary variables generated as Bernoulli, and the error terms (for the continuous intermediate variable and for the latent mediator, U) drawn from a normal distribution.

We considered four scenarios corresponding to varying magnitudes of natural direct versus natural indirect effects: (i) opposite effects; (ii) all direct; (iii) all indirect; and (iv) equal direct and indirect effects. The first scenario was specified to mimic the dental data (analyzed above) in which the estimated direct and indirect effects had opposite signs. The parameter values used for the simulations are given in the Supporting Information Table A1. We used a continuous intermediate variable error variance, σ12, equal to 0.5 for all four scenarios.

We conducted a second simulation study to compare the properties of our estimators to those of a common two-step approach that computes a summary measure (taken here to be the average) of the observed intermediate variables (Z1, …, ZM) and then carries out a mediation analysis using the summary measure as an observed mediator. Here, we supposed a situation, designed to be favorable to the two-step approach, involving multiple continuous intermediate variables with common factor loadings on a single latent mediator. Specifically, we used the GSEM above (Eq. (9)) with two continuous (and no dichotomous) intermediate variables, denoted Z1 and Z2, setting β0mC=0 and β1mC=1 for each, and with corresponding errors, ε1 and ε2, distributed as N(0, σ2) with specified common measurement error variance, σ2. In this model the Zs thus represent replicate measurements of a latent variable (or true value) U. In our implementation of the two-step approach, we used the parametric mediation formula (Imai et al., 2010; Albert, 2012), using the average ((Z1 + Z2)/2) as the observed mediator and assuming the latter to be normally distributed.

In this second study, we used a scenario (to be referred to as Scenario 5) that has approximately equal natural direct and indirect effects (of around −0.08, as in Scenario 4). Our main interest was in examining the effect of the level of the measurement error variability on the performance of the two methods. We thus used Scenario 5 with varying values of the measurement error SD (square root of the variance, σ = σ1 = σ2) as follows: 0, 0.3, 0.6, 1, 3, and 6.

For each study, simulated datasets for each scenario were obtained with a sample size of 200, similar to the dental dataset. For each observational unit (individual) the exposure group, X, coded as 0 (nonexposed) or 1 (exposed), was generated as a Bernoulli random variable with P(X = 1) = P(X = 0) = 0.5, and the covariate L was generated independently as N(0,1). For each individual, the latent variable U was then generated, followed by the remaining model variables (i.e., (Z1, …, ZM) and Y), according to the specified model (9). The vector of model variables was generated independently across individuals. For each dataset, the method of Section 2, with the whole sample as the reference group, was used to compute the estimates of D(1) and I(0). Five hundred bootstrap samples were drawn for each generated dataset to obtain 95% confidence intervals using the percentile method. For each scenario, 300 replications were performed.

We note that some scenarios, particularly those with large measurement error variances relative to the corresponding factor loadings, resulted in occasional very large (in absolute value) individual linear predictor values from the logistic regression models, and consequently a noncomputable likelihood for the given individual. To avoid failure of the computer program in such cases, we incorporated a slight truncation by using 0.0003 and 0.9997 as the lower and upper bounds for the estimated probabilities for the binary outcomes. Aside from this difficulty, it is also possible to have a lack of convergence of the Monte Carlo EM algorithm for any given dataset.

The true values for the estimands were obtained by applying the formula for the expected potential outcomes (4) using the true values for the regression parameters (βs and γs) in the appropriate version of model (5), integrating over the true (standard normal) distribution for U, and averaging over the Ls for all subjects in the dataset. These dataset-specific true values were used in computing the relative biases and coverage probabilities as described below. For descriptive purposes (as presented in Table 3), we computed an overall true value for a given scenario and estimator, as the average of the dataset-specific true values over the multiple replications.

Table 3
Simulation (study 2) results for a two-step versus the latent mediator model approach for estimating direct (D(1)) and indirect (I(0)) effects.

For each scenario and estimator, we computed the bias (mean of the estimate minus the true value), relative bias (the mean of the ratio of the dataset-specific bias and true value) for nonnull estimands, simulation standard error, coverage (percentage of 95% confidence intervals that cover the true estimand value), and power (percentage of 95% confidence intervals that do not cover 0). Analyses (for both real and simulated data) were conducted in SAS version 9.4 using SAS/IML and the LOGISTIC and REG procedures.

4.2 Results

Results for the two simulation studies are shown in Tables 2 and and3,3, respectively. From Table 2, the results for Scenario 1 show rather high (absolute values of) relative biases of 20% and 9% for the direct and indirect effect estimators, respectively. For each of the remaining scenarios (2–4), the relative biases were less than 6% for both the natural direct (D(1)) and indirect (I(0)) effect estimators. These contrasting results can be understood by noting that Scenario 1 (which mimicked the dental data) had relatively small factor loadings (analogous to high measurement error variances in continuous intermediate variables), while larger factor loadings were used for Scenarios 2 through 4. In supplementary simulations, we found that relative biases for Scenario 1 were reduced for larger sample sizes (as expected due to the consistency of ML estimators). For example, for n = 5000 per exposure group the relative biases for the direct and indirect effects were 1.8% and 0.9%, respectively.

Table 2
Simulation (study 1) results using latent mediator model approach for estimating direct (D(1)) and indirect (I(0)) effects in four scenarios.

Coverages for the bootstrap confidence intervals were close to (within 3%) the nominal 95% level in all cases but one, namely, the indirect effect in Scenario 2 (with I(0) = 0) where the coverage was nearly 100%. Power was moderate at best even for apparently substantial mediation effect sizes. Thus, it would appear that a sample size of 200 (as used in the dental data) is too small to detect the likely mediation effect magnitudes occurring in that study.

Table 3 provides the results for the second simulation study. Here, we compare the proposed latent mediator approach with the two-step approach described in the previous section. Only in the case of no measurement error (σ = 0) did the two-step approach provide consistently lower relative biases than the latent mediator approach. We note that a case of zero or near-zero error variance would be detectable in practice as the multiple intermediate variable measurements for each person would then be the same or very close. For σ ≥ 0.6, the latent mediator approach had considerably smaller relative biases that the two-step approach. For large SDs, the latent mediator approach, while maintaining its advantage over the two-step approach, also had large relative biases (exceeding 60% in the most extreme scenario of σ = 6). We note that supplementary simulations showed that for a large sample size of 5000 per exposure group, the latent mediator approach provided low relative biases even for the case of σ = 6. Finally, as shown in Table 3, coverage probabilities for the latent mediator approach were within 3% of the nominal 95% confidence level for all estimators, while those for the two-step approach were sometimes way off, and as low as 86% even for σ = 1.

5 Discussion

In this paper, we have developed an approach to causal mediation analysis for a latent mediator that accommodates mixed types of outcome variables. This approach involves the fitting of a GSEM using an approximate Monte Carlo EM algorithm for ML estimation (Sammel et al., 1997), followed by mediation formula computations to obtain estimates of natural direct and indirect effects. This approach makes a standard, though admittedly strong, sequential ignorability assumption. The new method was demonstrated in a dental data example, which revealed a suggestive (though nonstatistically significant) indirect effect of mother education on dental caries though a latent mediator representing oral health behavior. We studied the robustness of this conclusion to departures from sequential ignorability using a simple new sensitivity analysis method (Albert and Wang, 2015). The data used for this example and an SAS Macro implementing the methods are included in the Supporting Information.

The proposed mediation effect estimators were found to have low bias under most of the simulation study scenarios that we examined, but had substantial finite sample size bias in cases of high error variance for the variables measuring the latent mediator. However, the proposed approach was found to have lower bias than a two-step approach in cases of substantial measurement error variance, though, the latter has lower bias when there is no measurement error. We note that the two-step approach implies a similar mediation model structure as the latent mediator model (both assuming a single mediator), but differs in that it assumes no measurement error for the (summary measure) mediator. Further, in our simulation study we assumed that the intermediate measurement weights (thus, the optimal summary measure) were known in the two-step approach. In more general factor analysis situations where the two-step approach involves the estimation of a factor score in the first step (as in our data example), the advantage of the latent mediator model approach is likely to be greater. Wall and Li (2003) did an analogous comparison for a latent variable regression model, and also recommended a simultaneous (full information ML) estimation approach.

Our finding of large finite sample biases in the case of high measurement error variance (which was seen across methods) has important practical implications. We would recommend in particular that attention be paid in the selection of intermediate measurements in order to avoid this problem. In the factor analysis context, this may be achieved by assuring that variables used to measure a single factor are “coherent,” as would be indicated by high correlations among them. Although we did not focus on power issues, our results suggest that cases of high measurement error variance would require rather large sample sizes to maintain low bias in the estimation of mediation effects.

A limitation of our proposed approach is the required parametric assumptions of the GSEM. This includes the often reasonable, but unverifiable, assumption of a normally distributed latent mediator. The impact of departures from this assumption is indicated by results in the observed mediator case (Albert, 2012), which show substantial biases in mediation effect estimators (based on the standard mediation formula approach) when the mediator is nonnormally distributed.

Another limitation of our approach is its computational intensity. It is possible that computing times may be decreased by using a more efficient computational technique, for example, based on Newton–Raphson (Lindstrom and Bates, 1988) or a hybrid approach (e.g., Jamshidian and Jennrich, 1997). A more obvious reason for the long computing times is that, due to the complexity of the mediation formula computation (particularly for a continuous mediator), a closed-form variance formula is not available, requiring us to resort to the use of bootstrap sampling. An alternative approach might be to develop an approximate discrete version of the mediation formula that would allow a delta method approach to variance estimation.

Another possible limitation of the proposed method is that some researchers may have difficulty, for a given data context, in interpreting the estimate of the indirect effect through the latent mediator. However, even in such cases, the proportion of exposure effect occurring through the latent mediator may be readily interpretable. Similarly, it may be difficult, in the latent mediator model context, to conceive of a controlled direct effect, which would require the latent mediator to be somehow fixed to a constant value. In contrast, the natural direct and indirect effects, which have descriptive, as opposed to prescriptive, interpretations (Pearl, 2001) are suitable estimands for latent mediator models.

In situations where multiple mechanisms are of interest, the multiple mediator approach of Wang et al. (2013) may be applicable. However, the use of latent variables may be advantageous in that case as well. This will require an extension of the present approach to handle multiple latent mediators. Further research will be needed to address the statistical and computational challenges of this more complex situation.

Supplementary Material



The authors would like to thank Dr. Mary Sammel for sharing her SAS code and Dr. Lynn Singer for providing access to the longitudinal cohort of VLBW and NBW adolescents. They are also grateful to the Editor, an associate editor, and the reviewers for constructive comments that helped to improve the paper. Support for this research was provided in part by research grants R01-DE022674 (J.A.) and R21-DE16469 (S.N.) from the National Institute of Dental and Craniofacial Research, National Institutes of Health, Bethesda, MD, USA, and research grant MC-390592, MC-00127, and MC-00334 (L.S.) from the Maternal and Child Health Program, Health Resources and Services Administration, Department of Health and Human Services, Rockville, MA, USA.


Additional supporting information including source code to reproduce the results may be found in the online version of this article at the publisher’s web-site

Conflict of interest

The authors have declared no conflict of interest.


  • Albert JM. Mediation analysis via potential outcomes models. Statistics in Medicine. 2008;27:1282–1304. [PubMed]
  • Albert JM. Distribution-free mediation analysis for nonlinear models with confounding. Epidemiology. 2012;23:879–888. [PMC free article] [PubMed]
  • Albert JM, Nelson S. Generalized causal mediation analysis. Biometrics. 2011;67:1028–1038. [PMC free article] [PubMed]
  • Albert JM, Wang W. Sensitivity analyses for parametric causal mediation effect estimation. Biostatistics. 2015;16:339–351. [PMC free article] [PubMed]
  • Baron RM, Kenny DA. The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology. 1986;51:1173–1182. [PubMed]
  • Bollen KA. Structural Equations with Latent Variables. Wiley; New York, NY: 1989.
  • DiCiccio TJ, Efron B. Bootstrap confidence intervals. Statistical Science. 1996;11:189–212.
  • Imai K, Keele L, Yamamoto T. Identification, inference and sensitivity analysis for causal mediation effects. Statistical Science. 2010;25:51–71.
  • Jamshidian M, Jennrich RI. Acceleration of the EM algorithm by using Quasi-Newton methods. Journal of the Royal Statistical Society B. 1997;59:569–587.
  • Kuroki M. Graphical identifiability criteria for causal effects in studies with an unobserved treatment/response variable. Biometrika. 2007;94:37–47.
  • Kuroki M, Pearl J. Measurement bias and effect restoration in causal inference. UCLA Cognitive Systems Laboratory; Los Angeles, CA: 2011. (Technical Report(R-366)).
  • le Cessie S, Debeij J, Rosendaal FR, Cannegieter SC, Vandenbroucke J. Quantification of bias in direct effects estimates due to different types of measurement error in the mediator. Epidemiology. 2012;23:551–560. [PubMed]
  • Lindstrom MJ, Bates DM. Newton-Raphson and EM algorithms for linear mixed effects models for repeated measures data. Journal of the American Statistical Association. 1988;83:1014–1022.
  • Muthén B. Applications of causally defined direct and indirect effects in mediation analysis using SEM in Mplus. Muthén & Muthén; Los Angeles, CA: 2011. (Technical Report).
  • Muthén B, Asparouhov T. Causal effects in mediation modeling: an introduction with applications to latent variables. Structural Equation Modeling. 2015;22:12–23.
  • Nelson S, Albert JM, Lombardi G, Wishnek S, Asaad G, Kirchner HL, Singer LT. Dental caries and enamel defects in very low birth weight adolescents. Caries Research. 2010;44:509–518. [PMC free article] [PubMed]
  • Nelson S, Lee W, Albert JM, Singer LT. Early maternal psychosocial factors are predictors for adolescent caries. Journal of Dental Research. 2012 doi: 10.1177/0022034512454434. [PMC free article] [PubMed] [Cross Ref]
  • Pearl J. Direct and indirect effects. Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence; San Francisco, CA: Morgan Kaufmann. 2001. pp. 411–420.
  • Pearl J. The causal mediation formula a guide to the assessment of pathways and mechanisms. Prevention Science. 2012;13:426–436. [PubMed]
  • Preacher KJ, Hayes AF. Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods. 2008;40:879–891. [PubMed]
  • Rabe-Hesketh S, Skrondal A. Classical latent variable models for medical research. Statistical Methods in Medical Research. 2008;17:5–32. [PubMed]
  • Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992;3:143–155. [PubMed]
  • Sammel MD, Ryan LM, Legler JM. Latent variable models for mixed discrete and continuous outcomes. Journal of the Royal Statistical Society, Series B. 1997;59:667–678.
  • Singer LT, Yamashita TS, Lilien L, Collin M, Baley J. A longitudinal study of infants with bronchopulmonary dysplasia and very low birthweight. Pediatrics. 1997;100:987–993. [PMC free article] [PubMed]
  • Song PXK, Li M, Yuan Y. Joint regression analysis of correlated data using gaussian copulas. Biometrics. 2009;65:60–68. [PubMed]
  • VanderWeele TJ, Valeri L, Ogburn EL. The role of misclassification and measurement error in mediation analyses. Epidemiology. 2012;23:561–564. [PMC free article] [PubMed]
  • Wall MM, Li R. Tutorial in biostatistics: comparison of multiple regression to two latent variable techniques for estimation and prediction. Statistics in Medicine. 2003;22:3671–3685. [PubMed]
  • Wang W, Albert JM. Estimation of mediation effects for zero-inflated regression models. Statistics in Medicine. 2012;31:3118–3132. [PMC free article] [PubMed]
  • Wang W, Nelson S, Albert JM. Estimation of causal mediation effects for a dichotomous outcome in multiple-mediator models using the mediation formula. Statistics in Medicine. 2013;32:4211–4228. [PMC free article] [PubMed]