Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Modelling. Author manuscript; available in PMC 2017 April 1.
Published in final edited form as:
PMCID: PMC4982519

Using a latent variable model with non-constant factor loadings to examine PM2.5 constituents related to secondary inorganic aerosols


Factor analysis is a commonly used method of modelling correlated multivariate exposure data. Typically, the measurement model is assumed to have constant factor loadings. However, from our preliminary analyses of the Environmental Protection Agency’s (EPA’s) PM2.5 fine speciation data, we have observed that the factor loadings for four constituents change considerably in stratified analyses. Since invariance of factor loadings is a prerequisite for valid comparison of the underlying latent variables, we propose a factor model that includes non-constant factor loadings that change over time and space using P-spline penalized with the generalized cross-validation (GCV) criterion. The model is implemented using the Expectation-Maximization (EM) algorithm and we select the multiple spline smoothing parameters by minimizing the GCV criterion with Newton’s method during each iteration of the EM algorithm. The algorithm is applied to a one-factor model that includes four constituents. Through bootstrap confidence bands, we find that the factor loading for total nitrate changes across seasons and geographic regions.

Keywords: factor model, non-constant factor loading, P-spline, tensor product spline basis, generalized cross-validation, EM algorithm, PM2.5 constituents

1 Introduction

Fine particulate matter that is, air particles with diameter <2.5 micrometers (PM2.5), is associated with adverse health effects (e.g., Dominici et al., 2006; Schwartz et al., 1996). Laden et al. (2000) suggested PM2.5 from different sources has different toxic effects, implying the importance of the chemical profile of the constituents of PM2.5. Because the constituents are typically correlated, studies that analyze the relationship between constituents and health outcomes often use methods that summarize the correlated constituents, for example, using clustering (Austin et al., 2013) or latent variable models (Gryparis et al., 2007; Nikolov et al., 2008). In particular, widely used source apportionment methods (Thurston et al., 2011) use factor analytic models to summarize constituent data in terms of their relationship to air pollution sources, such as power plant or vehicle exhaust, that can be potentially regulated. Since the air pollution sources are not directly observed, source contributions are treated as latent factors and estimated from the data.

However, methods to summarize PM2.5 constituents may yield inconsistent summaries when applied to large geographic areas or long time periods. Such summaries depend on the correlation among constituents and correlations among constituents may vary with time and space (Austin et al., 2013; Bell et al., 2007). For instance, our preliminary anaylses of PM2.5 constituent data issued by the Environmental Protection Agency (EPA) Air Quality System (AQS) for years 2008 and 2009 (see Section 5 for more details on the data) suggest that the correlation among constituents related to secondary inorganic aerosols (Salvador et al., 2004; Thurston et al., 2011) varies with season and geographic location (Figure 1). Nitrate exhibits the greatest differences in terms of its correlation with other constituents, both across regions and across time within region. As a result, it is not surprising that preliminary factor analyses of this constituent data, stratified by time and region, yield factor loadings that differ across time and region for some constituents. In particular, the factor loading for total nitrate (Figure 2a) demonstrates large differences across regions and time. The estimated factor loadings tend to be larger in the winter months, implying a season effect. Furthermore, the estimated factor loadings are significantly higher in the midwest region, compared to other regions, for all periods except the first quarter of 2008. By contrast, the factor loading for ammonium ion exhibits fewer and less pronounced differences across either region or time (Figure 2b).

Figure 1
Array of heat maps shows how the Pearson correlation among the four PM2.5 constituents (log-transformed) change across different strata defined by region (row) and time period (column). The column headers denotes the year (Yr1: 2008, Yr2: 2009) and the ...
Figure 2
(a) and (b) compare the factor loadings for a particular constituent from stratified confirmatory factor analyses. The y-axis shows the size of the factor loading. Each dot on the plot is the factor loading from analysis of one stratum. The midwest, northeast ...

Differences in the factor model structure are not specifically modelled in current literature that uses factor analytic models applied to constituent data collected across large geographical areas. Although many models have been proposed to reduce data dimension and to predict outcomes based on many constituents simultaneously (Austin et al., 2013; Choi et al., 2009; Peng et al., 2009), a more thorough understanding of how the correlations among PM2.5 constituents change with time and space may allow us to gain further insights on the validity and interpretation of such dimension-reduction approaches and thus potentially improve modelling the association between constituents and health outcomes.

There is now a large literature on building models that analyze air pollution data from the perspective of time and space (e.g., Berrocal et al., 2010; Daniels et al., 2006; Huerta et al., 2004; Shaddick and Wakefield, 2002; Sahu and Mardia, 2005, to name a few). However, we focus on models utilizing latent factors, given their connection to air pollution sources (Nikolov et al., 2008, 2011) and because some of these models can be applied to understanding temporal and spatial patterns in the correlation of PM2.5 constituents specifically. For example, Gryparis et al. (2007) proposed semiparametric latent variable regression models for modelling multiple surrogates of a single pollution source. They used penalized splines to allow non-linear effects of covariates on the latent variable mean and used geoadditive models to account for the temporal and spatial correlations in the latent variable (Kammann and Wand, 2003). Lopes et al. (2008) used a spatial dynamic factor model to incorporate temporal and spatial correlations among observations of a single pollutant from different monitors over a certain time period. In that model, the spatial correlation was introduced through the factor loading matrix, that is, at a given time-point, monitors were treated as different variables. They modelled the temporal correlation among the latent factors using an autoregressive structure. Ippoliti et al. (2012) used a similar dynamic factor model but they coupled two sets of latent factors so that one latent factor could be predicted from the other.

More generally, models using latent variables have gained much popularity in research that relates environmental exposures to health outcomes because they can reduce the dimension of the data by summarizing correlated exposure measurements (see Sánchez et al., 2005). In the last decade, much work has been done to relax model assumptions, including nonlinear relationships among latent variables (Arminger, 1998; Guo et al., 2012; Lee et al, 2003; Lu et al, 2012; Song and Lu, 2010; Zhu and Lee, 1999), nonlinear links relating items to latent variables (Bauer, 2005; Gryparis et al., 2007), as well as models with spatial structures on the latent variables (Fahrmeir and Raach, 2007; Hewson and Bailey, 2010; Liu et al., 2005).

However, except for Zhang and Lee (2009) and Valentini et al. (2013), none of these existing advanced models specifically address the variation in the factor loadings exhibited among the constituent data (Figure 2a). Traditionally, the measurement model that links the observed items to the latent variables is assumed constant across the covariate space (e.g., across time and geographic location). A constant measurement model, also termed measurement model invariance (Horn and McArdle, 1992; Meredith, 1993), is needed for valid comparison of differences in latent variable means across covariate levels or groups. However, it has been shown that allowing the measurement model to vary across values of categorical covariates may be desirable when estimating exposure–health outcome associations, potentially because the flexible measurement model enables better estimation of the latent factor itself (Sánchez et al., 2012; Tao et al., 2015). Allowing the measurement model to vary across continuous covariates, however, is a more challenging problem. Zhang and Lee (2009) used a local maximum likelihood-based estimation procedure to model the factor loadings as varying with time. Valentini et al. (2013) gave the factor loadings a spatial structure and they modelled the factor loadings as conditionally independent multivariate Gaussian Markov Random Fields.

In this article, we propose a semiparametric factor analysis model with non-constant factor loadings that change with multiple covariates, in particular time and space as motivated by the EPA’s PM2.5 constituents data. We use penalized splines to estimate factor loadings that change with multiple covariates. This construction for the factor loadings bears a resemblance to the classic varying coefficients model (Hastie and Tibshirani, 1993). In Section 2, we propose the factor model that incorporates non-constant factor loadings, while Section 3, explains the estimation procedure. A key feature of our estimation algorithm is to include the generalized cross-validation (GCV) score within the EM iterations so that we can optimize the smoothing of the splines in a latent variable setting. In Section 4, we present a simulation study to assess the algorithm and estimation properties for certain model parameters and the estimated factor scores. We then analyze the constituent data set and draw inferences based on bootstrap approaches in Section 5. Section 6 ends with a discussion and future directions.

2 Model

Let yi be a vector of P observed variables measured at time ti and location (ui, vi) denoted by two coordinates, typically longitude and latitude. We use a latent variable model to summarize the information of yi. We introduce non-constant factor loadings in order to incorporate temporal and spatial differences in the marginal correlations among observed variables and to enable assessment of the invariance of the measurement model across time and space.

We assume that the elements of yi are linearly related to latent variables according to the measurement model,


where ηi is a latent variable, Λi = (λ1,i, …, λp,i, …, λP,i)T is a P × 1 vector of factor loadings, μ is a P × 1 vector of scalars and εi is a P × 1 vector of random errors. We let ηi be an i.i.d. random variable following the standard normal distribution. Since the mean of the latent variable is given, the observed variable means μ are identifiable (Sánchez et al., 2005). Furthermore, constraining the variance of the latent variable to 1 allows the identification of the factor loadings up to an additional sum to zero constraint (below). The error vectors εi are assumed to be independent of ηi, and to be independent and identically distributed (i.i.d.), and to follow a multivariate normal distribution, εi ~ N(0, Σ), where Σ is a diagonal covariance matrix. In our data application, the i.i.d. assumption is tenable since we first de-trend the observed data. The normality assumption is also adequate since the log-transformed PM2.5 constituents are approximately normal (see Section 5 for more details).

2.1 Non-constant factor loadings that vary with time and space

In classical latent variable models, the factor loadings are estimated as constant parameters or sometimes allowed to differ across levels of a categorical variable (Horn and McArdle, 1992). However, in our proposed model, factor loadings are allowed to vary smoothly with time and/or location, resembling a varying coefficients model (Hastie and Tibshirani, 1993).

We model the non-constant factor loadings as a smooth function of (ti, ui, vi), that is, the value of the pth factor loading for sample i, λp,i, is specified in the form of an additive model,


where fp,t, fp,uv, fp,tuv are all centred functions (i.e., the sum of the function values across all the data points is zero), λp,0 represents the constant part of the factor loading. The functions fp,t, fp,uv, respectively, capture the main effects of time and location on factor loadings, while fp,tuv tells us whether the location effect changes over time or whether there is interaction between time and location. We will omit ti, ui, vi in the following notations unless necessary.

We represent the non-constant factor loadings using penalized splines. Specifically, we construct univariate cubic B-spline basis functions for each of the covariates, t, u, v and use them to construct tensor product bases for the interaction terms fp,uv(ui, vi) and fp,tuv(ti, ui, vi) (Wood, 2006, Section 4.1.8). Collecting the values of the factor loadings corresponding to the pth observed variable across all observations i = 1, …, n into λp = (λp,1, …, λp,i, …, λp,n)T, we can express λp = λp(βp) = Xpβp, where


Note that we use Λi = (λ1,i, …, λp,i, …, λP,i)T to denote values of the P factor loadings for the ith observation and λp = (λp,1, …, λp,i, …, λp,n)T as the factor loadings for the pth observed variable. The design matrix Xp contains the values of the basis functions evaluated for each observation, with rows corresponding to observations i = 1, …, n, and the columns corresponding to the components in equation (2.2) that have been suitably centered and constrained to remove redundancies that arise when creating the tensor product bases (see Wood, 2006, Section 1.8.1). Parameters βp are the corresponding coefficients, which, except for the intercept λp,0, are penalized.

We use the squared first difference penalty (Eilers and Marx, 1996) to control the degrees of freedom of the smooth, which yields the desirable property of shrinking the smooth to the intercept λp,0. The penalty is constructed as a composite penalty that combines separate penalties for each of the component smooth in equation (2.2), that is, coefficients βp,t, βp,uv and βp,tuv are penalized separately. Furthermore, the interaction terms use tensor product bases and have separate penalties for each covariate conditional on the other covariates in the interaction (see Wood, 2006, Section 4.1.8 for a similar construction). In total, the penalty corresponding to each factor loading relies on six distinct smoothness parameters. These penalties are constructed separately for each of the P factor loadings in the model. The penalty for the spline coefficients of the pth factor loading is Ωp(βp,ωp)=βpTSp(ωp)βp, where S(ωp) is constructed as a block diagonal matrix whose diagonal blocks are 0, and ωp,tSt, ωp,u|vSu|v + ωp,v|uSv|u, ωp,t|uvSt|uv + ωp,u|tvSu|tv + ωp,v|tuSv|tu that correspond to the penalties of βp,t, βp,uv and βp,tuv. We define the overall penalty as Ω(β,ω)=p=1PβpTSp(ωp)βp, where β is the vector of spline coefficients for all the factor loadings combined and ω denotes all the penalty parameters.

3 Estimation

We use the EM algorithm to facilitate the estimation, where the complete data structure includes the observed data yi augmented with unobserved variables ηi. The penalized log-likelihood of the augmented data is


where y=(y1T,,ynT)T and η = (η1, … ηn), respectively, and θ contains the unique parameters in Σ (i.e., residual variances σp2 for each observed variable) in addition to {μp, λp,0, βp|p = 1, …, P}. The E-step computes the expected log-likelihood using the conditional distribution of ηi|yi. The M-step estimates Σ, and μp, λp,0, βp, p = 1, … P, by maximizing the expected log-likelihood. The smoothing parameters ωp are selected within each iteration of the EM.

3.1 EM algorithm

Using the assumption of a normal distribution and omitting the proportionality constant, the complete data log-likelihood is


3.1.1 E-step

Even though we are to maximize the penalized likelihood as in equation (3.1), the penalty does not affect the E-step because the penalty does not involve the missing latent variable. The expected values of ηi and ηi2 are needed in the E-step. Using the conditional normal distribution, these expectations at the (r)-th iteration are listed as follows. (We omit the superscript (r − 1) on Λ, Σ in the formulas below, but the calculations use parameter estimates from the (r − 1) th iteration).


3.1.2 M-step

In the M-step, we estimate μ, Λi, Σ, from the first part of equation (3.1), including the penalty. Given the estimated latent variables at the (r)-th iteration, this step can be thought of as fitting P varying coefficient models, one for each of the P observed variables (e.g., for each pollutant). To see this, first we rearrange the log-likelihood. If we let yp,i denote each of the P observed variables in yi, then we can write the objective penalized log-likelihood as


where y.p = (yp,1, …, yp,n)T and η = (η1, …, ηn)T are n × 1 column vectors. We can do this rearrangement because Σ is a diagonal matrix. This allows us to maximize equation (3.2) separately for each p, that is, for each observed variable. If Σ is not diagonal, this step would instead be analogous to multivariate regression.

For each observed variable y.p, the penalized log-likelihood given η(r) is


where ‘[composite function (small circle)]’ is the entrywise Hadamard product and S(ωp) is the previously described penalty matrix. Equation (3.3) is essentially a regression of y.p on [eta w/ tilde](r) with βp penalized, where


that is, we can write η(r)λp(βp)=η(r)(Xpβp)=[η1(r)x1βpηn(r)xnβp]=η(r)βp. Hence, replacing η(r) [composite function (small circle)] Λp(βp) with [eta w/ tilde](r)βp in equation (3.3) and maximizing the penalized log-likelihood, we then obtain a Ridge regression solution for the coefficients




where (ηTη)(r)=inxiT(ηi2)(r)xi=[x1T(η12)(r),,xnT(ηn2)(r)]Xp.

3.2 Generalized cross-validation

So far, we have treated the smoothing parameters ωp as given, but here we describe how to use the GCV score to select the smoothing parameters (Craven and Wahba, 1979; Golub et al., 1979). GCV is commonly used in the selection of smoothing parameters for generalized additive models (GAM, see Wood, 2006, Section 4.5), where the response variable is to be predicted from known covariates using a spline function. GCV penalizes the residual sum of squares with the effective number of parameters (see equation (3.6)). The smoothing parameters that give the smallest GCV score are selected. In the measurement model (see equation (2.1)), the latent variable, though actually [eta w/ tilde](r) according to equation (3.4), is a natural counterpart of the covariate in a GAM model, so we make use of [eta w/ tilde](r) and have




is the influence matrix (compare with equation (3.5)) and tr(Ap) represents the effective number of parameters used to estimate λp.

Since the latent variable is unknown and is itself to be computed from yp and β(r), use of ||y.p[eta w/ tilde](r) β(r)||2 is not valid if we evaluate GCVp after the EM algorithm converges. This is because [eta w/ tilde](r) from the last iteration varies with the smoothing parameters. Instead, we evaluate GCVp within each iteration of the EM algorithm, where [eta w/ tilde](r) is held as the fixed covariate in the regression-based M-step, and update the smoothing parameters with (ωp)(r) that gives the smallest GCVp for each M-step. This approach of selecting smoothing parameters for each EM iteration is similar to the performance iteration method used in GAM models (see Wood, 2006, Section 4.6).

We minimize GCVp using a numerical approach based on Newton’s method. Here, GCVp is regarded as a function of ρp = log(ωp) because ωp has to be constrained as positive. GCVp (ρp) is approximated around the current estimate of (ρp)(r) with a quadratic function based on the first and second derivatives of GCVp with regard to ρp. Then the minimum of GCVp(ρp) can be approached by successively solving the quadratic minimum. Thus, the M-step is modified as follows. We first solve βp(r) with (ωp)(r−1) = exp ((ρp)(r−1)) plugged in equation (3.5). Then we evaluate ρpGCVp(ρp) and 2(ρpβ)2GCVp(ρp) at (ρp)(r−1), obtain the critical point of the quadratic approximation and re-evaluate the derivatives of GCVp(ρp) again at the critical point. The process is repeated until we get to the minimum of GCVp(ρp) and then (ρp)(r−1) is updated to (ρp)(r).

3.3 Simultaneous confidence band

We use bootstrap resampling to construct a 95% confidence interval for point estimates and 95% simultaneous confidence band for spline estimates (Ruppert et al., 2003). We sample with replacement from the n observations to generate M new data sets of n observations each. In our data application, this resampling approach is valid since de-trending the data removes spatial and temporal correlation. The same EM algorithm is repeated on these data sets to get bootstrap estimates. This bootstrap strategy is based on the assumption that there is little temporal and spatial correlation for the residual error εi, of the actual data.

For single point estimates, such as λp,0, the 2.5% and 97.5% quantile of the bootstrap samples is taken as the bounds of the 95% confidence interval. For spline estimates, like fp,t, we first obtain C, the 95% quantile of


where f^p,t is the bootstrap estimate of fp,t and t is the support of t. Then the 95% simultaneous confidence band is constructed as f^p,t(t)±CVar[f^p,t(t)-f^p,t(t)]. We obtain Var[f^p,t(t)-f^p,t(t)] also from the bootstrap samples.

4. Simulation study

We conduct simulation studies to examine the advantages and disadvantages of this novel estimation method, as well as to verify that the estimation algorithm can recover model parameters. We first examine whether the algorithm can recover the true parameter values when the factor loadings vary with several covariates. For this purpose, we simulate 1000 data sets where P observed variables are associated with one latent factor and the factor loadings are smooth functions of three covariates in this form: λp,i = λp,0 + fp,uv(ui, vi) + fp,t(ti). This model is similar to the one used for our data application. For each of the 1000 data sets, we generate data on an 11 × 11 grid (i.e., 121 (ui, vi)), and for each point on the grid we have 101 time points. Each simulated data set has four observed variables and a sample size of 11 × 11 × 101 = 12, 221, which is a similar size to our data example. The simulation shows that our algorithm can correctly estimate non-constant factor loadings that change with multiple covariates, although some oversmoothing for fp,uv is observed (Figure A.1). Estimates for the other parameters are also unbiased (Table A.1).

Next, we further investigate how biases can enter various components of the model across different simulation scenarios where the factor loadings take on different functional forms. For this study, we let the factor loadings be functions of only one covariate. This simpler setting can enhance computational efficiency while still illustrating the impact of various estimating approaches under different scenarios. Specifically, we investigate how using non-constant factor loadings affects the estimation of model parameters and the factor scores. Examining estimation of factor scores gives insight as to how estimated source contributions in air pollution studies may be affected, for instance.

The simulations are set up with three different shapes for the true factor loadings. We let λp,i = λp,0 + fp(ti), where ti [set membership] [0, 1], fp is a centre function (i.e., i=1nfp(ti)=0, and the three types of fp(ti) are as follows, for p = 1, …, P.

  • Constant: fp(ti) = 0,
  • Non-constant, cyclic: fp(ti)=κ[cp+cos(4πtp,i)], where tp,i=ti-p-12P,
  • Non-constant, non-cyclic:
    where tp,i=ti-34+p-12(P-1).

In the formulas, cp is the constant that ensures i=1nfp(ti)=0 and κ is an amplitude parameter that changes the magnitude by which λp,i deviates from λp,0. Figures 3a and 3b illustrate the shape of fp(tj) for the non-constant factor loadings. In the context of air pollution studies, cyclic factor loadings would resemble seasonal patterns, whereas the non-constant non-cyclic pattern would resemble spatial gradients (e.g., north to south).

Figure 3
(a)–(b). Shape of fp(ti) for the simulation study. Each plot shows three curves differentiated by line styles when a scenario has three observed variables and the amplitude parameter κ = 1.

For each of the three shapes of factor loadings listed above, we hold some parameters constant, but we change the signal-to-noise ratio in the model as follows. For all the scenarios, we set μp = 1 for all p and make λp,02+σp2=8 in all cases. Within each scenario, we let λp,0 take on the same value for all p, but we change the ratio, λp,02/(λp,02+σp2), among the different scenarios. This ratio, which was set at 0.3, 0.5 and 0.8, quantifies the percentage of variance for the pth observed variable explained by the latent factor when the factor loading is constant. For the non-constant factor loadings, we also use different values of κ (0.75, 1.25), which also influence the percentage of variance explained by the latent factor.

For each scenario, we generate 1000 simulated data sets, each with 1000 equally spaced data points within the range of t. For some scenarios, we vary the sample size (500, 5000) and the number of observed variables, P = 6 instead of 3. For each scenario, we estimate the factor loadings using three different approaches: (a) constant factor loadings, (b) non-constant factor loadings using penalized splines and (c) non-constant factor loadings using unpenalized splines. For all the scenarios, we choose equally spaced knots that give 10 spline coefficients (including λp,0).

We report the simulation results primarily focusing on scenarios where (a) each simulated data set has 1000 equally spaced data points within the range of ti, (b) λp,02/(λp,02+σp2)=0.8 and (c) κ = 1.25 for the second and third types of fp(ti). The results for these primary scenarios are provided in Tables 1 and and22.

Table 1
Bias and standard deviation (in parentheses) of [lambda with circumflex]p,0, σ^ρ2 for different simulation scenarios and estimating approaches.
Table 2
Bias and mean squared error (MSE) of the factor score for different simulation scenarios and estimating approaches. IBIAS and IMSE show the bias and MSE integrated over the entire distribution of the latent variable η; the remaining columns show ...

We look at the results of our simulation study from two aspects. First, we compare the bias and standard error of [lambda with circumflex]p,0, σ^p2 among different estimation approaches and simulation scenarios (Table 1). This comparison primarily gives insight into possible loss of efficiency in constant factor loadings when they are instead estimated as non-constants, as well as gives insight regarding bias in residual variances, which ultimately affect how much weight a given observed variable is given when estimating the latent factor. When the factor loadings are constant, the bias and efficiency of [lambda with circumflex]p,0, σ^p2 are similar across estimation approaches, although σ^p2 is slightly underestimated when we use unpenalized splines since without penalization, the splines overfit the data. However, when the factor loadings are non-constant and estimated as constant, [lambda with circumflex]p,0, σ^p2 are both biased. The bias becomes trivial when we correctly estimate the factor loadings as non-constant. We also find that the penalty has little effect on the standard deviations of [lambda with circumflex]p,0, σ^p2 (compare ‘penalized’ with ‘un-penalized’ in Table 1), possibly because these two parameters measure the overall signal-to-noise ratio and thus they are less affected by the smoothness of fp(t) as long as the main shape of the pattern is captured. However, the penalty reduces the mean integrated squared error of fp(t) (see Table A.1), since the penalized splines are smoother.

Next, we compare how estimation approaches affect the factor score (Table 2). We define factor score as [eta w/ hat]i = E[theta w/ hat](ηi|y.,i), where [theta w/ hat] represents all the parameter estimates. In other words, the factor score [eta w/ hat]i is the same as η^i(r) from the E-step, but evaluated at convergence. Note that [eta w/ hat]i is shrunken towards the centre compared with ηi(E[η^iηi]=ΛiT(ΛiΛiT+ε)-1Λiηi<ηi). Thus, [eta w/ hat]i is potentially more biased for values of ηi that are farther away from the centre of its distribution. Therefore, in addition to examining the overall bias, 1ni=1n(η^i-ηi), and overall MSE, 1ni=1n(η^i-ηi)2, of the factor score (ηi is the true value of the latent factor from a given simulated data set), we also examine bias and MSE by percentiles of ηi. Since the distribution of the factor score is symmetric, only three percentile categories are listed in Table 2. The upper three categories have the same bias in negative sign (thus the overall zero bias) and the same MSE. To gain further insight regarding estimation of the factor score at different covariate values, we also draw scatterplots of the estimated factor score versus the true factor, with points on the scatterplot receiving different symbols according to tertiles of the values of the covariate t.

The overall average bias and MSE for the factor scores are similar across estimation approaches; however, differences emerge when we examine bias by percentiles of the distribution of the true latent variable and when we examine the factor scores at different covariate values t (Table 2 and Figure 4a–4b). We find that when the factor loadings are constant, the bias and MSE are similar if we estimate factor loadings either as constant or as non-constant using penalized splines. If we estimate the factor loadings using unpenalized splines, the MSE is unsurprisingly larger. However, when the factor loadings are non-constant, we obtain the smallest bias and MSE when we estimate the factor loadings using penalized splines. Conversely, the bias and MSE are largest when the factor loadings are estimated as constant, and this is most notable for scenarios with the non-cyclic factor loading. In this case, [eta w/ hat]i has a higher MSE for values of ηi at the two ends of its distribution (Figure 4a–4b). This larger MSE at the tails of the distribution is due to different directions in the bias of the factor score, which depends on covariate values (see scatterplot of factor scores versus true latent variable in Figure 4b marked according to the values of t).

Figure 4
(a)–(b). Simulation results showing plots of [eta w/ hat]i, against ηi when factor loadings are estimated as constant or non-constant, when true factor loadings are non-constant. The marker symbol distinguishes ηi by tertiles ...

In scenarios where the variance in the observed variables is smaller (i.e., λp,02/(λp,02+σp2) and κ are smaller), we find that the advantage of correctly estimating the non-constant factor loadings using penalized splines is less obvious. Similarly, when the noise in the data gets larger but the true factor loadings are constant, incorrectly estimating them as non-constant results in slightly increased bias in σ^p2 and increased MSE among all the parameter estimates along with the increase of MSE for the factor score.

5 Data application

We applied our model to four constituents from the US PM2.5 speciation data issued by EPA for years 2008 and 2009. The data set we used included 108 monitors stationed in part of the midwest, south and northeast (see Figure 5). Thirty–six monitors had measurements in every three days when they were functioning, but the majority had measurements only in every six days. Thus, we took every other observation from the thirty–six monitors that had measurements in every three days so that data from all monitors were equally spaced.

Figure 5
(a) is the contour plot of fp,uv for the factor loading of total nitrate. The bigger scattered dots are the monitor locations and the smaller arrayed dots represent areas that are significantly different from zero according to the 95% ...

Many of the constituents in the EPA data were zero inflated and would require special treatment beyond the scope of this article. Thus, we selected constituents that, after taking the logarithm, had an approximate normal distribution. Based on the exploratory factor analysis (see Section 1 and Figure 1), we decided to look at four constituents, sulphur, ammonium ion, total nitrate, sulphate, that are associated with secondary aerosols (Nikolov et al., 2008) and fit into a one-factor structure.

Since the concentration of constituents change across time and space, the mean of each observed variable (μ in equation (2.1)) would need to be modelled as a smooth function. Although this is in principle possible with our modelling approach, we opted instead to de-trend the data and work with residuals, that is, first we regressed each constituent on time and spatial coordinates using GAM and then we used the residuals from these GAMs as the observed variables. Note that this approach not only allowed us to treat μ as constant but also effectively removed spatial and temporal correlation that enabled us to conduct inference using bootstrap.

The factor loadings were initially modelled according to equation (2.2), but the interaction term fp,tuv was insignificant based on the bootstrap confidence band, and thus we decided to keep only ft and fuv. We used cubic B-splines as the basis functions; we chose the number of equally spaced knots so that βt and βuv had 10 and 25 parameters, respectively. Altogether, β had 34 parameters. Each component function then had its own set of smoothing parameters, a total of three smoothing parameters.

Table 3 summarizes the estimates of all the parameters in the model except the spline coefficients. (We do not list μ in the table because they are zero for the detrended data.) Figure 5 shows the time and spatial components of the non-constant factor loading for total nitrate. For the other constituents, the non-constant part is not deemed significant since the confidence bands engulf zero entirely (not shown). From Figure 5a, we can see that the factor loading in the midwest is significantly higher in the midwest region than in the south. Figure 5b shows that the factor loading for total nitrate fluctuates significantly over the two years in a periodic fashion, indicating a strong seasonal pattern. These patterns can be mostly explained by the fact that formation of ammonium nitrate particles is favoured in warmer environments (Salvador et al., 2004). In comparison to the factor scores from the model with non-constant factor loadings (which are essentially unbiased as shown in simulations), the factor scores from the model with constant factor loadings have systematic bias, depending on season and region (Figure 6).

Figure 6
Factor scores from two models: (a) all factor loadings estimated as constant parameters (y-axis), (b) all factor loadings estimated as non-constant (x-axis). Factor scores representing observations from different seasons and regions are plotted as follows: ...
Table 3
Parameter estimates for four PM2.5 constituent data.

6 Discussion

We propose a latent variable model where the factor loadings are allowed to vary across continuous covariate values. This modelling approach extends classical factor analysis models and can help shed light into how the relationship between measured variables and underlying constructs may be modified by covariates, and thus can help improve estimation of underlying latent variables. This model is motivated by analysis of PM2.5 data that showed a noticeable difference for some of the factor loadings across time and region. Variations of factor analysis models have long been used to summarize PM2.5 constituents and gain insight into the sources the particles arose from, a process termed source apportionment. However, such data reduction methods have typically been applied to smaller geographical areas. Factor structures that can better capture the temporal and spatially varying correlations among constituents can help improve summaries of PM2.5 constituents collected from a wide geographical area and time span. Our model allows us to capture time and space-varying correlations among constituents through the temporal and spatial pattern of factor loadings while maintaining a connection to interpetable sources.

We use the penalized regression spline to model factor loadings as functions of multiple covariates and we control the smoothness related to each covariate with its own smoothing parameter. In order to optimize the smoothing parameters in the context of a latent variable model, we propose the use of GCV within each iteration of the EM algorithm. This bypasses the issue of lack of predictors for the marginal distribution of the model by using the latent variable as the predictor in a single M-step. We also incorporate Newton’s method into the EM iterations to facilitate the optimization of multiple smoothing parameters at the same time. The algorithm is successfully implemented in SAS (available as supplementary material), thus enhancing applicability of our method.

Alternatively, we can penalize the spline by treating the spline coefficients as random variables, similar to the likelihood approach used for smoothing parameter selection in additive models (Ruppert et al., 2003; Wood, 2006). Under this approach, the variance of the random coefficients acts as the smoothing parameter and it can be estimated using restricted maximum likelihood (REML). Reiss and Ogden (2009) discusses the advantage of REML over GCV based on a theoretical comparison. However, this approach would be more computationally intensive for a latent factor model because the likelihood does not have a closed form.

In this article, we use bootstrap to draw inference, which can be computationally intensive. Future work can include developing methods to test the significance of different component functions of the non-constant factor loading more efficiently. In our model, we assume that the random errors in equation (2.1) are uncorrelated. In our data application, we de-trended the data before applying our modelling strategy to remove spatial and temporal autocorrelation using GAM. Conditional residuals from the model had negligible correlation, suggesting that an independent error structure may be sufficient for the de-trended data that we have analyzed. Incorporating alternative correlation structures is a desirable next step to avoid the two-step approach we have used. Alternatively, as in the work of Gryparis et al. (2007), the mean trend of the latent variable could also be used to capture the temporal and spatial correlation; this extension could also be used to readily predict constituents in space and time, by first predicting the latent variable at a given time and location.

In our current data analysis, we have modelled every constituent with a penalized regression spline and then assessed whether it is invariant across time and space. Estimating each spline with its own set of smoothing parameters is demanding, although feasible, computationally. It may also be less efficient to build up a model with this much complexity when there are a considerable number of observed items. However, the non-constant factor loadings for many of the constituents may potentially change in similar patterns, due to their common source. Therefore, one potential extension to further improve estimation efficiency is to model multiple non-constant factor loadings with fewer number of curves that capture the essential shape (Kneip and Gasser, 1988; Jiang et al., 2013). The extension of our model to more than one latent variable would be fairly straightforward after proper identifiability constraints are applied. The model can also easily accommodate data missing at random, since it is likelihood based. However, a potentially more challenging problem is the case when observed variables are not measured simultaneously at all locations, that is, missing by design.

The model and algorithm implemented in this article provide a platform for developing more comprehensive measurement models that can allow us to apply flexible dimension reduction strategies. The measurement model for the factor analysis that we have proposed can incorporate known information (such as well-understood sources of air pollution) while learning from the data by allowing for non-constant factor loadings.


The authors acknowledge support from: NIH grants R01ES016932; R01ES017022; P01ES022844; P20ES018171; UM NIEHS Core Center P30 ES017885 and EPA grants RD834800 and RD83543601. The contents of this study are solely the responsibility of the authors and do not necessarily represent the official views of the The National Institute of Environmental Health Sciences (NIEHS) or the National Institute of Health or EPA.


Table A.1

Bias and standard error (in parentheses) of [mu]p, [lambda with circumflex]0,p, σ^ρ2for simulations where the factor loadings change with three covariates.

[mu]1[mu]2[mu]3[mu]4[lambda with circumflex]1.0[lambda with circumflex]2.0[lambda with circumflex]3.0[lambda with circumflex]4.0 σ^12 σ^22 σ^32 σ^42
0.000 (0.023)0.001 (0.023)0.001 (0.024)0.000 (0.025)−0.007 (0.025)−0.009 (0.024)−0.001 (0.025)0.002 (0.025)0.003 (0.066)−0.016 (0.070)−0.010 (0.071)−0.010 (0.067)

Table A.2

Mean integrated squared error (MISE) of fp(t) for different simulation scenarios and estimating approaches.

True factor loadingsEstimating approachf1(t)f2(t)f3(t)
Non-constant cyclicPenalized37.4832.6133.12
Non-constant non-cyclicPenalized30.0139.9437.78

Figure A. 1

An external file that holds a picture, illustration, etc.
Object name is nihms806012f7.jpg

f1,UV, f1,t and their 95% confidence bands, (a), (b) respectively show f1,UV, f1, UV as contour plots. (c) plots f1,t (circles), f1,t (solid line) and its confidence band (shaded area). Results for the other fp,UV, fp,t are similar.


  • Arminger G. A Bayesian approach to nonlinear latent variable models using the Gibbs sampler and the Metropolis—Hastings algorithm. Psychometrika. 1998;63:271–300.
  • Austin E, Coull BA, Zanobetti A, Koutrakis P. A framework to spatially cluster air pollution monitoring sites in US based on the PM2.5 composition. Environmental International. 2013;59:244–54. [PMC free article] [PubMed]
  • Bauer DJ. The role of nonlinear factor-to-indicator relationships in tests of measurement equivalence. Psychological Methods. 2005;10:305–16. [PubMed]
  • Bell ML, Dominici F, Ebisu K, Zeger SL, Samet JM. Spatial and temporal variation in PM2.5 chemical composition in the United States for health effects studies. Environmental Health Perspectives. 2007;115:989–95. [PMC free article] [PubMed]
  • Berrocal VJ, Gelfand AE, Holland DM. A bivariate space-time downscaler under space and time misalignment. The Annals of Applied Statistics. 2010;4:1942–75. [PMC free article] [PubMed]
  • Choi J, Fuentes M, Reich BJ. Spatial-temporal association between fine particulate matter and daily mortality. Computational Statistics and Data Analysis. 2009;53:2989–3000. [PMC free article] [PubMed]
  • Craven P, Wahba G. Smoothing noisy data with spline functions—estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik. 1979;31:377–403.
  • Daniels MJ, Zhou ZG, Zou H. Conditionally specified space-time models for multivariate processes. Journal of Computational and Graphical Statistics. 2006;15:157–77.
  • Dominici F, Peng RD, Bell ML, Pham L, McDermott A, Zeger SL, Samet JM. Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. JAMA. 2006;295:1127–34. [PMC free article] [PubMed]
  • Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Statistical Science. 1996;11:89–102.
  • Fahrmeir L, Raach A. A Bayesian semi-parametric latent variable model for mixed responses. Psychometrika. 2007;72:327–46.
  • Golub GH, Heath M, Wahba G. Generalized cross-validation as a method for choosing a good Ridge parameter. Technometrics. 1979;21:215–23.
  • Gryparis A, Coull BA, Schwartz J, Suh HH. Semiparametric latent variable regression models for spatiotemporal modelling of mobile source particles in the greater Boston area. Journal of the Royal Statistical Society Series. 2007;56:183–209.
  • Guo R, Zhu HT, Chow SM, Ibrahim JG. Bayesian lasso for semiparametric structural equation models. Biometrics. 2012;68:567–77. [PMC free article] [PubMed]
  • Hastie T, Tibshirani R. Varying-coefficient models. Journal of the Royal Statistical Society Series B. 1993;55:757–96.
  • Hewson PJ, Bailey TC. Modelling multivariate disease rates with a latent structure mixture model. Statistical Modelling. 2010;10:241–64.
  • Horn JL, McArdle JJ. A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research. 1992;18:117–44. [PubMed]
  • Huerta G, Sanso B, Stroud JR. A spatiotemporal model for Mexico City ozone levels. Journal of the Royal Statistical Society: Series C. 2004;53:231–48.
  • Ippoliti L, Valentini P, Gamerman D. Space–time modelling of coupled spatiotemporal environmental variables. Journal of the Royal Statistical Society: Series C. 2012;61:175–200.
  • Jiang Q, Wang HS, Xia YC, Jiang GH. On a principal varying coefficient model. Journal of the American Statistical Association. 2013;108:228–36.
  • Kammann EE, Wand MP. Geoadditive models. Journal of the Royal Statistical Society: Series C. 2003;52:1–18.
  • Kneip A, Gasser T. Convergence and consistency results for self-modeling nonlinear-regression. Annals of Statistics. 1988;16:82–112.
  • Laden F, Neas LM, Dockery DW, Schwartz J. Association of fine particulate matter from different sources with daily mortality in six US cities. Environmental Health Perspectives. 2000;108:941–47. [PMC free article] [PubMed]
  • Lee SY, Song XY, Lee JCK. Maximum likelihood estimation of nonlinear structural equation models with ignorable missing data. Journal of Educational and Behavioral Statistics. 2003;28:111–34.
  • Liu X, Wall MM, Hodges JS. Generalized spatial structural equation models. Bio-statistics. 2005;6:539–57. [PubMed]
  • Lopes HF, Salazar E, Gamerman D. Spatial dynamic factor analysis. Bayesian Analysis. 2008;3:759–92.
  • Lu B, Song XY, Li XD. Bayesian analysis of multi-group nonlinear structural equation models with application to behavioral finance. Quantitative Finance. 2012;12:477–88.
  • Meredith W. Measurement invariance, factor-analysis and factorial invariance. Psychometrika. 1993;58:525–43.
  • Nikolov MC, Coull BA, Catalano PJ, Diaz E, Godleski JJ. Statistical methods to evaluate health effects associated with major sources of air pollution: A case-study of breathing patterns during exposure to concentrated Boston air particles. Journal of the Royal Statistical Society: Series C. 2008;57:357–78.
  • Nikolov MC, Coull BA, Catalano PJ, Godleski JJ. Multiplicative factor analysis with a latent mixed model structure for air pollution exposure assessment. Environmetrics. 2011;22:165–78.
  • Peng RD, Bell ML, Geyh AS, McDermott A, Zeger SL, Samet JM, Dominici F. Emergency admissions for cardiovascular and respiratory diseases and the chemical composition of fine particle air pollution. Environmental Health Perspectives. 2009;117:957–63. [PMC free article] [PubMed]
  • Reiss PT, Ogden RT. Smoothing parameter selection for a class of semiparametric linear models. Journal of the Royal Statistical Society: Series B. 2009;71:505–23.
  • Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression (Cambridge series in statistical and probabilistic mathematics) Cambridge, New York: Cambridge University Press; 2003.
  • Sahu SK, Mardia KV. A Bayesian kriged Kalman model for short-term forecasting of air pollution levels. Journal of the Royal Statistical Society: Series C. 2005;54:223–44.
  • Salvador P, Artinano B, Alonso DG, Querol X, Alastuey A. Identification and characterisation of sources of PM10 in Madrid (Spain) by statistical methods. Atmospheric Environment. 2004;38:435–47.
  • Sánchez BN, Budtz-Jørgensen E, Ryan LM, Hu H. Structural equation models: A review with applications to environmental epidemiology. Journal of the American Statistical Association. 2005;100:1443–55.
  • Sánchez BN, Kang S, Mukherjee B. A latent variable approach to study gene—environment interactions in the presence of multiple correlated exposures. Biometrics. 2012;68:466–76. [PMC free article] [PubMed]
  • Schwartz J, Dockery DW, Neas LM. Is daily mortality associated specifically with fine particles? Journal of the Air and Waste Management Association. 1996;46:927–39. [PubMed]
  • Shaddick G, Wakefield J. Modelling daily multivariate pollutant data at multiple sites. Journal of the Royal Statistical Society: Series C. 2002;51:351–72.
  • Song XY, Lu ZH. Semiparametric latent variable models with Bayesian P-splines. Journal of Computational and Graphical Statistics. 2010;19:590–608.
  • Tao Y, Sánchez BN, Mukherjee B. Latent variable models for gene—environment interactions in longitudinal studies with multiple correlated exposures. Statistics in Medicine. 2015;34:1227–241. [PMC free article] [PubMed]
  • Thurston GD, Ito K, Lall R. A source apportionment of U.S. fine particulate matter air pollution. Atmospheric Environment. 2011;45:3924–36. [PMC free article] [PubMed]
  • Valentini P, Ippoliti L, Fontanella L. Modeling US housing prices by spatial dynamic structural equation models. Annals of Applied Statistics. 2013;7:763–98.
  • Wood SN. Generalized Additive Models: An Introduction with R (Texts in statistical science) 2006.
  • Zhang WY, Lee SY. Nonlinear dynamical structural equation models. Quantitative Finance. 2009;9:305–14.
  • Zhu HT, Lee SY. Statistical analysis of nonlinear factor analysis models. British Journal of Mathematical and Statistical Psychology. 1999;52:225–42.