Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4982519

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Model
- 3 Estimation
- 4. Simulation study
- 5 Data application
- 6 Discussion
- References

Authors

Related links

Stat Modelling. Author manuscript; available in PMC 2017 April 1.

Published in final edited form as:

Published online 2016 March 27. doi: 10.1177/1471082X15627004

PMCID: PMC4982519

NIHMSID: NIHMS806012

Address for correspondence: Brisa N. Sánchez, Departments of Biostatistics, M4164 SPH II, 1415 Washington Heights, Ann Arbor, MI 48109-2029, USA. Email: ude.hcimu@asirb

The publisher's final edited version of this article is available at Stat Modelling

Factor analysis is a commonly used method of modelling correlated multivariate exposure data. Typically, the measurement model is assumed to have constant factor loadings. However, from our preliminary analyses of the Environmental Protection Agency’s (EPA’s) PM_{2.5} fine speciation data, we have observed that the factor loadings for four constituents change considerably in stratified analyses. Since invariance of factor loadings is a prerequisite for valid comparison of the underlying latent variables, we propose a factor model that includes non-constant factor loadings that change over time and space using P-spline penalized with the generalized cross-validation (GCV) criterion. The model is implemented using the Expectation-Maximization (EM) algorithm and we select the multiple spline smoothing parameters by minimizing the GCV criterion with Newton’s method during each iteration of the EM algorithm. The algorithm is applied to a one-factor model that includes four constituents. Through bootstrap confidence bands, we find that the factor loading for total nitrate changes across seasons and geographic regions.

Fine particulate matter that is, air particles with diameter <2.5 micrometers (PM_{2.5}), is associated with adverse health effects (e.g., Dominici *et al.,* 2006; Schwartz *et al.,* 1996). Laden *et al.* (2000) suggested PM_{2.5} from different sources has different toxic effects, implying the importance of the chemical profile of the constituents of PM_{2.5}. Because the constituents are typically correlated, studies that analyze the relationship between constituents and health outcomes often use methods that summarize the correlated constituents, for example, using clustering (Austin *et al.,* 2013) or latent variable models (Gryparis *et al.,* 2007; Nikolov *et al.,* 2008). In particular, widely used source apportionment methods (Thurston *et al.,* 2011) use factor analytic models to summarize constituent data in terms of their relationship to air pollution sources, such as power plant or vehicle exhaust, that can be potentially regulated. Since the air pollution sources are not directly observed, source contributions are treated as latent factors and estimated from the data.

However, methods to summarize PM_{2.5} constituents may yield inconsistent summaries when applied to large geographic areas or long time periods. Such summaries depend on the correlation among constituents and correlations among constituents may vary with time and space (Austin *et al.,* 2013; Bell *et al.,* 2007). For instance, our preliminary anaylses of PM_{2.5} constituent data issued by the Environmental Protection Agency (EPA) Air Quality System (AQS) for years 2008 and 2009 (see Section 5 for more details on the data) suggest that the correlation among constituents related to secondary inorganic aerosols (Salvador *et al.,* 2004; Thurston *et al.,* 2011) varies with season and geographic location (Figure 1). Nitrate exhibits the greatest differences in terms of its correlation with other constituents, both across regions and across time within region. As a result, it is not surprising that preliminary factor analyses of this constituent data, stratified by time and region, yield factor loadings that differ across time and region for some constituents. In particular, the factor loading for total nitrate (Figure 2a) demonstrates large differences across regions and time. The estimated factor loadings tend to be larger in the winter months, implying a season effect. Furthermore, the estimated factor loadings are significantly higher in the midwest region, compared to other regions, for all periods except the first quarter of 2008. By contrast, the factor loading for ammonium ion exhibits fewer and less pronounced differences across either region or time (Figure 2b).

Array of heat maps shows how the Pearson correlation among the four PM_{2.5} constituents (log-transformed) change across different strata defined by region (row) and time period (column). The column headers denotes the year (Yr1: 2008, Yr2: 2009) and the **...**

(a) and (b) compare the factor loadings for a particular constituent from stratified confirmatory factor analyses. The *y*-axis shows the size of the factor loading. Each dot on the plot is the factor loading from analysis of one stratum. The midwest, northeast **...**

Differences in the factor model structure are not specifically modelled in current literature that uses factor analytic models applied to constituent data collected across large geographical areas. Although many models have been proposed to reduce data dimension and to predict outcomes based on many constituents simultaneously (Austin *et al*., 2013; Choi *et al*., 2009; Peng *et al*., 2009), a more thorough understanding of how the correlations among PM_{2.5} constituents change with time and space may allow us to gain further insights on the validity and interpretation of such dimension-reduction approaches and thus potentially improve modelling the association between constituents and health outcomes.

There is now a large literature on building models that analyze air pollution data from the perspective of time and space (e.g., Berrocal *et al*., 2010; Daniels *et al*., 2006; Huerta *et al*., 2004; Shaddick and Wakefield, 2002; Sahu and Mardia, 2005, to name a few). However, we focus on models utilizing latent factors, given their connection to air pollution sources (Nikolov *et al*., 2008, 2011) and because some of these models can be applied to understanding temporal and spatial patterns in the correlation of PM_{2.5} constituents specifically. For example, Gryparis *et al*. (2007) proposed semiparametric latent variable regression models for modelling multiple surrogates of a single pollution source. They used penalized splines to allow non-linear effects of covariates on the latent variable mean and used geoadditive models to account for the temporal and spatial correlations in the latent variable (Kammann and Wand, 2003). Lopes *et al*. (2008) used a spatial dynamic factor model to incorporate temporal and spatial correlations among observations of a single pollutant from different monitors over a certain time period. In that model, the spatial correlation was introduced through the factor loading matrix, that is, at a given time-point, monitors were treated as different variables. They modelled the temporal correlation among the latent factors using an autoregressive structure. Ippoliti *et al*. (2012) used a similar dynamic factor model but they coupled two sets of latent factors so that one latent factor could be predicted from the other.

More generally, models using latent variables have gained much popularity in research that relates environmental exposures to health outcomes because they can reduce the dimension of the data by summarizing correlated exposure measurements (see Sánchez *et al*., 2005). In the last decade, much work has been done to relax model assumptions, including nonlinear relationships among latent variables (Arminger, 1998; Guo *et al*., 2012; Lee *et al*, 2003; Lu *et al*, 2012; Song and Lu, 2010; Zhu and Lee, 1999), nonlinear links relating items to latent variables (Bauer, 2005; Gryparis *et al*., 2007), as well as models with spatial structures on the latent variables (Fahrmeir and Raach, 2007; Hewson and Bailey, 2010; Liu *et al*., 2005).

However, except for Zhang and Lee (2009) and Valentini *et al*. (2013), none of these existing advanced models specifically address the variation in the factor loadings exhibited among the constituent data (Figure 2a). Traditionally, the measurement model that links the observed items to the latent variables is assumed constant across the covariate space (e.g., across time and geographic location). A constant measurement model, also termed measurement model invariance (Horn and McArdle, 1992; Meredith, 1993), is needed for valid comparison of differences in latent variable means across covariate levels or groups. However, it has been shown that allowing the measurement model to vary across values of categorical covariates may be desirable when estimating exposure–health outcome associations, potentially because the flexible measurement model enables better estimation of the latent factor itself (Sánchez *et al*., 2012; Tao *et al*., 2015). Allowing the measurement model to vary across continuous covariates, however, is a more challenging problem. Zhang and Lee (2009) used a local maximum likelihood-based estimation procedure to model the factor loadings as varying with time. Valentini *et al*. (2013) gave the factor loadings a spatial structure and they modelled the factor loadings as conditionally independent multivariate Gaussian Markov Random Fields.

In this article, we propose a semiparametric factor analysis model with non-constant factor loadings that change with multiple covariates, in particular time and space as motivated by the EPA’s PM_{2.5} constituents data. We use penalized splines to estimate factor loadings that change with multiple covariates. This construction for the factor loadings bears a resemblance to the classic varying coefficients model (Hastie and Tibshirani, 1993). In Section 2, we propose the factor model that incorporates non-constant factor loadings, while Section 3, explains the estimation procedure. A key feature of our estimation algorithm is to include the generalized cross-validation (GCV) score within the EM iterations so that we can optimize the smoothing of the splines in a latent variable setting. In Section 4, we present a simulation study to assess the algorithm and estimation properties for certain model parameters and the estimated factor scores. We then analyze the constituent data set and draw inferences based on bootstrap approaches in Section 5. Section 6 ends with a discussion and future directions.

Let **y*** _{i}* be a vector of

We assume that the elements of **y*** _{i}* are linearly related to latent variables according to the measurement model,

$${\mathbf{y}}_{i}=\mathit{\mu}+{\mathbf{\Lambda}}_{i}{\eta}_{i}+{\mathit{\epsilon}}_{i},$$

(2.1)

where *η _{i}* is a latent variable,

In classical latent variable models, the factor loadings are estimated as constant parameters or sometimes allowed to differ across levels of a categorical variable (Horn and McArdle, 1992). However, in our proposed model, factor loadings are allowed to vary smoothly with time and/or location, resembling a varying coefficients model (Hastie and Tibshirani, 1993).

We model the non-constant factor loadings as a smooth function of (*t _{i}*,

$${\lambda}_{p,i}={\lambda}_{p,0}+{f}_{p,t}({t}_{i})+{f}_{p,uv}({u}_{i},{v}_{i})+{f}_{p,\mathit{tuv}}({t}_{i},{u}_{i},{v}_{i}),$$

(2.2)

where *f _{p,t}*,

We represent the non-constant factor loadings using penalized splines. Specifically, we construct univariate cubic B-spline basis functions for each of the covariates, *t*, *u*, *v* and use them to construct tensor product bases for the interaction terms *f _{p,uv}(u_{i}*,

$${\mathbf{X}}_{p}=[1,{\mathbf{X}}_{t},{\mathbf{X}}_{uv},{\mathbf{X}}_{\mathit{tuv}}],\phantom{\rule{0.38889em}{0ex}}{\mathit{\beta}}_{p}^{T}=\left[{\lambda}_{p,0},{\mathit{\beta}}_{p,t}^{T},{\mathit{\beta}}_{p,uv}^{T},{\mathit{\beta}}_{p,\mathit{tuv}}^{T}\right].$$

Note that we use **Λ*** _{i}* = (

We use the squared first difference penalty (Eilers and Marx, 1996) to control the degrees of freedom of the smooth, which yields the desirable property of shrinking the smooth to the intercept *λ _{p}*

We use the EM algorithm to facilitate the estimation, where the complete data structure includes the observed data **y*** _{i}* augmented with unobserved variables

$$log\mathcal{L}(\mathit{\theta}\mid \mathbf{y},\mathit{\eta})-\frac{1}{2}\sum _{p=1}^{P}{\mathrm{\Omega}}_{p}({\mathit{\beta}}_{p},{\mathit{\omega}}_{p}),$$

(3.1)

where
$\mathbf{y}={({\mathbf{y}}_{1}^{T},\dots ,{\mathbf{y}}_{n}^{T})}^{T}$ and ** η** = (

Using the assumption of a normal distribution and omitting the proportionality constant, the complete data log-likelihood is

$$log\mathcal{L}(\mathit{\theta}\mid \mathbf{y},\mathit{\eta})=-\sum _{i=1}^{n}\left\{log\mid \mathbf{\sum}\mid +{({\mathbf{y}}_{i}-\mathit{\mu}-{\mathbf{\Lambda}}_{i}{\eta}_{i})}^{T}{\mathbf{\sum}}^{-1}({\mathbf{y}}_{i}-\mathit{\mu}-{\mathbf{\Lambda}}_{i}{\eta}_{i})+{\eta}_{i}^{2}\right\}.$$

Even though we are to maximize the penalized likelihood as in equation (3.1), the penalty does not affect the E-step because the penalty does not involve the missing latent variable. The expected values of *η _{i}* and
${\eta}_{i}^{2}$ are needed in the E-step. Using the conditional normal distribution, these expectations at the (

$$\begin{array}{l}{\eta}_{i}^{(r)}=\mathrm{E}({\eta}_{i}\mid {\mathbf{y}}_{i})\\ =\mathrm{E}({\eta}_{i})+\text{Cov}({\eta}_{i},{\mathbf{y}}_{i}){[\text{Cov}({\mathbf{y}}_{i})]}^{-1}[{\mathbf{y}}_{i}-\mathrm{E}({\mathbf{y}}_{i})]\\ ={\mathbf{\Lambda}}_{i}^{T}{({\mathbf{\Lambda}}_{i}{\mathbf{\Lambda}}_{i}^{T}+\mathbf{\sum})}^{-1}({\mathbf{y}}_{i}-\mathit{\mu}),\end{array}$$

$$\begin{array}{l}{({\eta}_{i}^{2})}^{(r)}=\mathrm{E}({\eta}_{i}^{2}\mid {\mathbf{y}}_{i})\\ =\text{Cov}({\eta}_{i}\mid {\mathbf{y}}_{i})+{[\mathrm{E}({\eta}_{i}\mid {\mathbf{y}}_{i})]}^{2}\\ =\text{Cov}({\eta}_{i})-\text{Cov}({\eta}_{i},{\mathbf{y}}_{i}){[\text{Cov}({\mathbf{y}}_{i})]}^{-1}\text{Cov}({\mathbf{y}}_{i},{\eta}_{i})+{({\eta}_{i}^{2})}^{(r)}\\ =1-{\mathbf{\Lambda}}_{i}^{T}{({\mathbf{\Lambda}}_{i}{\mathbf{\Lambda}}_{i}^{T}+\mathbf{\sum})}^{-1}{\mathbf{\Lambda}}_{i}+{({\eta}_{i}^{2})}^{(r)}.\end{array}$$

In the M-step, we estimate ** μ**,

$$\begin{array}{l}{\sum}_{i=1}^{n}logL\phantom{\rule{0.16667em}{0ex}}\left(\mathit{\mu},{\mathbf{\Lambda}}_{i},\mathbf{\sum};{\mathbf{y}}_{i}\mid {\eta}_{i}^{(r)}\right)-\frac{1}{2}{\sum}_{p=1}^{P}\mathrm{\Omega}({\mathit{\beta}}_{p},{\mathit{\omega}}_{p})={\sum}_{i=1}^{n}{\sum}_{p=1}^{P}logL\phantom{\rule{0.16667em}{0ex}}\left({\mu}_{p},{\lambda}_{p,i},{\sigma}_{p}^{2};{y}_{p,i}\mid {\eta}_{i}^{(r)}\right)-\frac{1}{2}{\sum}_{p=1}^{P}\mathrm{\Omega}({\mathit{\beta}}_{p},{\mathit{\omega}}_{p})\\ ={\sum}_{p=1}^{P}\left\{logL\phantom{\rule{0.16667em}{0ex}}\left({\mu}_{p},{\mathit{\lambda}}_{p}({\mathit{\beta}}_{p}),{\sigma}_{p}^{2};{\mathbf{y}}_{.p}\mid {\mathit{\eta}}^{(r)}\right)-\frac{1}{2}\mathrm{\Omega}({\mathit{\beta}}_{p},{\mathit{\omega}}_{p})\right\},\end{array}$$

(3.2)

where **y**.* _{p}* = (

For each observed variable **y**.* _{p}*, the penalized log-likelihood given

$$\begin{array}{l}logL\phantom{\rule{0.16667em}{0ex}}\left({\mu}_{p},{\mathit{\lambda}}_{p}({\mathit{\beta}}_{p}),{\sigma}_{p}^{2};{\mathbf{y}}_{.p}\mid {\mathit{\eta}}^{(r)}\right)-\frac{1}{2}\mathrm{\Omega}({\mathit{\beta}}_{p},{\mathit{\omega}}_{p})\\ \propto -nlog({\sigma}_{p}^{2})-{\sigma}_{p}^{-2}{\Vert {\mathbf{y}}_{.p}-{\mathbf{1}}_{n}{\mu}_{p}-{\mathit{\eta}}^{(r)}\circ {\mathit{\lambda}}_{p}({\mathit{\beta}}_{p})\Vert}^{2}-{\mathit{\beta}}_{p}^{T}\mathbf{S}({\mathit{\omega}}_{p}){\mathit{\beta}}_{p},\end{array}$$

(3.3)

where ‘’ is the entrywise Hadamard product and **S**(*ω** _{p}*) is the previously described penalty matrix. Equation (3.3) is essentially a regression of

$${\stackrel{\sim}{\mathit{\eta}}}^{(r)}=\left[\begin{array}{c}{\eta}_{1}^{(r)}{\mathbf{x}}_{1}\\ \vdots \\ {\eta}_{n}^{(r)}{\mathbf{x}}_{n}\end{array}\right],$$

(3.4)

that is, we can write
${\mathit{\eta}}^{(r)}\circ {\mathit{\lambda}}_{p}({\mathit{\beta}}_{p})={\mathit{\eta}}^{(r)}\circ ({\mathbf{X}}_{p}{\mathit{\beta}}_{p})=\left[\begin{array}{c}{\eta}_{1}^{(r)}{\mathbf{x}}_{1}{\mathit{\beta}}_{p}\\ \vdots \\ {\eta}_{n}^{(r)}{\mathbf{x}}_{n}{\mathit{\beta}}_{p}\end{array}\right]={\stackrel{\sim}{\mathit{\eta}}}^{(r)}{\mathit{\beta}}_{p}$. Hence, replacing *η*^{(}^{r}^{)} **Λ*** _{p}*(

$$\left[\begin{array}{c}{\mu}_{p}^{(r)}\\ {\mathit{\beta}}_{p}^{(r)}\end{array}\right]={\left\{\left[\begin{array}{cc}{\mathbf{1}}_{n}^{T}{\mathbf{1}}_{n}& {\mathbf{1}}_{n}^{T}{\stackrel{\sim}{\mathit{\eta}}}^{(r)}\\ {({\stackrel{\sim}{\mathit{\eta}}}^{T})}^{(r)}{\mathbf{1}}_{n}& {({\stackrel{\sim}{\mathit{\eta}}}^{T}\stackrel{\sim}{\mathit{\eta}})}^{(r)}\end{array}\right]+\left[\begin{array}{cc}0& 0\\ 0& \mathbf{S}({\mathit{\omega}}_{p})\end{array}\right]\right\}}^{-1}\phantom{\rule{0.16667em}{0ex}}\left[\begin{array}{c}{\mathbf{1}}_{n}^{T}\\ {({\stackrel{\sim}{\mathit{\eta}}}^{T})}^{(r)}\end{array}\right]\phantom{\rule{0.16667em}{0ex}}{\mathbf{y}}_{.p},$$

(3.5)

and

$${({\sigma}_{p}^{2})}^{(r)}={n}^{-1}\phantom{\rule{0.16667em}{0ex}}\left[{({\mathbf{y}}_{.p}-{\mathbf{1}}_{n}{\mu}_{p}^{(r)})}^{T}({\mathbf{y}}_{.p}-{\mathbf{1}}_{n}{\mu}_{p}^{(r)})-2{({\mathbf{y}}_{.p}-{\mathbf{1}}_{n}{\mu}_{p}^{(r)})}^{T}{\stackrel{\sim}{\mathit{\eta}}}^{(r)}{\mathit{\beta}}_{p}^{(r)}+{({\mathit{\beta}}_{p}^{T})}^{(r)}{({\stackrel{\sim}{\mathit{\eta}}}^{T}\stackrel{\sim}{\mathit{\eta}})}^{(r)}{\mathit{\beta}}_{p}^{(r)}\right],$$

where ${({\stackrel{\sim}{\mathit{\eta}}}^{T}\stackrel{\sim}{\mathit{\eta}})}^{(r)}={\sum}_{i}^{n}{\mathbf{x}}_{i}^{T}{({\eta}_{i}^{2})}^{(r)}{\mathbf{x}}_{i}=[{\mathbf{x}}_{1}^{T}{({\eta}_{1}^{2})}^{(r)},\dots ,{\mathbf{x}}_{n}^{T}{({\eta}_{n}^{2})}^{(r)}]\phantom{\rule{0.16667em}{0ex}}{\mathbf{X}}_{p}$.

So far, we have treated the smoothing parameters *ω** _{p}* as given, but here we describe how to use the GCV score to select the smoothing parameters (Craven and Wahba, 1979; Golub

$${\text{GCV}}_{p}=\frac{n{\Vert {\mathbf{y}}_{.p}-{\mathbf{A}}_{p}^{(r)}{\mathbf{y}}_{.p}\Vert}^{2}}{{[n-\text{tr}({\mathbf{A}}_{p}^{(r)})]}^{2}},$$

(3.6)

where

$${\mathbf{A}}_{p}^{(r)}={\left[\begin{array}{c}{\mathbf{1}}_{n}^{T}\\ {({\stackrel{\sim}{\mathit{\eta}}}^{T})}^{(r)}\end{array}\right]}^{T}\phantom{\rule{0.16667em}{0ex}}{\left\{\left[\begin{array}{cc}{\mathbf{1}}_{n}^{T}{\mathbf{1}}_{n}& {\mathbf{1}}_{n}^{T}{\stackrel{\sim}{\mathit{\eta}}}^{(r)}\\ {({\stackrel{\sim}{\mathit{\eta}}}^{T})}^{(r)}{\mathbf{1}}_{n}& {({\stackrel{\sim}{\mathit{\eta}}}^{T}\stackrel{\sim}{\mathit{\eta}})}^{(r)}\end{array}\right]+\left[\begin{array}{cc}0& 0\\ 0& \mathbf{S}({\mathit{\omega}}_{p})\end{array}\right]\right\}}^{-1}\phantom{\rule{0.16667em}{0ex}}\left[\begin{array}{c}{\mathbf{1}}_{n}^{T}\\ {({\stackrel{\sim}{\mathit{\eta}}}^{T})}^{(r)}\end{array}\right]$$

is the influence matrix (compare with equation (3.5)) and tr(**A*** _{p}*) represents the effective number of parameters used to estimate

Since the latent variable is unknown and is itself to be computed from **y*** _{p}* and

We minimize GCV* _{p}* using a numerical approach based on Newton’s method. Here, GCV

We use bootstrap resampling to construct a 95% confidence interval for point estimates and 95% simultaneous confidence band for spline estimates (Ruppert *et al.,* 2003). We sample with replacement from the *n* observations to generate *M* new data sets of *n* observations each. In our data application, this resampling approach is valid since de-trending the data removes spatial and temporal correlation. The same EM algorithm is repeated on these data sets to get bootstrap estimates. This bootstrap strategy is based on the assumption that there is little temporal and spatial correlation for the residual error *ε** _{i}*, of the actual data.

For single point estimates, such as *λ _{p}*

$$\underset{t\in \mathrm{t}}{sup}\left|\frac{{\widehat{f}}_{p,t}^{\ast}(t)-{\widehat{f}}_{p,t}(t)}{\sqrt{\text{Var}[{\widehat{f}}_{p,t}^{\ast}(t)-{\widehat{f}}_{p,t}(t)]}}\right|,$$

where
${\widehat{f}}_{p,t}^{\ast}$ is the bootstrap estimate of *f _{p},_{t}* and

We conduct simulation studies to examine the advantages and disadvantages of this novel estimation method, as well as to verify that the estimation algorithm can recover model parameters. We first examine whether the algorithm can recover the true parameter values when the factor loadings vary with several covariates. For this purpose, we simulate 1000 data sets where *P* observed variables are associated with one latent factor and the factor loadings are smooth functions of three covariates in this form: *λ _{p,i}* =

Next, we further investigate how biases can enter various components of the model across different simulation scenarios where the factor loadings take on different functional forms. For this study, we let the factor loadings be functions of only one covariate. This simpler setting can enhance computational efficiency while still illustrating the impact of various estimating approaches under different scenarios. Specifically, we investigate how using non-constant factor loadings affects the estimation of model parameters and the factor scores. Examining estimation of factor scores gives insight as to how estimated source contributions in air pollution studies may be affected, for instance.

The simulations are set up with three different shapes for the true factor loadings. We let *λ _{p,i}* =

*Constant: f*(_{p}*t*) = 0,_{i}*Non-constant, cyclic*: ${f}_{p}({t}_{i})=\kappa \phantom{\rule{0.16667em}{0ex}}\left[{c}_{p}+cos\phantom{\rule{0.16667em}{0ex}}\left(4\pi {t}_{p,i}^{\ast}\right)\right]$, where ${t}_{p,i}^{\ast}={t}_{i}-{\scriptstyle \frac{p-1}{2P}}$,*Non-constant, non-cyclic*:$${f}_{p}({t}_{i})=\kappa \phantom{\rule{0.16667em}{0ex}}\left[{c}_{p}+4{t}_{p,i}^{\ast}I\phantom{\rule{0.16667em}{0ex}}\left(-\frac{1}{4}\le {t}_{p,i}^{\ast}\le \frac{1}{4}\right)+I\phantom{\rule{0.16667em}{0ex}}\left({t}_{p,i}^{\ast}>\frac{1}{4}\right)-I\phantom{\rule{0.16667em}{0ex}}\left({t}_{p,i}^{\ast}<-\frac{1}{4}\right)\right],$$where ${t}_{p,i}^{\ast}={t}_{i}-{\scriptstyle \frac{3}{4}}+{\scriptstyle \frac{p-1}{2(P-1)}}$.

In the formulas, *c _{p}* is the constant that ensures
${\sum}_{i=1}^{n}{f}_{p}({t}_{i})=0$ and

For each of the three shapes of factor loadings listed above, we hold some parameters constant, but we change the signal-to-noise ratio in the model as follows. For all the scenarios, we set *μ _{p}* = 1 for all

For each scenario, we generate 1000 simulated data sets, each with 1000 equally spaced data points within the range of *t*. For some scenarios, we vary the sample size (500, 5000) and the number of observed variables, *P* = 6 instead of 3. For each scenario, we estimate the factor loadings using three different approaches: (a) constant factor loadings, (b) non-constant factor loadings using penalized splines and (c) non-constant factor loadings using unpenalized splines. For all the scenarios, we choose equally spaced knots that give 10 spline coefficients (including *λ _{p}*

We report the simulation results primarily focusing on scenarios where (a) each simulated data set has 1000 equally spaced data points within the range of *t _{i}*, (b)
${\lambda}_{p,0}^{2}/({\lambda}_{p,0}^{2}+{\sigma}_{p}^{2})=0.8$ and (c)

Bias and standard deviation (in parentheses) of _{p,}_{0},
${\widehat{\sigma}}_{\rho}^{2}$ for different simulation scenarios and estimating approaches.

Bias and mean squared error (MSE) of the factor score for different simulation scenarios and estimating approaches. IBIAS and IMSE show the bias and MSE integrated over the entire distribution of the latent variable *η*; the remaining columns show **...**

We look at the results of our simulation study from two aspects. First, we compare the bias and standard error of _{p,}_{0},
${\widehat{\sigma}}_{p}^{2}$ among different estimation approaches and simulation scenarios (Table 1). This comparison primarily gives insight into possible loss of efficiency in constant factor loadings when they are instead estimated as non-constants, as well as gives insight regarding bias in residual variances, which ultimately affect how much weight a given observed variable is given when estimating the latent factor. When the factor loadings are constant, the bias and efficiency of _{p,}_{0},
${\widehat{\sigma}}_{p}^{2}$ are similar across estimation approaches, although
${\widehat{\sigma}}_{p}^{2}$ is slightly underestimated when we use unpenalized splines since without penalization, the splines overfit the data. However, when the factor loadings are non-constant and estimated as constant, _{p,}_{0},
${\widehat{\sigma}}_{p}^{2}$ are both biased. The bias becomes trivial when we correctly estimate the factor loadings as non-constant. We also find that the penalty has little effect on the standard deviations of _{p}_{,0},
${\widehat{\sigma}}_{p}^{2}$ (compare ‘penalized’ with ‘un-penalized’ in Table 1), possibly because these two parameters measure the overall signal-to-noise ratio and thus they are less affected by the smoothness of * _{p}*(

Next, we compare how estimation approaches affect the factor score (Table 2). We define factor score as * _{i}* = E

The overall average bias and MSE for the factor scores are similar across estimation approaches; however, differences emerge when we examine bias by percentiles of the distribution of the true latent variable and when we examine the factor scores at different covariate values *t* (Table 2 and Figure 4a–4b). We find that when the factor loadings are constant, the bias and MSE are similar if we estimate factor loadings either as constant or as non-constant using penalized splines. If we estimate the factor loadings using unpenalized splines, the MSE is unsurprisingly larger. However, when the factor loadings are non-constant, we obtain the smallest bias and MSE when we estimate the factor loadings using penalized splines. Conversely, the bias and MSE are largest when the factor loadings are estimated as constant, and this is most notable for scenarios with the non-cyclic factor loading. In this case, * _{i}* has a higher MSE for values of

In scenarios where the variance in the observed variables is smaller (i.e.,
${\lambda}_{p,0}^{2}/({\lambda}_{p,0}^{2}+{\sigma}_{p}^{2})$ and *κ* are smaller), we find that the advantage of correctly estimating the non-constant factor loadings using penalized splines is less obvious. Similarly, when the noise in the data gets larger but the true factor loadings are constant, incorrectly estimating them as non-constant results in slightly increased bias in
${\widehat{\sigma}}_{p}^{2}$ and increased MSE among all the parameter estimates along with the increase of MSE for the factor score.

We applied our model to four constituents from the US PM_{2.5} speciation data issued by EPA for years 2008 and 2009. The data set we used included 108 monitors stationed in part of the midwest, south and northeast (see Figure 5). Thirty–six monitors had measurements in every three days when they were functioning, but the majority had measurements only in every six days. Thus, we took every other observation from the thirty–six monitors that had measurements in every three days so that data from all monitors were equally spaced.

(a) is the contour plot of _{p,uv} for the factor loading of total nitrate. The bigger scattered dots are the monitor locations and the smaller arrayed dots represent areas that are significantly different from zero according to the 95% **...**

Many of the constituents in the EPA data were zero inflated and would require special treatment beyond the scope of this article. Thus, we selected constituents that, after taking the logarithm, had an approximate normal distribution. Based on the exploratory factor analysis (see Section 1 and Figure 1), we decided to look at four constituents, sulphur, ammonium ion, total nitrate, sulphate, that are associated with secondary aerosols (Nikolov *et al.,* 2008) and fit into a one-factor structure.

Since the concentration of constituents change across time and space, the mean of each observed variable (** μ** in equation (2.1)) would need to be modelled as a smooth function. Although this is in principle possible with our modelling approach, we opted instead to de-trend the data and work with residuals, that is, first we regressed each constituent on time and spatial coordinates using GAM and then we used the residuals from these GAMs as the observed variables. Note that this approach not only allowed us to treat

The factor loadings were initially modelled according to equation (2.2), but the interaction term *f _{p,tuv}* was insignificant based on the bootstrap confidence band, and thus we decided to keep only

Table 3 summarizes the estimates of all the parameters in the model except the spline coefficients. (We do not list ** μ** in the table because they are zero for the detrended data.) Figure 5 shows the time and spatial components of the non-constant factor loading for total nitrate. For the other constituents, the non-constant part is not deemed significant since the confidence bands engulf zero entirely (not shown). From Figure 5a, we can see that the factor loading in the midwest is significantly higher in the midwest region than in the south. Figure 5b shows that the factor loading for total nitrate fluctuates significantly over the two years in a periodic fashion, indicating a strong seasonal pattern. These patterns can be mostly explained by the fact that formation of ammonium nitrate particles is favoured in warmer environments (Salvador

Factor scores from two models: (a) all factor loadings estimated as constant parameters (*y*-axis), (b) all factor loadings estimated as non-constant (*x*-axis). Factor scores representing observations from different seasons and regions are plotted as follows: **...**

We propose a latent variable model where the factor loadings are allowed to vary across continuous covariate values. This modelling approach extends classical factor analysis models and can help shed light into how the relationship between measured variables and underlying constructs may be modified by covariates, and thus can help improve estimation of underlying latent variables. This model is motivated by analysis of PM_{2.5} data that showed a noticeable difference for some of the factor loadings across time and region. Variations of factor analysis models have long been used to summarize PM_{2.5} constituents and gain insight into the sources the particles arose from, a process termed source apportionment. However, such data reduction methods have typically been applied to smaller geographical areas. Factor structures that can better capture the temporal and spatially varying correlations among constituents can help improve summaries of PM_{2.5} constituents collected from a wide geographical area and time span. Our model allows us to capture time and space-varying correlations among constituents through the temporal and spatial pattern of factor loadings while maintaining a connection to interpetable sources.

We use the penalized regression spline to model factor loadings as functions of multiple covariates and we control the smoothness related to each covariate with its own smoothing parameter. In order to optimize the smoothing parameters in the context of a latent variable model, we propose the use of GCV within each iteration of the EM algorithm. This bypasses the issue of lack of predictors for the marginal distribution of the model by using the latent variable as the predictor in a single M-step. We also incorporate Newton’s method into the EM iterations to facilitate the optimization of multiple smoothing parameters at the same time. The algorithm is successfully implemented in SAS (available as supplementary material), thus enhancing applicability of our method.

Alternatively, we can penalize the spline by treating the spline coefficients as random variables, similar to the likelihood approach used for smoothing parameter selection in additive models (Ruppert *et al*., 2003; Wood, 2006). Under this approach, the variance of the random coefficients acts as the smoothing parameter and it can be estimated using restricted maximum likelihood (REML). Reiss and Ogden (2009) discusses the advantage of REML over GCV based on a theoretical comparison. However, this approach would be more computationally intensive for a latent factor model because the likelihood does not have a closed form.

In this article, we use bootstrap to draw inference, which can be computationally intensive. Future work can include developing methods to test the significance of different component functions of the non-constant factor loading more efficiently. In our model, we assume that the random errors in equation (2.1) are uncorrelated. In our data application, we de-trended the data before applying our modelling strategy to remove spatial and temporal autocorrelation using GAM. Conditional residuals from the model had negligible correlation, suggesting that an independent error structure may be sufficient for the de-trended data that we have analyzed. Incorporating alternative correlation structures is a desirable next step to avoid the two-step approach we have used. Alternatively, as in the work of Gryparis *et al*. (2007), the mean trend of the latent variable could also be used to capture the temporal and spatial correlation; this extension could also be used to readily predict constituents in space and time, by first predicting the latent variable at a given time and location.

In our current data analysis, we have modelled every constituent with a penalized regression spline and then assessed whether it is invariant across time and space. Estimating each spline with its own set of smoothing parameters is demanding, although feasible, computationally. It may also be less efficient to build up a model with this much complexity when there are a considerable number of observed items. However, the non-constant factor loadings for many of the constituents may potentially change in similar patterns, due to their common source. Therefore, one potential extension to further improve estimation efficiency is to model multiple non-constant factor loadings with fewer number of curves that capture the essential shape (Kneip and Gasser, 1988; Jiang *et al*., 2013). The extension of our model to more than one latent variable would be fairly straightforward after proper identifiability constraints are applied. The model can also easily accommodate data missing at random, since it is likelihood based. However, a potentially more challenging problem is the case when observed variables are not measured simultaneously at all locations, that is, missing by design.

The model and algorithm implemented in this article provide a platform for developing more comprehensive measurement models that can allow us to apply flexible dimension reduction strategies. The measurement model for the factor analysis that we have proposed can incorporate known information (such as well-understood sources of air pollution) while learning from the data by allowing for non-constant factor loadings.

The authors acknowledge support from: NIH grants R01ES016932; R01ES017022; P01ES022844; P20ES018171; UM NIEHS Core Center P30 ES017885 and EPA grants RD834800 and RD83543601. The contents of this study are solely the responsibility of the authors and do not necessarily represent the official views of the The National Institute of Environmental Health Sciences (NIEHS) or the National Institute of Health or EPA.

_{1} | _{2} | _{3} | _{4} | _{1.0} | _{2.0} | _{3.0} | _{4.0} | ${\widehat{\sigma}}_{1}^{2}$ | ${\widehat{\sigma}}_{2}^{2}$ | ${\widehat{\sigma}}_{3}^{2}$ | ${\widehat{\sigma}}_{4}^{2}$ |
---|---|---|---|---|---|---|---|---|---|---|---|

0.000 (0.023) | 0.001 (0.023) | 0.001 (0.024) | 0.000 (0.025) | −0.007 (0.025) | −0.009 (0.024) | −0.001 (0.025) | 0.002 (0.025) | 0.003 (0.066) | −0.016 (0.070) | −0.010 (0.071) | −0.010 (0.067) |

True factor loadings | Estimating approach | _{1}(t) | _{2}(t) | _{3}(t) |
---|---|---|---|---|

Constant | Penalized | 3.82 | 4.25 | 3.80 |

Unpenalized | 45.28 | 45.31 | 44.63 | |

Non-constant cyclic | Penalized | 37.48 | 32.61 | 33.12 |

Unpenalized | 51.36 | 41.96 | 42.12 | |

Non-constant non-cyclic | Penalized | 30.01 | 39.94 | 37.78 |

Unpenalized | 45.84 | 50.23 | 52.18 |

- Arminger G. A Bayesian approach to nonlinear latent variable models using the Gibbs sampler and the Metropolis—Hastings algorithm. Psychometrika. 1998;63:271–300.
- Austin E, Coull BA, Zanobetti A, Koutrakis P. A framework to spatially cluster air pollution monitoring sites in US based on the PM
_{2.5}composition. Environmental International. 2013;59:244–54. [PMC free article] [PubMed] - Bauer DJ. The role of nonlinear factor-to-indicator relationships in tests of measurement equivalence. Psychological Methods. 2005;10:305–16. [PubMed]
- Bell ML, Dominici F, Ebisu K, Zeger SL, Samet JM. Spatial and temporal variation in PM
_{2.5}chemical composition in the United States for health effects studies. Environmental Health Perspectives. 2007;115:989–95. [PMC free article] [PubMed] - Berrocal VJ, Gelfand AE, Holland DM. A bivariate space-time downscaler under space and time misalignment. The Annals of Applied Statistics. 2010;4:1942–75. [PMC free article] [PubMed]
- Choi J, Fuentes M, Reich BJ. Spatial-temporal association between fine particulate matter and daily mortality. Computational Statistics and Data Analysis. 2009;53:2989–3000. [PMC free article] [PubMed]
- Craven P, Wahba G. Smoothing noisy data with spline functions—estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik. 1979;31:377–403.
- Daniels MJ, Zhou ZG, Zou H. Conditionally specified space-time models for multivariate processes. Journal of Computational and Graphical Statistics. 2006;15:157–77.
- Dominici F, Peng RD, Bell ML, Pham L, McDermott A, Zeger SL, Samet JM. Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. JAMA. 2006;295:1127–34. [PMC free article] [PubMed]
- Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Statistical Science. 1996;11:89–102.
- Fahrmeir L, Raach A. A Bayesian semi-parametric latent variable model for mixed responses. Psychometrika. 2007;72:327–46.
- Golub GH, Heath M, Wahba G. Generalized cross-validation as a method for choosing a good Ridge parameter. Technometrics. 1979;21:215–23.
- Gryparis A, Coull BA, Schwartz J, Suh HH. Semiparametric latent variable regression models for spatiotemporal modelling of mobile source particles in the greater Boston area. Journal of the Royal Statistical Society Series. 2007;56:183–209.
- Guo R, Zhu HT, Chow SM, Ibrahim JG. Bayesian lasso for semiparametric structural equation models. Biometrics. 2012;68:567–77. [PMC free article] [PubMed]
- Hastie T, Tibshirani R. Varying-coefficient models. Journal of the Royal Statistical Society Series B. 1993;55:757–96.
- Hewson PJ, Bailey TC. Modelling multivariate disease rates with a latent structure mixture model. Statistical Modelling. 2010;10:241–64.
- Horn JL, McArdle JJ. A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research. 1992;18:117–44. [PubMed]
- Huerta G, Sanso B, Stroud JR. A spatiotemporal model for Mexico City ozone levels. Journal of the Royal Statistical Society: Series C. 2004;53:231–48.
- Ippoliti L, Valentini P, Gamerman D. Space–time modelling of coupled spatiotemporal environmental variables. Journal of the Royal Statistical Society: Series C. 2012;61:175–200.
- Jiang Q, Wang HS, Xia YC, Jiang GH. On a principal varying coefficient model. Journal of the American Statistical Association. 2013;108:228–36.
- Kammann EE, Wand MP. Geoadditive models. Journal of the Royal Statistical Society: Series C. 2003;52:1–18.
- Kneip A, Gasser T. Convergence and consistency results for self-modeling nonlinear-regression. Annals of Statistics. 1988;16:82–112.
- Laden F, Neas LM, Dockery DW, Schwartz J. Association of fine particulate matter from different sources with daily mortality in six US cities. Environmental Health Perspectives. 2000;108:941–47. [PMC free article] [PubMed]
- Lee SY, Song XY, Lee JCK. Maximum likelihood estimation of nonlinear structural equation models with ignorable missing data. Journal of Educational and Behavioral Statistics. 2003;28:111–34.
- Liu X, Wall MM, Hodges JS. Generalized spatial structural equation models. Bio-statistics. 2005;6:539–57. [PubMed]
- Lopes HF, Salazar E, Gamerman D. Spatial dynamic factor analysis. Bayesian Analysis. 2008;3:759–92.
- Lu B, Song XY, Li XD. Bayesian analysis of multi-group nonlinear structural equation models with application to behavioral finance. Quantitative Finance. 2012;12:477–88.
- Meredith W. Measurement invariance, factor-analysis and factorial invariance. Psychometrika. 1993;58:525–43.
- Nikolov MC, Coull BA, Catalano PJ, Diaz E, Godleski JJ. Statistical methods to evaluate health effects associated with major sources of air pollution: A case-study of breathing patterns during exposure to concentrated Boston air particles. Journal of the Royal Statistical Society: Series C. 2008;57:357–78.
- Nikolov MC, Coull BA, Catalano PJ, Godleski JJ. Multiplicative factor analysis with a latent mixed model structure for air pollution exposure assessment. Environmetrics. 2011;22:165–78.
- Peng RD, Bell ML, Geyh AS, McDermott A, Zeger SL, Samet JM, Dominici F. Emergency admissions for cardiovascular and respiratory diseases and the chemical composition of fine particle air pollution. Environmental Health Perspectives. 2009;117:957–63. [PMC free article] [PubMed]
- Reiss PT, Ogden RT. Smoothing parameter selection for a class of semiparametric linear models. Journal of the Royal Statistical Society: Series B. 2009;71:505–23.
- Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression (Cambridge series in statistical and probabilistic mathematics) Cambridge, New York: Cambridge University Press; 2003.
- Sahu SK, Mardia KV. A Bayesian kriged Kalman model for short-term forecasting of air pollution levels. Journal of the Royal Statistical Society: Series C. 2005;54:223–44.
- Salvador P, Artinano B, Alonso DG, Querol X, Alastuey A. Identification and characterisation of sources of PM
_{10}in Madrid (Spain) by statistical methods. Atmospheric Environment. 2004;38:435–47. - Sánchez BN, Budtz-Jørgensen E, Ryan LM, Hu H. Structural equation models: A review with applications to environmental epidemiology. Journal of the American Statistical Association. 2005;100:1443–55.
- Sánchez BN, Kang S, Mukherjee B. A latent variable approach to study gene—environment interactions in the presence of multiple correlated exposures. Biometrics. 2012;68:466–76. [PMC free article] [PubMed]
- Schwartz J, Dockery DW, Neas LM. Is daily mortality associated specifically with fine particles? Journal of the Air and Waste Management Association. 1996;46:927–39. [PubMed]
- Shaddick G, Wakefield J. Modelling daily multivariate pollutant data at multiple sites. Journal of the Royal Statistical Society: Series C. 2002;51:351–72.
- Song XY, Lu ZH. Semiparametric latent variable models with Bayesian P-splines. Journal of Computational and Graphical Statistics. 2010;19:590–608.
- Tao Y, Sánchez BN, Mukherjee B. Latent variable models for gene—environment interactions in longitudinal studies with multiple correlated exposures. Statistics in Medicine. 2015;34:1227–241. [PMC free article] [PubMed]
- Thurston GD, Ito K, Lall R. A source apportionment of U.S. fine particulate matter air pollution. Atmospheric Environment. 2011;45:3924–36. [PMC free article] [PubMed]
- Valentini P, Ippoliti L, Fontanella L. Modeling US housing prices by spatial dynamic structural equation models. Annals of Applied Statistics. 2013;7:763–98.
- Wood SN. Generalized Additive Models: An Introduction with R (Texts in statistical science) 2006.
- Zhang WY, Lee SY. Nonlinear dynamical structural equation models. Quantitative Finance. 2009;9:305–14.
- Zhu HT, Lee SY. Statistical analysis of nonlinear factor analysis models. British Journal of Mathematical and Statistical Psychology. 1999;52:225–42.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |