Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC2733176

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. THE SOUTHERN CALIFORNIA CHS
- 3. THE PROPOSED FUNCTIONAL-BASED MULTILEVEL MODEL
- 4. ANALYSIS OF THE LUNG FUNCTION DATA
- 5. FURTHER TOPICS
- 6. DISCUSSION
- FUNDING
- Supplementary Material
- References

Authors

Related links

Biostatistics. 2008 October; 9(4): 686–699.

Published online 2008 March 18. doi: 10.1093/biostatistics/kxm059

PMCID: PMC2733176

Kiros Berhane^{*}

Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033-9987, USA, Email: ude.csu@sorik

Department of Epidemiology and Public Health, Imperial College London, Norfolk Place, London W2 1PG, UK, Email: ku.ca.lairepmi@rotilom.yssaj

Received 2007 January 31; Revised 2007 November 7; Accepted 2007 December 17.

Copyright © The Author 2008. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

This article has been cited by other articles in PMC.

Flexible multilevel models are proposed to allow for cluster-specific smooth estimation of growth curves in a mixed-effects modeling format that includes subject-specific random effects on the growth parameters. Attention is then focused on models that examine between-cluster comparisons of the effects of an ecologic covariate of interest (e.g. air pollution) on nonlinear functionals of growth curves (e.g. maximum rate of growth). A Gibbs sampling approach is used to get posterior mean estimates of nonlinear functionals along with their uncertainty estimates. A second-stage ecologic random-effects model is used to examine the association between a covariate of interest (e.g. air pollution) and the nonlinear functionals. A unified estimation procedure is presented along with its computational and theoretical details. The models are motivated by, and illustrated with, lung function and air pollution data from the Southern California Children's Health Study.

Cohort studies on the long-term effects of air pollution, such as the Harvard Six-Cities study and the Southern California Children's Health Study (hereafter referred to as CHS), involve multiple measurements on each of many subjects, in addition to possible geographic clustering due to inclusion of several communities with a diverse pollution profile (Peters, Avol, Gauderman, *and others*, 1999). Proper analysis of the resulting longitudinal data should account for the clustering effects, as well as the serial correlation that usually arises due to repeated measurements from the same subject. In such studies, the main interest is in being able to estimate air pollution effects on outcomes such as lung function (i) between times within subjects, (ii) between subjects within communities, and (iii) between communities. The multilevel modeling paradigm (Berhane *and others*, 2004) gives a unified way of examining the effects of air pollution at the different levels outlined above. These models fall under the general umbrella of the well-studied mixed-effects models (Laird and Ware, 1982; Diggle *and others*, 1994).

Growth curve modeling of physiologic measures such as height and lung function in children has always posed challenges because of the nonconstant growth rates as subjects move from preadolescence to puberty and then to young adulthood. Several methods have been proposed to model these growth curves. Fully parametric methods range from simple ones that assume a global polynomial curve (Kory *and others*, 1961) to those that fit separate parametric curves in mutually exclusive pre- and postadolescence age categories (Dockery *and others*, 1983). With the growing popularity of smoothing techniques (Hastie and Tibshirani, 1990), more flexible methods have been proposed for modeling growth curves. Wypij *and others* (1993) used generalized estimating equations–based methods (Liang and Zeger, 1986) that use splines to model lung function growth curves in the Harvard Six-Cities study (Ferris *and others*, 1979). Berhane *and others* (2000) used similar models in a mixed-effects modeling framework on data from the CHS. Other related, but more general, models have also been proposed, along with extensive studies of the ensuing theoretical and computational issues (Berhane and Tibshirani, 1998; Brumback and Rice, 1998; Lin and Zhang, 1999).

While the use of smoothing techniques to model lung function growth has become the state of the art, methods for relating these growth curves to pollution levels are not well established. The main challenge has been in deciding which aspects of the nonlinear growth curves to relate to air pollution and the lack of appropriate statistical methods to conduct the analysis. In this paper, we propose a new technique for modeling important features of nonlinear curves as functions of ecologic covariates. The methodology is motivated by our desire to assess the impact of ambient air pollution on children's lung function growth.

The main features of the proposed methodology are the use of Gibbs sampling techniques to estimate the posterior distribution of important features of the curves (known as functionals) and the use of meta-analytic techniques to relate the posterior means and posterior variances to ambient air pollution levels. The idea of using Gibbs sampling techniques to study posterior distributions of functionals has been used before by Silverman (1985) and Hastie and Tibshirani (2000). Our methods differ from those previously proposed in 2 important aspects: namely, (i) we fit cluster-specific curves with subject-specific random effects to allow for subject-specific attributes and (ii) we introduce a meta-analytic second-stage model in order to relate the posterior means of the functionals along with posterior variances, entering the mixed-effects second-stage model either as inverse-variance weights or as known variances of the random effect for modeling heterogeneity. The near equivalence of such multilevel models to a related unified mixed-effects model has been discussed previously (Berhane *and others*, 2004). Our proposed strategy is similar to those proposed for functional mixed-effects models by Guo (2002) and wavelet-based approaches of Morris and Carroll (2006) and references therein. Our approach has the advantages of simplicity and computational ease for analysis of large epidemiologic data sets. We plan to make our suite of computer programs available to the research community.

We briefly describe the CHS in Section 2. Details of the proposed model are discussed in Section 3, followed by a detailed analysis of CHS data in Section 4. Further topics of interest, conclusions, and areas for future research are discussed in Sections 5 and 6. Note that although the methods are motivated by the CHS data, they are potentially applicable to a wide range of applications where the main interest might be in drawing ecologic inference on scientifically relevant functionals of nonlinear growth trajectories.

The CHS is a longitudinal study designed to assess the long-term effects of air pollution on children's health. The study included a sample of 12 Southern California communities out of 86 communities with routine air pollution monitoring that were selected to exhibit maximum variability with respect to ambient levels of ozone (${\text{O}}_{3}$), particulates (PM${}_{10}$), nitrogen dioxide (NO${}_{2}$), and acid vapor. Figure S1 in the supplementary material (available at *Biostatistics* online, http://www.biostatistics.oxfordjournals.org) depicts the geographic distribution of the study communities. Details of design- and power-related issues are discussed in Navidi *and others* (1994). In 1993, the study enrolled approximately 150 fourth graders and 75 seventh and tenth graders from each community, leading to a total of 3681 students who returned a signed consent form. In 1994, 386 fifth and 111 eighth graders were added from the same schools. A second 4th-grade cohort of 2081 children was enrolled in 1996. Thus, a total of 6259 children have entered the study for observation. Attrition amounted to about 8% per year, 95% of which was due to moving away from participating schools. The models proposed in this paper will be illustrated by using data from all participants of the first cohort of fourth graders.

A baseline questionnaire was completed by the primary caregiver of each child, covering residential history, current residential characteristics (e.g. ventilation and sources of indoor pollution), personal risk factors, respiratory symptoms, and usual activities. An abbreviated yearly follow-up questionnaire was used to collect data on chronic respiratory symptoms and diseases and time-dependent covariates. For more details, see Peters, Avol, Navidi, *and others* (1999).

Lung function tests, which are the focus of this paper, were conducted via field team visits to participating schools in winter to spring of each year (January to June), a period of relatively low pollution levels in the region, so as to minimize the effects of acute pollution episodes. The lung functions measured included forced expiratory volume in 1 s (FEV${}_{1}$), the maximum mid-expiratory flow (MMEF), the peak expiratory flow rate (PEFR), forced volume capacity (FVC), and forced expiratory flow (FEF${}_{75}$). More details on the lung function testing techniques and quality control procedures are given in Peters, Avol, Gauderman, *and others* (1999). For illustrative purposes, our examples in this paper are limited to FEV${}_{1}$, as this is the main measure of lung volume and hence its maximum rate of growth is more likely to be affected by effects of air pollution during a child's growth spurt.

Air pollution levels were assessed via dedicated central site monitors in each study community, providing data on continuous hourly ambient ${\text{O}}_{3}$, PM${}_{10}$, and NO${}_{2}$ and 2-week measures of $\text{PM}{}_{2.5}$ and acid vapors. Individual exposure predictions were made via the “microenvironmental” approach (Navidi and Lurmann, 1995) based on ambient exposure levels, housing characteristics, and time–activity patterns.

Focusing on O_{3}, elemental carbon, and acid vapors, our illustrative examples use multiyear average exposure data obtained by averaging the hourly or 2-week integrated measures of exposure data from central site monitors. An ubiquitous problem in air pollution research is the high correlation between the various air pollutants. This leads to multicollinearity in models that simultaneously assess multiple pollutants and hence precludes our ability to identify the pollutant with maximum impact on children's health. This issue and a potential solution that uses Bayesian model averaging have been discussed briefly in Berhane *and others* (2004). We found out that all the pollutants were highly correlated with each other, with the exception of O_{3} which was uncorrelated with any of the other pollutants. In this paper, we only consider models that assess effects of one pollutant at a time. Extensions of our proposed models to the multipollutant problem in a manner that properly handles the multicollinearity in the pollutant effects, while possible, are beyond the scope of this paper.

To establish notation, we first describe the multilevel linear model that has been used for assessing the effects of air pollution on lung function growth (Berhane *and others*, 2004; Gauderman *and others*, 2004). Denote by $y{}_{cij}$ the outcome measure (e.g. lung function, log transformed to ensure normality) for subject *i* in community *c* at attained age $t{}_{cij}$, where *j* indexes year. Predictors can be time-dependent, time-constant subject-specific attributes, or community-specific. Uppercase and lowercase letters denote community-specific and subject-specific quantities, respectively. Thus, $\text{w}{}_{cij}$ represents time-dependent covariates (e.g. respiratory infection), $\text{w}{}_{ci}$ time-constant fixed covariates (e.g. ethnicity), $P{}_{cj}$ the community annual-average levels of pollution, and *p*_{c} the community-specific multiyear average levels of pollution. A 3-level linear model that allows for subject- and community-level variance components based on subject-specific intercepts and slopes is the following linear mixed-effects model (Laird and Ware, 1982):

(3.1)

where ${e}_{ci}~N(0,{\sigma}_{e,ci}^{2})$, and ${e}_{c}~N(0,{\sigma}_{e,c}^{2})$ are random subject-specific and community-specific intercepts and $e{}_{cij}~N(0,{\sigma}_{cij}^{2})$ is an error term. Similarly, ${f}_{ci}~N(0,{\sigma}_{f,ci}^{2})$ and ${f}_{c}~N(0,{\sigma}_{f,c}^{2})$ are the corresponding random slopes. The implicit assumption of independence between random effects in Model (3.1) could be relaxed to allow for more complex temporal and/or spatial correlation structures. See Berhane *and others* (2004) for more details.

Exploratory analyses have shown that the relationship between log-transformed values of lung function measures and height is linear in short age-intervals, but both the intercept and the slope change with age during the adolescent growth period (Wypij *and others*, 1993). This observation reflects the complex relationship between height growth, age, and lung function growth. It is well known that human lung growth spurt lags by about 6 months from that of height growth spurt. We extend Model (3.1) to depict this complex relationship. Focusing on the temporal level of the model for modeling growth trajectories, a varying-coefficient model (Hastie and Tibshirani, 1993) has the following form:

(3.2)

where “*h*” denotes the time-dependent log-transformed height measurement. In order to make ${f}_{1}\left(t{}_{cij}\right)$ interpretable as a growth curve, *h* is usually centered by a smooth function of “log (height)” on “attained age.”

Options for modeling ${f}_{1}$ and ${f}_{2}$ range from completely nonparametric ones such as smoothing splines to fully parametric functions such as polynomials. Here, we use natural spline basis functions. These are constructed by using B-spline basis functions (DeBoor, 1974), leading to piecewise polynomials that are joined together at a specified number of break points (or knots), and constrained to be linear beyond the boundary knots. The advantage of such inherently parametric representation is that inference is based on solid theoretical grounds and extracting features of growth curves via appropriate derivatives is straightforward. Based on this approach, a multilevel version of (3.1) that allows for nonlinear subject-specific growth parameters could then be formulated as follows:

(3.3)

(3.4)

(3.5)

(3.6)

(3.7)

where $k=1,\dots ,K$ subscripts the *K* natural spline basis functions ${k}_{\left(t{}_{cij}\right)}$. Models (3.3)–(3.7) could be combined to give the following unified mixed-effects model:

(3.8)

For settings where the number of observations per subject is limited, as in the CHS data, it may not be advisable to fit complex (individual-specific and/or community-specific) interactions for the “${\mathrm{\left(k\right)}}^{\left(t{}_{cij}\right)\times \text{h}{}_{cij}}$” terms. It also makes more sense to fit separate smooth functions for each of the communities, while subject-to-subject variability is captured via random error terms in order to avoid overparameterization. Moreover, the biologically important features of the growth curves that may be affected by air pollution may not be extractable from the above formulation. Hence, we propose the following 2-level model, where (i) Models (3.3)–(3.5) are combined and simplified to form a new level-1 model, (ii) a Markov Chain Monte Carlo (MCMC) procedure, such as the Gibbs sampler, is employed to estimate the posterior distribution of functionals (e.g. maximum rate of growth) of the community-specific curves, and (iii) a mixed-effects meta-analytic model is fitted to assess the relationship of the functionals with air pollution. The new level-1 model is given as follows:

(3.9)

where ${\mathbf{e}}_{ci}~\text{MVN}(0,{\Sigma}_{e,ci})$, ${\mathbf{e}}_{ci}=({e}_{ci}^{\left(1\right)},\dots ,{e}_{ci}^{\left(K\right)})$ is the vector of subject-specific random error terms associated with the growth parameters, included to allow for random subject effects, and $e{}_{cij}~N(0,{\sigma}_{cij}^{2})$ is a residual error term. The parametric model (3.9) can be fitted via popular software packages (e.g. PROC MIXED in SAS, LME in Splus). We can then extract the community-specific growth curves, with proper adjustments for subject-to-subject variability due to the rich random-effects structure. Growth rate estimates at any given age can also be extracted by using derivatives of the growth curves, a relatively straightforward process when dealing with natural splines.

When the interest is in predicted value(s) at any age of either the growth curve itself or one of its derivatives, or linear combinations of them, these features are known as linear functionals and inference on them is straightforward (Silverman, 1985) as they allow for closed-form estimation of variances. An important recent substantive publication from the CHS uses this approach to model the relationship between air pollution and lung function growth over the childhood period (Gauderman *and others*, 2004). The need for Gibbs sampling arises when the focus is on nonlinear functionals (e.g. maximum rate of growth) mainly due to the difficulty of getting full access to uncertainty estimates for the functional. This problem arises because $({\lambda}_{1}{f}_{1}+{\lambda}_{2}{f}_{2})\ne ({\lambda}_{1}{f}_{1})+({\lambda}_{2}{f}_{2})$.

The mixed-effects model (3.9) can be rewritten as

(3.10)

where $c=1,\dots ,C$, $i=1,\dots ,I$, and $j=1,\dots ,{n}_{i}$, denote clusters, subjects, and repeated measurements per subject, respectively, and $N={\sum}_{i}{n}_{i}$. Here, $\mathbf{X}{}_{cij}=({\mathrm{kc}}_{\left(t{}_{cij}\right);c=1,\dots ,C;k=1,\dots ,K)}$ represents the fixed effects from the realizations of the B-spline basis functions of the community-specific growth curves, $\mathbf{W}{}_{cij}=(\mathbf{w}{}_{cij}:\mathbf{w}{}_{ci}:(P{}_{cj}-{P}_{c}\left)\right)$ denotes a vector of adjustment factors, and $\mathbf{Z}{}_{cij}=({k}_{\left(t{}_{cij}\right);k=1,\dots ,K)}$ denotes a vector of covariates forming the random component of the model. Additionally, we assume that $\mathbf{e}{}_{cij}~N(0,{\sigma}^{2})$ and ${\mathbf{b}}_{i}~MVN(0,\Sigma )$, where ${\mathbf{b}}_{i}$ denotes the $q\times 1$ vector of random terms. In the last version of (3.10), ${\mathbf{\beta}}^{*}=[\mathbf{\beta},\mathbf{\delta}{]}^{T}$ represents fixed coefficients associated with ${\text{X}}_{ci}^{*}=[{\text{X}}_{ci},{\text{W}}_{ci}]$, ${\text{X}}_{ci}$ denotes the fixed effects from the realizations of the natural spline basis functions of the community-specific growth curves, ${\text{W}}_{ci}$ denotes adjustment factors, and ${\text{Z}}_{ci}$ denotes covariates forming the random component of the model.

Following the Bayesian paradigm, all parameters in (3.10) are treated as random quantities (with appropriate prior distributions) regardless of whether effects are fixed or random. Therefore, we proceed by specifying the prior distributions for the parameters in Model (3.10) as follows:

- ${\mathbf{\beta}}^{*}~\text{MVN}(0,{\mathbf{\sigma}}_{{\mathbf{\beta}}^{*}})$, where ${\mathbf{\sigma}}_{{\mathbf{\beta}}^{*}}^{-1}=0$, indicates vague prior information about ${\mathbf{\beta}}^{*}$.
- ${\text{b}}_{ci}~\text{MVN}(0,{\mathbf{\sigma}}_{\text{b}})$.
- ${\sigma}^{2}~\text{IG}({e}_{{\sigma}^{2}},{f}_{{\sigma}^{2}})$, where IG denotes the inverse gamma distribution.

Hyperpriors also need to be specified for ${\text{b}}_{ci}$, leading to ${\mathbf{\sigma}}_{\text{b}}~W\left(\right(\text{R}{)}^{-1},\rho )$, where *W* denotes the Wishart distribution, a multivariate generalization of the gamma distribution. Here, *R* is a $d\times d$ matrix, *d* is the number of columns in matrix ${\text{Z}}_{ci}$, and $\rho \ge d$ is a scalar degree of freedom parameter. Given all prior and hyperprior information, their full conditional posterior distributions are as follows:

- Given $({\text{y}}^{*}=\text{y}-\text{Zb},{\mathbf{\sigma}}_{{\mathbf{\beta}}^{*}},{\sigma}^{2})$, the posterior distribution of ${\mathbf{\beta}}^{*}$ is $\text{MVN}\left(\text{d}\right({\sigma}^{-2}{\text{X}}^{*}{\text{y}}^{*}),\text{d})$.
- Given $({\text{y}}_{ci}^{*}={\text{y}}_{ci}-{\text{X}}_{ci}^{*}{\mathbf{\beta}}^{*},{\mathbf{\sigma}}_{\text{b}},{\sigma}^{2})$, the posterior distribution of ${\text{b}}_{ci}$ is $\text{MVN}({\text{d}}_{ci}({\sigma}^{-2}{\text{Z}}_{ci}^{T}{\text{y}}_{ci}^{*}),{\text{d}}_{ci})$.
- Given $({\text{y}}^{**}=\text{y}-{\text{X}}^{*}{\mathbf{\beta}}^{*}-\text{Zb},{e}_{{\sigma}^{2}},{f}_{{\sigma}^{2}})$, the posterior distribution of ${\sigma}^{2}$ follows $\text{IG}(\frac{n}{2}+{e}_{{\sigma}^{2}},[\frac{1}{2}({\text{y}}^{**T}{\text{y}}^{**})+{f}_{{\sigma}^{2}}^{-1}{]}^{-1})$.
- given $({\text{b}}_{ci}^{*}={\text{b}}_{ci}-0,\rho ,\text{R})$, the posterior distribution of ${\Sigma}_{\text{b}}$ is $W(k+\rho ,{\sum}_{c}{\sum}_{i}({\text{b}}_{ci}^{T}{\text{b}}_{ci}+{\text{R}}^{-1}{)}^{-1})$, where
*k*is the number of clusters.

Here, ${\text{d}}^{-1}={\sigma}^{-2}{\text{X}}^{*T}{\text{X}}^{*}+{\mathbf{\sigma}}_{{\mathbf{\beta}}^{*}}^{-1}$, $\text{b}=[{\text{b}}_{11},\dots ,{\text{b}}_{1{m}_{1}},\dots ,{\text{b}}_{C1},\dots ,{\text{b}}_{C{m}_{C}}{]}^{T}$, and ${\text{d}}_{ci}^{-1}={\sigma}^{-2}{\text{Z}}_{ci}^{T}{\text{Z}}_{ci}+{\mathbf{\sigma}}_{\text{b}}^{-1}$. In our Gibbs sampling, we use the restricted maximum likelihood (REML) estimates from the linear mixed-effects model as initial values. Using the REML estimates shortens the burn-in period for the Gibbs sampler. Noninformative priors are used in our model for appropriate parameters (e.g. adjustment factors), and parameters for ${\sigma}^{2}$ were specified as $({e}_{{\sigma}^{2}},{f}_{{\sigma}^{2}})=(0.01,0.01)$. Since it has been suggested in the literature (Gelman, 2006) that this choice of prior may prove to be informative, alternative choices of priors are also considered to check for robustness of the substantive findings. Also, the hyperparameters for ${\mathbf{\sigma}}_{\text{b}}$ were specified as $\left(\right(\text{R}{)}^{-1},\rho )=(({\text{I}}_{d}{)}^{-1},d)$, where $d=4$ in our case. Therefore, the Gibbs sampling procedure is implemented as follows:

- Update ${\mathbf{\beta}}^{*}$ and ${\text{b}}_{ci}$ from their own full conditional distributions, that is the distribution of the parameter in question given all other quantities in the model.
- Update the variance components ${\sigma}^{2}$, ${\mathbf{\sigma}}_{\text{b}}$ from their full conditional distribution.
- Repeat (1) and (2) until convergence.

All analyses use WinBUGS. Issues related to convergence were examined by using convergence diagnostic measures as in Gelman and Rubin (1992), implemented using the CODA function in R. This method uses 2 or more parallel chains with different starting values, essentially sampling from an overdispersed distribution. The Gelman and Rubin statistic is calculated based on the comparison of within- and between-chain variances, giving a value close to 1 if the MCMC output from all chains is indistinguishable, and hence, convergence is considered to be reached.

In each cycle of the Gibbs sampling algorithm, community-specific growth curves are estimated along with associated growth rate estimates via derivatives of the community-specific growth curves. First, we extract the community-specific growth curves via where is a $g\times 1$ vector of the community-specific fixed coefficient associated with the natural spline. Then, we obtain the community-specific growth rates by taking the derivatives of $\widehat{{\text{f}}_{c}}$ described above. Finally, we construct the entire posterior distribution of functionals of interest such as the maximum rate of growth. Figure 1 gives a pictorial presentation of the prior specifications for the Bayesian modeling procedure outlined above. Pictorial presentation of the setup for the Bayesian modeling paradigm. 1

Given estimates of functional(s) of interest from the algorithms outlined in Sections 3.1–3.2, along with their entire posterior distributions, one could conduct inferences on the relationships between such functionals and ecologic covariates (e.g. air pollution). To do this, we fit an ecologic weighted regression. This approach takes the uncertainty estimates of the functionals into account by using as weights the inverse of the posterior variance estimates of the community-specific posterior means of the nonlinear functional of interest.

Let ${c}_{}$ and ${P}_{c}$ denote a functional of interest (e.g. maximum rate of growth) and a level of an ecologic covariate (e.g. air pollution) for community *c*, respectively. Let ${V}_{{c}_{}}$ denote the variance estimate of ${c}_{}$ as obtained from the Gibbs sampling algorithm. Then, a weighted ecologic regression model is

(3.11)

where $e=({e}_{1},\dots ,{e}_{C})~N(0,{\tau}^{2})$, with weights $1/{V}_{{c}_{}}$. Model (3.11) implicitly assumes that the posterior distribution of the functional of interest is, at least approximately, Gaussian. It also assumes that the estimates from the communities are uncorrelated with each other (i.e. there is no spatial correlation). Each of these assumptions could be tested, and relaxed when necessary, to accommodate more general classes of models. Specifically, a meta-analytic mixed linear–effects model could be used through an iterative process as in Stram (1996) to allow for a nondiagonal weight matrix as discussed in Section 5.1. A fully Bayesian approach that circumvents the Gaussian assumption is discussed in Section 5.2.

We now present a detailed analysis of the motivating example from the Southern California CHS. To illustrate the methods proposed in this paper, we use data from the first 4th-grade cohort of schoolchildren who were recruited in fall of 1993 from the 12 communities depicted in Figure S1 provided in the supplementary material (available at *Biostatistics* online, http://www.biostatistics.oxfordjournals.org). Lung function tests were conducted annually on all children. Depending on whether subjects were part of annual retest samples (about 10% were retested each year, roughly 3 months apart), up to 13 lung function test measurements were available. We required that each subject has at least 2 measurements.

The growth trajectories of childhood lung function measurements are different in boys and girls due to the relatively earlier growth spurts in girls. Figure 2 (left panel) illustrates the growth trajectories of FEV 1 for female (dashed curve) and male (solid curve) participants in the first 4th-grade cohort of the CHS. The growth trajectories clearly differ by gender, with girls having their growth spurt earlier and also achieving their maximal FEV 1 levels at the end of the observation period, whereas boys show an increasing trend even at the end of their high school years. Hence, lung function growth trajectories should always be fitted separately by gender even though joint inference is possible through a combined analysis.

Gender-specific growth curves (left panel) and rates of growth (right panel) for FEV_{1} in 8- to 18-year-old female (dashed curve) and male (solid curve) participants of the CHS. The curves are smooth functions estimated by natural splines from a mixed-effects **...**

Note that the time at which maximal lung function levels are achieved (and the corresponding levels) could be considered as additional nonlinear functionals for ecologic inference by identifying the region of the growth trajectory at which the rate of growth is almost 0. However, males will still be growing, and this may lead to unstable estimates of their maximum attained lung function. Because our purpose here is to illustrate the methodology via gender-specific and combined analysis, we restrict our analyses to the examination of the effect of air pollutants on the maximum rate of growth. For this, we need both the growth curve and its first derivative in order to examine rates of change. The right panel of Figure 2 depicts the gender-specific rates of growth for female (dashed curve) and male (solid curve) study participants.

The first-level mixed-effects model (3.9) was fitted via community-specific natural spline basis functions with interior knots at 12 and 14 years of age and boundary knots at 10 and 18 years of age. All models were fitted using log(FEV_{1}) to satisfy the normality assumption of the model. The results from this first-level model are summarized in Table 1 for female and male participants. In addition to the spline-based terms for age and log(height), both models included several design and adjustment variables: race/ethnicity, respiratory infection within the month prior to the spirometry testing, prevalent and incident asthma status, body mass index (BMI, including a squared term to model the well-documented U-shaped relationship), physical exercise, personal smoking, field technician, and the spirometer involved in the test.

Posterior mean estimates and Bayesian credible intervals (BCI) for the level-1 model adjustment covariates on FEV_{1} in female and male participants of the CHS

For female participants, there was a significant deficit in FEV_{1} levels (Table 1) for those who had a respiratory illness within the month prior to the spirometry test (− 0.9% deficit). Black children showed a significant deficit in FEV_{1} levels (−11.3%) as did Asian children (−5.8%) compared to their Caucasian counterparts. Marginally significant associations were also observed with diagnosis of asthma at study entry (−2.6% deficit) and Hispanic ethnicity (2.6% elevated) compared to their non-Hispanic white counterparts. There was also a significant quadratic relationship with BMI, but no significant relationship was found with personal smoking or incident asthma during follow-up.

For male participants, the first-level model results presented in Table 1 indicate significant deficits with black ethnicity (−14.2%) and Asian ethnicity (−2.4%) compared to White ethnicity, asthma status at entry (−4.3%), and respiratory infection within a month of spirometry test (−1.3%). There was also a significant quadratic relationship with BMI levels. The only differences from the results for females were the findings of significant positive association with personal smoking (0.9% increase in those who reported smoking) and the significant association with exercising (0.6% increase in those who reported exercising). The anomalous finding related to personal smoking may reflect associations with the overall health status of the subjects and other correlated factors.

Based on the above-described first-level model for females (Table 1), 40 000 posterior realizations of the community-specific growth curves were sampled as described in Section 3.3 after a burn-in period of 10 000 samples. Figure S2 in the supplementary material, available at *Biostatistics* online, gives the posterior mean growth curves, 95% Bayesian confidence limits, and a random sample of 100 posterior realizations (in gray). Note that the gray curves are representative Gibbs sampling realizations from the population growth curves, not subject-specific ones.

Similarly, Figure S3 in the supplementary material, available at *Biostatistics* online, gives the first derivative of the growth curves for females in order to estimate rates of growth. From the Gibbs sampling realizations of these derivative curves, we estimate the posterior distribution of the maximum rate of growth to be modeled against multiyear averages of air pollution in the ecologic regressions that follow. Figures S4 and S5 in the supplementary material, available at *Biostatistics* online, give the community-specific growth and associated derivative curves for male participants.

For ecologic inference, we were interested in the association between long-term levels of air pollution and the nonlinear functional which depicts the maximum rate of growth in FEV_{1}. For simplicity, we focus on air pollution only, although it is possible to consider other ecologic covariates as discussed in Berhane *and others* (2004). Results from fitting the ecologic meta-analytic regression described in Section 3.4 are given in Table 2, with focus on O_{3}, elemental carbon, and acid vapors, after transforming the maximum posterior rates of growth (and their corresponding posterior variances) to their original scale. To examine whether effects of air pollution are different in females and males, tests for pollution-by-gender interactions were conducted in each of the 3 pollutants under consideration. These tests for interaction revealed that the effects of air pollution were similar and not statistically different in females and males (*p* values for interaction were 0.21, 0.59, and 0.86 for O_{3}, elemental carbon, and acid vapors, respectively). Hence, we present results for both genders combined. Table 2 indicates that significant deficits in the maximum rate of growth of FEV_{1} are associated with long-term levels of elemental carbon and acid vapors. For comparisons that were scaled to contrast over the range between least and most polluted communities, there were significant deficits in the maximum rate of growth of FEV_{1} of 27.2 and 32.8 mL for 1.2 μ g of elemental carbon and 9.6 ppb of acid vapor, respectively. Though not significant, the contrast of 37.5 ppb in O_{3} levels (based on average daily levels between 10 AM and 6 PM, where children are mostly outside) was associated with a deficit of 5.9 mL in the maximum growth rate of FEV_{1}.

The results outlined above assume normality of the posterior distributions of the maximum rate of growth. To check whether this claim was supported by the data, tests for normality were conducted for each of the communities. No evidence of nonnormality was found for the marginal posterior distributions of the maximum attained levels of FEV_{1}, but the marginal posterior distributions of the maximum rate of growth were found to be normally distributed only after a log transformation. The second-stage meta-analytic models for the maximum rate of growth of FEV_{1} were refitted using the community-specific mean and variance values of the log-transformed marginal posterior realization of the maximum rate of growth. The results were almost identical with those reported in Table 2.

To assess whether convergence was reached in the Gibbs sampler, the Gelman and Rubin diagnostic (GRD) measures described in Section 3.2 were calculated based on 3 different sets of initial values, that is 0, −1, and the maximum likelihood estimates. Based on 10 000 burn-in iterations with 10 000 more iterations saved for analysis, the estimates for the GRD measures for the residual variance (σ) were found to be equal to 1.0 for both males and females. The corresponding values for the fixed-coefficient vector associated with the natural spline (δ) were found to be 1.03 and 1.04 for females and males, respectively. These results indicate that convergence was reached for all of our models.

As noted in Section 3.2, sensitivity analysis was also conducted to check whether our results were sensitive to our choice of variance component priors. Our substantive findings were found to be robust to the use of alternative choices of priors. For example, using a *uniform* (0,100) prior for σ instead of IG(0.01,0.01) led to significant deficits in the maximum rate of growth of FEV_{1} of 26.0 and 31.3 mL for 1.2 μ g of elemental carbon and 9.6 ppb of acid vapor, respectively, for comparisons that were scaled to contrast over the range between least and most polluted communities. The effect estimate for O_{3} was also essentially identical to that obtained under the IG(0.01,0.01) prior for σ.

In this section, we discuss several methodologic issues that could arise in fitting the proposed models with respect to model building, inference, and model checking.

Given estimates of functional(s) of interest from the algorithms outlined in Sections 3.2–3.3, along with their entire posterior distributions, one could conduct inferences on the relationships between such functionals and ecologic covariates (e.g. air pollution) by using a meta-analytic mixed-effects model as outlined in Stram (1996). This approach takes the uncertainty estimates of the functionals into account in an appropriate manner. In fact, the near equivalence of this approach to a combined mixed-effects model in the linear setting (as in Model (3.1)) has been discussed previously (see Berhane *and others*, 2004, for details).

Let ${c}_{}$ and ${P}_{c}$ denote a functional of interest (e.g. maximum rate of growth) and a level of an ecologic covariate (e.g. air pollution) for community *c*, respectively. Let ${V}_{{c}_{}}$ denote the variance estimate of ${c}_{}$ as obtained from the Gibbs sampling algorithm. Then, the following meta-analytic mixed-effects model could be used:

(5.1)

where $e=({e}_{1},\dots ,{e}_{C})~N(0,{\tau}^{2})$ and $\mathbf{\psi}=({\psi}_{1},\dots ,{\psi}_{C})~N(0,{\mathbf{V}}_{}$. Here, ${\mathbf{V}}_{}$ is a variance matrix for the ${c}_{}$s with ${V}_{{c}_{}}$s as diagonal elements. Model (5.1) leads to an iterative process that alternates between updates for $({\gamma}_{0},{\gamma}_{1})$ and ${\tau}^{2}$ until convergence. It implicitly assumes the normality of the posterior distribution of the functional of interest. The assumption of normality of the posterior distributions of the functionals could be tested via standard procedures, and transformation techniques could be used to achieve normality whenever possible.

When the assumption of normality for the posterior distribution of the functional of interest is untenable, even after transformation, an alternative Bayesian algorithm could be used. The algorithm involves 3 steps: namely, (i) fit a mixed-effects model (3.9), (ii) for each realization from the MCMC process (as outlined in Section 3.3), fit the ecologic model

(5.2)

where ${e}^{\left(g\right)}=({e}_{1}^{\left(g\right)},\dots ,{e}_{C}^{\left(g\right)})~N(0,{\tau}^{2\left(g\right)})$, and ${\mathrm{c\left(g\right)}}_{}^{}$ denotes the *k* th realization of ${c}_{}$ from the Gibbs sampling process, and (iii) conduct Bayesian inference on the distribution of the parameter estimates ${\gamma}_{0}^{\left(g\right)}$ and ${\gamma}_{1}^{\left(g\right)}$.

While natural spline–based models are quite flexible in depicting the nonlinear patterns in lung function, there is still a need to determine the number and position of the knots. Several approaches have been proposed to deal with this issue, which is essentially that of model selection. One could use substantive judgment or a trial-and-error approach to determine the number and position of the knots. It may also be possible to develop partially or fully automated methods such as BRUTO (Hastie, 1989), but such methods need to pay attention to the effects of the intrasubject and intracommunity correlation on the procedures. There is a rich literature on Bayesian approaches to selection of knots in spline-based models. See Biller (2000), DiMatteo *and others* (2001), Zhou and Shen (2001), and references therein for details on related approaches. However, this topic is beyond the scope of this paper.

We have presented a new method for analyzing the effects of ecologic covariates, in our case air pollution, on the biologically interesting aspects of nonlinear growth trajectories of physiologic measures. The method was motivated by a desire to estimate the long-term effects of air pollution on the biologically important and substantively relevant aspects of lung function growth trajectory. The long-term effects of air pollution on the linear growth slope (in short follow-up periods) are well established. But effects of air pollution on the entire growth trajectory (or important aspects of the trajectory) have not been well studied. The methods presented in this paper are intended to bridge this gap.

While the immediate application of the methods is to environmental epidemiology, the methods could also be applied in other settings. The application to air pollution described in Section 4, and hence the methodology, could be extended in several ways, for example to enable more thorough assessment of the multipollutant problem, measurement error in the exposure data, and simultaneous modeling of several outcomes.

California Air Resources Board (A033-186), National Institute of Environmental Health Sciences (5P30ES07048-04, 1PO1ESO939581-01), and US Environmental Protection Agency (CR824034-01-3, R826708-01-0).

We thank Duncan Thomas, Jim Gauderman, John Peters, Frank Gilliland, Rob McConnell, Ed Avol, Fred Lurmann, and members of the external advisory committee (especially Scott Zeger) for valuable discussions. *Conflicts of Interest:* None declared.

- Berhane K, McConnell R, Gilliland F, Islam T, Gauderman WJ, Avol E, London S, Rappaport E, Margolis HG, Peters JM. Sex-specific effects of asthma on pulmonary function in children. American Journal of Respiratory and Critical Care Medicine. 2000;62:1723–1730. [PubMed]
- Berhane K, Stram DO, Gauderman WJ, Thomas DC. Statistical issues in studies of the long term effects of air pollution: the Southern California Children's Health Study (with discussion) Statistical Science. 2004;19:414–449.
- Berhane K, Tibshirani R. Generalized additive models for longitudinal data. Canadian Journal of Statistics. 1998;26:517–535.
- Biller C. Adaptive Bayesian regression splines in semiparametric generalized linear models. Journal of Computational and Graphical Statistics. 2000;9:122–140.
- Brumback BA, Rice JA. Smoothing spline models for the analysis of nested and crossed samples of curves. Journal of the American Statistical Association. 1998;93:961–983.
- DeBoor C. A Practical Guide to Splines. New York: Springer; 1974.
- Diggle PJ, Liang KY, Zeger SL. Analysis of Longitudinal Data. Oxford: Clarendon Press; 1994.
- DiMatteo I, Genovese CR, Kass RE. Bayesian curve-fitting with free knot splines. Biometrika. 2001;88:1055–1071.
- Dockery DW, Berkey CS, Ware JH, Speizer FE, Ferris BG., Jr Distribution of forced vital capacity and forced expiratory volume in one second in children 6 to 11 years of age. American Review of Respiratory Disease. 1983;128:405–412. [PubMed]
- Ferris BG, Speizer FE, Spengler JD, Dockery D, Bishop YM, Wolfson M, Humble C. Effects of sulfur oxides and respirable particles on human health: methodology and demography of population in study. American Review of Respiratory Disease. 1979;120:767–779. [PubMed]
- Gauderman WJ, Avol E, Gilliland F, Vora H, Thomas D, Berhane K, McConnell R, Kuenzli N, Lurmann F, Rappaport E,
*and others*The effect of air pollution on lung development from 10 to 18 years of age. New England Journal of Medicine. 2004;351:1057–1067. [PubMed] - Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis. 2006;1:515–533.
- Gelman A, Rubin D. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;7:457–511.
- Guo W. Functional mixed effects models. Biometrics. 2002;58:121–128. [PubMed]
- Hastie T. Discussion of flexible and parsimonious smoothing and additive modeling, by J.H. Friedman and B.W. Silverman. Technometrics. 1989;31:23–29.
- Hastie T, Tibshirani R. Generalized Additive Models. London: Chapman and Hall; 1990.
- Hastie T, Tibshirani R. Varying coefficient models, (with discussion) Journal of the Royal Statistical Society, Series B. 1993;55:757–796.
- Hastie T, Tibshirani R. Bayesian backfitting, (with discussion) Statistical Science. 2000;15:196–223.
- Kory RC, Callahan R, Boren HG, Syner JC. The Veterans Administration-Army cooperative study of pulmonary function. I. Clinical spirometry in normal men. American Journal of Medicine. 1961;30:243–258. [PubMed]
- Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed]
- Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
- Lin X, Zhang D. Inference in generalized additive mixed models by using smoothing splines. Journal of the Royal Statistical Society, Series B. 1999;61:381–400.
- Morris JM, Carroll RJ. Wavelet-based functional mixed models. Journal of the Royal Statistical Society, Series B. 2006;61:179–199. [PMC free article] [PubMed]
- Navidi W, Lurmann F. Measurement error in air pollution exposure assessment. Journal of Exposure Analysis and Environmental Epidemiology. 1995;5:111–124. [PubMed]
- Navidi W, Thomas D, Stram D,
*and others*Design and analysis of multilevel analytic studies with applications to a study of air pollution. Environmental Health Perspectives. 1994;102:25–32. [PMC free article] [PubMed] - Peters JM, Avol E, Gauderman WJ, Linn WE, Navidi W, London SJ, Margolis H, Rappaport E, Vora H, Gong H, Thomas DC. A study of twelve southern California communities with differing levels and types of air pollution. II. Effects on pulmonary function. American Journal of Respiratory and Critical Care Medicine. 1999;159:768–775. [PubMed]
- Peters JM, Avol E, Navidi W, London SJ, Gauderman WJ, Lurman F, Linn WE, Margolis H, Rappaport E, Gong H, Thomas DC. A study of twelve southern California communities with differing levels and types of air pollution. I. Prevalence of respiratory morbidity. American Journal of Respiratory and Critical Care Medicine. 1999;159:760–767. [PubMed]
- Silverman BW. Some aspects of the spline smoothing approach to non-parametric regression curve fitting (with discussion) Journal of the Royal Statistical Society, Series B. 1985;47:1–52.
- Stram DO. Meta-analysis of published data using a linear mixed-effects model. Biometrics. 1996;52:536–544. [PubMed]
- Wypij D, Pugh M, Ware JH. Modeling pulmonary function growth with regression splines. Statistica Sinica. 1993;3:329–350.
- Zhou S, Shen X. Spatially adaptive regression splines and accurate knot selection schemes. Journal of the American Statistical Association. 2001;96:247–259.

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |