Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Ann Inst Stat Math. Author manuscript; available in PMC 2010 August 5.
Published in final edited form as:
Ann Inst Stat Math. 2009 February 1; 100(2): 278–290.
doi:  10.1016/j.jmva.2008.04.012.
PMCID: PMC2916816



Clustered data arise commonly in practice and it is often of interest to estimate the mean response parameters as well as the association parameters. However, most research has been directed to inference about the mean response parameters with the association parameters relegated to a nuisance role. There is little work concerning both the marginal and association structures, especially in the semiparametric framework. In this paper, our interest centers on inference on the association parameters in addition to the mean parameters. We develop semiparametric methods for both complete and incomplete clustered binary data and establish the theoretical results. The proposed methodology is illustrated through numerical studies.

Keywords: Association, Binary outcomes, Clustered data, Estimating equation, Missing at random, Semiparametric estimation

1 Introduction

Clustered data arise frequently in practice, and their analysis methods typically differ from univariate analysis methods due to possible association within clusters. Incorporating complex association structure in inferential procedures becomes a major challenge for analysis of clustered data, especially for discrete data. Under the parametric framework, various methods have been developed for binary data. See Prentice (1988), Lipsitz et al. (1991), Carey et al. (1993), Molenberghs and Lesaffre (1994), and Yi and Cook (2002), for instance.

Semiparametric models based on generalized estimating equations (GEE) methods and their extensions have become increasingly popular. See, for example, Severini and Staniswalis (1994), Carroll et al. (1997), Xia et al. (1999), Lin and Carroll (2001a, b), Wang (2003), Fan and Li (2004), Wang et al. (2005), Chen and Jin (2005), and Xia and Härdle (2006), among others. These methods mainly concern the marginal mean parameters with the association parameters treated as nuisance. However, in many applications, estimation of the association parameters is a central theme of the study. For example, in familial studies of inherited traits and developmental toxicology studies of laboratory animals (e.g., Hall and Severini 1998), subjects in a family or cluster share common genetic traits or are subject to common environmental factors, and it is of prime scientific interest to study the association between responses.

Under the likelihood formulation, Aerts and Claeskens (1997) and Claeskens and Aerts (2000) explored inference methods for the marginal and association parameters. However, with a marginal formulation for semiparametric regression there is relatively little discussion on featuring both the mean and association structures. To fill up this gap, in this paper we develop semiparametric inference procedures for both marginal and association parameters for clustered binary data, and rigorously establish the asymptotic properties. As missing observations commonly occur, which causes additional difficulty in conducting inference (Little and Rubin 2002), in this paper we also describe methods to handle incomplete data. The proposed methods provide a flexible framework to feature various types of mean and association structures. Such a flexibility however comes at the cost of increasing the complexity of inferential procedures. First, the common computing algorithm based on the Newton-Raphson method that applies to parametric models can not be directly used now due to the unknown function θ(.). Secondly, and more importantly, existing asymptotic distribution theory under the parametric framework (Prentice 1988; Yi and Cook 2002) breaks down for the current setup. The inclusion of a nonparametric term θ(.) into the mean model remarkably changes the nature of the model setup, and it is not trivial to establish the asymptotic results for the estimators of the mean and association parameters.

The remainder of the article is organized as follows. The notation and inference framework are introduced in Section 2 and estimation procedures are described in Section 3. In Section 4 we establish the asymptotic properties of the resulting estimators and discuss the issue of parameter interpretation. Numerical studies are given in Section 5 to assess the performance and to illustrate the use of the proposed methods. In Section 6 we develop the inference methods for handling incomplete data. We conclude the article with a discussion in the last section.

2 Notation and Framework

Suppose that there are n clusters and mi subjects within cluster i, i = 1, . . . , n. Let Yij be the binary response for subject j in cluster i, xij and zij be the p × 1 and q × 1 covariate vectors, respectively. Denote Yi = (Yi1, . . . , Yimi)T, xi = (xi1, ..., ximi)T and zi = (zi1, ..., zimi)T. Define μij = E(Yij|xi, zi), and let μi = (μi1, ..., μimi)T, i = 1, ..., n. Provided the mean of Yij depends only on the covariate vector for subject j, i.e. E(Yij|xi, zi) = E(Yij|xij, zij) (Pepe and Anderson 1994), we consider the regression model


where g(.) is a monotone link function, β and α are parameter vectors, and θ(.) is an unknown smoothing function. The requirement ||α|| = 1 ensures identifiability of α. Common choices of function g(.) include logit, probit, and complementary log-log functions. In some situations, g(.) can assume a flexible but more complex function form. For instance, Kim et al. (2008) proposed a class of link functions based on generalized t-distributions to characterize binary responses.

We assume that Yij and Yi’j’ are independent for different clusters i and i’, but within the same cluster, the responses may be correlated. Let ψijk be the odds ratio between responses Yij and Yik in cluster i (j < k), defined by


Regression models may be employed to feature various association structures, with the dependence of the association on covariates being explicitly reflected:


where h(.) is a monotone link function, uijk is a vector of covariates which specifies the form of the association between Yij and Yik, and ϕ is a vector of regression parameters. Letting uijk be the scalar one, for instance, leads to the exchangeable association between responses within the same cluster; while setting uijkTϕ=ϕjk results in an autoregressive correlation among responses (j < k). Typically, a log-linear regression may be assumed for (2) (Fitzmaurice and Laird 1993).

Let μijk = P(Yij = 1, Yik = 1|xi, zi) be the joint probability for the pair (Yij, Yik), given the covariates xi and zi. It is determined by the marginal means and the odds ratio, given by (Lipsitz et al. 1991; Yi and Thompson 2005)


where aijk = 1 − (1 − ψijk)(μij + μik).

3 Estimation Procedures

In this section we describe marginal methods for estimation of mean response parameters α and β and association parameters ϕ. Let Σi = [σijk] be the true covariance matrix for the response vector Yi for cluster i, with σijj = μij(1 − μij) and σijk = μijkμijμik for jk, and Vi=diag(σijj,j=1,,mi)Cidiag(σijj,j=1,,mi) be a working matrix, where Ci is an invertible working correlation matrix. Throughout the paper we assume that Ci may depend on a parameter vector τ that is distinct from the mean response parameters α and β, and can be estimated by the method of moments.

Let Uiα(α,β,θ(.))=(μiTα)Vi1(Yiμi) and Uiβ(α,β,θ(.))=(μiTαβ)Vi1(Yiμi) It can be seen that both U(α, β, θ(.)) and U(α, β, θ(.)) have zero expectation, i.e., they are unbiased estimating functions for α and β.

To estimate the association parameters ϕ, we employ the alternating logistic regression discussed in Carey et al. (1993) where the conditional expectation ζijk = E(Yij|Yik = yik, xi, zi) is needed for j < k. The conditional expectation ζijk is related to the association, marginal and joint probabilities by ζijk = expit(dijk), where dijk=(logψijk)yik+log(μijμijk1μijμik+μijk), and expit(t) = exp(t)/(1 + exp(t)). Let Vi=diag{ξijk(1ξijk),j<k} be the working matrix, then Uiϕ(α,β,θ(.),ϕ)=(ξiTϕ)Vi1i are unbiased estimating functions for ϕ, where εi = (yi1ζi12, ..., yi1ζi1mi, yi2ζi23, ..., yi,mi−1ζi,mi−1,mi)T, and ζi = (ζi12, ...., ζi1mi, ζi23, ..., ζi,mi−1,mi)T.

If θ(.) is known to be a linear function, then estimation of α, β and ϕ may proceed in a straightforward manner, as outlined in Carey et al. (1993) and Yi and Cook (2002), where the working matrix Vi may be taken as the true covariance matrix Σi. Since the estimating functions U(α, β, θ(.)), U(α, β, θ(.)) and U(α, β, θ(.), ϕ) involve an unknown smooth function θ(.), we need to use nonparametric approaches to estimate this function locally in order to estimate α, β, and ϕ. Assuming θ(u) has the second derivative, we may approximate θ(u) by a locally linear function within the neighborhood of u0 via the Taylor series expansion θ(u) ≈ θ(u0) + θ(1)(u0(uu0) for a given point u0, where d(l)(.) represents the lth derivative of function d(.). Let K(u) be a kernel function (or a mean zero symmetric density function) with a compact support and h be a bandwidth. Denote Kh(t) = K(t/h)/h, a0(u0) = θ(u0), a1(u0) = (1)(u0), and a(u0) = (a0(u0), a1(u0))T. Let Uij=zijTα, and Δi=diag(μij(1),j=1,,mi), where μij(1) is the first derivative of the function g(.) evaluated at xijTβ+θ(zijTα). Define Gij(u, Uij) to be an mi × 2 matrix with the lth column ej × {(uUij)/h}l−1 (l = 1, 2), where ej is an mi × 1 vector of 0 except with the jth entry being 1. Below we describe a two-stage algorithm for estimation of mean parameters α and β and association parameter ϕ.

Stage 1

Step 1

For a given point u in a selected grid find θ^(u,α^,β^)=a^0(u) by solving


with respect to a(u), where the lth element of μi(j) is g{xilTβ+I(l=j)(a0(u)+a1(u)(uUij)h)+I(lj)θ^(Uil,α,β)} with α and β replaced by α^ and β^, respectively, and μij(1)(α^,β) is the first derivative of the function g(.) evaluated at xijTβ+{a0(u)+a1(u)(uzijTα^)h}.

Step 2

Given the estimate θ^(u,α^,β^)=a^0(u) and a^1(u) for points u in the selected grid, update (α^,β^) by solving the following equations for α and β:



where μ^i(α,β)=(μ^i1(α,β),,μ^imi(α,β))T with μ^ij(α,β)=g(xijTβ+θ^(zijTα,α^,β^)).

Step 3

Repeat steps 1 and 2 until convergence of (α^,β^).

Stage 2

To estimate the association parameter ϕ, we solve the equations:


with respect to ϕ, where Uiϕ(α^,β^,θ^(zijTα^,α^,β^),ϕ) is U(α, β, θ(.), ϕ) with μij replaced by g(xijTβ^+θ^(zijTα^,α^,β^)), and θ(.) replaced by θ^(zijT,α^,α^,β^). Denote by ϕ^ the resulting by estimate of ϕ.

To implement the algorithm, we need to choose initial values α0 and β0 and set α^=α0α0 and β^=β0. One may for instance, take initial values as the estimates obtained from the usual generalized linear models with θ(.) specified as the identity function in (1). Given α^ and β^, we first set Vi to be the independence working matrix and apply Stage 1 to obtain estimate (α^,β^). Applying Stage 2 leads to an estimate of ϕ^. Then we iterate Stages 1 and 2 until convergence of (α^,β^,ϕ^). Throughout iterations the working matrix Vi is taken as the true covariance matrix Σi. A similar estimation algorithm is discussed in Yi et al. (2009). However, there is a substantial difference centering on the treatment of the working matrix Vi. In the algorithm of Yi et al. (2009), the independence working matrix is employed throughout all iterations, which may incur some efficiency loss.

4 Theory

4.1 Asymptotic Properties

Analogous to Lin and Carroll (2001a, b) and Wang (2003) we assume mi [equivalent] m for ease of notation. Covariates xi and zi are allowed to be correlated. The triples (Yi, xi, zi), i = 1, 2, ..., n, are assumed independently identically distributed. Let vijl and σijl be the (j, l)th element of Vi1 and Σi1, respectively. Let ϕ^α(u)=θ^(u,α,β)αT and ϕ^β(u)=θ^(u,α,β)βT with the dependence on α and β suppressed in notation, and ϕα(u) and ϕβ(u) be their limits defined in Appendix 2. Denote x~ij=xijϕβ(Uij,α,β)<mi>,<mi>z~ij=θ(1)(Uij)zijϕα(Uij,α,β)<mi>,<mi>x~i=(xi1,,xim)T, and z~i=(zi1,,zim)T. In the sequel, we drop the subject index i for ease of notation whenever expressing expectations. Let A~(V)=E{(z~,x~)TΔV1Δ(z~,x~)}, and B~(V,Σ)=E{(z~,x~)TΔV1ΣV1Δ(z~,x~)}

Theorem 1

Under the conditions in Appendix 1, as n → ∞ and h → 0 at the rate such that nh8 → 0 and nh/log(1/h) → ∞, we have


where Ω(V,Σ)={A~(V)}1B~(V,Σ){A~(V)}1, which is minimized by V = Σ and in this case equals {A~(V)}1.

We note that Theorem 1 does not just apply to binary data under model (1), it applies to continuous data as well. This can be readily seen from the proof in Appendix 2. With univariate data (i.e., m = 1), Carroll et al. (1997) established a similar asymptotic result. With m > 1, it is more difficult to develop the asymptotic theory as there is complexity in accommodating the working matrix Vi for multivariate data. In here, we not only derive the asymptotic distribution for the estimator α^ and β^, but also identify the scenario to obtain the semiparametric efficient estimator. Furthermore, Theorem 1 generalizes the results in Wang et al. (2005) where only a scalar covariate zij is considered. That is, there is no need to estimate parameter α associated with unknown function θ(.). In addition to these contributions, our work distinguishes from existing methods because of the following development on estimation of association parameters ϕ.

Let J=E(UiϕUiϕT) and H = E[−([partial differential]/[partial differential]ϕT)U]. If α, β and θ(.) are all known, estimating functions U are regular parametric unbiased estimating functions of ϕ, and thereby it is straightforward to establish that n(ϕ^ϕ)~N(0,H1J[H1]T), according to Liang and Zeger (1986). When α, β and θ(.) are unspecified and estimated, variation in the estimators α^, β^ and θ^(u) must be taken into account. If θ(.) is a known parametric function, one may easily adapt the arguments in Yi and Cook (2002) to work out the asymptotic distribution of n(ϕ^ϕ). However, here θ(.) is unknown and it is estimated locally, we need to incorporate this local estimation variability into the asymptotic variance of n(ϕ^ϕ) as well. This unknown θ(.) function presents a challenge in establishing the asymptotic distribution for the estimators, and this feature distinguishes the current work from existing results.

Along with the line of the proof of Theorem 1 in Appendix 2, we can derive an approximation of ϕ^h(u)θ(u) as follows


where b*, W2, Q1,*, and Q2,* are given in Appendix 2. Standard but tedious calculations can show that nh{θ^h(u)θ(u)b(u)h2} weakly converges to a normal distribution with zero mean. This result may be used for statistical inference. However, we suggest a bootstrap alternative if needed since the asymptotic variance assumes a very complex form.

Let A~i=j,kdijkϕ[ξijk(1ξijk)](dijkθ(Uij)z~ijT+dijkθ(Uik)z~ikT,dijkβT,dijkϕT) for i = 1, ..., n. Let Δi=diag(μiT(αT,βT)T,ξTϕ), and Ai=((A~i(V),0)T,A~iT)T. We establish the joint asymptotic distribution of the estimator (α^,β^,ϕ^) in Theorem 2, and its proof is given in Appendix 3.

Theorem 2

Under the conditions in Theorem 1,


where Ω* = E[A*−1Δ*diag(V−1,V*−1)Σ*diag(V−1,V*−1)Δ*A*−1], and Σ* is the covariance matrix of the vector (YiT,YiT)T with Yi=i+ξi.

Inferences about parameters α, β and ϕ may be based on Theorem 2, where Ω* is replaced by a consistent estimate in which the associated terms may be substituted by the corresponding empirical estimates. However, implementation of these empirical estimates is too complicated to be of practical interest. It is more plausible to apply the simple bootstrap method for a variance estimate, and this is consistent with available research work concerning semiparametric models, such as Lin and Carroll (2001a, b), Liang et al. (2004), and Wang et al. (2005).

4.2 Bandwidth Selection

In the implementation of the procedures above, choosing an appropriate bandwidth h is very crucial. As bandwidth h affects both bias and variance estimate, there is a trade-off between suitable bias and variance estimate. Bias correction requires the choice of a relatively small bandwidth, whereas variance estimate needs a large value of bandwidth. In principle, bandwidth selection is data driven, and traditional methods such as cross-validation approach may be applied to select a proper bandwidth h based on available data. However, as pointed out in Fan et al. (1995), this approach could perform poorly in some settings with a large magnitude of sample variation produced, hence it is not regarded as a sensible bandwidth selection rule for practical use. Instead, “plugging in” method may be a promising candidate for bandwidth selection. Fan et al. (1995) discussed this approach to handle local polynomial regression under the framework of generalized linear models. Ruppert et al. (1995) explored this method of bandwidth selection with local least squares regression.

Along with the same line we may derive an optimal bandwidth based on the asymptotic weighted mean integrated squared error (AMISE) of θ^h(u)


An optimal bandwidth is given by hopt = C × n−1/5, where C assumes a complex form depending on b*(u), Q1,*, and Q2,*. To reduce the computation burden, we specify C by an ad hoc way in our numerical experiments. See details in Section 5.

4.3 Interpretation of Coefficients α

The proposed models allow for the flexibility of accommodating complex dependence (e.g., curvature) of the responses on covariates. As is true in general statistical modeling, this flexibility is achieved at the cost of less transparent interpretation of the model parameters. In the formulation of the usual generalized linear models, the interpretation of the model is clearly reflected by the link function and the linear form of the coefficients. However, in the current development the inclusion of a nonlinear function θ(.) could dramatically change the meaning of the model, though nonzero values of α and β may indicate “significant” predictors of the response (Carroll et al. 1997). As coefficients α appear in a nonlinear function θ(.) whose form is unknown (this function may not even be monotone), interpretation of coefficients α is not as transparent as that of coefficients β. Here we particularly discuss the parameter interpretation when the link function g−1(.) in (1) is taken as logit.

In the usual setup of a logistic regression model with logit(μij)=xijTβ+zijTα, it is seen that the rth component of β is expressed as βr = logit{μij(xijr + 1)} − logit{μij(xijr)} with other covariates held fixed, where μij(xijr) stresses explicitly the dependence of μij on xijr. That is, the coefficient βr represents the change in the log-odds with one unit change in covariate xijr, given other covariates are fixed. Analogous interpretation applies to the coefficient αr. Alternatively, those coefficients may be represented by means of the partial derivatives. Let oddij = P(Yij = 1|xi, zi)/P(Yij = 0|xi, zi), then logit(μij) = log(oddij), and hence βr=xijr(logoddij), αr=zijr(logoddij). It is apparent that in model (1) log oddij does change linearly in xijr. However, it does not change linearly in zijr as zijr(logoddij) is generally not a constant. Interpretation of α therefore is not as straightforward as that in the usual logistic regression. To understand how this parameter affects the change in the response, we may use average derivatives (Chaudhuri et al. 1997; Huang and Liu 2006) to explain parameters α here. Let cij=(cij1,,cijq)T=E[zij(logoddij)] be the average or expected changes in the log-odds with respect to covariates zij while other covariates are held fixed. It can be seen that, unlike the property of the usual logistic regression such cij depends on the underlying distribution of covariates.

Denote dij=E[θ(1)(zijTα)], j = 1, ..., m, then cij = dijα, i.e., cijr = dijαr, r = 1, ..., q, leading to αst = cijs/cijt. Therefore, the ratio of αs to αt indicates the relative average change in the log-odds with respect to zijs and zjit while other covariates are held fixed. If taking cij1 as the baseline average change of the log-odds (with respect to covariate zij1), then αr=α1cijrcij1, and hence αr may be interpreted as α1 times of the relative average change in the log-odds (related to the change of zijr) as opposed to the baseline average change, with other covariates remained unchanged.

5 Numerical Studies

We now present the results for simulation experiments and real data analysis. In the implementation of the proposed method we use the standard normal density function as the kernel function for the nonparametric procedure in Step 1 of Stage 1. In realizing Step 1 of Stage 1, for given α^ and β^, we divide the range [mini,j(α^Tzij),maxi,j(α^Tzij)] into 50 equally spaced sub-intervals, and take the cutting points as the grids at which θ(.) is estimated. Typically, we choose a varying bandwidth at each iteration by taking h = C × n−1/5, where C is the sample standard deviation of {α^Tzij:i=1,,n;j=1,,m} for given α^ at each iteration.

5.1 Simulation Study

We conduct a simulation study to evaluate the performance of the proposed methods. Here we focus on pairwise association with higher order association being constrained as 0. That is, generate binary vector yi = (yi1, yi2, ..., yim)T from the joint density function (e.g., Yi and Thompson 2005)


where ρijk is the correlation coefficient of Yij and Yik, given by ρijk=(μijkμijμik)σijjσikk. The mean responses are modeled as logit(μij) = βxij + θ(α1zi1 + α2zi2 + α3zi3), where we take θ(t)=sin[π(t1.35536)(1.64533)] as in Carroll et al. (1997). For i = 1, ..., n, consider an exchangeable association structure with log ψijk = ϕ for j < k. Covariates xij are generated from the binomial distribution Bin(1, 0.5) and covariates zij are generated from the uniform distribution U[0, 1]. Set β = 0.3 and α1=α2=α3=13 as in Carroll et al. (1997). Various configurations of ψijk are considered to reflect different strengths of association. In particular, ψijk = 1 represents the scenario of independence structure within clusters. Set m = 4. Two hundred-fifty simulations are run for each parameter configuration. We conduct simulation on different sample sizes with n = 100 and n = 200.

Table 1 reports the differences (Bias) between the estimates and the true values, the empirical standard errors (SE) and the mean squared errors (MSE) for the regression and association parameters. The estimators for the mean and association parameters have reasonably small finite sample biases. It appears that finite sample biases for estimation of β and ϕ tend to become smaller as the sample size increases. It is not surprising that the empirical standard errors for the estimates of the linear coefficient β are smaller than those of α^. The mean squared errors for the estimators α^, β^ and ϕ^ appear reasonably small, though the mean squared errors for α^ tend to be larger than those of β^. As expected, larger sample size leads to smaller standard errors, and then smaller mean squared errors. This simulation demonstrates that the proposed methods give rise to reasonable estimates for both the mean and association parameters.

Table 1
Simulation Results from 250 Replications

5.2 An Example

We apply the proposed methods to analyze a subset of Genetic Analysis Workshops (GAW) 13 data arising from Cohort 2 of the Framingham Heart Study. The Framingham Heart Study is an ongoing prospective study of risk factors for cardiovascular disease (CVD). The objective of the Framingham Heart Study was to identify common factors or characteristics that contribute to CVD by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke.

In the analysis here we consider the data set consisting of 203 families each having 4 members with the baseline measurements. High blood pressure is an important risk factor for cardiovascular disease and is a leading cause of mortality in industrialized countries. As high blood pressure is a complex disorder that results from environmental and genetic factors and their interactions, and other study indicates that blood pressure increases with age (Kraft et al. 2003). It is of interest to study how blood pressure is influenced by the risk factors and how individuals within the same family may be associated. The covariates of interest include age, gender, high density lipoprotein (HDL) and body mass index (BMI) (BMI=weight (kg)/height2 (m2)). Let Yij = 1 if subject j in family i has high blood pressure, and Yij = 0 otherwise.

We consider a semiparametric regression model for the mean response


where xij is gender, taking value 1 for male and 0 otherwise, zij1 is age, zij2 is HDL, and zij3 is BMI. zij1, zij2 and zij3 are standardized as (zijr‾z..r)/s..r where ‾z..r and s..r represent the sample mean and standard deviation of zijrs, respectively, r = 1, 2, 3. Exchangeable association structure is modeled here with log ψi;jk = ϕ, for jk.

Here we conduct three analyses which mainly differ in the treatment of the covariance structure in estimation procedures. Analysis 1 takes Vi as the independence working matrix in both Steps 1 and 2; Analysis 2, following the spirit of Zeger and Diggle (1994), takes Vi as the independence working matrix in Step 1 but the true covariance structure Σi in Step 2; while Analysis 3 is the proposed method which takes Vi as the true covariance structure Σi in both Steps 1 and 2. In Table 2 we report the parameter estimates, standard errors and p-values for the three analyses. The estimates from the three analyses are fairly comparable. Except for HDL, Analysis 3 yields the smallest standard errors, and this agrees with Theorem 1. At significance level 0.05, the three analyses suggest that age plays an important role in predicting high blood pressure. As people get older, they are more prone to have higher blood pressure. There is no evidence that HDL has an effect on having high blood pressure. Strong evidence indicates that BMI has a statistically significant impact on high blood pressure. An individual with a larger BMI has a larger chance to have higher blood pressure. It is noted that there is no evidence to support an existing association among response measurements of family members.

Table 2
Analyses of a Family Data Set from the Framingham Heart Study

To understand if there is a curvature relationship between the response and covariate variables, we plot logit(μij) against the single index θ(.) respectively for female and male data with Analysis 3. The patterns are displayed in Figure 1. It is seen that there are nonlinear trends for both data sets. Thereby using a nonlinear term in the regression (8) for logit(μij) is perhaps more appropriate than using a linear term.

Figure 1
Estimated nonlinear curves for the family data from the Framingham Heart Study. Solid curve is the estimate of logit {P(blood pressure)} for females, and the dotted curve is the estimate of logit {P(blood pressure)} for males.

6 Incomplete Clustered Data

6.1 The Missing Data Process

Let Rij = I(Yij is observed) and Ri = (Ri1, Ri2, ..., Rim)T. Let πij = P(Rij = 1|Yi, xi, zi) be the probability that subject j in cluster i is being observed, given the response and covariates vectors for cluster i. Different missing data mechanisms have been distinguished (Little and Rubin 2002) based on how missing data processes depend on the responses. Here we focus the discussion on missing at random (MAR) mechanisms, where we assume P(Ri=1Yi,xi,zi)=P(Ri=1Yi(0),xi,zi). Here Yi(0) denotes the vector of the observed components of Yi.

Regression models are typically used to relate a function of Yi(0) and the covariates xi and zi to the probability πij. Namely, πij=η(Yi(0),xi,zi;δ), where η(.) is a known function, and δ is the vector of regression parameters.

We model the association among missing data indicators in the same manner as we do to the response components. That is, we define the odds ratio for subjects j and k in cluster i as


Under MAR let ϕ* be the regression parameters linking the odds ratios to the related covariates and observed responses, uijk, say.

Let πijk = P(Rij = 1, Rik = 1|Yi, xi, zi) be the joint probability for Rij and Rik, conditional on the responses and covariates. It is given by


where aijk=1(1ψijk)(πij+πik).

Now we describe the estimating equations for the parameters associated with the missing data process. Let Wi=[wijk] be the m × m matrix with wijj=πij(1πij) and wijk=πijkπijπik for jk. Define Si(δ,ϕ)=(πiTδ)Wi1(Riπi), then i=1nSi(δ,ϕ)=0 are unbiased estimating equations for parameters δ.

By the same spirit as in Section 3, the estimating functions for ϕ* can be written as Si(δ,ϕ)=(ξiTϕ)Wi1(Riξi), where Wi=diag(ξijk(1ξijk),j<k) is the m* × m* diagonal matrix with ξijk=E(RijRik=rik;Yi,xi,zi) given by


and ξi=(ξi12,,ξi1m,ξi23,,ξi,m1,m)T, and m* = m(m − 1)/2.

Setting i=1nSi(δ,ϕ)=0 and i=1nSi(δ,ϕ)=0 leads to the estimator δ^ and ϕ^. Denote π^ij=πij(δ^) and π^ijk=πijk(δ^,ϕ^), and these estimates may enter the estimating functions in Section 6.2 below for conducting estimation of β, α, and ϕ.

6.2 Inference for Response Parameters

To conduct estimation for mean parameters α and β and association parameters ϕ, we need to employ weighted estimating functions that accommodate missingness in the estimation procedures. Let Πi=diag(I(Rij=1)π^ij,j=1,2,,m) and Πi=diag(I(Rij=1,Rik=1)π^ijk,j<k) be the weight matrices. Then unbiased estimating functions for β,α, and θ are given by


Estimation of α, β and ϕ may proceed in the same manner as in Section 3 with the modifications to incorporate the weight matrices Πi and Πi. That is, (3),(4), and (5) are modified as


respectively, and (6) is replaced by (9) with α, β and θ(zijTα) replaced by α^, β^ and θ^(zijTα^,α^,β^), respectively. Here α^, β^ and ϕ^ denote the resultant estimators. Analogous to the proof of Theorem 2, we can derive the following asymptotic result.

Theorem 3

Under the conditions of Theorem 1, n{(α^α)T,(β^β)T,(ϕ^ϕ)T}T has a multivariate normal distribution with mean zero and a sandwich covariance matrix.

7 Discussion

In this paper we develop semiparametric approaches to analyze clustered data. Interest here lies in estimation of the association coefficients in addition to the marginal mean parameters. The simulation studies demonstrate that the proposed methods work well under various situations. In the current development we focus on modeling the mean response with semiparametric regression but the association structure with parametric regression. More generally, we may use a semiparametric specification to model the association structure


with an unknown smooth function η*(.). The extension to including (10) is straightforward by adapting the arguments in the paper, though a more complicated presentation is needed.

Our current work generalizes Carroll et al. (1997) from univariate case to multivariate data, and it also extends Wang et al. (2005) from a scalar covariate z to multiple covariates. Furthermore, the proposed methods incorporate estimation of association parameters, and apply to incomplete data. These features make the proposed methods attractive as they offer a useful tool to handle problems that are not addressed by existing research. The methods we describe here have applications in a wide variety of settings. For example, they can also be generalized to accommodating data with more complex association structures. In many situations, clustered data may arise from longitudinal studies. Clustered longitudinal data feature both a cross-sectional and a longitudinal correlation structure and interest often lies in the strengths of both types of association (Yi and Cook 2002). The proposed methods may be adapted to handle longitudinal data arising in clusters.


The authors gratefully acknowledge an associate editor and two referees, whose comments greatly improved the form and content of this work. The research of Yi and He was supported by the Natural Sciences and Engineering Research Council of Canada. The research of Liang was partially supported by the NIAD/NIH grants AI62247 and AI059773 and NSF grant DMS-0806097. The authors thank Boston University and NHLBI for providing the data set from the Framingham Heart Study (No. N01-HC-25195) in the illustration. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI.

Appendix 1: Conditions

Without exception detailed technical conditions are needed here to guarantee rigorous proofs. Below we just outline several key assumptions with the detailed list of conditions omitted. For more details see Carroll et al. (1997) and Wang et al. (2005).

  1. The density function of zij has a continuous second derivative on its support.
  2. The density function of zijTα is positive and uniformly continuous for α in a neighborhood of its true value.
  3. θ(2)(u) is continuous on its support.
  4. The random vector xij is assumed to have a bounded support.
  5. K(·) is a symmetric probability density function with bounded support.

In the following development the identities are valid to the order of op(an), where an = h2 + {log(n)/nh}1/2 + n−1/2.

Appendix 2: Proof of Theorem 1

We first introduce the following notation, which is similar to that in Wang et al. (2005). The major change is to replace Tij in Wang et al. (2005) with Uij=zijTα. To be specific, let fj(u) be the marginal density function for Uij, and fjl(u, v) be the joint density function for Uij and Uil (jl).



and ϕβ(u) is the solution of


where σjl is the (j, l)th element of Σ−1 and Δj is the (j, j)th element of the diagonal matrix Δ. Further, let


where Q1,[1](u, v) = 0, Q2,[1](u, v) = −Q(u, v), Q1,[k](u,v)=Q(u,v)+A(Q1,[k1];u,v), and Q2,[k](u,v)=A(Q2,[k1];u,v). As k → ∞, at convergence, θ^(u)θ(u) shares the same asymptotic structure as in (11) except that b[k], Q1,[k], and Q2,[k] are replaced by b*, Q1,*, and Q2*, where b(u)=θ(2)(u)W21(u)j=1mljE{ΔjvjlΔlb(Ul)Uj=u}fj(u), Q1,(u,v)=Q(u,v)+A(Q1,;u,v) and Q2,(u,v)=A(Q2,;u,v).

The proof of Theorem 1 can be completed following the spirit of Appendices A.3 and A.4 in Wang et al. (2005). However, great complexity is present in the current development as we have to deal with estimation of an additional parameter vector α. We now prove Theorem 1 with two steps. In the first step, we express the derivative of the estimator θ^(u,α,β) with respect to α, and in the second step, we establish the asymptotic distribution of bthe estimator (α^T,β^T)T. We note that it is the first step that distinguishes the current development from that in Wang et al. (2005).

First we work on the derivative of θ^(u,α,β) with respect to α. Expressing the first component of (3) in terms of any (α, β), we obtain


Differentiating with respect to α yields


Using the arguments similar to the proof of Theorem 5.1 in Ichimura (1993), we can show that the first, third and fourth terms in the first summand is op(1), and the second and third terms in the second summation is of order op(1). On the other hand,




It follows that as n → ∞, ϕ^α(u)ϕα(u) uniformly on u, where ϕα(u) is the solution of the limit form of (12). That is,


Recall (4) and (5):


where μ^i(α,β)=(μ^i1(α,β),,μ^im(α,β))T with the jth element g(xijTβ+θ^(Uij,α^,β^)). To ease the notation, let vecα,β={(α^α)T,(β^β)T}T. Note that elementwisely, we have the Taylor series expansion for μij(β,α)μ^ij(α,β) as follows.


Putting this in a matrix form, we obtain


where μi(1)=(μi1(1),,μim(1))T, θ^(Ui,α,β) and θ(Ui) are stacked vectors with the jth element being θ^(Uij,α,β) and θ(Uij), respectively, and b*c = (b1c1, ..., brcr) denotes the elementwise product of vectors b = (b1, ..., br)T and c = (c1, ..., cr)T.

Note that


After some tedious calculation, (13) can be simplified as


where A~i(V)={μi(1)(z~i,x~i)}TVi1{μi(1)(z~i,x~i)}n and B~i={μi(1)(z~i,x~i)}TVi1[μi(1){θ^(Ui,α,β)θ(Ui)}]. Let A~(V)=E[A~i(V)]=E{(z~,x~)TΔV1Δ(z~,x~)}, then








Adapting the arguments respectively concerning Bn and C2n in Wang et al. (2005), we can show that B1n = op(1) and B2n = op(1). Therefore, using the central limit theorem to (14), we show Theorem 1.

Appendix 3: Proof of Theorem 2

Let ξ^ijk be ζijk with θ(Uil) and β being replaced by θ^(zilTα^,α^,β^) and β^, respectively (l = j, k). Note that dijk is a function of θ(Uij), θ(Uik), β and ϕ. Applying a Taylor series expansion, we express ξijkξ^ijk as


which can further be expressed as


where vecα,β,ϕ={(α^α)T,(β^β)T,(ϕ^ϕ)T}T. After some algebra, we obtain


where B~i is the vector with the rth element


Analogously to the arguments in Appendix 2 for Bn, we can show that Bir=op(1). Now it remains to show the asymptotic distribution. Working on the estimating equations (4)-(6)


we obtain, after some algebra


Using the central limit theorem, we obtain that


where Ω* = E[A*−1Δ*diag(V−1, V*−1)Σ*diag(V−1, V*−1)Δ*A*−1], and Σ* is the covariance matrix of the vector (YiT,YiT)T.

Contributor Information

Grace Y. Yi, Department of Statistics and Actuarial Science, University of Waterloo, Canada N2L 3G1.

Wenqing He, Department of Statistical and Actuarial Sciences, University of Western Ontario, Canada N6A 5B7.

Hua Liang, Department of Biostatistics and Computational Biology, University of Rochester Medical Center, 601 Elmwood Avenue, Box 630, Rochester, NY 14642, USA.


[1] Aerts M, Claeskens G. Local polynomial estimation in multiparameter likelihood models. Journal of the American Statistical Association. 1997;92:1536–1545.
[2] Carey V, Zeger SL, Diggle PJ. Modelling multivariate binary data with alternating logistic regressions. Biometrika. 1993;80:517–526.
[3] Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. Journal of the American Statistical Association. 1997;92:477–489.
[4] Chaudhuri P, Doksum K, Samarov A. On average derivative quantile regression. The Annals of Statistics. 1997;25:715–744.
[5] Chen K, Jin Z. Local polynomial regression analysis of clustered data. Biometrika. 2005;92:59–74.
[6] Claeskens G, Aerts M. Bootstrapping local polynomial estimators in likelihood-based models. Journal of Statistical Planning and Inference. 2000;86:63–80.
[7] Fan J, Heckman NE, Wand MP. Local polynomial kernel regression for generalized linear models and quasi-likelihood functions. Journal of the American Statistical Association. 1995;90:141–150.
[8] Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association. 2004;99:710–723.
[9] Fitzmaurice GM, Laird NM. A likelihood-based method for analysing longitudinal binary responses. Biometrika. 1993;80:141–151.
[10] Hall DB, Severini TA. Extended generalized estimating equations for clustered data. Journal of the American Statistical Association. 1998;93:1365–1375.
[11] Huang JZ, Liu L. Polynomial spline estimation and inference of proportional hazards regression models with flexible relative risk form. Biometrics. 2006;62:793–802. [PubMed]
[12] Ichimura H. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econoemtrics. 1993;58:71–120.
[13] Kim S, Chen MH, Dey DK. Flexible generalized t-link models for binary response data. Biometrika. 2008;95:93–106.
[14] Kraft P, Bauman L, Yuan JY, Horvath S. Multivariate variance-components analysis of longitudinal blood pressure measurements from the Framingham heart study. BMC Genetics. 2003;4(Suppl 1):S55. [PMC free article] [PubMed]
[15] Liang H, Wang S, Robins JM, Carroll RJ. Estimation in partially linear models with missing covariates. Journal of the American Statistical Association. 2004;99:357–367.
[16] Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
[17] Lin XH, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001a;96:1045–1056.
[18] Lin XH, Carroll RJ. Semiparametric regression for clustered data. Biometrika. 2001b;88:1179–1185.
[19] Lipsitz SR, Laird NM, Harrington DP. Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika. 1991;78:153–160.
[20] Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd ed. John Wiley & Sons, Inc.; New York: 2002.
[21] Molenberghs G, Lesaffre E. Marginal modelling of correlated ordinal data using an n-Way plackett distribution. Journal of the American Statistical Association. 1994;89:633–644.
[22] Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics, Simulation and Computation. 1994;23:939–951.
[23] Prentice RL. Correlated binary regression with covariates specific to each binary observation. Biometrics. 1988;44:1033–1048. [PubMed]
[24] Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association. 1995;90:1257–1270.
[25] Severini TA, Staniswalis JG. Quasilikelihood estimation in semiparametric models. Journal of the American Statistical Association. 1994;89:501–511.
[26] Wang N. Marginal nonparametric kernel regression accounting for within-subject correlation. Biometrika. 2003;90:43–52.
[27] Wang N, Carroll RJ, Lin X. Efficient semiparametric marginal estimation for longitudinal/clustered data. Journal of the American Statistical Association. 2005;100:147–157.
[28] Xia Y. A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics. 2007;35:2654–2690.
[29] Xia Y, Härdle W. Semi-parametric estimation of partially linear single-index models. Journal of Multivariate Analysis. 2006;97:1162–1184.
[30] Xia Y, Tong H, Li WK. On extended partially linear single-index models. Biometrika. 1999;86:831–842.
[31] Yi GY, Cook RJ. Marginal methods for incomplete longitudinal data arising in clusters. Journal of the American Statistical Association. 2002;97:1071–1080.
[32] Yi GY, He W, Liang H. Analysis of correlated binary data under partially linear single-index logistic models. Journal of Multivariate Analysis. 2009;100:278–290. [PMC free article] [PubMed]
[33] Yi GY, Thompson ME. Marginal and association regression models for longitudinal binary data with drop-outs: a likelihood-based approach. The Canadian Journal of Statistics. 2005;33:3–20.
[34] Zeger SL, Diggle PJ. Semi-parametric models for longitudinal data with applications to CD4 cell numbers in HIV seroconverters. Biometrics. 1994;50:689–699. [PubMed]
[35] Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–130. [PubMed]