Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Ann Stat. Author manuscript; available in PMC 2010 November 18.
Published in final edited form as:
Ann Stat. 2009 October 1; 37(5A): 2377–2408.
doi:  10.1214/08-AOS662
PMCID: PMC2987698



In the analysis of cluster data the regression coefficients are frequently assumed to be the same across all clusters. This hampers the ability to study the varying impacts of factors on each cluster. In this paper, a semiparametric model is introduced to account for varying impacts of factors over clusters by using cluster-level covariates. It achieves the parsimony of parametrization and allows the explorations of nonlinear interactions. The random effect in the semiparametric model accounts also for within cluster correlation. Local linear based estimation procedure is proposed for estimating functional coefficients, residual variance, and within cluster correlation matrix. The asymptotic properties of the proposed estimators are established and the method for constructing simultaneous confidence bands are proposed and studied. In addition, relevant hypothesis testing problems are addressed. Simulation studies are carried out to demonstrate the methodological power of the proposed methods in the finite sample. The proposed model and methods are used to analyse the second birth interval in Bangladesh, leading to some interesting findings.

Keywords and phrases: Varying-coefficient models, local linear modelling, cluster level variable, cluster effect

1. Introduction

1.1. Preamble

Longitudinal data analysis has attracted considerable attention in the literature. For longitudinal data, the data from the same cluster are dependent with each other. As far as modelling is concerned, this within cluster dependency is usually accounted by random cluster effects and modelled by a within cluster correlation matrix. The within cluster correlation matrix plays a very important role in longitudinal data analysis as it can be used to improve the efficiency of the estimation. Actually, most of the existing literature is devoted to address how to make use of within cluster correlation matrix to improve the estimation for unknown parameters or functions.

The methodology for parametric based longitudinal data analysis is quite mature; see e.g. Diggle, Heagerty, Liang and Zeger [6] and the references therein. The situation with nonparametric based longitudinal data analysis is very different. One of the main difficulties is how to incorporate the within cluster correlation structure into the estimation procedure. Lin and Carroll [18] recommend that we ignore the within cluster correlation when kernel smoothing is employed. Welsh, Lin and Carroll [27] investigate the possibility of using weighted least squares based on the within cluster correlation structure when spline smoothing is used. They suggest that the weighted least squares estimator based on the true within cluster correlation structure works better than the estimator based on working independence when spline smoothing is used. Other literature about nonparametric longitudinal regression includes Zeger and Diggle [31], Brumback and Rice [2], Hoover et al. [14], Wu et al. [28], Martinussen and Scheike [21], Chiang et al. [3], Huang et al. [15], Wang [25], Fan and Li [8], Chiou and Müller [4], Wang, Carroll, and Lin [26], Qu and Li [23], Lin and Carroll [19], Sun et al. [24], Fan and Wu [10], among others.

Most of the literature assumes the regression parameters or functions are the same across all clusters. However, when the regression effects of some particular clusters are of interest, it is unreasonable to assume the regression parameters or functions are the same across all clusters. The interactions of regression effects with clusters are of interest. A naive method to address this is to let the regression coefficients or functions vary freely over clusters. But, this naive method will not be parsimonious particularly when the number of clusters is large and the issue of estimability arises. In addition, the within cluster dependency may be addressed by the random effect. This leads us to model these cluster-dependent regression coefficients or functions by using cluster level variables. It addresses simultaneously the parsimony and cluster dependency of modeling.

1.2. A motivating example

The data which stimulates this project is from Bangladesh about the second birth interval which is defined as the duration between the first birth and the second birth. The data come from the Bangladesh Demographic and Health Survey (BDHS) of 1996–97 (Mitra et al. [22]), which is a cross-sectional, nationally representative survey of ever-married women aged between 10 and 49. Of interest is how some factors which are commonly found to be associated with contraceptive use in Bangladesh, such as education and religion, affect the second birth interval. The data were collected from different districts (clusters) in the six different divisions of Bangladesh. Bangladesh is divided into six administrative divisions which are Barisal, Chittagong, Dhaka, Kulna, Rajshahi and Sylhet. The data from the same cluster are correlated with each other due to cluster-level factors such as cultural norms and access to family planning programmes. Of particularly interest is how the factors affect the second birth interval in some particular clusters. For example how these factors affect the second birth interval in a rural area in Chittagong division.

Some of the factors of interest are defined on the individual level such as education and are called individual level variables. Some of the factors are defined on the cluster level such as type of region of residence and are called cluster level variables. We use yij to denote the length of the second birth interval of the jth woman in the ith cluster, Xij to denote the vector of the corresponding individual level variables, Zi to denote the vector of the cluster level variables.

Frequently, the linear model


would be used to fit the data. The within cluster dependency would be accounted by that εij, j = 1, ···, ni, are correlated. The covariance matrix of εi = (εi1, ···, εini)T can be incorporated into the estimation procedure.

Model (1.1) would be fine if the interest focuses only on the global impact of the factors. However, if the picture for a particular cluster is of interest, (1.1) would not be adequate. Let’s take education as an example. It is evident that the impact of education in the cluster where Muslims predominate would be different from the cluster where Hindus predominate. To take the difference of this kind into account, we may relax the assumption imposed on (1.1) and allow the regression coefficients to vary over clusters. This leads to


While it accounts for the varying impact across clusters, (1.2) is not parsimnious. In fact, (1.2) involves pm+q regression coefficients, where p and q are the dimensions of Xij and Zi respectively. When m, the number of clusters, is large, there would be too many unknown parameters in model (1.2) for us to get reasonably accurate estimators. In longitudinal data analysis, we often come across large number of clusters. For example, there are 296 clusters in the second birth interval data set that stimulates this paper. If we use (1.2) to fit the data, we would face 296p + q unknown coefficients, and would certainly pay a big price on variances of the resulting estimators.

A sensible approach is to model the factor loadings ai by using cluster level variables. A reasonable model is


where A = (α1, ···, αq), and ei (i = 1, ···, m) are random effects with mean zero. This achieves simultaneously the parsimony and within cluster dependency, and the cluster-dependent factor loadings are allowed. In fact, the number of unknown coefficients in model (1.3) is p(q + 1) + q which is usually much smaller than pm + q.

A further extension of model (1.3) is to let the factor loadings vary with time as the society and technology evolve with time. By allowing the impacts varying with time, we come up with the model


where Uij is time, A(Uij) = (α1(Uij), ···, αq(Uij)). Model (1.4) is the model which we are going to address. It is a kind of varying coefficient model (Xia and Li [30]; Fan and Zhang [9, 12]; Zhang et al. [32]; Li and Liang [17]). To make (1.4) mathematically more clear and general, from now on Uij is not necessarily to be time, it can be any continuous covariate. This allows the nonlinear interaction of individual variables Xij and cluster level variable Zi with Uij. We assume εij is measurement error with mean 0 and variance σ2, and independent of Xij, Uij and Zi, and {ei} are i.i.d. random effects with mean 0p×1 and covariance matrix Σ, and independent of all other random variables. We assume { (XijT,Uij)T} are i.i.d., and so are {Zi}.

In (1.4), β(·), αk(·), k = 0, ···, q, are unknown functions to be estimated, and so are σ2 and Σ. Although (1.4) is stimulated by the second birth interval data, the modeling concept and estimation methodology, which this paper aims to explore, are equally applicable to other kinds of data, such as the data obtained from medicine and engineering.

The paper is organised as follows. We begin in Section 2 with a description of the estimation procedure for the proposed model (1.4). In Section 3 we establish the asymptotic properties of the proposed estimators. Hypothesis test associated with model (1.4) is discussed in Section 4. The performance of the method is assessed by a simulation study in Section 5. In Section 6, we use the proposed model and estimation procedure to analyse the data on the second birth intervals in Bangladesh, and explore how the impacts of the factors of interest on the length of second birth intervals in some particular clusters change over time.

2. Estimation procedure

In this section, we are going to construct the estimation procedure for the proposed model (1.4). We estimate the unknown functional coefficients first, then σ2 and Σ.

2.1. Estimation of functional coefficients

By Taylor’s expansion we have


when Uij is in a small neighbourhood of u, which leads to the following local least squares estimation procedure


where zi0 = 1, Kh(·) = K(·/h)/h, h is bandwidth, and K(·) is kernel function. We minimise L with respect to J, d, bk, ck, k = 0, ···, q, to get the minimiser Ĵ, d, bk, ĉk, k = 0, ···, q. We use bk to estimate αk(u), and Ĵ to estimate β(u). From now on, we denote bk by [alpha]k(u), and Ĵ by [beta](u).



where Ip is size p identity matrix, 1d is a d dimensional vector with each component being 1. By a simple calculation, we have




with ek,p denoting the unit vector of length p with 1 at position k, 0p×q the size p × q matrix with all entries 0 and s = (q + 1)p + q.

In practice, some coefficients in (1.4) are constant. Under such situation, model (1.4) becomes a semivarying coefficient mixed effects model. An interesting question is how to estimate the constant coefficients. Fan and Zhang [9] studied a varying coefficient model with coefficients having different degrees of smoothness. They proposed a two-stages estimation procedure. Based on their idea, we propose a very simple estimation procedure for the unknown constant coefficients. Suppose the jth component αkj(u) of αk(u) is a constant which is denoted by Ckj. We first pretend αkj(u) is a function and get the estimator [alpha]kj(Uil) of αkj(u) at Uil, l = 1, ···, ni, i = 1, ···, m, by the above estimation procedure. Then, take the average of [alpha]kj(Uil) over l = 1, ···, ni, i = 1, ···, m. This average


is our estimator of Ckj. We will show it later that the convergence rate of this estimator is of order OP (n−1/2) when the bandwidth is properly selected. This provides a simple method for estimating the constant coefficients.

A more efficient estimate for the constant coefficient can be obtained by using the profile likelihood method. See, for example, Lam and Fan [16]. For simplicity, we do not pursue this further.

2.2. Estimation of σ2 and Σ

Let ai(Uij)=α0(Uij)+k=1qαk(Uij)zik and ri = (ri1, ···, rini)T, where rij=yijXijTai(Uij)ZiTβ(Uij). Correspondingly, let âi(·) and [r with circumflex]i be their substitution estimators. Set


For each given i, we have the following linear model


The residual sum of squares of this linear model


would be the raw material for estimating σ2. The degree of freedom of rssi is nip. Let RSSi be rssi with ri replaced by [r with circumflex]i. Pooling all {RSSi}together leads to the estimator of σ2 as


Finally, we estimate Σ. From (2.4), we have the least squares estimator of ei as


which leads to


The last two terms are of order OP (m1/2), so they are negligible. Hence,


Therefore, we have


to estimate Σ. In (2.5), êi is i with ri replaced by [r with circumflex]i.

3. Asymptotic properties

In this section, we are going to present the asymptotic properties of the proposed estimators. For any p × p symmetric matrix A, we use vech(A) to denote the vector consisting of all elements on and below the diagonal of the matrix A, vec(A) to denote the vector by simply stacking the column vectors of matrix A below one another. Obviously, there exists a unique p2 × p(p + 1)/2 matrix Rp such that vec(A) = Rpvech(A).

We first introduce some notation. For any function or function vector g(·), we use ġ(·) and g(·) to denote its first and second derivatives, respectively. We use An external file that holds a picture, illustration, etc.
Object name is nihms248828ig1.jpg to denote the collections of all individual and cluster level covariates. Let μi = ∫ tiK(t)dt, νi = ∫ tiK2(t)dt, and


where Ω1(u) and Ω3(·) are respectively p(q +1)× p(q +1) and q × q submatrix of Ω(u). Without loss of generality, we will assume that μ0 = 1.

Our main asymptotic results are presented through the following 6 theorems. We leave the proofs of these theorems in Appendix. The first two theorems give the asymptotic normality of the estimated functional coefficients.

Theorem 1

Under the conditions (1)–(6) in Appendix, when nh5 is bounded, for k = 0, ···, q, we have






Theorem 2

Under the conditions (1)–(6) in Appendix, when nh5 is bounded, we have


where Λ3(u) = Ω3(u)−1 + Ω3(u)−1Ω2(u)TΛ1(u)−1Ω2(u)Ω3(u)−1, Θ2(u) = [Upsilon]2(u)Ξ1(u)[Upsilon]2(u)T, and [Upsilon]2(u) = (Λ2(u)T, Λ3(u)).

We now present the asymptotic normality for the parametric component. To present the asymptotic property of σ2, we assume that the following limits exist and finite: c1 = limm→∞ n/(nmp), c2 = limm→∞ m/(nmp) and


where “plim” denotes convergence in probability.

Theorem 3

Under the conditions (1)–(7) in Appendix, when nh8 → 0, we have


Theorem 3 suggests the estimator [sigma with hat]2 is of convergence rate OP (n−1/2) which is the optimal convergence rate of parametric estimator.

Additional notation is needed for presenting the asymptotic normality of Σ. Write


where Δ1(r), Σ(r) (r = 1, ···, p) denote the rth row of Δ1, Σ, respectively. Let


where Pi is a diagonal matrix generated from the diagonal elements of Pi.

Theorem 4

Under the conditions (1)–(7) in Appendix, when nh8 → 0, we have


where r = p(p + 1)/2 and


Theorem 4 shows the estimator [Sigma] also achieves the convergence rate of OP (n−1/2).

Theorem 5

When αkj(u) is a constant, under the conditions (1)–(6) and (8) in Appendix and nh4 → 0, we have




with ΓijT=(XijT,ZiTXijT,ZiT) being the jth row of matrix Γi.

When the jth (j = 1, ···, q) component βj(u) of β(u) is a constant, Theorem 5 applies to its estimator as well except that the variance is




Theorem 6

Under the conditions (1)–(6), (8) and (9) in Appendix with K(t) having a compact support [−c0, c0], and h = n−ρ, 1/5 ≤ ρ < 1/3, for all u [set membership] [a, b] we have




if K(c0) ≠ 0, and


if K(c0) = 0, K(t) is absolutely continuous and K2(t), (K(t))2 are integrable on (− ∞, +∞).


Theorem 6 gives the distribution of the maximum discrepancy between the estimated functional coefficient and true coefficient. It is the basis for constructing hypothesis test or confidence band. Theorem 6 also applies to the estimator of any component of β(·).

4. Confidence bands and hypothesis test

In this section, we will investigate how to construct the confidence bands for the functional coefficients in model (1.4).

For model (1.4), we often wish to know if an estimated functional coefficient is significantly different from zero or if the estimated functional coefficient is really varying. More generally, we wish to test


where α0(u) is a specific function. This kind of nonparametric testing problems can conveniently handled by using the generalized likelihood ratio method (Fan, Zhang and Zhang [11]). Instead, our test statistics will be based on the constructed confidence bands.

As the proposed confidence bands and test statistics involve the estimation of the biases and variances of the proposed estimators of the functional coefficients, we first construct the estimation procedure for the biases and variances. Throughout this paper, for any functional vector g(u), we use g(i)(u) to denote the ith derivative of g(u).

4.1. Estimation for bias and variance

Following Fan and Gijbels [7], by (2.2), we have for k = 0, ···, q, that


where R = (R11, ···, R1n1, ···, Rm1, ···, Rmnm)T with


By Taylor’s expansion, we have


This leads to


where θ(u) = (α0(u)T, ···, αp(u)T, β(u)T)T. The estimators of [theta with umlaut](u) and θ (3)(u) can be easily obtained by using local cubic fit with an appropriate pilot bandwidth h* (=O(n−1/7)) in Section 2.1 which is optimal for estimating [theta with umlaut](u). We denote the estimators by θ¨^(u) and [theta w/ hat](3)(u). Substituting them into (4.3), we obtain R and hence estimated biases bias^(α^k(u)D) and bias^(β^(u)D).

We now derive an estimator of variance of [alpha]k(u) and [beta](u). Noticing that the estimators are linear in Y, according to (2.2). Hence, we need only to estimate var(Y). A natural estimator is


This together with (2.2) give us the following estimators


with Ak and B defining in (2.2).

4.2. Confidence bands and hypothesis testing

We first state the Theorem based on which the confidence bands and hypothesis tests are constructed.

Theorem 7

Under the conditions (1)–(9) in Appendix with K(t) having a compact support [−c0, c0], and h = n−ρ, 1/5 ≤ ρ < 1/3. Furthermore, if αkj(3)(·) is continuous on [a, b] and the pilot bandwidth h* is of order n−1/7, then for all u [set membership] [a, b] we have


where ωn is exactly the same as that in Theorem 6.


If the component of β(3)(·) is continuous on [a, b], Theorem 7 applies to its estimator as well.

Based on Theorem 7, the 1 − α confidence bands of αkj(u) can be easily constructed as




It is worth to mention that Xia [29] investigated bias-corrected confidence bands for univariate nonparametric regression.

By Theorem 7, hypothesis (4.1) can be tested by the test statistic


and rejecting H0 when the test statistic exceeds the asymptotic critical value cα = −log{−0.5 log(1 − α)}. Similarly, we may want to ask the question whether a specific functional coefficient is really varying. This amounts to testing the composite null hypothesis


Based on Theorems 5 and 7, we test the problem (4.4) by computing the statistic


and rejecting H0 for large values of the test statistic.

5. Simulation study

In this section, we are going to use a simulated example to demonstrate how well the proposed estimation method works, and examine how much loss one could incur if ignoring the structure of ai(·) when the structure holds. We demonstrate the accuracy of the proposed estimators first.

In model (1.4), we take p = 3, q = 2 and m = 100. The cluster sizes ni (i = 1, ···, m) are generated by the integer part of |2ξ| + 6, ξ ~ N(0, 1). The covariantes {Xij} are independently generated from N(0, Ip), {Zi} are independently generated from N(0, Iq), and {Uij} are independently generated from U(0, 1). We also set the random effect ei following the normal distribution N(0p, Σ), measurement error εij following normal distribution N(0, σ2) where Σ = (σij) = 0.52Ip and σ = 0.5. We set β0(u) = sin(2πu) (intercept term), α0(u) = (α01(·), α02(·), α03(·))T a vector with each component being sin(2πu), α1(u) = (α11(·), α12(·), α13(·))T a vector with each component being cos(2πu), α2(u) = (α21(·), α22(·), α23(·))T a vector with each component being sin(πu), β(u) = (β1(·), β2(·))T a vector with each component being sin(2πu).

For any function or functional vector g(·), if ĝ(·) is an estimator of g(·), we define the mean integrated squared error (MISE) of ĝ(·) as


where ||b||2 = bTb and use it to assess the accuracy of the estimators.

The proposed estimation method is employed to estimate the functional coefficients. The kernel function is taken to be Epanechnikov kernel 0.75(1− t2)+ and bandwidth is taken to be 0.15. The MISEs of the proposed estimators of unknown functions are presented in Table 1, based on 100 simulations. The MSEs of the estimators of Σ and σ2 are presented in Table 2. From Tables 1 and and2,2, we can see the proposed estimators are quite accurate.

Table 1
The MISEs of the Estimators
Table 2
The MSEs of the Estimators

To visualize the accuracy of the proposed estimators, among the 100 simulations, we single out the one with median performance, and plot the estimated functions together with their 95% confidence bands in Fig. 1. It shows again the proposed estimators are very accurate.

Fig 1
The solid lines are the true curves, the dashed lines are the estimators, and the dotted lines are 95% confidence bands.

Now, we turn to examine how much loss one could incur when ignoring the structure of ai(·), namely examining the gain of our model (1.4). We now assume the cluster effects ei = 0. Given the sizes of the clusters in the above simulated example, many clusters would be too small for one to estimate the corresponding ai(·) when ignoring the structure of ai(·). So, we now assume all clusters share the same size of 50, and keep the number of clusters 100. Except the change on the sizes of the clusters and the assumption of ei = 0, all remain the same as the above simulated example.

We use MISE1,i to denote the MISE of the estimator of ai(·) obtained by the proposed estimation method, and MISE2,i the MISE of the estimator obtained without using the structure of ai(·), i.e., regarding ai(·) as a free unknown function and estimating it based on the first part of model (1.4). The ratio


is used to assess the loss incured on the estimation for a(·) = (a1(·), ···, am(·)) due to ignoring the structure of ai(·), i = 1, ···, m. We compute the RMISEs for different bandwidths, and plot the obtained RMISEs against the bandwidths in Fig. 2. It is clear, the RMISE is almost 0 when the bandwidth is less than 0.3, and it never goes beyong 0.25, which suggests the loss is significant. Similarly, we define the RMISE for β(·), and plot the RMISEs against different bandwidths. It again shows the loss incured on the estimation for β(·) due to ignoring the the structure of ai(·), i = 1, ···, m, is still significant.

Fig 2
The left panel is the plot of the RMISEs of a(·) against bandwidths, the middle panel is the plot of the RMISEs of β(·), and the right panel is the density function of the estimated random effects in real data analysis.

6. Real data analysis

The data we study come from the Bangladesh Demographic and Health Survey (BDHS) of 1996–97 (Mitra et al. [22]), which is a cross-sectional, nationally representative survey of ever-married women aged between 10 and 49. The analysis is based on a sample of 8189 women nested within 296 primary sampling units or clusters, with sample sizes ranging from 16 to 58. We allow the hierarchical structure of the data by fitting a two-level model with women at level 1 nested within clusters at level 2. A further hierarchical level is the administrative division; Bangladesh is divided into six administrative divisions which are Barisal, Chittagong, Dhaka, Kulna, Rajshahi and Sylhet. Effects at this level are represented in the model by fixed effects since there are only six divisions.

The dependent variable, yij, is the duration in months between the first birth and the second birth for the jth woman in the ith cluster. We consider several covariates which are commonly found to be associated with fertility in Bangladesh. The selected individual-level categorical covariates include the woman’s level of education (none coded by 0, primary or secondary plus coded by 1) denoted by xij1, religion (Hindu coded by 1, Muslim or other coded by 0) denoted by xij2, first child (boy coded by 1) denoted by xij3. Xij = (xij1, xij2, xij3)T. Another individual-level covariate is the year of marriage (Uij). We also consider two cluster-level variables, administrative division and type of region of residence (rural coded by 1, urban coded by 0). We take urban as the reference and the differences between urban and rural clusters are modeled by a dummy variables zi1. We take Barisal as the reference and the differences among the six administrative divisions are modeled by a set of dummy variables zil, l = 2, ···, 6.

The proposed model (1.4) is used to fit the data, and the proposed estimation procedure is employed to estimate αj(·) = (αj1(·), αj2(·), αj3(·))T, j = 0, ···, 6, and β(·) = (β1(·), ···, β6(·))T. The kernel involved in the estimation is taken to be Epanechnikov kernel, and the bandwidth is chosen to be 35% of the range of Uij.

First, we examine how strong the random effect on each cluster is. To this end, for each cluster, we estimate the random effect of this cluster. The density function of the norm of the random effects of all clusters is presented in Fig. 2. It is easy to see that the random effects are close to zero. This indicates that the within cluster correlation has been mainly accounted by the deterministic cluster effect in (1.4).

As we mentioned before, if a coefficient is treated as a function when it is constant, we would pay a price on the variances of the resulting estimators. Thus, for each coefficient in the model, the proposed hypothesis test is employed to test whether it is constant or not. The P-value for each coefficient is depicted in Table 3. Table 3 shows that α01(·), α02(·), α32(·), α42(·), α52(·), β4(·), and β5(·) are nonconstant, the others are constant. From now on, we use αij to denote constant function αij(·), βi constant function βi(·).

Table 3
The P-Values of the Coefficients being Constant

The former indicates the presentence of their nonlinear interactions with the year of marriage, while the latter shows no such interactions present.

The proposed estimation procedure is used to estimate the constant and functional coefficients. The estimated constant coefficients and their standard errors computed by the leave-one-cluster-out Jackknife are presented in Table 4, and the estimated functional coefficients together with their 95% confidence bands are presented in Fig. 3.

Fig 3
The solid lines are the estimated functional coefficients, and dotted lines are the 95% confidence bands.
Table 4
Estimated Constant Coefficients

It is visible, from Table 4, some constant coefficients are not significantly apart from zero. This has some practical indications. For example, α41 (its estimated value is −0.014 with standard error 0.016) not significantly apart from zero indicates the impact of eduction in the division of Kulna is not significantly different to that in Barisal.

As explained in the section of introduction, the proposed modelling and estimation methods mainly serve for the inference for a particular cluster. We are now going to use a cluster in the rural area in Chittagong to illustrate how the proposed method work.

Based on the model (1.4) and the estimators of the unknown coefficients involved, we have the impact of education on the second birth interval in a rural area in Chittagong


Similarly, we can get the impact of Hindu


and the impact of first child in this area


The functional coefficients â1(·) and â2(·) are presented in Fig. 3. The second one in the bottom panel in Fig. 3 is â1(·), which shows that in rural area in Chittagong, even the educated women still have shorter second birth interval than the uneducated women in urban area in Barisal before 1970. This indicates administrative division and type of region of residence play a very important role in fertility behavior in Bangladesh before 1970. It is also noticable that the second birth intervals of the educated women are getting longer.

From the third one in the bottom panel in Fig. 3, which is â2(·), we can see even in rural area in Chittagong, Hindus still have longer second birth interval than Muslims even in the urban area in Barisal after 1970. This suggests that religion plays an important role in fertility behavior in Bangladesh.

The impact â3(·) of boy in Rural in Chittagong is not varying with time. It is 0.128, which suggests even in rural area in Chittagong, the women with first child being boy still have longer second birth interval than the women with first child being girl or dead even in the urban area in Barisal. A possible interpretation is, like many developing countries, the Bangladesh culture always favors boy.


In this section, we will prove the asymptotic distributions of the proposed estimators. For easy description, we write




where s = (q + 1)p + q is defined in (2.2).

Moreover, for any function g(u) on the interval [a, b], define ||g|| = supu[set membership][a,b] |g(u)| and for any matrix A(u) = (aij(u))p×p, set


The following technical conditions are imposed to establish the asymptotic results:

  1. Eε114<, E||e1||4 < ∞, Ezj4<,Exi2d< for some d > 2, where ||e1||2=e1Te1, zj denotes the jth element of Z and xi denotes the ith element of X for j = 1, ···, q, i = 1, ···, p.
  2. [alpha umlaut]kj(·) is continuous in a neighborhood of u, for k = 0, ···, q, j = 1, ···, p, where [alpha umlaut]kj(·) is the jth element of [alpha umlaut]k(·), and assume [alpha umlaut]kj(u) ≠ 0; similarly, β̈l(·) is continuous in a neighborhood of u, for l = 1, ···, q, where β̈l(·) is the lth element of β(·), and β̈l(u) ≠ 0.
  3. The marginal density f(·) of U has a continuous derivative in some neighborhood of u, and f(u) ≠ 0.
  4. Each element of Ω(u) and Ξ1(u) are continuous in the neighborhood of u, and Ω(u) is positive definite at the point u.
  5. The function K(t) is a symmetric density function with a compact support.
  6. ni, i = 1, ···, m, are bounded. n → ∞, h → 0 and nh2 → ∞.
  7. E||(xiTxi)1||4+δ< for some δ > 0 and i = 1, ···, m, where ||·|| denotes Frobenium norm of a matrix.
  8. E[XilTXirΓilΓirTUil=u,Uir=v] is continuous in the neighborhood of u and v respectively for rl, r, l = 1, ···, ni, i = 1, ···, m.
  9. Ω(u) is bounded and has a continuous derivative on [a, b], τ(u) has a bounded derivative on [a, b], 0 < || τ(u)|| < ∞, and the first derivative of K(t) has a finite number of sign changes over its support.

Note that Ω(u) is automatically positive semidefinite at the point u, so the second part of condition (4) is easily satisfied.

To obtain the proof of the theorems, the following lemmas are required.

Lemma 1

Let {Uij } be i.i.d. random variables, {ξij} be identically distributed, and ξij be independent of ξlk for i ≠ l. Assume that the marginal density f(·) of U has a continuous derivative in some neighborhood of u, E(ξ11|U11 = u) is continuous in the neighborhood of u and E(ξ112)<. Let K(·) be a bounded positive function with a bounded support. Then when nh2 → ∞, for λ = 0, 1, 2, ···



Let Sn,λ=n1i=1mj=1niξij(h1(Uiju))λKh(Uiju) and g(v) = E(ξ11|U11 = v), then


by continuity of both the density function f(·) and the conditional expectation function g(·) in the neighborhood of u. Moreover, as K(·) is a bounded function with a bounded support, |uλK(u)| is bounded. Then by Jensen inequality it follows that




Lemma 2

Let {ξn, n ≥ 1} be an independent sequence with Eξn = 0, and


for any j ≥ 1 and some λ > 0, if i=1nEξi2, then on a possibly enlarged probability space, there exists a sequence of independent random variables {ηn, n ≥ 1}, ηn ~ N(0, var(ξn)) such that


where C is a positive constant.

See Lin and Lu [20], P.129, Theorem 2.6.3.

Proof of Theorem 1

It can be shown that








Moreover, it follows from Lemma 1 that




By condition (1)(5)(6), Lindeberg Feller Theorem, Slutsky theorem and inverse of block matrix, it follows that


Similarly, it can be shown that


Since Ln1 and Ln2 is independent, (Ln1 + Ln2) has the following asymptotic distribution:


By similar arguments as establishment of Lemma 1 and condition (2), we find that


This together with (A.1), we get that


Therefore, when nh5 is bounded, we obtain that


Proof of Theorem 2

It follows immediately from the proof of Theorem 1.

Proof of Theorem 3

Let Δri = [r with circumflex]iri, Qi = IniPi, and Qi be a diagonal matrix generated from the diagonal elements of Qi. Now


As Qi is an idempotent matrix and all the diagonal components of QiQi are equal to zero, by straightforward calculation it follows that




As Jn1=(nmp)1i=1m{l=1nir=1,rlniXilT(xiTxi)1Xirεilεir} is a sum of independent variables, by Lindeberg Feller Theorem, it follows that




Since the two terms are uncorrelated, we have that


In the following, we will show that the remaining parts from Jn3 to Jn7 in (A.3) satisfy n1/2Jnl = oP (1), l = 3, ···, 7.

By the conditional bias of [theta w/ hat](u) and law of large numbers, it follows from nh8 → 0 that


Since 0 ≤ [var phi]TQi[var phi][var phi]T[var phi] for any i and ni dimensional vector [var phi], we have


By the conditional bias and covariance matrix of [theta w/ hat](u), it follows that


where l=1niQirlQivl=δrv={1r=v0rv.



By similar expression as (A.1) and straightforward calculation, we have


Moreover, by boundness of kernel function and independence of the random errors, it can be shown that E{|Jn7||An external file that holds a picture, illustration, etc.
Object name is nihms248828ig1.jpg} = OP((nh)−1).

Therefore using Markov inequality, when h → 0, nh2 → ∞ we get


Combing the results from (A.3) to (A.6), we have


Proof of Theorem 4

Using standard arguments as in the proof of Theorem 3 and law of large numbers, when nh2 → ∞, the conditional bias of [Sigma] is


and by straightforward but tedious calculation and Lindeberg-Feller Theorem, when nh8 → 0, it follows that




Therefore, we have


Proof of Theorem 5

By simple calculation, it can be seen that


First, we obtain that E(Tn1) = 0, and for any j let


and F = (F11, ···, F1n1, ···, Fm1, ···, Fmnm)T, then we have


by Lemma 1, straightforward but tedious calculation, and law of large numbers. Therefore using condition (1)(5)(6) and Lindeberg Feller Theorem, it follows that


By the same way, it can be shown that


Moreover, combining the results similar to (A.1) and (A.2), we get


Therefore, by independence of Tn1 and Tn2, when nh4 → 0 we have


Proof of Theorem 6



First of all, we approximate the random matrix I1(u). As f(·) and Ω(·) have continuous derivatives on [a, b], by similar arguments as Lemma 1 and Neumann series expansion, it follows that


uniformly for u [set membership] [a, b] where S(u)=(μ000μ2)Ω(u).

By the asymptotic normality of n1hf(u)H1XTW(ε+xe) in the proof of Theorem 1, we get that


Therefore, using (A.7) and (A.8), it can be seen that


Next, we consider


Let wil=ekp+j,sTΩ(u)1Γil, then


Divide interval [a, b] into n subintervals Jr = [dr−1, dr), r = 1, 2, ···, n − 1, Jn = [dn−1, b] where dr=a+banr. Define Ũil = drI(Uil [set membership] Jr), r = 1, ···, n, and it is obvious that |UilŨil| = O(n−1). Then by law of large numbers for random sequence { wil(εil+XilTei)}, it follows that


uniformly for u [set membership] [a, b]. By the definition of Ũil, we have that


Let ζt=r=1ti=1ml=1niI(UilJr)wil(εil+XilTei)=i=1ml=1niI(aUil<dt)wil(εil+XilTei), ζ0 [equivalent] 0, then by Lemma 2, for any t = 1, ···, n and u [set membership] [a, b], we get


where W(·) is a Wiener process and


It follows from Abel’s transform that






uniformly for u [set membership] [a, b].

For a Wiener process, it is known that (Csörgö and Révész [5], P.44)


when δ is any small number. Using this property and the boundness of K(·), we have


uniformly for u [set membership] [a, b]. This together with (A.11) and (A.12), it follows that






For a Gaussian process, following the similar proof of Lemmas in Härdle [13], we have


Therefore, by (A.13) and (A.14),


From Theorem 2 and Theorem 3.1 of Bickel and Rosenblatt [1], when h = nρ, 1/5 ≤ ρ < 1/3, we have


where ωn is defined in Theorem 6. As


uniformly for u [set membership] [a, b] by straightforward calculation and similar arguments as Lemma 1. Using the same proof of the first part of Theorem 2 of Fan and Zhang [12], the result of Theorem 6 is easily obtained.

Proof of Theorem 7

To prove the theorem, we first derive the rate of convergence for the bias and variance estimators. By (A.7) and its similar arguments, we have


where the rate n−1/7 comes from the pilot estimation of θ¨^(·) and the term o(h) comes from the coefficient in front of [theta w/ hat](3)(·).

Furthermore, by similar proof as Lemma 1, we get


where S(u)=(ν000ν2)Ω(u), and


These results together with (A.7), results of Theorem 3 and 4, we have


Using (A.15) and the same proof of the second part of Theorem 2 of Fan and Zhang [12], the result of Theorem 7 is obtained.

Contributor Information

Wenyang Zhang, Department of Mathematical Sciences, University of Bath, UK.

Jianqing Fan, Department of Operation Research and Financial Engineering Princeton University, USA.

Yan Sun, School of Economics, Shanghai University of Finance and Economics, P.R.China.


1. Bickel PL, Rosenblatt M. On some global measures of the derivations of density function estimates. The Annals of Statistics. 1973;1:1071–1095.
2. Brumback B, Rice JA. Smoothing spline models for the analysis of nested and crossed samples of curves (with discussion) Journal of the American Statistical Association. 1998;93:961–994.
3. Chiang CT, Rice JA, Wu CO. Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables. Journal of the American Statistical Association. 2001;96:605–619.
4. Chiou JM, Müller HG. Estimated estimating equations: semiparametric inference for clustered/longitudinal data. Journal of the Royal Statistical Society, Series B. 2005;67:531–553.
5. Csörgö M, Révész P. Strong approximations in probability and statistics. Academic Press; New York: 1981.
6. Diggle PJ, Heagerty P, Liang KY, Zeger SL. Analysis of Longitudinal Data. Oxford University Press; 2002.
7. Fan J, Gijbels I. Data-driven bandwidth selection in local polynomial fitting: variable bandwidth and spatial adaptation. Journal of the Royal Statistical Society, Series B. 1995;57:371–394.
8. Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of American Statistical Association. 2004;99:710–723.
9. Fan J, Zhang W. Statistical estimation in varying coefficient models. The Annals of Statistics. 1999;27:1491–1518.
10. Fan J, Wu Y. Semiparametric estimation of covariance matrices for longitudinal data. Journal of American Statistical Association. 2008 to appear. [PMC free article] [PubMed]
11. Fan J, Zhang C, Zhang J. Generalized likelihood ratio statistics and Wilks phenomenon. The Annals of Statistics. 2001;29:153–193.
12. Fan J, Zhang W. Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scandinavian Journal of Statistics. 2000;27:715–731.
13. Härdle W. Asymptotic maximal deviation of M-smoothers. Journal of Multivariate Analysis. 1989;29:163–179.
14. Hoover DR, Rice JA, Wu CO, Yang LP. Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika. 1998;85:809–822.
15. Huang JZ, Wu CO, Zhou L. Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika. 2002;89:111–128.
16. Lam C, Fan J. Profile-Kernel likelihood inference with diverging number of parameters. The Annals of Statistics. 2008 to appear. [PMC free article] [PubMed]
17. Li R, Liang H. Variable selection in semiparametric regression modeling. The Annals of Statistics. 2008;36:261–286. [PMC free article] [PubMed]
18. Lin X, Carroll RJ. Nonparametric function estimation for clustered data when the predictor is measured without/with error. Journal of the American Statistical Association. 2000;95:520–534.
19. Lin X, Carroll RJ. Semiparametric estimation in general repeated measures problems. Journal of the Royal Statistical Society, Series B. 2006;68:69–88.
20. Lin ZY, Lu CR. Strong approximation theorem (in chinese) Science Press; Beijing, China: 1992.
21. Martinussen T, Scheike TH. A Semiparametric Additive Regression Model for Longitudinal Data. Biometrika. 1999;86:691702.
22. Mitra SN, Al-Sabir A, Cross AR, Jamil K. Bangladesh and Demographic Health Survey 1996–1997. Dhaka and Calverton, MD: National Institute of Population Research and Training (NIPORT), Mitra and Associates, and Macro International Inc; 1997.
23. Qu A, Li R. Nonparametric modeling and inference function for longitudinal data. Biometrics. 2006;62:379–391. [PMC free article] [PubMed]
24. Sun Y, Zhang W, Tong H. Estimation of the covariance matrix of random effects in longitudinal studies. The Annals of Statistics. 2007;35:2795–2814.
25. Wang N. Marginal Nonparametric Kernel Regression Accounting Within-Subject Correlation. Biometrika. 2003;90:2942.
26. Wang N, Carroll RJ, Lin X. Efficient Semiparametric Marginal Estimation for Longitudinal/Clustered Data. Journal of the American Statistical Association. 2005;100:147157.
27. Welsh AH, Lin X, Carroll RJ. Marginal longitudinal nonparametric regression: Locality and efficiency of spline and kernel methods. Journal of the American Statistical Association. 2002;97:482–493.
28. Wu CO, Chiang CT, Hoover DR. Asymptotic confidence regions for kernel smoothing of a varying-coefficient model with longitudinal data. Journal of the American Statistical Association. 1998;93:1388–1402.
29. Xia Y. Bias-corrected confidence bands in nonparametric regression. Journal of the Royal Statistical Society Series B. 1998;60:797–811.
30. Xia Y, Li WK. On the estimation and testing of functional-Coefficient linear models. Statistica Sinica. 1999;9:735–758.
31. Zeger SL, Diggle PJ. Semiparametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics. 1994;50:689–699. [PubMed]
32. Zhang W, Lee SY, Song X. Local polynomial fitting in semivarying coefficient models. Journal of Multivariate Analysis. 2002;82:166–188.