PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Technometrics. Author manuscript; available in PMC 2010 June 16.
Published in final edited form as:
Technometrics. 2010 February 1; 52(1): 124–136.
doi:  10.1198/TECH.2009.08161
PMCID: PMC2886314
NIHMSID: NIHMS208461

Partitioning degrees of freedom in hierarchical and other richly-parameterized models

SUMMARY

Hodges & Sargent (2001) developed a measure of a hierarchical model’s complexity, degrees of freedom (DF), that is consistent with definitions for scatterplot smoothers, interpretable in terms of simple models, and that enables control of a fit’s complexity by means of a prior distribution on complexity. DF describes complexity of the whole fitted model but in general it is unclear how to allocate DF to individual effects. We give a new definition of DF for arbitrary normal-error linear hierarchical models, consistent with Hodges & Sargent’s, that naturally partitions the n observations into DF for individual effects and for error. The new conception of an effect’s DF is the ratio of the effect’s modeled variance matrix to the total variance matrix. This gives a way to describe the sizes of different parts of a model (e.g., spatial clustering vs. heterogeneity), to place DF-based priors on smoothing parameters, and to describe how a smoothed effect competes with other effects. It also avoids difficulties with the most common definition of DF for residuals. We conclude by comparing DF to the effective number of parameters pD of Spiegelhalter et al (2002). Technical appendices and a dataset are available online as supplemental materials.

Keywords: Degrees of freedom, hierarchical model, model complexity, prior distribution

1 Introduction and motivation

Hodges & Sargent (2001) developed a measure of a hierarchical model’s complexity, degrees of freedom (DF), that is consistent with the definition for scatterplot smoothers and interpretable in terms of simple models. DF describes complexity of the whole fitted model but in general it is unclear how to allocate DF to individual parts of a model. We use an example to introduce the problem of measuring the complexity of a model’s components.

Example 1. Periodontal measurements: clustering and heterogeneity

Periodontal attachment loss (AL) – the extent of a tooth’s root (in millimeters) that is no longer attached to surrounding bone by periodontal ligament – is used to diagnose and monitor periodontal disease. Below, we present analyses of AL measurements for 12 research subjects, for each of whom AL was measured on a quadrant of 7 teeth. On each tooth, three sites (distal, direct, and mesial) on both the buccal (tongue) side and lingual (cheek) side were measured, giving 7 × 3 × 2 = 42 sites per subject. Figure 1 is a schematic of one subject’s measurements.

Figure 1
Example 1, neighbor relations in periodontal data. “X” indicates a measurement site, lines indicate neighbor relations, teeth are numbered from the central incisor (1) to the second molar (7); MES: mesial site, DIR: direct site, DIS: distal ...

A site’s AL measurement is the sum of true loss and measurement error, which is substantial. Spatial correlation in true AL arises because if a person has poor hygiene in an area of the mouth, sites in that area may be more prone to loss and because bacterial infection, the ultimate cause of periodontal damage, is transmitted among the sites on a given tooth. Non-spatial heterogeneity arises from local features of the dentition, e.g., tooth malalignments that predispose a site to gum recession or features that make measurement difficult and affect all examiners similarly.

Gilthorpe et al (2003) used non-spatial random effects to merge the “linear” and “burst” theories of progressive attachment loss, while Reich & Hodges (2008a) used spatial random effects to improve detection of disease progression. The preceding paragraph, however, suggests a model with both spatial clustering and non-spatial heterogeneity in true AL. We consider a model commonly used by epidemiologists (Besag et al 1991). For simplicity, this section models one subject and one measurement at each site; Section 3.1 analyzes all 12 subjects, and the dataset is available online with the supplemental materials. Let y [set membership] RN be observations on N sites on a spatial lattice with G islands (unconnected groups of sites); N = 42 and G = 1 in our example. Model y as multivariate normal with mean δ + ξ and error covariance σ2IN. Nonspatial heterogeneity is captured by ξ = (ξ1,···, ξN)′, modelled as N(0,τh2IN). Spatial clustering is captured by δ as follows. Neighbor relations among sites are summarized in an N × N matrix Q, with non-diagonal entries qij = − 1 if site i and j are neighbors and 0 otherwise, while diagonal entry qii is the number of site i’s neighbors. Figure 1 shows the neighbor pairs we used; other choices are possible (e.g., Reich et al 2007). Then δ is modeled with an intrinsic conditional autoregressive model, CAR(Q,τc2): conditional on τc2, δ has (improper) joint density

f(δτc2)(τc2)NG2exp(12τc2δQδ).

This density is always improper because 1N is an eigenvector of Q with eigenvalue 0; the impropriety implicitly allows δ a non-zero intercept, which is made explicit below. Also note that δQδ = Σj~i(δiδj)2, where j ~ i means j is i’s neighbor, so Q’s specification has the effect of shrinking neighbors toward each other.

The model’s components for clustering (δ) and heterogeneity (ξ) are controlled by variances τc2 and τh2 respectively. As τh2 is increased, ξi’s magnitude increases but without a spatial pattern. The effect of increasing τc2 is determined by the spatial neighbor pairs. To see how, consider a simplified model without ξ, so y = δ + ε, where δCAR(Q,τc2) as above and εi ~ N(0, σ2), and assume G = 1. Given r=σ2/τc2, the posterior mean and best linear unbiased predictor (BLUP) of δ are [delta with circumflex] = (IN + rQ)−1y. Let Q have spectral decomposition Q = VDV, where D = diag(d1, ···,dN − 1, 0), d1 ≥ ··· ≥ dN − 1 >0, and the orthonormal V partitions as V=(V11N1N) where V1 is N × (N − 1) (Hodges et al 2003). The single zero eigenvalue and conforming partition of V arise because there are G = 1 islands. Then [delta with circumflex] = V[var phiv with circumflex], where [var phiv with circumflex] = (IN + rD)−1Vy. Because dN = 0, φ^N=1Ny¯ regardless of r, while for i < N, φ^i=(V1y)i/(1+rdi), which is shrunk toward zero by rdi > 0. For the lattice in Figure 1, the three smallest eigenvalues, dN−1, dN−2, and dN−3 are smaller than the largest, d1, by factors of 248, 62, and 28 respectively; the corresponding columns of V1 describe roughly linear, quadratic, and cubic trends along the long axis of Figure 1. Columns of V1 corresponding to increasingly larger di describe increasingly higher-frequency variation in the spatial structure, whose coefficients [var phiv with circumflex]i are shrunk toward zero to increasingly greater degrees for given r=σ2/τc2. Thus for small τc2, the fitted values [delta with circumflex] mostly reflect damped large-scale structure (Reich & Hodges 2008b) and as τc2 increases, damping is reduced in all [var phiv with circumflex] and the fit becomes increasingly wiggly.

After fitting this model, any of several measures could be used to describe the whole model’s complexity, but the distinct contributions of the heterogeneity and clustering components are not well understood. This paper proposes a partition of a readily interpretable measure of complexity (Hodges & Sargent 2001, henceforth H&S). H&S used a model in which observed data y [set membership] Rd are assumed to be multinormal with mean X1θ and covariance matrix Γ0, with X1 known and θ constrained by a model or prior distribution, Zθ ~ N(0, Γ1), with Z known. Γ0 and Γ1 are usually functions of a few unknowns. H&S reformulated this as a linear model

Y=(y0)=(X1Z)θ+(εδ)=Xθ+e,
(1)

where e = (ε, δ)′ ~ N(0, Γ) for Γ = diag(Γ0, Γ1), so ε and δ are independent a priori conditional on unknowns in Γ. (This formulation is equivalent to H&S’s but we are about to discard it so we will not belabor the equivalence.) Define X=(X1Z); then given Γ X1θ has posterior mean or BLUP X1(XΓ1X)1X1Γ01y. H&S used linear model theory to rationalize defining the complexity or degrees of freedom (DF) of this model fit given Γ as

ρ=tr[X1(XΓ1X)1X1Γ01],

and showed ρ is the same as other measures of complexity when both are defined. Lu et al (2007) extended this definition to generalized linear hierarchical models.

H&S’s DF can be used for several purposes. Model complexity is used by model-comparison criteria, though for reasons discussed below, we do not consider such criteria. The original purposes of H&S were to describe the complexity of a fitted hierarchical model and to control that complexity by putting a prior distribution on it, inducing a prior on unknowns in Γ. (Section 2.6 below gives an accessible introduction to this idea.) Several authors have specified priors this way, e.g., Lu et al (2007), extending uniform shrinkage priors (Daniels 1999, Natarajan & Kass 2000), Paciorek & Schervish (2006), and Hodges et al (2007).

H&S’s approach has limitations that restrict its use. Except for special cases, e.g., balanced single-error-term ANOVA (Hodges et al 2007), it is unclear how to allocate DF to individual effects. We use examples to illustrate this limitation and later show the advantage of removing it.

Example 1 revisited

The model for δ can be re-expressed as V1δN(0,τc2D11), where D1 = diag(d1, …, dN−1). The models for δ and ξ are then combined as in (1):

(y00)=(IIV100I)(δξ)+(εφcφh)

for ε ~ N(0, σ2I), φcN(0,τc2D11), and φhN(0,τh2IN). H&S’s DF is then (Lu et al 2007)

ρ=tr((ININININ)(IN+σ2τc2QININ(1+σ2τc2)IN)1).

We have assumed Q describes a connected map, so G = 1. After some labor, ρ becomes

ρ=1+i=1N1τc2σ2/di+τh2σ2τc2σ2/di+τh2σ2+1.

While this suggests how total DF in the fit might be attributed to δ and ξ, any such partition must be rigorously justified (Section 2.5 below does so).

Example 2. Global mean surface temperature: dynamic linear growth model

Summary measures of the earth’s surface temperature are used to describe global climate. Consider the series yt, t = 0, …, 124, of global average temperature deviations (units 0.01 degrees Celsius) from 1881 to 2005. Figure 2 plots yt, available at http://data.giss.nasa.gov/gistemp/tabledata/GLB.Ts.txt; Section 3.2 explains the rest of the plot. We smooth yt using the linear growth model, which captures variation using time-varying mean and trend and independent noise (West & Harrison 1997, Chapter 2). This model has equations for observation error, variation in local mean, and variation in trend respectively:

yt=μt+nt,t=0,,T,μt=μt1+βt1+w1,t,t=1,,T,βt=βt1+w2,t,t=1,,T1.
Figure 2
Example 2, global mean surface data, data and fitted smooths for gamma priors.

Let μ, β, n, w1 and w2 be the vectors of μt, βt, nt, w1t and w2t respectively. Assume n, w1 and w2 are mutually independent and nN(0,σn2IT+1),w1N(0,σ12IT), and w2N(0,σ22IT1), so σn2,σ12 and σ22 describe respectively the size of observational error nt and the smoothness of level μt and trend βt. The equation for βt is the simplest CAR model, while the model for updating μt is similar, so the variances σ12 and σ22 play roles like those of the variances in Example 1.

As in H&S, combine the three equations and reformulate them as a linear model:

(y(T+1)×10T×10(T1)×1)=(IT+10(T+1)×TZ1IT0(T1)×(T+1)Z2)(μβ)+(nw1w2),

, where Z1 = [0T×1; IT] − [IT; 0T×1], Z2 = [0(T−1)×1; IT−1] − [IT−1; 0(T−1)×1]. Then

ρ=tr[(IT+1000)(IT+1+σn2σ12Z1Z1σn2σ12Z1σn2σ12Z1σn2σ12IT+σn2σ22Z2Z2)1].

This is difficult to simplify and thus offers little intuition about how to attribute DF separately to the local mean and the local linear trend (but again, see Section 2.5).

Usually, as in these examples, hierarchical models specify multiple sources of variation in the data. The first example has three: spatial clustering, local heterogeneity, and error. In the second example, variation comes from perturbations in the trend and local mean and from observation error. In these examples, it would be desirable to attribute DF to these meaningful effects.

Such a partition of DF has at least two uses. First, it is an interpretable measure of each effect’s share of the fit’s complexity. Effects compete to explain variation when their design matrices are not orthogonal. In the absence of smoothing this produces well-known collinearity effects; smoothing complicates the situation. Smoothed effects compete with unsmoothed effects (e.g., Reich et al 2006) and with each other, which influences the severity of collinearity effects and the degree to which smoothed effects are made smooth. Allocating a fit’s DF to effects provides a view of this competition. For Example 1’s model, Eberly and Carlin (2000) used τc/(τc + τh) as a measure of clustering’s contribution to model fit. However, τc and τh are not commensurate: they are arbitrary, depending on the scaling of the design matrices for δ and ξ. Also, while the CAR’s variance parameter τc2 can be interpreted as an index of correlation between neighbor pairs, it does not describe that correlation in any precise sense (Wall 2004). DF, by contrast, is independent of scaling (see Section 2.4 below) and directly describes complexity in the fitted values arising from the two parts of the model, with separate DF for clustering (DFc) and heterogeneity (DFh). The change from prior to posterior in the distribution of DFc/(DFc + DFh) shows the data’s evidence about the proportion of variation explained by clustering. Section 3.1 explores this.

Second, a partition of DF can be useful for specifying a prior distribution on variance parameters. For the linear growth model, West & Harrison (1997, Chapter 2) and Prado et al (2001) assumed fixed σ12/σn2 and σ22/σn2 to control complexity in the fit. It would be handy to avoid fixing these ratios but because they are hard to think about, it is hard to specify priors for them. A prior on interpretable DF avoids this difficulty.

This paper proposes a new definition of DF, consistent with H&S, which attributes a fit’s total DF to individual effects. This definition involves a novel conception of an effect’s DF as, loosely speaking, the effect’s fraction of the model’s total variance. The new definition applies to any model with a mixed-effect form and normal distributions for random effects and errors, for example penalized splines in the mixed-model representation (Ruppert et al 2003) and at least some Gaussian process models. Section 2 defines the class of models, then gives and rationalizes the new definition. Section 3 applies it to the examples and explores its uses. The new definition is consistent with the one in Ruppert et al (2003, Section 8.3) except for DF for residuals (Ruppert et al 2003, Section 3.14). Ruppert et al (2003) also gave an approximation to an effect’s DF, saying “for all examples we have observed … there is practically no difference between the approximate and exact degrees-of-freedom values” (p. 176). Section 3 gives an example in which the approximation differs substantially from exact DF. Section 4 considers DF for residuals, in particular why the most common formula (Ruppert et al 2003, Section 3.14), when added to the DF in the fitted model, gives a total less than the sample size, an oddity avoided by the new definition. Section 5 concludes; technical results are in the Appendix, which is available online with the supplemental materials.

Model comparison criteria generally penalize a measure of model fit with a measure of model complexity, e.g., Vaida & Blanchard (2005) derived a penalty based on ρ for a conditional Akaike information criterion for linear mixed models. In defining the deviance information criterion (DIC), Spiegelhalter et al (2002) defined their penalty, the effective number of parameters pD, in terms of a measure of model fit. However, in view of Plummer (2008), it is not clear that the “right” penalty for using the same data to fit and evaluate a model is a simple function of any measure of model complexity, so that describing model complexity and comparing models are distinct problems. We therefore defer further consideration of model comparison, DIC, and pD to Section 5.

2 The new definition of DF

2.1 Model specification and notation

Consider a linear model, which we call “the standard model”, written as

y=X1θ1+X2θ2+ε,
(2)

where y [set membership] Rn contains the observations, X1 [set membership] Rn×p is the design matrix for fixed effects (FE) that are not smoothed, and X2 [set membership] Rn×L is the design matrix for smoothed effects, including random effects, effects representing spatial clustering, etc. Vectors θ1 [set membership] Rp and θ2 [set membership] RL are mean-structure parameters. Partition X2’s columns as X2 = [X21, ···, X2l], lL, and conformably partition θ2=(θ21,,θ2l). Model the jth cluster of smoothed parameters θ2j as θ2j|Γ2j ~ N(0, Γ2j), with the θ2j|Γ2j mutually independent; define Γ2 as the block diagonal matrix with the Γ2j being the blocks on the diagonal. The normally distributed error ε has mean 0 and covariance Γ0. Γ0 and Γ2j are nonnegative definite and not necessarily proportional to the identity matrix or even diagonal. As the examples indicate, this is a large class of models. It includes Gaussian process models that can be written as y = X1θ1 + S + ε, where X1θ1 represents regressors, X2 = In, θ2 = S [set membership] Rn with Γ2 = cov(S) non-diagonal and usually a function of unknowns, and cov(ε) is diagonal.

Further notation includes R(X) and N(X) for the column and null spaces of the matrix X, respectively, while PX is the orthogonal projection onto R(X). Appendix (a) gives necessary facts about the Moore-Penrose generalized inverse, denoted X+.

2.2 Motivation for the new definition

First, consider a linear model simpler than the standard model (2):

y=X2θ+ε,
(3)

where θ is smoothed using the model or prior θ ~ N(0, Γ2), and ε ~ N(0, Γ0), for positive definite Γ0, Γ2. H&S wrote the model on θ as “constraint cases”, and combined it with (3) to give

Y=(y0)=Xθ+ε=(X2I)θ+(εξ),

where ε* is multinormal with mean 0 and block diagonal covariance Γ = diag(Γ0, Γ2). H&S’s DF is

ρ=tr[X2(XΓ1X)1X2Γ01]=tr[X2(X2Γ01X2+Γ21)1X2Γ01].
(4)

Rewrite the matrix inverse in (4) as (Schott 1997, Theorem 1.7)

(X2Γ01X2+Γ21)1=Γ2Γ2X2(X2Γ2X2+Γ0)1X2Γ2.
(5)

The left-hand side of (5) is familiar from Bayesian linear models as the conditional posterior co-variance of θ|y, Γ, while the right-hand side is familiar as the conditional covariance of θ given y from the joint multivariate normal distribution of θ and y. Use (5) to rewrite (4):

ρ=tr[X2Γ2X2Γ01X2Γ2X2(X2Γ2X2+Γ0)1X2Γ2X2Γ01]=tr[X2Γ2X2(X2Γ2X2+Γ0)1(X2Γ2X2+Γ0X2Γ2X2)Γ01]=tr[X2Γ2X2(X2Γ2X2+Γ0)1],

which has the form: trace [ratio of {modelled variance matrix} to {total variance matrix}]. This suggests defining an effect-specific DF as that effect’s contribution of variance to total variance. In the standard model (2), the mean-structure parameters θ2j are independent of each other conditional on their covariances θ2j, and the variance in y arising from the effect represented by θ2j is X2jΓ2jX2j. Along with variation from the error ε, the total modelled variance for y is 1lX2jΓ2jX2j+Γ0=X2Γ2X2+Γ0. Thus, in this special case when all effects are smoothed (θ1 is null) and all Γj are positive definite, the reformulated DF for X2j is tr[X2jΓ2jX2j(X2Γ2X2+Γ0)1].

Green (2002) used a different approach to derive an equivalent decomposition of the effective number of parameters pD when Γ0 and Γ2 are completely specified, in which case pD is identical to H&S’s DF. However, Green’s derivation does not appear to extend to cases in which Γ0 or Γ2 are functions of unknowns. The development above and below is also reminiscent of Gelman & Pardoe (2006), who developed measures of explained variance and pooling for each of the effects in a model (which they call “levels”), although their intent and results are rather different.

2.3 Definition of DF

Section 2.2’s reformulation of ρ for model (3), with all effects smoothed, suggests a way to rewrite H&S’s DF as a sum of meaningful quantities, but for the standard model (2) the new formulation must accommodate nonsmoothed effects X1. A nonsmoothed effect can be viewed as the limit of a smoothed effect, for which the prior covariance goes to infinity and imposes no constraint. The new definition uses λ Γ1 [set membership] Rn×n as θ1’s covariance matrix, where Γ1 is unspecified but positive definite and λ is a positive scalar. In the limit as λ goes to +∞, Γ1 disappears.

For the standard model (2), then, define DF for nonsmoothed effects, denoted by DF(X1), as

DF(X1)=limλ+tr[X1λΓ1X1(X1λΓ1X1+X2Γ2X2+Γ0)+]=tr[PX1]=rank(X1).

Appendix (b) gives a proof. For the smoothed effect θ2j, DF is defined analogously:

DF(X2j)=limλ+tr[X2jΓ2jX2j(X1λΓ1X1+X2Γ2X2+Γ0)+]=tr{X2jΓ2jX2j[(IPX1)(X2Γ2X2+Γ0)(IPX1)]+}.

Appendix (c) gives a proof. Similarly, for the error term ε,

DF(ε)=limλ+tr[IΓ0I(X1λΓ1X1+X2Γ2X2+Γ0)+]=tr{Γ0[(IPX1)(X2Γ2X2+Γ0)(IPX1)]+}.

This follows by the argument in Appendix (c). In general this differs from residual degrees of freedom as defined in Ruppert et al (2003, Section 3.14) or Hastie & Tibshirani (1990, Section 3.5).

By this definition, the DF of a smoothed effect X2j or error ε is the fraction of variation contributed by X2j or ε out of variation not accounted for by the unsmoothed effects X1. Because DF(X2j) can be laborious to compute, Cui (2008, p. 79) derived the approximation DFp(X2j)=tr{X2jΓ2jX2j[(IPX1)(X2jΓ2jX2j+Γ0)(IPX1)]+}, which is equivalent to the approximation in Ruppert et al (2003, pp. 175–176) when both are defined. Section 3.2 briefly considers this approximation’s accuracy.

2.4 Properties of the definition

We argue for the new definition’s validity by presenting four of its properties.

(DF.a)

The new definition is consistent with H&S’s DF: the sum of DF over all effects equals H&S’s ρ for the whole model. Appendix (d) gives a proof.

(DF.b)

The sum of DF in the model fit and error is fixed. When Γ0 is positive definite, DF(X1)+1JDF(X2j)+DF(ε)=dim(y), the dimension of y. Appendix (e) gives a proof. For the definition of DF in residuals given by Ruppert et al (2003) and Hastie & Tibshirani (1990), DF(X1)+1JDF(X2j)+DF(residuals)<dim(y), with the deficiency arising from DF(residuals).

(DF.c)

The DF of an effect has a reasonable upper bound: DF(X2j) ≤ rank((IPX1)X2j) = rank([X1, X2j]) − rank(X1) ≤ rank(X2j) for j = 1, ···, J. Appendix (f) gives a proof.

(DF.d)

Scale-invariance property. DF, and particularly priors on DF, avoid problems arising from scaling of columns of the design matrix. (Section 2.6 discusses prior distributions on DF.) Suppose in the standard model (2), the covariance of each smoothed effect θ2j and the covariance of the error term ε are each characterized by a single parameter: Γ2j=σj2Γ2j0 and Γ0=σ02Γ00, where the σj2 are unknown scalars and Γ2j0,Γ00 are known and positive definite. Let B [set membership] Rp×p be nonsingular, and Hj be a matrix such that HjΓ2j0Hj=Γ2j0, e.g., Γ2j0 is the identity and Hj is orthogonal. Then the posterior of Xθ arising from independent priors on (DF(X21), ···, DF(X2l)) and σ02 is the same when X1 is transformed to X1=X1B, and X2 is transformed so X2j=tjX2jHj for nonzero scalars tj. Appendix (g) gives a proof.

2.5 The new definition applied to the examples

Example 1. Model with clustering and heterogeneity

Assume the spatial map represented by Q is connected, so G = 1. (This is not necessary.) The CAR model for the spatial clustering effects δ models V1δ as N(0,τc2D11), while 1Nδ is not smoothed, so re-express this model in the standard form (2) by taking X1 = 1N, X21 = V1, X22 = IN, and θ1=1Nδ/N,θ21=V1δ, θ22 = ξ, Γ21=τc2D11,Γ22=τh2IN, and Γ0 = σ2IN. The constraint cases become θ21N(0,τc2D11) and θ22N(0,τh2IN). Appendix (h) derives these expressions for DF(δ) and DF(ξ):

DF(δ)=rank(X1)+tr{X21Γ21X21[(INPX1)(X21Γ21X21+X22Γ22X22+Γ0)(INPX1)]+}=1+i=1N1τc2σ2/diτc2σ2/di+τh2σ2+1;DF(ξ)=tr{X22Γ22X22[(INPX1)(X21Γ21X21+X22Γ22X22+Γ0)(INPX1N)]+}=i=1N1τh2σ2τc2σ2/di+τh2σ2+1;.

Thus DF(δ) + DF(ξ) is H&S’s DF. The DF of each component is a function of both variance ratios ( τc2/σ2,τh2/σ2), which sheds light on how the effects compete: If τc2/σ2 increases for fixed τh2/σ2, then DF(δ) increases and DF(ξ) decreases. When there is more than one island (G > 1), it is easy to show that DF for clustering and heterogeneity are the sums of the respective DF for each island.

Example 2. Linear growth model

Write the model as a linear model in μ0, the initial local mean, β0, the initial local trend, w1, noise in the local mean, and w2, noise in the trend:

(y0y1y2y3yT)=(101112131T)(μ0β0)+(00010011001110011111)(w11w12w1T)+(000000100021000T1T2T3T41)(w21w22w2T1)+(n0n1nT).

In terms of the standard model, the unsmoothed effects are θ1 = (μ0, β0)′, the smoothed effects θ2=(w1,w2), the smoothing covariances Γ21=σ12IT and Γ22=σ22IT1, and the error covariance Γ0=σn2IT+1. DF(w1), DF(w2), and DF(n) follow straightforwardly, but the expressions do not simplify, so we omit them and illustrate their properties with some special cases.

When the local mean and trend do not vary ( σ12=σ22=0), the model reduces to a linear regression with intercept μ0 and slope β0, with 2 DF; thus when σ12>0 or σ22>0, DF(w1) and DF(w2) describe complexity in the fit, beyond a linear trend, attributable to these two sources. Note that the matrix (X1|X21) is saturated and X22 = X21B for a specific B. Thus if σ12>0 and σ22>0, w1 and w2 compete with each other. Table 1 shows how they compete for various ( σ12,σ22); noise variance σn2 is fixed at 1 and T at 124. When σ2 = 0 (Table 1a), DF(w1) increases as σ1 grows, with analogous results when σ1 = 0 and σ2 grows. As for the clustering and heterogeneity model, for fixed σ1> 0 (Table 1b), DF(w1) is reduced as σ2 grows because larger σ2 allows w2 to compete more aggressively, so DF(w2) grows. The analogous result occurs when σ1 increases for fixed σ2.

Table 1
DF for w1 and w2 in the linear growth model in different scenarios (σn = 1)

2.6 Using priors on DF to induce priors on smoothing parameters

The most familiar notion of DF is for linear models with one error term (e.g., Weisberg 1985), in which a model’s DF is the fixed, known rank of its design matrix. As extended to scatterplot smoothers (e.g., Hastie & Tibshirani 1990), a fit’s DF is not fixed and known before the model is fit, but rather is a function of tuning constants chosen to control the fit’s smoothness. For DF as redefined here, e.g., in Example 1’s model, the vector of effect-specific DF, (DF(δ), DF(ξ)), is a one-to-one function of the vector of variance ratios ( τc2/σ2,τh2/σ2) and again is not fixed or known before the model is fit. Because of this one-to-one function, placing a prior distribution on ( τc2/σ2,τh2/σ2) induces a prior on (DF(δ), DF(ξ)). But it is equally legitimate to place a prior on (DF(δ), DF(ξ)), to induce a prior on the unknown variance ratios ( τc2/σ2,τh2/σ2). Such a prior has the advantage of being specified on the interpretable complexity measure instead of the usually more obscure variances or variance ratios.

This can be stated more precisely. In the standard model (2), suppose each smoothed effect θ2j and the error term ε have covariances characterized by one parameter, e.g., Γ2j=σj2Γ2j0 and Γ0=σ02Γ00, where the σj2 are unknown scalars and Γ2j0,Γ00 are known and positive definite. Assume further that X2j [not subset, equals] R(X1). Then DF(·) is a 1-1 mapping between q = (DF(X21), ···, DF(X2l))′ on q’s range and s=(log(σ12/σ02),,log(σl2/σ02))Rl, so a prior on q induces a unique prior on s. Appendix (i) gives a proof and the Jacobian. By Property (DF.d) in Section 2.4, under these assumptions the posterior distribution of X2jθ2j using a prior on q is invariant under certain transformations of X2j. Sometimes it is desirable to put a prior on functions of q, e.g., (DF(δ)/DF(δ) + DF(ξ)) in Example 1 (discussed below). If an l-variate function of q, u = (u1(q), ···, ul(q)) is a 1-1 mapping, then a prior on u induces a unique prior on s.

When Γ2j’s form is more complex than assumed above, a prior on DF partially specifies a prior on Γ2j, and a complete prior specification requires a prior on other functions of Γ2j. For example, if Γ2j=diag(σj12,σj22), then a uniform prior on DF(X2j) induces a prior on a scalar function of ( σj12/σ02,σj22/σ02). A complete specification requires a prior on another function of ( σj12/σ02,σj22/σ02), e.g., on σj12/σj22.

3 Examples: Priors on DF, partitioning DF, competing effects

3.1 Periodontal measurements: clustering and heterogeneity model

We analyze AL data as described in Section 1 for twelve subjects without missing measurements taken from a larger dataset of 410 subjects. The dataset is available on line with the supplemental materials. Each site in each subject was measured once by each of two examiners, so each subject provided 84 AL measurements on 42 sites. Zhao (1999) investigated differences between examiners using all 410 subjects. She found tiny systematic differences between examiners. For each pair of examiners, the differences between the examiners’ measurements showed small spatial correlation and the standard deviation of the differences was similar for the different examiner pairs. Thus for the present purpose we simply treat the two exams on each patient as providing two iid measurements at each site.

Let yijk be the kth measurement of site j for subject i, i = 1, ···, 12, j = 1, ···, 42, k = 1,2. Model yijk as yijk = μ + αi + δij + ξij + εijk, where μ is the grand mean and αi is subject i’s random effect, modelled as an independent N(0,σα2) draw. The vector δi = (δi1, ···, δi42) captures spatial clustering for subject i. We assume these are independent across i and model each δi as CAR(Q,σc,i2), with neighbor pairs as in Figure 1. Heterogeneity effects ξij are modelled as N(0,σh,i2) assumed iid within subject. The εijk capture unsystematic measurement error and are modeled as independent N(0,σ02),σ02 being common to all subjects.

Subject i has smoothing variances σc,i2 and σh,i2 and thus DF(δi) and DF(ξi), allowing subject-specific inferences about the relative contribution to the fit of clustering and heterogeneity. An obvious prior for this purpose is a prior on fi = DF(δi)/(DF(δi)+ DF(ξi)), the fraction, for subject i, of the fitted complexity attributed to clustering. A flat prior on fi shows no preference between clustering and heterogeneity without constraining the total DF in subject i’s fit, DF(δi)+ DF(ξi), so it could be considered natural. We used independent flat priors on fi, DF(δi)+ DF(ξi)), and DF(α). Table 2 shows the posterior means of the fi (the posterior medians are similar).

Table 2
Periodontal data: posterior mean DF for spatial clustering and site heterogeneity and posterior mean of fi, the fraction of DF for spatial clustering. The maximum possible DF for δi and ξi are, respectively, 41 and 42 for individual subjects ...

In Table 2, overall DF for spatial clustering has posterior mean 82.00 out of 492; subject-specific values range from 1.66 to 16.75 out of 41. Overall DF for heterogeneity has posterior mean 158.26 out of 503; subject-specific values range from 1.60 to 28.12 out of 42. The posterior mean of fi ranges from 0.18 to 0.52 with median 0.36, compared to the prior mean 0.50, so for most subjects the data indicate that heterogeneity should receive more fitted complexity than clustering. Compared to σc,i/(σc,i +σh,i)(Eberly & Carlin 2000), fi has a direct interpretation and accounts for all other sources of variation, i.e., because measurement error variance σ02 is common to all subjects, fi depends indirectly on σc,j2/σ02 and σh,j2/σ02 for ji.

Are these results reasonable? Figure 3 shows the data and the posterior of fi for subjects 11, 7, and 10, who have the smallest E(fi|y) (subject 11) and the largest (subjects 7 and 10). Based on Section 1’s interpretation, when DF are low for both clustering and heterogeneity, the data should have little large-scale trend or unpatterned noise respectively. When DF are large for both, we should see big differences in local level between different parts of the mouth, and sizeable unpatterned variation around that large-scale trend. When DF is large for heterogeneity but small for clustering, we should see little large-scale trend but much unpatterned variation. Subjects 11 and 7 have similar posterior DF for clustering, about 6 DF, while subject 11 has much larger DF for heterogeneity, about 28 DF compared to about 6 DF for subject 7. Figure 3’s data show similarly modest large-scale trend for these two subjects but much greater unpatterned variation for subject 11. Now compare subjects 7 and 10, who have the same E(fi|y) while subject 10’s posterior expected DF for both components are smaller by a factor of about 4. As expected, subject 10’s data show much less trend and variation than subject 7’s.

Figure 3
Example 1, AL data and histograms of draws from the posterior of fi, the fraction of model complexity attributed to spatial clustering. The top two panels are for subject 11, the middle two panels are for subject 7, and the bottom two panels are for subject ...

The random effects α, Δ, and ξ compete with each other: the column spaces for α and Δ are orthogonal to each other but both are contained in the column space for ξ. To give a taste of how this affects partitioning of DF, Table 3 shows results from simulated data like the periodontal dataset just analyzed. For simplicity, all 12 simulated subjects have the same σc2 and σh2 in both the data and the model. We fixed σ02 at 1 and sampled 100 artificial datasets from each of 6 sets of true variance ratios ( σα2/σ02,σc2/σ02,σh2/σ02), drawing new αi, δi, and ξi for each artificial dataset. The analyses used a flat prior on (DF(α), DF(δ), DF(ξ)); flat priors on DF(α), DF(δ) + DF(ξ) and DF(δ)/(DF(δ) + DF(ξ)) give similar results. The DF allocated to subjects, DF(α) are similar in all cases considered, and are not discussed further.

Table 3
True DF and average posterior median DF for clustering and heterogeneity, for various true ( σα2/σ02,σc2/σ02,σh2/σ02), with 100 simulated datasets per scenario. “Est DF” is the average ...

In Table 3, consider the columns labelled “True DF”; these are DF as a function of the true variances. The model with both clustering and heterogeneity (the rows (1,1,1) and (0.25,0.25,0.25)) allocates less complexity to clustering than the model with only clustering (rows (1,1,0) and (0.25,0.25,0)). The presence of heterogeneity reduces the True DF for clustering proportionally more when the true variances are 1 (42% reduction) than when the true variances are 0.25 (21%). The presence of clustering has a similar effect on the True DF for heterogeneity (29% and 19% reductions, respectively). In other words, the two effects compete but to a lesser degree when each is constrained more by a smaller true variance. Now consider the columns labelled “Est[imated] DF”, which for each row in the table is the posterior median DF for each artificial dataset, averaged over the 100 artificial datasets. When heterogeneity is truly absent (rows (1,1,0) and (0.25,0.25,0)), the analysis nonetheless allocates some DF to heterogeneity, 35 and 32 when the variances are 1 and 0.25 respectively, and the estimated DF for the competing clustering effect are below the true values by 23 and 10 DF respectively. That is, some of heterogeneity’s DF are “stolen” from clustering, about 2/3 when the variances are 1 (and competition is more fierce) and about 1/3 when the variances are 0.25. An analogous but less pronounced effect occurs when clustering is truly absent. Section 5 discusses the implications of this finding.

3.2 Global mean surface temperature: linear growth model

Using the linear growth model to smooth the global mean surface temperature series, we consider two types of priors on ( σn2,σ12,σ22): independent Gamma(0.001,0.001) on each 1/σj2, and DF-induced priors. Figure 2 plots the data and three fits (posterior means) arising from gamma priors. For one fit, the analysis used the design matrix’s original scaling; for the other two, the columns of the design matrix were multiplied by 50 and 100. All smooths capture the increase from 1881 to 1940, the decrease from 1940 to 1970, and the increase after 1970. The fit with the original scaling smooths the most, with DF(w1) + DF(w2) having posterior mean 5.6. The gamma prior’s effect is not invariant to the scaling of X’s columns, so the posteriors differ noticeably when the same gamma prior is used with re-scaled design matrices. When X’s columns have been multiplied by 100, the posterior means of DF(w1) and DF(w2) sum to 21.5. It was simply luck that in the “natural” scaling, the gamma prior gave a reasonable result.

Priors specified on DF, by contrast, avoid the problem of scaling. Instead, they control the fit’s smoothness directly by constraining DF in the fit. Figure 4 plots fits from flat priors on (DF(w1), DF(w2)) with five different constraints on total DF in the smoothed effects. The fit becomes smoother as the constraint on total DF is reduced. Figure 5 shows histograms of 10,000 draws from the posterior of DF(w1) and DF(w2) arising from flat priors on them constrained so total smoothed DF < 6. Both posteriors are skewed, but in different directions. For the local mean E(DF(w1)|y) = 0.86, while for the local slope E(DF(w2)|y) = 4.58.

Figure 4
Example 2, global mean surface data, DF prior with different sum constraints on total smoothed DF.
Figure 5
Example 2, global mean surface data, histograms of posterior DF for local mean (top) and local trend (bottom), for the flat prior on DF with sum of total smoothed DF constrained to < 6.

This example also gives an instance in which the approximate DF (Section 2.3; Ruppert et al 2003, Section 8.3) performs poorly. For the flat priors on (DF(w1), DF(w2)) with total DF constrained to be less than 6, the posterior mean of exact and approximate DF for w1 are 0.86 and 1.87 respectively, differing by more than 1 DF, while the posterior mean of exact and approximate DF for w2 are closer, 4.58 and 4.76 respectively. When total DF is constrained to be less than 10, the posterior mean of exact and approximate DF are 2.97 and 5.13 for w1 — differing by more than 2 DF — and 5.14 and 5.75 for w2.

Other choices are possible for DF-based priors. For example, West & Harrison (1997) usually fix signal-to-noise ratio for smoothing variances, which amounts to point priors on the corresponding DF. This could be relaxed by specifying priors on DF with the same centers as these point priors, but with positive variances.

The Gamma(0.001,0.001) prior has been criticized on various grounds. Alternative priors on variances or standard deviations have been proposed; Gelman (2006) gives an influential critique and some alternatives. These alternatives are all scale-dependent, like the gamma prior. We do not, however, propose a flat prior on DF as a default: for this model and dataset, a flat prior on (DF(w1), DF(w2)) without a constraint on total DF gives a grossly undersmoothed fit. It is still true that overparameterized models often need strong prior information.

4 Degrees of freedom for residuals

The standard model (2), with Γ0 set to σε2I, can be re-expressed as

y=f+εεN(0,σε2I)fN(X1θ1,X2Γ2X2),

where f is a true but unknown function we wish to estimate, evaluated at the design values corresponding to the rows of X1. This notation is most familiar for penalized splines represented as mixed linear models, as in Ruppert et al (2003), where f is the true smooth function relating a predictor x and dependent variable y. It appears that Ruppert et al (2003) require X1 to have full rank; this section assumes X1 has full rank.

Treating σε2 and Γ2 as known, the fitted values from this model – the BLUPs or conditional posterior means – are ŷ = Sy for

S=[X1X2][[X1X2][X1X2]+σε2[000Γ21]]1[X1X2];

S is a linear smoother. The DF in the fitted values ŷ is tr (S) under all published definitions.

The most common definition for residual DF in linear mixed models is derived as follows (Ruppert et al 2003, Section 3.14; Hastie & Tibshirani 1990, Section 3.5). Assuming f is fixed, the mean squared error – the expectation of the residual sum of squares given Γ2, Γ0 – is

MSE(yΓ2,Γ0,f)=Eε[(y^y)(y^y)Γ2,Γ0,f]=Eε[y(SI)(SI)yΓ2,Γ0,f]=σε2tr((SI)(SI))+f(SI)(SI)f=σε2(n+tr(SS)2tr(S))+f(SI)(SI)f.
(6)

The term f(SI)(SI)f arises from bias. “Assuming the bias term … is negligible” (Ruppert et al 2003, p. 83), then by analogy with ordinary linear models, the DF in the residuals is n + tr(SS) − 2tr(S), and {residual sum of squares}/(n + tr(SS) − 2tr(S)) is an unbiased estimate of σε2. Since DF in the fit is tr(S),

(DFinfit)+(DFinresiduals)=tr(S)+n+tr(SS)2tr(S)=n+tr(SS)tr(S)<n.

The inequality holds because S is symmetric with eigenvalues in [0,1] and at least one eigenvalue less than 1, so tr(SS) < tr(S). This raises the question: where are the missing degrees of freedom?

Taking the mixed-effect model (2) literally, f is random, so we can remove the conditioning on f in (6). To do so, we need a claim, proved in Appendix (j), that if f is distributed as normal with mean X1θ1and covariance X2Γ2X2′, then

Ef[f(SI)(SI)fΓ2,Γ0]=tr((SI)(X2Γ2X2+X1θ1(X1θ1))(SI))=tr((SI)(σε2S)).

We can now remove the conditioning on f. By the claim

MSE(yΓ2,Γ0)=Ef[MSE(yΓ2,Γ0,f)]=σε2(n+tr(SS)2tr(S))+Ef[f(SI)(SI)fΓ2,Γ0]=σε2(n+tr(SS)2tr(S))+tr((SI)(σε2S))=σε2(ntr(S)).

By the same rationale used to define the usual residual DF, we can now define residual DF as ntr (S), the same as the new definition of DF in Section 2.3. Therefore

(DFinfit)+(DFinresiduals)=tr(S)+ntr(S)=n,

as defined in Section 2; the missing degrees of freedom are in the bias term of mean squared error. The difference between the two definitions can be small or large. For the global mean surface temperature dataset, using the prior with total DF constrained to be under 6 and fixing the variance ratios at their posterior medians, residual DF is 115.7 under the widely-used definition and 117.6 under the new definition. However, for the AL data, fixing the variance ratios at their posterior medians, residual DF is 699.6 under the widely-used definition and 764.4 under the new definition.

The foregoing highlights an awkward element of contemporary non-Bayesian theory for mixed effect models, specifically those with random effects that do not meet the definition in, say, Scheffé (1959, p. 238: a random effect’s levels can “be regarded as a random sample from some population about which we wish to make our statistical inferences, rather than making them about the particular [levels] in the experiment”). Penalized splines, as in Ruppert et al (2003, Section 4.9), present the clearest example of this difficulty. The theory of penalized splines presumes the smooth function f is fixed but unknown (e.g., Ruppert et al 2003, p. 58); the spline is fit using a random-effect model because this gives a minimization problem formally identical to the minimization problem in the original definition of penalized splines, but also allows us to use all the usual mixed linear model tools. For penalized splines, it seems meaningless outside a Bayesian framework to take an expectation over the distribution of f, which is presumed fixed if unknown and is formally represented as a random effect for convenience only. Thus, the mean squared error given Γ0, Γ2, and f is (6), a quantity that is fixed but unknown because f is unknown. Although in expectation over f, total DF in the model and residuals add to the sample size n, for a given f this equality fails in general, so “the books don’t balance” as they do in conventional linear models.

The standard derivation of residual DF is intended to produce denominator DF for F-tests and is rationalized by the assumption that the squared length of bias is small. For uses of this machinery described by Ruppert et al (2003) and Hastie & Tibshirani (1990), in which the functions f are quite smooth, bias likely is small and this presumption seems reasonable as, indeed, it was for the global mean surface temperature example. However, methods tend to outgrow their inventors’ intentions, inevitably so for a scheme as inclusive as the one in Ruppert et al (2003), so this standard derivation will be extended to cases where bias is not necessarily small, e.g., for the AL data. Thus, a theoretically more tidy treatment would seem desirable; the present paper’s new definition of DF may provide such a treatment.

From a Bayesian viewpoint, Γ2 can describe either variation in θ2, for old-fashioned random effects, or uncertainty about θ2, for new-fangled random effects, and no problem arises because a probability distribution, once specified, obeys the same rules regardless of its rationale. However, except when designing experiments, a Bayesian viewpoint does not consider expectations over varying y, so the usual derivation of residual DF has no interest. Thus, a Bayesian interpretation cannot justify removing the conditioning on f in the usual derivation of residual DF.

Section 2’s new definition, while not explicitly Bayesian, does treat the covariance matrices Γ2j as Bayesians do, irrespective of their rationale. In the new definition, an effect’s DF describes that effect’s proportion of total variance, partitioning the sample size n naturally (property DF.b) while treating all sources of variation in the same manner, i.e., without singling out error variation for distinctive treatment as in the usual derivation of residual DF.

5 Discussion

We have presented a new conception of degrees of freedom for models using random effects and normal probability distributions, anticipated partly by Green (2002). Although the resulting definition of DF arises from linear model theory, it defines a scale for complexity in the space of covariance structures and can be loosely interpreted as the fraction of variation attributed to individual effects. The new definition rationalizes partitioning the total DF in the dataset into pieces for effects in the model and for error. The latter avoids difficulties in the most common definition of residual DF (Section 4). Conceiving an effect’s DF as the fraction of variance attributed to the effect suggests a way to define DF for non-normal random-effect models, possibly avoiding the linear and normal approximations explicit in Lu et al (2007) and implicit in Ruppert et al (2003, p. 212).

The examples illustrated two uses of the new definition. The first is inducing prior distributions on smoothing variances. In some cases, such variances are meaningful quantities but priors on them are often nonintuitive and depend on the scale of the design matrix, as Section 3.2 illustrated. The examples showed how to put scale-invariant priors on interpretable quantities like DF or DF(δ)/(DF(δ) + DF(ξ)), inducing priors on the less intuitive smoothing parameters. The global mean surface temperature example showed how a prior on DF can control smoothness directly.

The new definition’s second use is as an interpretable measure of each effect’s contribution to the fit’s complexity. This sheds light on competition between effects, which is especially acute in cases where the competing effects have design matrices with column spaces that overlap substantially, as in both examples. Competition of this sort is a large topic that is only now being noticed (e.g., Reich et al 2006); tools like DF can help us describe and understand this phenomenon.

As a byproduct, the AL example gave an instance in which the approximate DF of an effect (Ruppert et al 2003, Section 8.3; Cui 2008, p. 79) performs poorly. It is almost certainly not an accident that this arose in a model with highly collinear effects, in contrast with the examples in Ruppert et al (2003). If the approximation fails only with severe collinearity, it may be possible to develop a diagnostic that indicates when the approximation is accurate, which would be useful given that it computes much faster than exact DF.

Section 3.1’s simulation found a tendency to allocate DF to an effect that was, in fact, absent. In practice, this may not be important because our methods would be used in the context of a larger effort of model building etc., and before fitting this model one would probably drop the heterogeneity term if the data indicated it was absent. Also, if the purpose of the analysis is better estimation of AL at individual sites, this is almost certainly not important. From a theoretical point of view, however, this problem needs to be explained. The proximate cause is that smoothers tend not to shrink effects to zero even when they should. (This is the sole virtue of the zero variance estimates that are frequently produced by maximizing the restricted likelihood.) The extent of undershrinkage depends on the prior and would be reduced by, say, a scaled Beta(0.1, 0.1) prior on the relevant DF, which pushes the posterior more toward the extremes of complete or no shrinkage. Perhaps the larger implication is a need to re-think these extravagantly overparameterized models, which are the problem’s ultimate cause. The difficulty almost certainly arises because of the extreme collinearity of the clustering and heterogeneity effects; partitioning degrees of freedom merely describes the consequence of this collinearity.

Finally, how does DF compare to the popular effective number of parameters pD? Recall that in view of Plummer (2008), we distinguish the problems of model complexity and model comparison. Thus DF’s purpose is to describe a model’s complexity in terms drawn from simple, well-understood models, making it possible to control complexity in fitting the model to a dataset by means of a prior distribution on complexity. Regarding pD’s interpretability, its definition (Spiegelhalter et al 2002, Section 2.3) is opaque and the only concrete interpretation Spiegelhalter et al (2002) offer is that in simple cases – where complexity measures already exist – pD takes values that people generally like. Second, our approach implicitly presumes that a model exists independently of any realized outcome y, so its complexity can be defined without a realized y. This is true of DF and of most complexity measures (e.g., Hastie & Tibshirani 1990, Ye 1998), but not pD, the definition of which requires a realized y through the point estimate [theta w/ tilde] and the posterior mean of the deviance D(θ). In special cases, such as normal hierarchical models with all covariances known, pD does not depend on y and in these cases pD agrees with DF (Spiegelhalter et al 2002, Sections 2, 4; Green 2002). But with unknown variances – i.e., in practical situations – pD and DF diverge.

A complexity measure defined independently of the outcome y will, in general, be a function of unknowns. This is acceptable: statisticians routinely specify models as functions of unknowns – in Bayesian terms, conditional on unknowns – and it creates no difficulty to describe a model’s complexity as a function of its unknowns. The DF of a fitted model is obviously a function of the data, either through plug-in estimates or posterior distributions of the unknown parameters in the covariance matrices. But a complexity measure defined independently of y, like DF, allows a prior distribution on complexity to control softly the complexity of the fit. This is not possible with pD because in general it is defined in terms of a specific realized y.

It is not clear pD can be partitioned corresponding to components of the model, as DF can. Spiegelhalter et al (2002, Sections 6.3, 8.1) do partition pD corresponding to subdivisions of the deviance to create an outlier diagnostic from DIC, and in problems where the likelihood factors, DIC and pD partition analogously, e.g., for errors-in-covariates models. Green (2002) gives a partition of pD identical to ours in the special case where all covariance matrices are known, but it is not clear this can be extended to the case where covariances are unknown.

Thus, while pD’s relatively easy computation is certainly desirable, for our purposes this seems outweighed by its conceptual and practical disadvantages.

Acknowledgments

This work was supported in part by a Merck Company Foundation Quantitative Sciences Fellowship. The authors thank Dr. Bryan Michalowicz of the University of Minnesota School of Dentistry for permission to use the periodontal data and post it on Technometrics’s web site. The fourth author’s work was supported in part by National Cancer Institute grant 2-R01-CA095955-05A2.

References

  • Besag J, York JC, Mollie A. Bayesian image restoration, with two applications in spatial statistics (with discussion) Ann Inst Stat Math. 1991;43:1–59.
  • Cui Y. PhD dissertation. University of Minnesota Division of Biostatistics; 2008. Smoothing analysis of variance for general designs and partitioning degrees of freedom in hierarchical and other richly parameterized models. [PMC free article] [PubMed]
  • Daniels MJ. A prior for the variance in hierarcical models. Can J Stat. 1999;27:569–580.
  • Eberly LE, Carlin BP. Identifiability and convergence issues for Markov chain Monte Carlo fitting of spatial models. Stat Med. 2000;19:2279–2294. [PubMed]
  • Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis. 2006;1:515–534.
  • Gelman A, Pardoe I. Bayesian measures of explained variance and pooling in multilevel (hierarchical) models. Technometrics. 2006;48:241–251.
  • Gilthorpe MS, Zamzuri AT, Griffiths GS, Maddick IH, Eaton KA, Johnson NW. Unification of the “burst” and “linear” theories of periodontal disease progression: A multilevel manifestation of the same phenomenon. J Dent Res. 2003;82:200–205. [PubMed]
  • Green P. Discussion of Spiegelhalter et al (2002) J Roy Stat Soc, Ser B. 2002;64:627–628.
  • Groetsch CW. Inverse Problems in the Mathematical Sciences. Vieweg: Wiesbaden; 1993.
  • Hastie TJ, Tibshirani RJ. Generalized Additive Models. Boca Raton FL: CRC/Chapman & Hall; 1990.
  • Hodges JS, Carlin BP, Fan Q. On the precision of the conditionally autoregressive prior in spatial models. Biometrics. 2003;59:317–322. [PubMed]
  • Hodges JS, Cui Y, Sargent DJ, Carlin BP. Smoothing balanced single-error-term analysis of variance. Technometrics. 2007;49:12–25.
  • Hodges JS, Sargent DJ. Counting degrees of freedom in hierarchical and other richly-parameterised models. Biometrika. 2001;88:367–379.
  • Lu H, Hodges JS, Carlin BP. Measuring the complexity of generalized linear hierarchical models. Can J Stat. 2007;35:69–87.
  • Natarajan R, Kass RE. Reference Bayesian methods for generalized linear mixed models. J Amer Stat Assn. 2000;95:227–237.
  • Ortega JM, Rheinboldt WC. Iterative Solution of Nonlinear Equations in Several Variables. New York: Academic Press; 1970.
  • Paciorek CJ, Schervish MJ. Spatial modelling using a new class of nonstationary covariance functions. Environmetrics. 2006;17:483–506. [PMC free article] [PubMed]
  • Plummer M. Penalized loss functions for Bayesian model comparison. Biostatistics. 2008;9:523–539. [PubMed]
  • Prado R, West M, Krystal AD. Multichannel electroencephalographic analyses via dynamic regression models with time-varying lag-lead structure. J Roy Stat Soc, Ser C. 2001;50:95–109.
  • Reich BJ, Hodges JS. Modeling longitudinal spatial periodontal data: A spatially adaptive model with tools for specifying priors and checking fit. Biometrics. 2008a;64:790–799. [PubMed]
  • Reich BJ, Hodges JS. Identification of the variance components in the general two-variance linear model. J Stat Plan Inf. 2008b;138:1592–1604.
  • Reich BJ, Hodges JS, Carlin BP. Spatial analyses of periodontal data using conditionally autoregressive priors having two types of neighbor relations. J Amer Stat Assn. 2007;102:44–55.
  • Reich BJ, Hodges JS, Zadnik V. Effects of residual smoothing on the posterior of the fixed effects in disease-mapping models. Biometrics. 2006;62:1197–1206. [PubMed]
  • Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. New York: Cambridge; 2003.
  • Schott JR. Matrix Analysis for Statistics. New York: Wiley; 1997.
  • Scheffé H. The Analysis of Variance. New York: Wiley; 1959.
  • Spiegelhalter DJ, Best DG, Carlin BP, Van Der Linde A. Bayesian measures of model complexity and fit (with discussion) J Roy Stat Soc, Ser B. 2002;64:583–639.
  • Vaida F, Blanchard S. Conditional Akaike information criterion for mixed effects models. Biometrika. 2005;92:351–370.
  • Wall MW. A close look at the spatial structure implied by the CAR and SAR models. J Stat Plan Inf. 2004;121:311–324.
  • Weisberg S. Applied Linear Regression. Wiley; New York: 1985.
  • West M, Harrison J. Bayesian Forecasting and Dynamic Models. 2. New York: Springer; 1997.
  • Ye J. On measuring and correcting the effects of data mining and model selection. J Amer Stat Assn. 1998;93:120–131.
  • Zhao YH. MS thesis. Division of Biostatistics, University of Minnesota; 1999. Examining mean structure, correlation structure, and differences between examiners in a large periodontal data set.