Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2886314

Formats

Article sections

- SUMMARY
- 1 Introduction and motivation
- 2 The new definition of DF
- 3 Examples: Priors on DF, partitioning DF, competing effects
- 4 Degrees of freedom for residuals
- 5 Discussion
- References

Authors

Related links

Technometrics. Author manuscript; available in PMC 2010 June 16.

Published in final edited form as:

Technometrics. 2010 February 1; 52(1): 124–136.

doi: 10.1198/TECH.2009.08161PMCID: PMC2886314

NIHMSID: NIHMS208461

See other articles in PMC that cite the published article.

Hodges & Sargent (2001) developed a measure of a hierarchical model’s complexity, degrees of freedom (DF), that is consistent with definitions for scatterplot smoothers, interpretable in terms of simple models, and that enables control of a fit’s complexity by means of a prior distribution on complexity. DF describes complexity of the whole fitted model but in general it is unclear how to allocate DF to individual effects. We give a new definition of DF for arbitrary normal-error linear hierarchical models, consistent with Hodges & Sargent’s, that naturally partitions the *n* observations into DF for individual effects and for error. The new conception of an effect’s DF is the ratio of the effect’s modeled variance matrix to the total variance matrix. This gives a way to describe the sizes of different parts of a model (e.g., spatial clustering vs. heterogeneity), to place DF-based priors on smoothing parameters, and to describe how a smoothed effect competes with other effects. It also avoids difficulties with the most common definition of DF for residuals. We conclude by comparing DF to the effective number of parameters *p _{D}* of Spiegelhalter et al (2002). Technical appendices and a dataset are available online as supplemental materials.

Hodges & Sargent (2001) developed a measure of a hierarchical model’s complexity, degrees of freedom (DF), that is consistent with the definition for scatterplot smoothers and interpretable in terms of simple models. DF describes complexity of the whole fitted model but in general it is unclear how to allocate DF to individual parts of a model. We use an example to introduce the problem of measuring the complexity of a model’s components.

Periodontal attachment loss (AL) – the extent of a tooth’s root (in millimeters) that is no longer attached to surrounding bone by periodontal ligament – is used to diagnose and monitor periodontal disease. Below, we present analyses of AL measurements for 12 research subjects, for each of whom AL was measured on a quadrant of 7 teeth. On each tooth, three sites (distal, direct, and mesial) on both the buccal (tongue) side and lingual (cheek) side were measured, giving 7 × 3 × 2 = 42 sites per subject. Figure 1 is a schematic of one subject’s measurements.

Example 1, neighbor relations in periodontal data. “X” indicates a measurement site, lines indicate neighbor relations, teeth are numbered from the central incisor (1) to the second molar (7); MES: mesial site, DIR: direct site, DIS: distal **...**

A site’s AL measurement is the sum of true loss and measurement error, which is substantial. Spatial correlation in true AL arises because if a person has poor hygiene in an area of the mouth, sites in that area may be more prone to loss and because bacterial infection, the ultimate cause of periodontal damage, is transmitted among the sites on a given tooth. Non-spatial heterogeneity arises from local features of the dentition, e.g., tooth malalignments that predispose a site to gum recession or features that make measurement difficult and affect all examiners similarly.

Gilthorpe et al (2003) used non-spatial random effects to merge the “linear” and “burst” theories of progressive attachment loss, while Reich & Hodges (2008a) used spatial random effects to improve detection of disease progression. The preceding paragraph, however, suggests a model with both spatial clustering and non-spatial heterogeneity in true AL. We consider a model commonly used by epidemiologists (Besag et al 1991). For simplicity, this section models one subject and one measurement at each site; Section 3.1 analyzes all 12 subjects, and the dataset is available online with the supplemental materials. Let **y** *R ^{N}* be observations on

$$f(\mathit{\delta}\mid {\tau}_{c}^{2})\propto {({\tau}_{c}^{2})}^{-{\scriptstyle \frac{N-G}{2}}}exp\left(-\frac{1}{2{\tau}_{c}^{2}}{\mathit{\delta}}^{\prime}\mathbf{Q}\mathit{\delta}\right).$$

This density is always improper because **1*** _{N}* is an eigenvector of

The model’s components for clustering (** δ**) and heterogeneity (

After fitting this model, any of several measures could be used to describe the whole model’s complexity, but the distinct contributions of the heterogeneity and clustering components are not well understood. This paper proposes a partition of a readily interpretable measure of complexity (Hodges & Sargent 2001, henceforth H&S). H&S used a model in which observed data **y** *R ^{d}* are assumed to be multinormal with mean

$$\mathbf{Y}=\left(\begin{array}{c}\mathbf{y}\\ \mathbf{0}\end{array}\right)=\left(\begin{array}{c}{\mathbf{X}}_{1}\\ \mathbf{Z}\end{array}\right)\mathit{\theta}+\left(\begin{array}{c}\mathit{\epsilon}\\ \mathit{\delta}\end{array}\right)=\mathbf{X}\mathit{\theta}+\mathbf{e},$$

(1)

where **e** = (** ε, δ**)′ ~

$$\rho =tr[{\mathbf{X}}_{1}{({\mathbf{X}}^{\prime}{\mathbf{\Gamma}}^{-1}\mathbf{X})}^{-1}{\mathbf{X}}_{1}^{\prime}{\mathbf{\Gamma}}_{0}^{-1}],$$

and showed *ρ* is the same as other measures of complexity when both are defined. Lu et al (2007) extended this definition to generalized linear hierarchical models.

H&S’s DF can be used for several purposes. Model complexity is used by model-comparison criteria, though for reasons discussed below, we do not consider such criteria. The original purposes of H&S were to describe the complexity of a fitted hierarchical model and to control that complexity by putting a prior distribution on it, inducing a prior on unknowns in **Γ.** (Section 2.6 below gives an accessible introduction to this idea.) Several authors have specified priors this way, e.g., Lu et al (2007), extending uniform shrinkage priors (Daniels 1999, Natarajan & Kass 2000), Paciorek & Schervish (2006), and Hodges et al (2007).

H&S’s approach has limitations that restrict its use. Except for special cases, e.g., balanced single-error-term ANOVA (Hodges et al 2007), it is unclear how to allocate DF to individual effects. We use examples to illustrate this limitation and later show the advantage of removing it.

The model for ** δ** can be re-expressed as
${\mathbf{V}}_{1}^{\prime}\mathit{\delta}\sim N(\mathbf{0},{\tau}_{c}^{2}{\mathbf{D}}_{1}^{-1})$, where

$$\left(\begin{array}{c}\mathbf{y}\\ \mathbf{0}\\ \mathbf{0}\end{array}\right)=\left(\begin{array}{cc}\mathbf{I}& \mathbf{I}\\ -{\mathbf{V}}_{1}^{\prime}& \mathbf{0}\\ \mathbf{0}& -\mathbf{I}\end{array}\right)\left(\begin{array}{c}\mathit{\delta}\\ \mathit{\xi}\end{array}\right)+\left(\begin{array}{c}\mathit{\epsilon}\\ {\mathit{\phi}}_{c}\\ {\mathit{\phi}}_{h}\end{array}\right)$$

for ** ε** ~

$$\rho =tr\left(\left(\begin{array}{ll}{\mathbf{I}}_{N}\hfill & {\mathbf{I}}_{N}\hfill \\ {\mathbf{I}}_{N}\hfill & {\mathbf{I}}_{N}\hfill \end{array}\right){\left(\begin{array}{cc}{\mathbf{I}}_{N}+{\scriptstyle \frac{{\sigma}^{2}}{{\tau}_{c}^{2}}}\mathbf{Q}& {\mathbf{I}}_{N}\\ {\mathbf{I}}_{N}& (1+{\scriptstyle \frac{{\sigma}^{2}}{{\tau}_{c}^{2}}}){\mathbf{I}}_{N}\end{array}\right)}^{-1}\right).$$

We have assumed **Q** describes a connected map, so *G* = 1. After some labor, *ρ* becomes

$$\rho =1+\sum _{i=1}^{N-1}\frac{{\scriptstyle \frac{{\tau}_{c}^{2}}{{\sigma}^{2}}}/{d}_{i}+{\scriptstyle \frac{{\tau}_{h}^{2}}{{\sigma}^{2}}}}{{\scriptstyle \frac{{\tau}_{c}^{2}}{{\sigma}^{2}}}/{d}_{i}+{\scriptstyle \frac{{\tau}_{h}^{2}}{{\sigma}^{2}}}+1}.$$

While this suggests how total DF in the fit might be attributed to ** δ** and

Summary measures of the earth’s surface temperature are used to describe global climate. Consider the series *y _{t}*,

$$\begin{array}{l}{y}_{t}={\mu}_{t}+{n}_{t},t=0,\cdots ,T,\\ {\mu}_{t}={\mu}_{t-1}+{\beta}_{t-1}+{w}_{1,t},t=1,\cdots ,T,\\ {\beta}_{t}={\beta}_{t-1}+{w}_{2,t},t=1,\cdots ,T-1.\end{array}$$

Let ** μ**,

As in H&S, combine the three equations and reformulate them as a linear model:

$$\left(\begin{array}{c}{\mathbf{y}}_{(T+1)\times 1}\\ {\mathbf{0}}_{T\times 1}\\ {\mathbf{0}}_{(T-1)\times 1}\end{array}\right)=\left(\begin{array}{cc}{\mathbf{I}}_{T+1}& {\mathbf{0}}_{(T+1)\times T}\\ {\mathbf{Z}}_{1}& -{\mathbf{I}}_{T}\\ {\mathbf{0}}_{(T-1)\times (T+1)}& {\mathbf{Z}}_{2}\end{array}\right)\left(\begin{array}{c}\mathit{\mu}\\ \mathit{\beta}\end{array}\right)+\left(\begin{array}{c}\mathbf{n}\\ {\mathbf{w}}_{1}\\ {\mathbf{w}}_{2}\end{array}\right),$$

, where **Z**_{1} = [**0**_{T}_{×1}; **I*** _{T}*] − [

$$\rho =tr\left[\left(\begin{array}{cc}{\mathbf{I}}_{T+1}& \mathbf{0}\\ \mathbf{0}& \mathbf{0}\end{array}\right){\left(\begin{array}{cc}{\mathbf{I}}_{T+1}+{\scriptstyle \frac{{\sigma}_{n}^{2}}{{\sigma}_{1}^{2}}}{\mathbf{Z}}_{1}^{\prime}{\mathbf{Z}}_{1}& -{\scriptstyle \frac{{\sigma}_{n}^{2}}{{\sigma}_{1}^{2}}}{\mathbf{Z}}_{1}^{\prime}\\ -{\scriptstyle \frac{{\sigma}_{n}^{2}}{{\sigma}_{1}^{2}}}{\mathbf{Z}}_{1}& {\scriptstyle \frac{{\sigma}_{n}^{2}}{{\sigma}_{1}^{2}}}{\mathbf{I}}_{T}+{\scriptstyle \frac{{\sigma}_{n}^{2}}{{\sigma}_{2}^{2}}}{\mathbf{Z}}_{2}^{\prime}{\mathbf{Z}}_{2}\end{array}\right)}^{-1}\right].$$

This is difficult to simplify and thus offers little intuition about how to attribute DF separately to the local mean and the local linear trend (but again, see Section 2.5).

Usually, as in these examples, hierarchical models specify multiple sources of variation in the data. The first example has three: spatial clustering, local heterogeneity, and error. In the second example, variation comes from perturbations in the trend and local mean and from observation error. In these examples, it would be desirable to attribute DF to these meaningful effects.

Such a partition of DF has at least two uses. First, it is an interpretable measure of each effect’s share of the fit’s complexity. Effects compete to explain variation when their design matrices are not orthogonal. In the absence of smoothing this produces well-known collinearity effects; smoothing complicates the situation. Smoothed effects compete with unsmoothed effects (e.g., Reich et al 2006) and with each other, which influences the severity of collinearity effects and the degree to which smoothed effects are made smooth. Allocating a fit’s DF to effects provides a view of this competition. For Example 1’s model, Eberly and Carlin (2000) used *τ*_{c}/(*τ*_{c} + *τ _{h}*) as a measure of clustering’s contribution to model fit. However,

Second, a partition of DF can be useful for specifying a prior distribution on variance parameters. For the linear growth model, West & Harrison (1997, Chapter 2) and Prado et al (2001) assumed fixed ${\sigma}_{1}^{2}/{\sigma}_{n}^{2}$ and ${\sigma}_{2}^{2}/{\sigma}_{n}^{2}$ to control complexity in the fit. It would be handy to avoid fixing these ratios but because they are hard to think about, it is hard to specify priors for them. A prior on interpretable DF avoids this difficulty.

This paper proposes a new definition of DF, consistent with H&S, which attributes a fit’s total DF to individual effects. This definition involves a novel conception of an effect’s DF as, loosely speaking, the effect’s fraction of the model’s total variance. The new definition applies to any model with a mixed-effect form and normal distributions for random effects and errors, for example penalized splines in the mixed-model representation (Ruppert et al 2003) and at least some Gaussian process models. Section 2 defines the class of models, then gives and rationalizes the new definition. Section 3 applies it to the examples and explores its uses. The new definition is consistent with the one in Ruppert et al (2003, Section 8.3) except for DF for residuals (Ruppert et al 2003, Section 3.14). Ruppert et al (2003) also gave an approximation to an effect’s DF, saying “for all examples we have observed … there is practically no difference between the approximate and exact degrees-of-freedom values” (p. 176). Section 3 gives an example in which the approximation differs substantially from exact DF. Section 4 considers DF for residuals, in particular why the most common formula (Ruppert et al 2003, Section 3.14), when added to the DF in the fitted model, gives a total less than the sample size, an oddity avoided by the new definition. Section 5 concludes; technical results are in the Appendix, which is available online with the supplemental materials.

Model comparison criteria generally penalize a measure of model fit with a measure of model complexity, e.g., Vaida & Blanchard (2005) derived a penalty based on *ρ* for a conditional Akaike information criterion for linear mixed models. In defining the deviance information criterion (DIC), Spiegelhalter et al (2002) defined their penalty, the effective number of parameters *p _{D},* in terms of a measure of model fit. However, in view of Plummer (2008), it is not clear that the “right” penalty for using the same data to fit and evaluate a model is a simple function of any measure of model complexity, so that describing model complexity and comparing models are distinct problems. We therefore defer further consideration of model comparison, DIC, and

Consider a linear model, which we call “the standard model”, written as

$$\mathbf{y}={\mathbf{X}}_{1}{\mathit{\theta}}_{1}+{\mathbf{X}}_{2}{\mathit{\theta}}_{2}+\mathit{\epsilon},$$

(2)

where **y** *R ^{n}* contains the observations,

Further notation includes *R*(**X**) and *N*(**X**) for the column and null spaces of the matrix **X**, respectively, while *P*** _{X}** is the orthogonal projection onto

First, consider a linear model simpler than the standard model (2):

$$\mathbf{y}={\mathbf{X}}_{2}\mathit{\theta}+\mathit{\epsilon},$$

(3)

where ** θ** is smoothed using the model or prior

$$\mathbf{Y}=\left(\begin{array}{c}\mathbf{y}\\ \mathbf{0}\end{array}\right)=\mathbf{X}\mathit{\theta}+{\mathit{\epsilon}}^{\ast}=\left(\begin{array}{c}{\mathbf{X}}_{2}\\ \mathbf{I}\end{array}\right)\mathit{\theta}+\left(\begin{array}{c}\mathit{\epsilon}\\ \mathit{\xi}\end{array}\right),$$

where **ε*** is multinormal with mean **0** and block diagonal covariance **Γ** = *diag*(**Γ**_{0}, **Γ**_{2}). H&S’s DF is

$$\begin{array}{l}\rho =tr[{\mathbf{X}}_{2}{({\mathbf{X}}^{\prime}{\mathbf{\Gamma}}^{-1}\mathbf{X})}^{-1}{\mathbf{X}}_{2}^{\prime}{\mathbf{\Gamma}}_{0}^{-1}]\\ =tr[{\mathbf{X}}_{2}{({\mathbf{X}}_{2}^{\prime}{\mathbf{\Gamma}}_{0}^{-1}{\mathbf{X}}_{2}+{\mathbf{\Gamma}}_{2}^{-1})}^{-1}{\mathbf{X}}_{2}^{\prime}{\mathbf{\Gamma}}_{0}^{-1}].\end{array}$$

(4)

Rewrite the matrix inverse in (4) as (Schott 1997, Theorem 1.7)

$${({\mathbf{X}}_{2}^{\prime}{\mathbf{\Gamma}}_{0}^{-1}{\mathbf{X}}_{2}+{\mathbf{\Gamma}}_{2}^{-1})}^{-1}={\mathbf{\Gamma}}_{2}-{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}{({\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})}^{-1}{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}.$$

(5)

The left-hand side of (5) is familiar from Bayesian linear models as the conditional posterior co-variance of ** θ|y, Γ,** while the right-hand side is familiar as the conditional covariance of

$$\begin{array}{l}\rho =tr[{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}{\mathbf{\Gamma}}_{0}^{-1}-{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}{({\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})}^{-1}{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}{\mathbf{\Gamma}}_{0}^{-1}]\\ =tr[{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}{({\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})}^{-1}({\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0}-{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}){\mathbf{\Gamma}}_{0}^{-1}]\\ =tr[{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}{({\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})}^{-1}],\end{array}$$

which has the form: trace [ratio of {modelled variance matrix} to {total variance matrix}]. This suggests defining an effect-specific DF as that effect’s contribution of variance to total variance. In the standard model (2), the mean-structure parameters *θ*_{2}* _{j}* are independent of each other conditional on their covariances

Green (2002) used a different approach to derive an equivalent decomposition of the effective number of parameters *p _{D}* when

Section 2.2’s reformulation of *ρ* for model (3), with all effects smoothed, suggests a way to rewrite H&S’s DF as a sum of meaningful quantities, but for the standard model (2) the new formulation must accommodate nonsmoothed effects **X**_{1}. A nonsmoothed effect can be viewed as the limit of a smoothed effect, for which the prior covariance goes to infinity and imposes no constraint. The new definition uses λ **Γ**_{1} *R ^{n}*

For the standard model (2), then, define DF for nonsmoothed effects, denoted by *DF*(**X**_{1}), as

$$\begin{array}{l}DF({\mathbf{X}}_{1})={\mathit{lim}}_{\lambda \to +\infty}tr[{\mathbf{X}}_{1}\lambda {\mathbf{\Gamma}}_{1}{\mathbf{X}}_{1}^{\prime}{({\mathbf{X}}_{1}\lambda {\mathbf{\Gamma}}_{1}{\mathbf{X}}_{1}^{\prime}+{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})}^{+}]\\ =tr[{P}_{{\mathbf{X}}_{1}}]=\mathit{rank}({\mathbf{X}}_{1}).\end{array}$$

Appendix (b) gives a proof. For the smoothed effect *θ*_{2}* _{j}*, DF is defined analogously:

$$\begin{array}{l}DF({\mathbf{X}}_{2j})={\mathit{lim}}_{\lambda \to +\infty}tr[{\mathbf{X}}_{2j}{\mathbf{\Gamma}}_{2j}{\mathbf{X}}_{2j}^{\prime}{({\mathbf{X}}_{1}\lambda {\mathbf{\Gamma}}_{1}{\mathbf{X}}_{1}^{\prime}+{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})}^{+}]\\ =tr\{{\mathbf{X}}_{2j}{\mathbf{\Gamma}}_{2j}{\mathbf{X}}_{2j}^{\prime}{[(\mathbf{I}-{P}_{{\mathbf{X}}_{1}})({\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})(\mathbf{I}-{P}_{{\mathbf{X}}_{1}})]}^{+}\}.\end{array}$$

Appendix (c) gives a proof. Similarly, for the error term **ε,**

$$\begin{array}{l}DF(\mathit{\epsilon})={\mathit{lim}}_{\lambda \to +\infty}tr[\mathbf{I}{\mathbf{\Gamma}}_{0}{\mathbf{I}}^{\prime}{({\mathbf{X}}_{1}\lambda {\mathbf{\Gamma}}_{1}{\mathbf{X}}_{1}^{\prime}+{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})}^{+}]\\ =tr\{{\mathbf{\Gamma}}_{0}{[(\mathbf{I}-{P}_{{\mathbf{X}}_{1}})({\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}+{\mathbf{\Gamma}}_{0})(\mathbf{I}-{P}_{{\mathbf{X}}_{1}})]}^{+}\}.\end{array}$$

This follows by the argument in Appendix (c). In general this differs from residual degrees of freedom as defined in Ruppert et al (2003, Section 3.14) or Hastie & Tibshirani (1990, Section 3.5).

By this definition, the DF of a smoothed effect **X**_{2}* _{j}* or error

We argue for the new definition’s validity by presenting four of its properties.

The new definition is consistent with H&S’s DF: the sum of DF over all effects equals H&S’s *ρ* for the whole model. Appendix (d) gives a proof.

The sum of DF in the model fit and error is fixed. When **Γ**_{0} is positive definite,
$DF({\mathbf{X}}_{1})+{\sum}_{1}^{J}DF({\mathbf{X}}_{2j})+DF(\mathit{\epsilon})=\mathit{dim}(\mathbf{y})$, the dimension of **y.** Appendix (e) gives a proof. For the definition of DF in residuals given by Ruppert et al (2003) and Hastie & Tibshirani (1990),
$DF({\mathbf{X}}_{1})+{\sum}_{1}^{J}DF({\mathbf{X}}_{2j})+DF(\text{residuals})<\mathit{dim}(\mathbf{y})$, with the deficiency arising from *DF*(residuals).

The DF of an effect has a reasonable upper bound: *DF*(**X**_{2}* _{j}*) ≤

Scale-invariance property. DF, and particularly priors on DF, avoid problems arising from scaling of columns of the design matrix. (Section 2.6 discusses prior distributions on DF.) Suppose in the standard model (2), the covariance of each smoothed effect *θ*_{2}* _{j}* and the covariance of the error term

Assume the spatial map represented by **Q** is connected, so *G* = 1. (This is not necessary.) The CAR model for the spatial clustering effects ** δ** models
${\mathbf{V}}_{1}^{\prime}\mathit{\delta}$ as
$N(\mathbf{0},{\tau}_{c}^{2}{\mathbf{D}}_{1}^{-1})$, while
${\mathbf{1}}_{N}^{\prime}\mathit{\delta}$ is not smoothed, so re-express this model in the standard form (2) by taking

$$\begin{array}{l}DF(\mathit{\delta})=\mathit{rank}({\mathbf{X}}_{1})+tr\{{\mathbf{X}}_{21}{\mathbf{\Gamma}}_{21}{\mathbf{X}}_{21}^{\prime}{[({\mathbf{I}}_{N}-{P}_{{\mathbf{X}}_{1}})({\mathbf{X}}_{21}{\mathbf{\Gamma}}_{21}{\mathbf{X}}_{21}^{\prime}+{\mathbf{X}}_{22}{\mathbf{\Gamma}}_{22}{\mathbf{X}}_{22}^{\prime}+{\mathbf{\Gamma}}_{0})({\mathbf{I}}_{N}-{P}_{{\mathbf{X}}_{1}})]}^{+}\}\\ =1+\sum _{i=1}^{N-1}\frac{{\scriptstyle \frac{{\tau}_{c}^{2}}{{\sigma}^{2}}}/{d}_{i}}{{\scriptstyle \frac{{\tau}_{c}^{2}}{{\sigma}^{2}}}/{d}_{i}+{\scriptstyle \frac{{\tau}_{h}^{2}}{{\sigma}^{2}}}+1};\\ DF(\mathit{\xi})=tr\{{\mathbf{X}}_{22}{\mathbf{\Gamma}}_{22}{\mathbf{X}}_{22}^{\prime}{[({\mathbf{I}}_{N}-{P}_{{\mathbf{X}}_{1}})({\mathbf{X}}_{21}{\mathbf{\Gamma}}_{21}{\mathbf{X}}_{21}^{\prime}+{\mathbf{X}}_{22}{\mathbf{\Gamma}}_{22}{\mathbf{X}}_{22}^{\prime}+{\mathbf{\Gamma}}_{0})({\mathbf{I}}_{N}-{P}_{{\mathbf{X}}_{{\mathbf{1}}_{N}}})]}^{+}\}\\ =\sum _{i=1}^{N-1}\frac{{\scriptstyle \frac{{\tau}_{h}^{2}}{{\sigma}^{2}}}}{{\scriptstyle \frac{{\tau}_{c}^{2}}{{\sigma}^{2}}}/{d}_{i}+{\scriptstyle \frac{{\tau}_{h}^{2}}{{\sigma}^{2}}}+1};\end{array}.$$

Thus *DF*(** δ**) +

Write the model as a linear model in *μ*_{0}, the initial local mean, *β*_{0}, the initial local trend, **w**_{1}, noise in the local mean, and **w**_{2}, noise in the trend:

$$\left(\begin{array}{c}{y}_{0}\\ {y}_{1}\\ {y}_{2}\\ {y}_{3}\\ \vdots \\ {y}_{T}\end{array}\right)=\left(\begin{array}{cc}1& 0\\ 1& 1\\ 1& 2\\ 1& 3\\ \vdots & \vdots \\ 1& T\end{array}\right)\left(\begin{array}{c}{\mu}_{0}\\ {\beta}_{0}\end{array}\right)+\left(\begin{array}{cccccc}0& 0& \cdots & \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& 0\\ 1& 0& \cdots & \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& 0\\ 1& 1& 0& \cdots & \phantom{\rule{0.16667em}{0ex}}& 0\\ 1& 1& 1& 0& \cdots & 0\\ \vdots & \vdots & \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& \ddots & \vdots \\ 1& 1& 1& 1& \cdots & 1\end{array}\right)\left(\begin{array}{c}{w}_{11}\\ {w}_{12}\\ \vdots \\ {w}_{1T}\end{array}\right)+\left(\begin{array}{cccccc}0& 0& \cdots & \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& 0\\ 0& 0& \cdots & \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& 0\\ 1& 0& 0& \cdots & \phantom{\rule{0.16667em}{0ex}}& 0\\ 2& 1& 0& 0& \cdots & 0\\ \vdots & \vdots & \phantom{\rule{0.16667em}{0ex}}& \phantom{\rule{0.16667em}{0ex}}& \ddots & \vdots \\ T-1& T-2& T-3& T-4& \cdots & 1\end{array}\right)\left(\begin{array}{c}{w}_{21}\\ {w}_{22}\\ \vdots \\ {w}_{2T-1}\end{array}\right)+\left(\begin{array}{c}{n}_{0}\\ {n}_{1}\\ \vdots \\ {n}_{T}\end{array}\right).$$

In terms of the standard model, the unsmoothed effects are *θ*_{1} = (*μ*_{0}, *β*_{0})′, the smoothed effects
${\mathit{\theta}}_{2}={({\mathbf{w}}_{1}^{\prime},{\mathbf{w}}_{2}^{\prime})}^{\prime}$, the smoothing covariances
${\mathbf{\Gamma}}_{21}={\sigma}_{1}^{2}{\mathbf{I}}_{T}$ and
${\mathbf{\Gamma}}_{22}={\sigma}_{2}^{2}{\mathbf{I}}_{T-1}$, and the error covariance
${\mathbf{\Gamma}}_{0}={\sigma}_{n}^{2}{\mathbf{I}}_{T+1}$. *DF*(**w**_{1}), *DF*(**w**_{2}), and *DF*(*n*) follow straightforwardly, but the expressions do not simplify, so we omit them and illustrate their properties with some special cases.

When the local mean and trend do not vary (
${\sigma}_{1}^{2}={\sigma}_{2}^{2}=0$), the model reduces to a linear regression with intercept *μ*_{0} and slope *β*_{0}, with 2 DF; thus when
${\sigma}_{1}^{2}>0$ or
${\sigma}_{2}^{2}>0$, *DF*(**w**_{1}) and *DF*(**w**_{2}) describe complexity in the fit, beyond a linear trend, attributable to these two sources. Note that the matrix (**X**_{1}|**X**_{21}) is saturated and **X**_{22} = **X**_{21}**B** for a specific **B.** Thus if
${\sigma}_{1}^{2}>0$ and
${\sigma}_{2}^{2}>0$, **w**_{1} and **w**_{2} compete with each other. Table 1 shows how they compete for various (
${\sigma}_{1}^{2},{\sigma}_{2}^{2}$); noise variance
${\sigma}_{n}^{2}$ is fixed at 1 and *T* at 124. When *σ*_{2} = 0 (Table 1a), *DF*(**w**_{1}) increases as *σ*_{1} grows, with analogous results when *σ*_{1} = 0 and *σ*_{2} grows. As for the clustering and heterogeneity model, for fixed *σ*_{1}> 0 (Table 1b), *DF*(**w**_{1}) is reduced as *σ*_{2} grows because larger *σ*_{2} allows **w**_{2} to compete more aggressively, so *DF*(**w**_{2}) grows. The analogous result occurs when *σ*_{1} increases for fixed *σ*_{2}.

The most familiar notion of DF is for linear models with one error term (e.g., Weisberg 1985), in which a model’s DF is the fixed, known rank of its design matrix. As extended to scatterplot smoothers (e.g., Hastie & Tibshirani 1990), a fit’s DF is not fixed and known before the model is fit, but rather is a function of tuning constants chosen to control the fit’s smoothness. For DF as redefined here, e.g., in Example 1’s model, the vector of effect-specific DF, (*DF*(** δ**),

This can be stated more precisely. In the standard model (2), suppose each smoothed effect *θ*_{2}* _{j}* and the error term

When **Γ**_{2}* _{j}*’s form is more complex than assumed above, a prior on DF partially specifies a prior on

We analyze AL data as described in Section 1 for twelve subjects without missing measurements taken from a larger dataset of 410 subjects. The dataset is available on line with the supplemental materials. Each site in each subject was measured once by each of two examiners, so each subject provided 84 AL measurements on 42 sites. Zhao (1999) investigated differences between examiners using all 410 subjects. She found tiny systematic differences between examiners. For each pair of examiners, the differences between the examiners’ measurements showed small spatial correlation and the standard deviation of the differences was similar for the different examiner pairs. Thus for the present purpose we simply treat the two exams on each patient as providing two iid measurements at each site.

Let *y _{ijk}* be the

Subject *i* has smoothing variances
${\sigma}_{c,i}^{2}$ and
${\sigma}_{h,i}^{2}$ and thus *DF*(*δ** _{i}*) and

Periodontal data: posterior mean DF for spatial clustering and site heterogeneity and posterior mean of *f*_{i}, the fraction of DF for spatial clustering. The maximum possible DF for *δ*_{i} and *ξ*_{i} are, respectively, 41 and 42 for individual subjects **...**

In Table 2, overall DF for spatial clustering has posterior mean 82.00 out of 492; subject-specific values range from 1.66 to 16.75 out of 41. Overall DF for heterogeneity has posterior mean 158.26 out of 503; subject-specific values range from 1.60 to 28.12 out of 42. The posterior mean of *f _{i}* ranges from 0.18 to 0.52 with median 0.36, compared to the prior mean 0.50, so for most subjects the data indicate that heterogeneity should receive more fitted complexity than clustering. Compared to

Are these results reasonable? Figure 3 shows the data and the posterior of *f***_{i}** for subjects 11, 7, and 10, who have the smallest

Example 1, AL data and histograms of draws from the posterior of *f*_{i}, the fraction of model complexity attributed to spatial clustering. The top two panels are for subject 11, the middle two panels are for subject 7, and the bottom two panels are for subject **...**

The random effects ** α, Δ,** and

True DF and average posterior median DF for clustering and heterogeneity, for various true (
${\sigma}_{\alpha}^{2}/{\sigma}_{0}^{2},{\sigma}_{c}^{2}/{\sigma}_{0}^{2},{\sigma}_{h}^{2}/{\sigma}_{0}^{2}$), with 100 simulated datasets per scenario. “Est DF” is the average **...**

In Table 3, consider the columns labelled “True DF”; these are DF as a function of the true variances. The model with both clustering and heterogeneity (the rows (1,1,1) and (0.25,0.25,0.25)) allocates less complexity to clustering than the model with only clustering (rows (1,1,0) and (0.25,0.25,0)). The presence of heterogeneity reduces the True DF for clustering proportionally more when the true variances are 1 (42% reduction) than when the true variances are 0.25 (21%). The presence of clustering has a similar effect on the True DF for heterogeneity (29% and 19% reductions, respectively). In other words, the two effects compete but to a lesser degree when each is constrained more by a smaller true variance. Now consider the columns labelled “Est[imated] DF”, which for each row in the table is the posterior median DF for each artificial dataset, averaged over the 100 artificial datasets. When heterogeneity is truly absent (rows (1,1,0) and (0.25,0.25,0)), the analysis nonetheless allocates some DF to heterogeneity, 35 and 32 when the variances are 1 and 0.25 respectively, and the estimated DF for the competing clustering effect are below the true values by 23 and 10 DF respectively. That is, some of heterogeneity’s DF are “stolen” from clustering, about 2/3 when the variances are 1 (and competition is more fierce) and about 1/3 when the variances are 0.25. An analogous but less pronounced effect occurs when clustering is truly absent. Section 5 discusses the implications of this finding.

Using the linear growth model to smooth the global mean surface temperature series, we consider two types of priors on (
${\sigma}_{n}^{2},{\sigma}_{1}^{2},{\sigma}_{2}^{2}$): independent Gamma(0.001,0.001) on each
$1/{\sigma}_{j}^{2}$, and DF-induced priors. Figure 2 plots the data and three fits (posterior means) arising from gamma priors. For one fit, the analysis used the design matrix’s original scaling; for the other two, the columns of the design matrix were multiplied by 50 and 100. All smooths capture the increase from 1881 to 1940, the decrease from 1940 to 1970, and the increase after 1970. The fit with the original scaling smooths the most, with DF(**w**_{1}) + DF(**w**_{2}) having posterior mean 5.6. The gamma prior’s effect is not invariant to the scaling of **X**’s columns, so the posteriors differ noticeably when the same gamma prior is used with re-scaled design matrices. When **X**’s columns have been multiplied by 100, the posterior means of DF(**w**_{1}) and DF(**w**_{2}) sum to 21.5. It was simply luck that in the “natural” scaling, the gamma prior gave a reasonable result.

Priors specified on DF, by contrast, avoid the problem of scaling. Instead, they control the fit’s smoothness directly by constraining DF in the fit. Figure 4 plots fits from flat priors on (DF(**w**_{1}), DF(**w**_{2})) with five different constraints on total DF in the smoothed effects. The fit becomes smoother as the constraint on total DF is reduced. Figure 5 shows histograms of 10,000 draws from the posterior of DF(**w**_{1}) and DF(**w**_{2}) arising from flat priors on them constrained so total smoothed DF < 6. Both posteriors are skewed, but in different directions. For the local mean *E*(*DF*(**w**_{1})|**y**) = 0.86, while for the local slope *E*(*DF*(**w _{2}**)|

Example 2, global mean surface data, histograms of posterior DF for local mean (top) and local trend (bottom), for the flat prior on DF with sum of total smoothed DF constrained to < 6.

This example also gives an instance in which the approximate DF (Section 2.3; Ruppert et al 2003, Section 8.3) performs poorly. For the flat priors on (DF(**w**_{1}), DF(**w**_{2})) with total DF constrained to be less than 6, the posterior mean of exact and approximate DF for **w**_{1} are 0.86 and 1.87 respectively, differing by more than 1 DF, while the posterior mean of exact and approximate DF for **w**_{2} are closer, 4.58 and 4.76 respectively. When total DF is constrained to be less than 10, the posterior mean of exact and approximate DF are 2.97 and 5.13 for **w**_{1} — differing by more than 2 DF — and 5.14 and 5.75 for **w**_{2}.

Other choices are possible for DF-based priors. For example, West & Harrison (1997) usually fix signal-to-noise ratio for smoothing variances, which amounts to point priors on the corresponding DF. This could be relaxed by specifying priors on DF with the same centers as these point priors, but with positive variances.

The Gamma(0.001,0.001) prior has been criticized on various grounds. Alternative priors on variances or standard deviations have been proposed; Gelman (2006) gives an influential critique and some alternatives. These alternatives are all scale-dependent, like the gamma prior. We do not, however, propose a flat prior on DF as a default: for this model and dataset, a flat prior on (DF(**w**_{1}), DF(**w**_{2})) without a constraint on total DF gives a grossly undersmoothed fit. It is still true that overparameterized models often need strong prior information.

The standard model (2), with **Γ**_{0} set to
${\sigma}_{\epsilon}^{2}\mathbf{I}$, can be re-expressed as

$$\begin{array}{l}\mathbf{y}=\mathbf{f}+\mathit{\epsilon}\\ \mathit{\epsilon}\sim N(\mathbf{0},{\sigma}_{\epsilon}^{2}\mathbf{I})\\ \mathbf{f}\sim N({\mathbf{X}}_{1}{\mathit{\theta}}_{1},{\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{\mathbf{X}}_{2}^{\prime}),\end{array}$$

where **f** is a true but unknown function we wish to estimate, evaluated at the design values corresponding to the rows of **X**_{1}. This notation is most familiar for penalized splines represented as mixed linear models, as in Ruppert et al (2003), where **f** is the true smooth function relating a predictor *x* and dependent variable **y.** It appears that Ruppert et al (2003) require **X**_{1} to have full rank; this section assumes **X**_{1} has full rank.

Treating
${\sigma}_{\epsilon}^{2}$ and **Γ**_{2} as known, the fitted values from this model – the BLUPs or conditional posterior means – are **ŷ** = **Sy** for

$$\mathbf{S}=\left[\begin{array}{ll}{\mathbf{X}}_{1}\hfill & {\mathbf{X}}_{2}\hfill \end{array}\right]{\left[\left[\begin{array}{l}{{\mathbf{X}}_{1}}^{\prime}\hfill \\ {{\mathbf{X}}_{2}}^{\prime}\hfill \end{array}\right]\left[\begin{array}{ll}{\mathbf{X}}_{1}\hfill & {\mathbf{X}}_{2}\hfill \end{array}\right]+{\sigma}_{\epsilon}^{2}\left[\begin{array}{cc}\mathbf{0}& \mathbf{0}\\ \mathbf{0}& {{\mathbf{\Gamma}}_{2}}^{-1}\end{array}\right]\right]}^{-1}\left[\begin{array}{l}{{\mathbf{X}}_{1}}^{\prime}\hfill \\ {{\mathbf{X}}_{2}}^{\prime}\hfill \end{array}\right];$$

**S** is a linear smoother. The DF in the fitted values **ŷ** is *tr* (**S**) under all published definitions.

The most common definition for residual DF in linear mixed models is derived as follows (Ruppert et al 2003, Section 3.14; Hastie & Tibshirani 1990, Section 3.5). Assuming **f** is fixed, the mean squared error – the expectation of the residual sum of squares given **Γ**_{2}, **Γ**_{0} – is

$$\begin{array}{l}\mathit{MSE}(\mathbf{y}\mid {\mathbf{\Gamma}}_{2},{\mathbf{\Gamma}}_{0},\mathbf{f})={E}_{\epsilon}[{(\widehat{\mathbf{y}}-\mathbf{y})}^{\prime}(\widehat{\mathbf{y}}-\mathbf{y})\mid {\mathbf{\Gamma}}_{2},{\mathbf{\Gamma}}_{0},\mathbf{f}]\\ ={E}_{\epsilon}[{\mathbf{y}}^{\prime}{(\mathbf{S}-\mathbf{I})}^{\prime}(\mathbf{S}-\mathbf{I})\mathbf{y}\mid {\mathbf{\Gamma}}_{2},{\mathbf{\Gamma}}_{0},\mathbf{f}]\\ ={\sigma}_{\epsilon}^{2}tr({(\mathbf{S}-\mathbf{I})}^{\prime}(\mathbf{S}-\mathbf{I}))+{\mathbf{f}}^{\prime}{(\mathbf{S}-\mathbf{I})}^{\prime}(\mathbf{S}-\mathbf{I})\mathbf{f}\\ ={\sigma}_{\epsilon}^{2}(n+tr({\mathbf{S}}^{\prime}\mathbf{S})-2tr(\mathbf{S}))+{\mathbf{f}}^{\prime}{(\mathbf{S}-\mathbf{I})}^{\prime}(\mathbf{S}-\mathbf{I})\mathbf{f}.\end{array}$$

(6)

The term **f**′**(S** − **I)**′**(S** − **I)f** arises from bias. “Assuming the bias term … is negligible” (Ruppert et al 2003, p. 83), then by analogy with ordinary linear models, the DF in the residuals is *n* + *tr*(**S**′**S**) − 2tr(**S**), and {residual sum of squares}/(*n* + *tr*(**S**′**S**) − 2*tr*(**S**)) is an unbiased estimate of
${\sigma}_{\epsilon}^{2}$. Since DF in the fit is *tr*(**S**),

$$\begin{array}{l}(\text{DF}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{fit})+(\text{DF}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{residuals})=tr(\mathbf{S})+n+tr({\mathbf{S}}^{\prime}\mathbf{S})-2tr(\mathbf{S})\\ =n+tr({\mathbf{S}}^{\prime}\mathbf{S})-tr(\mathbf{S})<n.\end{array}$$

The inequality holds because **S** is symmetric with eigenvalues in [0,1] and at least one eigenvalue less than 1, so *tr*(**S**′**S**) < *tr*(**S**). This raises the question: where are the missing degrees of freedom?

Taking the mixed-effect model (2) literally, **f** is random, so we can remove the conditioning on **f** in (6). To do so, we need a claim, proved in Appendix (j), that if **f** is distributed as normal with mean **X**_{1}**θ**_{1}and covariance **X**_{2}**Γ**_{2}**X**_{2}′, then

$$\begin{array}{l}{E}_{\mathbf{f}}[{\mathbf{f}}^{\prime}{(\mathbf{S}-\mathbf{I})}^{\prime}(\mathbf{S}-\mathbf{I})\mathbf{f}\mid {\mathbf{\Gamma}}_{2},{\mathbf{\Gamma}}_{0}]=tr({(\mathbf{S}-\mathbf{I})}^{\prime}({\mathbf{X}}_{2}{\mathbf{\Gamma}}_{2}{{\mathbf{X}}_{2}}^{\prime}+{\mathbf{X}}_{1}{\mathit{\theta}}_{1}{({\mathbf{X}}_{1}{\mathit{\theta}}_{1})}^{\prime})(\mathbf{S}-\mathbf{I}))\\ =tr({(\mathbf{S}-\mathbf{I})}^{\prime}(-{\sigma}_{\epsilon}^{2}\mathbf{S})).\end{array}$$

We can now remove the conditioning on **f**. By the claim

$$\begin{array}{l}\mathit{MSE}(\mathbf{y}\mid {\mathbf{\Gamma}}_{2},{\mathbf{\Gamma}}_{0})={E}_{\mathbf{f}}[\mathit{MSE}(\mathbf{y}\mid {\mathbf{\Gamma}}_{2},{\mathbf{\Gamma}}_{0},\mathbf{f})]\\ ={\sigma}_{\epsilon}^{2}(n+tr({\mathbf{S}}^{\prime}\mathbf{S})-2tr(\mathbf{S}))+{E}_{\mathbf{f}}[{\mathbf{f}}^{\prime}{(\mathbf{S}-\mathbf{I})}^{\prime}(\mathbf{S}-\mathbf{I})\mathbf{f}\mid {\mathbf{\Gamma}}_{2},{\mathbf{\Gamma}}_{0}]\\ ={\sigma}_{\epsilon}^{2}(n+tr({\mathbf{S}}^{\prime}\mathbf{S})-2tr(\mathbf{S}))+tr({(\mathbf{S}-\mathbf{I})}^{\prime}(-{\sigma}_{\epsilon}^{2}\mathbf{S}))\\ ={\sigma}_{\epsilon}^{2}(n-tr(\mathbf{S})).\end{array}$$

By the same rationale used to define the usual residual DF, we can now define residual DF as *n* − *tr* (**S**), the same as the new definition of DF in Section 2.3. Therefore

$$(\text{DF}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{fit})+(\text{DF}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{residuals})=tr(\mathbf{S})+n-tr(\mathbf{S})=n,$$

as defined in Section 2; the missing degrees of freedom are in the bias term of mean squared error. The difference between the two definitions can be small or large. For the global mean surface temperature dataset, using the prior with total DF constrained to be under 6 and fixing the variance ratios at their posterior medians, residual DF is 115.7 under the widely-used definition and 117.6 under the new definition. However, for the AL data, fixing the variance ratios at their posterior medians, residual DF is 699.6 under the widely-used definition and 764.4 under the new definition.

The foregoing highlights an awkward element of contemporary non-Bayesian theory for mixed effect models, specifically those with random effects that do not meet the definition in, say, Scheffé (1959, p. 238: a random effect’s levels can “be regarded as a random sample from some population about which we wish to make our statistical inferences, rather than making them about the particular [levels] in the experiment”). Penalized splines, as in Ruppert et al (2003, Section 4.9), present the clearest example of this difficulty. The theory of penalized splines presumes the smooth function **f** is fixed but unknown (e.g., Ruppert et al 2003, p. 58); the spline is fit using a random-effect model because this gives a minimization problem formally identical to the minimization problem in the original definition of penalized splines, but also allows us to use all the usual mixed linear model tools. For penalized splines, it seems meaningless outside a Bayesian framework to take an expectation over the distribution of **f**, which is presumed fixed if unknown and is formally represented as a random effect for convenience only. Thus, the mean squared error given **Γ _{0}, Γ_{2},** and

The standard derivation of residual DF is intended to produce denominator DF for F-tests and is rationalized by the assumption that the squared length of bias is small. For uses of this machinery described by Ruppert et al (2003) and Hastie & Tibshirani (1990), in which the functions **f** are quite smooth, bias likely is small and this presumption seems reasonable as, indeed, it was for the global mean surface temperature example. However, methods tend to outgrow their inventors’ intentions, inevitably so for a scheme as inclusive as the one in Ruppert et al (2003), so this standard derivation will be extended to cases where bias is not necessarily small, e.g., for the AL data. Thus, a theoretically more tidy treatment would seem desirable; the present paper’s new definition of DF may provide such a treatment.

From a Bayesian viewpoint, **Γ**_{2} can describe either variation in *θ*_{2}, for old-fashioned random effects, or uncertainty about *θ*_{2}*,* for new-fangled random effects, and no problem arises because a probability distribution, once specified, obeys the same rules regardless of its rationale. However, except when designing experiments, a Bayesian viewpoint does not consider expectations over varying **y,** so the usual derivation of residual DF has no interest. Thus, a Bayesian interpretation cannot justify removing the conditioning on **f** in the usual derivation of residual DF.

Section 2’s new definition, while not explicitly Bayesian, does treat the covariance matrices **Γ**_{2}*j* as Bayesians do, irrespective of their rationale. In the new definition, an effect’s DF describes that effect’s proportion of total variance, partitioning the sample size *n* naturally (property DF.b) while treating all sources of variation in the same manner, i.e., without singling out error variation for distinctive treatment as in the usual derivation of residual DF.

We have presented a new conception of degrees of freedom for models using random effects and normal probability distributions, anticipated partly by Green (2002). Although the resulting definition of DF arises from linear model theory, it defines a scale for complexity in the space of covariance structures and can be loosely interpreted as the fraction of variation attributed to individual effects. The new definition rationalizes partitioning the total DF in the dataset into pieces for effects in the model and for error. The latter avoids difficulties in the most common definition of residual DF (Section 4). Conceiving an effect’s DF as the fraction of variance attributed to the effect suggests a way to define DF for non-normal random-effect models, possibly avoiding the linear and normal approximations explicit in Lu et al (2007) and implicit in Ruppert et al (2003, p. 212).

The examples illustrated two uses of the new definition. The first is inducing prior distributions on smoothing variances. In some cases, such variances are meaningful quantities but priors on them are often nonintuitive and depend on the scale of the design matrix, as Section 3.2 illustrated. The examples showed how to put scale-invariant priors on interpretable quantities like DF or *DF*(** δ**)/(

The new definition’s second use is as an interpretable measure of each effect’s contribution to the fit’s complexity. This sheds light on competition between effects, which is especially acute in cases where the competing effects have design matrices with column spaces that overlap substantially, as in both examples. Competition of this sort is a large topic that is only now being noticed (e.g., Reich et al 2006); tools like DF can help us describe and understand this phenomenon.

As a byproduct, the AL example gave an instance in which the approximate DF of an effect (Ruppert et al 2003, Section 8.3; Cui 2008, p. 79) performs poorly. It is almost certainly not an accident that this arose in a model with highly collinear effects, in contrast with the examples in Ruppert et al (2003). If the approximation fails only with severe collinearity, it may be possible to develop a diagnostic that indicates when the approximation is accurate, which would be useful given that it computes much faster than exact DF.

Section 3.1’s simulation found a tendency to allocate DF to an effect that was, in fact, absent. In practice, this may not be important because our methods would be used in the context of a larger effort of model building etc., and before fitting this model one would probably drop the heterogeneity term if the data indicated it was absent. Also, if the purpose of the analysis is better estimation of AL at individual sites, this is almost certainly not important. From a theoretical point of view, however, this problem needs to be explained. The proximate cause is that smoothers tend not to shrink effects to zero even when they should. (This is the sole virtue of the zero variance estimates that are frequently produced by maximizing the restricted likelihood.) The extent of undershrinkage depends on the prior and would be reduced by, say, a scaled Beta(0.1, 0.1) prior on the relevant DF, which pushes the posterior more toward the extremes of complete or no shrinkage. Perhaps the larger implication is a need to re-think these extravagantly overparameterized models, which are the problem’s ultimate cause. The difficulty almost certainly arises because of the extreme collinearity of the clustering and heterogeneity effects; partitioning degrees of freedom merely describes the consequence of this collinearity.

Finally, how does DF compare to the popular effective number of parameters *p _{D}*? Recall that in view of Plummer (2008), we distinguish the problems of model complexity and model comparison. Thus DF’s purpose is to describe a model’s complexity in terms drawn from simple, well-understood models, making it possible to control complexity in fitting the model to a dataset by means of a prior distribution on complexity. Regarding

A complexity measure defined independently of the outcome **y** will, in general, be a function of unknowns. This is acceptable: statisticians routinely specify models as functions of unknowns – in Bayesian terms, conditional on unknowns – and it creates no difficulty to describe a model’s complexity as a function of its unknowns. The DF of a *fitted* model is obviously a function of the data, either through plug-in estimates or posterior distributions of the unknown parameters in the covariance matrices. But a complexity measure defined independently of **y**, like DF, allows a prior distribution on complexity to control softly the complexity of the fit. This is not possible with *p _{D}* because in general it is defined in terms of a specific realized

It is not clear *p _{D}* can be partitioned corresponding to components of the model, as DF can. Spiegelhalter et al (2002, Sections 6.3, 8.1) do partition

Thus, while *p _{D}*’s relatively easy computation is certainly desirable, for our purposes this seems outweighed by its conceptual and practical disadvantages.

This work was supported in part by a Merck Company Foundation Quantitative Sciences Fellowship. The authors thank Dr. Bryan Michalowicz of the University of Minnesota School of Dentistry for permission to use the periodontal data and post it on *Technometrics’s* web site. The fourth author’s work was supported in part by National Cancer Institute grant 2-R01-CA095955-05A2.

- Besag J, York JC, Mollie A. Bayesian image restoration, with two applications in spatial statistics (with discussion) Ann Inst Stat Math. 1991;43:1–59.
- Cui Y. PhD dissertation. University of Minnesota Division of Biostatistics; 2008. Smoothing analysis of variance for general designs and partitioning degrees of freedom in hierarchical and other richly parameterized models. [PMC free article] [PubMed]
- Daniels MJ. A prior for the variance in hierarcical models. Can J Stat. 1999;27:569–580.
- Eberly LE, Carlin BP. Identifiability and convergence issues for Markov chain Monte Carlo fitting of spatial models. Stat Med. 2000;19:2279–2294. [PubMed]
- Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis. 2006;1:515–534.
- Gelman A, Pardoe I. Bayesian measures of explained variance and pooling in multilevel (hierarchical) models. Technometrics. 2006;48:241–251.
- Gilthorpe MS, Zamzuri AT, Griffiths GS, Maddick IH, Eaton KA, Johnson NW. Unification of the “burst” and “linear” theories of periodontal disease progression: A multilevel manifestation of the same phenomenon. J Dent Res. 2003;82:200–205. [PubMed]
- Green P. Discussion of Spiegelhalter et al (2002) J Roy Stat Soc, Ser B. 2002;64:627–628.
- Groetsch CW. Inverse Problems in the Mathematical Sciences. Vieweg: Wiesbaden; 1993.
- Hastie TJ, Tibshirani RJ. Generalized Additive Models. Boca Raton FL: CRC/Chapman & Hall; 1990.
- Hodges JS, Carlin BP, Fan Q. On the precision of the conditionally autoregressive prior in spatial models. Biometrics. 2003;59:317–322. [PubMed]
- Hodges JS, Cui Y, Sargent DJ, Carlin BP. Smoothing balanced single-error-term analysis of variance. Technometrics. 2007;49:12–25.
- Hodges JS, Sargent DJ. Counting degrees of freedom in hierarchical and other richly-parameterised models. Biometrika. 2001;88:367–379.
- Lu H, Hodges JS, Carlin BP. Measuring the complexity of generalized linear hierarchical models. Can J Stat. 2007;35:69–87.
- Natarajan R, Kass RE. Reference Bayesian methods for generalized linear mixed models. J Amer Stat Assn. 2000;95:227–237.
- Ortega JM, Rheinboldt WC. Iterative Solution of Nonlinear Equations in Several Variables. New York: Academic Press; 1970.
- Paciorek CJ, Schervish MJ. Spatial modelling using a new class of nonstationary covariance functions. Environmetrics. 2006;17:483–506. [PMC free article] [PubMed]
- Plummer M. Penalized loss functions for Bayesian model comparison. Biostatistics. 2008;9:523–539. [PubMed]
- Prado R, West M, Krystal AD. Multichannel electroencephalographic analyses via dynamic regression models with time-varying lag-lead structure. J Roy Stat Soc, Ser C. 2001;50:95–109.
- Reich BJ, Hodges JS. Modeling longitudinal spatial periodontal data: A spatially adaptive model with tools for specifying priors and checking fit. Biometrics. 2008a;64:790–799. [PubMed]
- Reich BJ, Hodges JS. Identification of the variance components in the general two-variance linear model. J Stat Plan Inf. 2008b;138:1592–1604.
- Reich BJ, Hodges JS, Carlin BP. Spatial analyses of periodontal data using conditionally autoregressive priors having two types of neighbor relations. J Amer Stat Assn. 2007;102:44–55.
- Reich BJ, Hodges JS, Zadnik V. Effects of residual smoothing on the posterior of the fixed effects in disease-mapping models. Biometrics. 2006;62:1197–1206. [PubMed]
- Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. New York: Cambridge; 2003.
- Schott JR. Matrix Analysis for Statistics. New York: Wiley; 1997.
- Scheffé H. The Analysis of Variance. New York: Wiley; 1959.
- Spiegelhalter DJ, Best DG, Carlin BP, Van Der Linde A. Bayesian measures of model complexity and fit (with discussion) J Roy Stat Soc, Ser B. 2002;64:583–639.
- Vaida F, Blanchard S. Conditional Akaike information criterion for mixed effects models. Biometrika. 2005;92:351–370.
- Wall MW. A close look at the spatial structure implied by the CAR and SAR models. J Stat Plan Inf. 2004;121:311–324.
- Weisberg S. Applied Linear Regression. Wiley; New York: 1985.
- West M, Harrison J. Bayesian Forecasting and Dynamic Models. 2. New York: Springer; 1997.
- Ye J. On measuring and correcting the effects of data mining and model selection. J Amer Stat Assn. 1998;93:120–131.
- Zhao YH. MS thesis. Division of Biostatistics, University of Minnesota; 1999. Examining mean structure, correlation structure, and differences between examiners in a large periodontal data set.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |