The problem is most easily illustrated for designs, in which some factor varies between subjects, but other factors (or covariates) vary within subjects (e.g., split-plot and repeated-measures designs;
Quinn and Keough 2002). For example, in a study of differential allocation, females are paired experimentally to either attractive or unattractive males (
Bolund et al. forthcoming). They are allowed to produce a clutch, and egg sizes are measured for all eggs. When the interest is to estimate the effect of the treatment (attractive vs. unattractive male) on mean egg size, it is sufficient to include individual-specific random intercept effects, that is, allowing females to differ in their mean egg sizes and hence intercepts. This will effectively control for the nonindependence of eggs coming from the same female when the factor of interest, the treatment, is applied to some of the females. However, many studies also focused on how the treatment affects the patterns of female investment over the laying sequence within a clutch. In this case, a model that controls for individual-specific intercepts only, but not for individual-specific slopes of investment (of egg size over the laying sequence), will greatly underestimate the
P value of 1) the slope main effect and 2) the treatment by laying order interaction, leading to many false-positive findings.
In the above example, the aim is to estimate within-subject slopes (factorial treatments can be considered as slopes as well,
Gelman and Hill 2007) and to generalize these slopes to a larger population of individuals from which the subjects were sampled. It is common, however, that individuals do not only differ in their absolute trait value (like mean egg size) but also in their slopes of response to some factor or covariate (like change of egg volume over the laying sequence). By estimating fixed effects, we are usually interested in the average slope in a population of individuals. If there is high between-individual variation in slopes, then taking more measurements from the same individual will make the estimate of this particular slope more precise. However, these additional measurements do not contribute much to make the estimate for the population slope more accurate. Only by measuring more individuals and, hence, more slopes, one can be more confident about the average slope in the population. Problematically, random intercept models wrongly treat repeated measurements within individuals as independent data points with respect to the population slope. Hence, estimating slopes from within-individual replicates will give too narrow confidence intervals for the population. In the framework of null hypothesis testing, this will lead to too many rejections of the null hypotheses when testing the population-wide mean slope against some specific value (slope main effect) or the slopes of 2 populations against each other (slope-by-treatment interactions).
To illustrate the phenomenon of inflated rates of type I error, we generated data sets that mimic data collected from a split-plot design. We randomly assigned 30 virtual individuals to 2 treatments (15 individuals in each treatment). Within individuals, we sampled 5 trait values and the order of these values as a covariate (analogous to egg sizes within a laying sequence of 5 eggs from 1 clutch). We allowed the 30 individuals to vary in their trait value increase over the sequence by drawing slopes from a normal distribution (with a mean of zero and a standard deviation [SD] of σb). Furthermore, we allowed within-individual error by assigning single measurements a deviation from the regression slope drawn from a normal distribution (with a mean of zero and an SD of σr). There was no population difference in means between treatments, population slopes were zero in both treatments, and there was no between-individual variation in mean trait values. These are not essential assumptions of the simulation because introducing differences between treatments (in slopes and/or intercepts) as well as allowing individuals to vary in their mean trait values gave the same results.
We fitted a random intercept model [lmer (trait~Treatment*LaySeq+(1|IndID)] to the 150 data points using lmer from the lme4 package in R 2.6.2 (
Bates 2007;
R Development Core Team 2008). For each randomly created data set, we evaluated whether the confidence intervals for the fixed-effect estimates included the true value. In our simulation, the true values were zero so that the proportion of simulations for which the confidence interval did not include zero is the type I error rate. We let the between-individual variation in slopes (σ
b) and within-individual scatter around the regression line (σ
r) vary between 0 and 0.5 and ran 1000 simulations for each parameter combination.
The type I error rate for finding a significant treatment main effect was close to the expected 5% when the between-individual variation in slopes (σb) was low, but for high σb values, the type I error rate was considerably lower, that is, too conservative (, left, panel-wide means: α = 0.036 [top] and α = 0.017 [bottom]). This reflects a loss of power for testing the between-individual treatment effect when the between-individual variation in slopes was not accounted for. On the contrary, the false-positive rate of finding significant main effects of slopes as well as significant slope-by-treatment interactions increased with the between-individual variation in slopes (σb), but this effect got less pronounced as the within-individual scatter around the regression line (σr) increased (, center and right, panel-wide means: α = 0.23 [top center], α = 0.23 [top right], α = 0.095 [bottom center], and α = 0.098 [bottom right]).
To demonstrate how severe this issue can be in a real data set, we used egg size and egg yolk color measurements from 30 zebra finch pairs (
Bolund et al. forthcoming). Each female laid 4 clutches, and all 2–6 eggs from each clutch were measured. Zebra finch eggs increased in size over the laying sequence and egg yolks changed from orange toward yellow. We simulated a random assignment of the 30 pairs to 2 fictional treatments separately for the 4 clutches. Because assignment was random, there was, by definition, no true treatment effect and no slope-by-treatment interaction. However, the proportion of significant slope-by-treatment interactions after 10

000 runs ranged from 0.085 to 0.35 (median of 4 clutches: 0.10) for egg size and from 0.11 to 0.41 (median of 4 clutches: 0.15) for egg color. This is clearly more than the desirable rate of false positives (α = 0.05).