The majority of neuroscience experiments include some type of inferential statistical analysis, where conclusions are reached based on the distance between the observed results from some hypothetical expected value. Discovering how the brain and nervous system work requires the proper application of statistical methods, and inappropriate analyses can lead to incorrect inferences, which in turn leads to wasted resources, biases in the literature, fruitless explorations of non-existent phenomena, distraction from more important questions, and perhaps worst of all, ineffectual therapies that are advanced to clinical trials [1
]. Pseudoreplication is a particularly serious error of analysis that has not received much attention in the neuroscience literature, and which Hurlbert defined over twenty years ago as the "... use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples maybe) or replicates are not statistically independent" [3
]. Put simply, it is a confusion between the number of data points with the number of independent samples, and can be illustrated with the following example. Suppose the following information was provided in the Methods section of a manuscript: "Ten rats were randomly assigned to either the treatment or the control group, and performance on the rotarod (a test of motor coordination) was tested on all the rats on three consecutive days. Differences between groups were assessed with a two-tailed independent samples t
-test, with p
< 0.05 considered statistically significant." Then in the Results section the authors report that "the treatment group did significantly better than the control group (t28
= 2.1; p
= 0.045)." Have the authors analysed the data correctly? No. With a sample size of ten rats, there should only be eight degrees of freedom (df
- 2, where n1
are the number of independent samples in each group) associated with this statistical test. The concept of degrees of freedom is perhaps not the most intuitive statistical idea, but it can be thought of as the number of independent data points that can be used to estimate population parameters (e.g. means, differences between means, variances, slopes, intercepts, etc.), and whenever something is estimated from the data, a degree of freedom is lost. Therefore the total df
is equal to the sample size only if all the samples are independent: measuring the height of ten unrelated individuals provides ten independent pieces of information about the average height of the population from which they were drawn; measuring the height of one person ten times provides only one independent piece of information about the population. In the above rat example, the three observations from each rat (from the three days of testing) were treated as independent samples, and hence the 28 degrees of freedom arose from thinking that the fifteen data points in each group contain independent information about the effect of the treatment (n1
= 15, and so 15 + 15 - 2 = 28). Incidentally, the correct analysis with a t
-statistic of 2.1 on 8 df
has a corresponding p
-value of 0.069. In addition to the incorrect degrees of freedom, there is also the problem of false precision, which is discussed at greater length below (see Figure ). However, it should be noted here that the df
problem has greater relevance when sample sizes are small, but false precision is arguably of greater concern in general.
Figure 1 An example of pseudoreplication. Two rats are sampled from a population with a mean (μ) of 50 and a standard deviation (σ) of 10, and ten measurements of an arbitrary outcome variable are made on each rat. The first (incorrect) 90% CI (more ...)
The assumption of independence means that observations within each group or treatment combination are independent of each other. An alternative way of expressing this concept is to say that the errors (residual values) are independent, once the effects of all the other explanatory variables have been taken into account. In addition, other variables that are not included in the analysis (e.g. the order in which the samples were obtained) must not influence the outcome or be correlated with the residuals. The remainder of the introduction will define some commonly used terms, illustrate why pseudoreplication is problematic, and finally, discuss the four situations in which it can arise.
The terms sample
, experimental unit
, and experiment
have overlapping meanings, are often used interchangeably, and can have different meanings based on the context. An experimental unit
is defined as the smallest entity that can be randomly assigned to a different treatment condition [4
]. A person or a rat are typical experimental units, because they can be allocated to different treatments. The sample size
is usually reported as the "n
" and is defined as the number of experimental units, but the term is slightly ambiguous because one could take two blood samples from a rat (in the morning and afternoon for example) and therefore there are twice as many samples as rats, but the "sample size" still refers to the number of rats. An observation
occurs whenever a value of an outcome variable is recorded, and it is equivalent to the number of data points; if there are twenty rats and only one observation is taken on each rat, then the number of observations equals the sample size (n
). If multiple observations are taken from each rat, then observations within each rat are not independent and therefore all of the observations cannot be summed to give a total sample size. In cell culture experiments, the whole procedure is often repeated three or more times and reported as three "independent replicate experiments". In this case n
is the number of experiments. The term experiment is ambiguous in this context because all of the independent trials or runs taken together can be thought of as "the experiment". Replicates
also typically refers to independent observations, and hence the term pseudoreplication when this is not the case. However, Cumming et al. use replicates to refer to "repetition of measurements on one individual in a single condition, or multiple measurements of the same or identical samples", and thus they use the term replicate to refer to observations that are not independent [5
]. The difference here is between biological replicates
which are independent (e.g. two unrelated rats are biological replicates) and technical replicates
which are not independent (e.g. dividing a blood sample from a single rat into two sub-samples, and measuring the concentration of a substance in each sub-sample). In this paper, replicates refers to biological replicates unless otherwise indicated; references to samples or sample size (n
) refer to the number of independent values, and observations are used to refer to individual data points, which most likely are not independent (but observations could be independent if there is only one observation per animal for example). The term pseudoreplication is used synonymously with lack of independence of observations, correlated observations, and correlated errors.
Pseudoreplication leads to the wrong hypothesis being tested and false precision
Ignoring lack of independence leads to two major problems. The first is that the statistical analysis is not testing the research hypothesis that the scientist intends, in other words, the incorrect hypothesis is being tested. This is illustrated in Figure , where two rats are sampled from a population, and the interest is in determining whether the rats come from a population with a mean of 50 on some arbitrary outcome variable (shown horizontally on the right), or is their value far enough away from 50 that we conclude that they come from a different population. This can be stated as H0: μ = 50 (the null hypothesis), and H0: μ ≠ 50 (alternative hypothesis). Ten measurements are made on each rat and a one-sample t-test can be used to compare the mean of this single sample of rats to a hypothesised population value. The incorrect analysis would give t19 = -7.75, a p-value of 2.7 × 10-7, and a 95% confidence interval (CI) from 32.9 to 40.2. The correct analysis would give t1 = -2.07, p = 0.287, and 95% CI = (-46.3, 119.4). The change in p-value between the two analyses is six orders of magnitude, which demonstrates the importance of dealing with pseudoreplication appropriately. When calculating standard errors and confidence intervals, and making inferences between different groups with statistical tests, the assumption is that all the values are independently drawn from the parent population, but clearly the rat that the observation came from partly determines what that value is. Statistical analyses performed on such data without regard for this structure are often meaningless (in this case the researcher would falsely conclude that the mean of the sample is less than 50). The incorrect 95% confidence interval does not include the true population mean, while the correct 95% CI spans the whole distribution (as one would expect--with only two independent pieces of information there is little certainty about the true population value). Multiple observations on each rat provides increased precision for estimating the true mean for that rat, but does not directly provide increased precision for estimating the population mean in the way that increasing the number of rats does. As the number of samples within each rat increases, the incorrect error bar in Figure will get increasingly narrower, while the correct error bar will remain the same—as it should, because no new information about the population of rats will be obtained by further sampling of these two rats. This idea is also extended to cases where there is more than one experimental group or condition; it is necessary to distinguish between those measurements that are independent samples from the population and which increase precision and decrease uncertainty about the population parameters (which is what the hypotheses tests are testing), and those measurements that only increase the precision of the value for a particular subject.
The second problem that arises is that correlations between observations can lead to calculated p
-values that are either higher or lower than the true p
-value. For the above example, "correlation between observations" refers to the degree of similarity of the observations within each rat, relative to the observations between rats. This is called the intraclass correlation (IC) and is expressed as a ratio of variances. We can model the data in Figure as
are the values of the response, i
is an index indicating the rat that the observation comes from (i
= 1 or 2 in this example), and j
is an index for the observation within each rat (j
= 1,...,10). The grand mean (the average of the 20 y
values) is denoted by μ
is the amount by which the mean of each rat is above or below the grand mean, and εij
are the residuals, which is the distance of each of the 20 values from the mean of their respective rat. The intraclass correlation can then be calculated as
is the variance of the means of the rats about the grand mean, and
is the variance of the residuals (i.e. the unexplained variance). The variability in the data is therefore partitioned into the variability between rats (
) and the variability within rats (
). As can be seen from the above equation, as
gets large, IC approaches zero, and when all the observations within each rat are identical (
= 0), IC approaches one. The IC can thus be interpreted in a similar manner to the Pearson correlation, but restricted to positive values. For the above rat example,
= 83.3 and
= 16.6 giving IC = 0.83, which indicates that the observations within each rat are highly correlated.
A detailed analysis by Scariano and Davenport showed that both the Type I (false positive) and Type II (false negative) error probabilities can be affected by within group correlations [6
]. When there is a positive within group correlation (the more common situation), the Type I error probability (α
) will be greater than 0.05, and the greater the correlation the greater the number of false positives. For example, a two independent group comparison with n
= 10 in each group and with a modest within group correlation of IC
= 0.30 would give an α
probability of 0.37; in other words, 37% of the time (and not 5%) the null hypotheses would be (erroneously) rejected. Thus when there is a positive correlation, null hypotheses will be rejected too often, and this is the reason that violating the independence assumption can be more serious than violating the normality or equal variances assumption [8
]. The four situations in which pseudoreplication can arise are discussed next and summarised in Table .
Four situations in which pseudoreplication can arise.
Repeated measurements on the same experimental unit
A common situation is when observations are taken at different times or under different experimental conditions on the same subjects, and this is usually a planned part of the experimental design. Data of this type are typically analysed with a paired-samples t-test if there are only two conditions or time points, or a repeated measures (RM) analysis of variance (ANOVA) if there are more than two time points. There are a number of advantages of such designs, including a reduction in the number of animals or participants used, and increased statistical power because subjects act as their own control. The important distinction is that observations from different subjects are independent of each other, but not the observations within each subject. These data are often analysed correctly (in the sense that paired samples t-tests are used instead of independent samples t-tests), possibly because undergraduate statistics courses for biologists usually cover the difference between "within subjects" and "between subjects" designs.
Data with a hierarchical structure
A second common design where pseudoreplication can occur is when data are hierarchically organised. Biological data are often sampled at different spatial scales or levels of biological organisation. For example, several brains may be sliced into sections, and a number of regions on a section may be examined histologically (or maybe just the left and right side of the brain), and perhaps only a certain number of cells within each region would be examined. Thus there is a hierarchy, with the whole brain (animal) at the top, sections within a brain, regions within a section, and cells within a region (see reference [9
] for a graphical example of hierarchical histological data). If cells are the unit of interest, then typically many cells are examined per brain. Consider an experiment with two experimental conditions (treatment vs. control), with one rat in each condition. The outcome variable is the number of synapses on cells in the CA3 region of the hippocampus, and 100 cells are examined in each rat. This would give 2 rats × 100 cells per rat = 200 data points. The incorrect way to analyse this data is with a t
-test with an n
of 200 (similar to the example in Figure ). This is incorrect because differences due to the treatment are completely confounded with natural animal-to-animal differences between the two rats. The standard deviation (SD
), in the denominator of the t
-statistic is meant to represent the variability between
rats, not within rats. Furthermore, the standard error (SD
) is a measure of the uncertainty associated with the means of the population of rats
, not the populations of cells within rats. The n
in Equation 1 must therefore represent the number of independent observations, which in this case is the number of rats, not the number of cells
is the group mean. Cells within rats will tend to be more similar than cells between rats and therefore are not independent of each other. Including all of the 200 data points in the analysis as if they were independent gives a false estimate of the precision (i.e. the error term is too small) because t
gets big as
gets big. Two rats will never be exactly the same and therefore it is simply a matter of taking enough measurements on two rats to show that they are statistically different. This point generalises to experiments with more than two groups and more than one factor. If the experiment had used two rats in each experimental condition and 50 cells were observed in each rat, there would still be 200 data points (observations) in total, but the same problem remains, although the treatment effect is not completely confounded with the inter-rat variability.
Another common case of hierarchically structured data is when multiple animals are born in a litter. Animals within a litter are not independent because they share the same parents and the same prenatal and early postnatal environment, and animals are therefore nested within litters [10
]. Laboratory animals are often highly inbred and genetically identical (or very similar), but epigenetic and developmental factors may play a role, and two rats from the same litter are likely to be more similar than two rats from two different litters, and litter effects have been found on a variety of outcome variables, including life span [2
], body weight [12
], total brain volume (after controlling for body weight [13
]), behavioural tests (rotarod [14
], possibly prepulse inhibition [15
]), and plasma concentrations of various substances (leptin [16
], glucose, insulin, triglycerides [17
]). It is likely that litter effects are present in many response variables, but few papers mention how these were dealt with in the experimental design stage, or whether the data were examined for the presence of litter effects. If all animals in the control condition are from one litter while all the animals in the treatment condition are from another litter, then the treatment effects will be completely confounded with litter effects, making it difficult to attribute differences between conditions to the effect of the treatment.
Other examples include applying treatments to cages of rats rather than individual rats (e.g. administering a substance in the drinking water), or applying treatments to pregnant females but examining the effect in the offspring. Here, cage and pregnant females are the experimental units, and not the individual animals since the treatments can only be applied to whole cages and pregnant females and not to the individuals animals. This type of experimental design is often referred to as a split-plot design and is characterised by the restrictions on randomisation; it needs to be distinguished from a design where individual rats can be randomised to different conditions. In addition, cells in the same flask or well of a cell-culture experiment are not independent; they will tend to be more similar than cells in different flasks or wells and will be subject to the same uncontrolled effects.
Observations correlated in space
Observations may be correlated in space because multiple measurements taken at one location will all be affected by the idiosyncratic aspects of that location. For example, 96-well plates often contain small amounts of fluid, and wells near the edges of the plate may evaporate faster than wells in the centre, and thus alter the concentration of substances such as metabolites, secreted hormones, etc. Placing the control samples in the first column of the plate and the treated samples in the second column would therefore not be a good idea. This is also the reason why microarrays have replicate probes for the same gene scattered throughout the array and not placed beside each other, as this accounts for any spatial effects in the quality of the array that may have arisen during manufacturing or handling. Spatial dependence may also arise in incubators for culturing cells. A large cell culture experiment may use two incubators, but differences particular to each incubator may affect the outcome variable. For example, the temperature and humidity levels may be different, or these variables may fluctuate more in one incubator than another, perhaps because one may be used more and thus the door is opened more often as people access their samples. Good experimental design would dictate that the treated samples are not placed in one incubator while the control samples are in the other, as it would be impossible to separate the effect of the treatment from the effect of the incubator.
Observations correlated in time
Unlike repeated measurements on the same samples, observations that are correlated in time are often not a planned feature of the experimental design, but arise from the sampling protocol, the phenomenon under investigation, or the way in which the experiment is conducted. In addition, observations need not be on the same subject. For example, rats have a circadian rhythm in the stress hormone corticosterone, which peaks at the beginning of the dark (active) phase, and gradually decreases throughout the night [18
]. Suppose that plasma corticosterone concentration is the main outcome variable and blood samples from twenty rats need to be taken. If the sampling starts at the beginning of the dark phase (i.e. at the peak concentration) and takes 2 hours to complete, there might be an overall decrease in corticosterone concentration in rats that were sampled at later time points compared to earlier ones. This could confound the results if the first ten rats were the control rats and the next ten were in the treatment group, as it would be difficult to distinguish treatment effects from circadian effects. It would therefore be better to alternate rats from each group when sampling the blood. A circadian effect would not be eliminated, but it could now be taken into account by including time or sample number in the model, which would not be possible if treatment is confounded with time. One example of such a time-dependence between sample number and the main outcome variable is discussed in reference [19