The mouse is the premier model organism for understanding gene function in development and disease. To further the functional annotation of the mammalian genome, the International Mouse Phenotyping Consortium (www.mousephenotype.org/)
aims to phenotype knockouts for all mouse genes, building on the large collection of targeted alleles in C57BL/6N embryonic stem cells available from the International Knockout Mouse Consortium 
. Many centres are screening mutant mouse strains to identify genes with phenotypes of interest and are making this data publicly available 
as primary screen data with pipelines constructed to give a shallow but broad review of an animal’s phenotype. The complementary role of secondary phenotyping is to confirm and extend the primary observations into specialised fields of research. Concern over reproducibility of phenotyping experiments has been raised 
. Some of these reproducibility issues have been tracked to the presence of environment*genetic interactions 
but may be arising from poor design and analysis 
. Poor experimental design, analysis and reporting was noted to be a significant problem in a systematic review of published papers involving in vivo
. This has led to the publication of the Animal Research: Reporting In Vivo
Experiments (ARRIVE) guidelines 
a check list to lead the field towards good practice. This includes ensuring the analysis is appropriate for the design and data characteristics, such as the results are robust and have isolated cause and effect (high internal validity). An experiment is described as having high internal validity when the effect (e.g. phenotypic difference) can be confidently assigned to the treatment (e.g. genotype difference). To achieve high internal validity, careful experimental design is needed to account for potential confounders and the statistical test used needs to consider the structure of the data appropriately. The threats from poor control of confounding factors have been identified in many biological fields, from biomarker discovery to genome wide association studies 
Through high throughput phenotyping programs, where data is systematically collected on one genetic background, the significant sources of variation can be identified and it has become obvious that batch (defined here as those readings collected on a particular day) can lead to large variation in phenotyping variables. This observation has significant implications for the data analysis of both high throughput and secondary phenotyping experiments where use of small batches of animals is common. It is challenging and costly to produce sufficient animals of the right age within a narrow time point for an experiment. Consider the Sanger Mouse Genetics Project which requires 7 male and 7 female homozygote mice, generated by a heterozygote cross; a best case scenario would require 14 mating pairs being assembled at the same point in time 
. In order to generate these mating pairs, there would be a staged breeding process to generate the mice which involves several rounds of expansions depending on breeding success. This best case scenario is commonly hampered by fecundity, viability or other phenotypic problems within a line and hence to achieve a one batch pipeline the pairing number needs increasing significantly. In contrast, by accepting smaller numbers of mice in multiple batches, lower breeding pair numbers can be established. The smaller scale allows the generation of mice to answer firstly developmental and breeding issues and secondly to feed the pipeline over time and subsequent litters. As soon as we have mice of the right age, these are entered into the pipeline and have an average batch size of three for an allele per gender when we aim to phenotype 7 mice per gender for each allele. This batch approach, has allowed us to utilise animals that would otherwise be discarded as the process had not generated the required experimental sample size. Multiple small batches allow us to meet the high throughput pipeline needs and also help reduce the breeding cost per line. However, this approach will have implications on the data analysis, in the presence of temporal variation; the phenotypes of mice in the same batch are likely to be more similar than those from different batches. Furthermore, the operational constraints arising in a high throughput environment make optimal experimental design impractical; typically, mutant and control mice are not assayed on the same day, so any phenotypic differences could be due to genotype or to subtle changes in the environment (e.g. temperature fluctuations or pipetting errors). Data analysis, with the aim of controlling for variability over time 
, is a major challenge for high throughput phenotyping and often a problem in secondary phenotyping.
Body weight is known to correlate with many other biologically interesting variables (e.g. bone density, blood calcium level and high- density lipoproteins) 
. Furthermore, body weight is a highly heritable trait, and consequently commonly altered in knockout lines of mice 
. It is therefore unsurprising when the knockout also results in difference in these other variables. It raises the question as to whether the change in these variables is as expected given the observed change in body weight. Statistically, body weight in these examples is described as a confounding variable, which is one that it is associated with both the probable cause (genotype) and the outcome (phenotypic trait of interest). To understand the observed phenotype, the analysis pipeline should assess whether the change observed was due to the genotype or associated with the body weight change accompanying the genotype change.
Current analysis methods can be divided into two types; a reference range methodology (RR)(as implemented at http://www.sanger.ac.uk/mouseportal/
), or the application of traditional statistical tests 
. In RR, control mice of the same genetic background and sex are used to estimate the natural variation in a trait. In the Sanger Mouse Genetics Project, a knockout has an “abnormal phenotype” if over 60% of the mutant mice lie outside the range of 95% of the natural variation in the controls. This percentage was empirically selected to ensure the majority of mice for a line were affected. With this method, there is no p-value and the false positive and negative rates are undefined and not controlled. Traditional statistical tests, such as a Student’s t
-Test or ANOVA, do control the false positive rate if factors such as body weight and batch do not affect the phenotype. Moreover, as the Student’s t
-Test is the most powerful statistical test for a difference in the means of two groups with Normal errors, it should be preferred to the RR in principle. However, a more important consideration is that the traditional tests produce false positive phenotype calls if weight and batch affect the phenotype.
An alternative method, linear mixed models (MM) are a class of statistical models suited to modelling multiple sources of variability on a phenotype, where some explanatory factors (such as sex, body weight and mutant genotype) are assumed to take fixed values that affect the population mean, whilst others such as batch are treated as affecting the covariance structure; animals from the same batch will have correlated phenotypes. MM are an established technique in the analysis of complex traits (for example in maize 
and mice 
), but to our knowledge they have not been usedwithin the mutant mouse phenotyping community and no comparison or discussion on this method versus others has been published. The few examples we have identified, include Kafkafi et al.
who used a MM to compare open field data for various mouse lines across institutes to assess the prevalence of genotype*environment interactions and treated the variation between laboratory as a random effect 
; Wainwright et al.
, in a study looking at the impact of pre-natal ethanol exposure in mice on behaviour and brain size, demonstrated the value of a MM approach where litter was treated as a random effect over an ANOVA on litter mean data 
; Goncalves et al.
, used a MM to query cardiovascular data where a repeated measures design had been used in mice and the subject was treated as the random effect 
With high throughput data, we are treating batch as the random effect adding variation to the data. The variation in batch arises from multiple factors including technician, reagent lot, day, cage, mother and litter size 
. All these effects are modelled and tested within the MM framework. For each mutant strain, we test the contributions of sex, weight, genotype and genotype-by-sex interaction by fitting two nested mixed models (Equation 
, and Equation 
), where the phenotype of mouse i
is assessed within the j
-th batch. (See for parameter and associated definitions.) A comparison between the fits of the models tests whether the phenotype is mediated by a body weight change. The MM can be interpreted as a generalisation of the T-test that takes into account the explanatory variables, in the sense that it is almost identical to the T-test if they are not significant.
For smaller scale projects, such as in secondary phenotyping, where the batch number is limited, there is an potential alternative to the MM of treating batch as a fixed effect rather than a random effect and then using a generalised linear model. As the measured batches are a random subsets of all possible batches, treating it as a random variable in a MM allows us to reflect the random selection of batch and thus is theoretically more appropriate. This can be seen in that the user is typically not interested in batch i.e. what was the result on Wednesday and how did it differ exactly from Monday etc. The primary need is to account for batch in the analysis. Furthermore, the MM will be more sensitive, as it economises on the number of degrees of freedom used by the factor levels; instead of estimating a mean for every single factor level, the random effect model estimates the distribution of the mean 
To assess these approaches, we investigated control data to assess the temporal variation visualised with batch. We then considered the applicability of the various analysis methods in the presence of multiple batches. To demonstrate the issues we analysed four randomly-selected mutant colonies (Ppp3catm2e(EUCOMM)Wtsi
, and Slc25a21tm1a(KOMP)Wtsi
) from the Sanger Mouse Genetics Project with a focus on seven traits from the Dual-Energy X-Ray Absorptiometry (DEXA) screen 
which focuses on bone and tissue composition. Here we show that a linear mixed model is an appropriate method to query data which has a batch issue. We also show how this approach detects subtle but important quantitative differences in phenotype that are currently overlooked. This manuscript intends to demonstrate that this method is a significant improvement over methods currently applied in identifying phenotypes and has ethical benefits.