Accurate measurement of the variable of interest is fundamentally important in any health research or practice setting. However, it is widely recognised that measurements simultaneously made on the same subject or specimen by different instruments, methods or observers invariably yield different empirical values. As such, evaluation of measurement quality is a central issue in deciding the utility of any instrument, method or observer [1
]. Measurement validity and reproducibility are essential elements in determining this quality. Validity is the degree to which a measurement measures what it purports to measure and reproducibility is the degree to which a measurement provides the same result each time it is performed on a given subject or specimen [2
]. Reproducibility is invariably assessed using agreement analysis of within (intra) and between (inter) instrument, method or observer measurement comparison studies. For ease of exposition, we shall refer to instrument, method or observer comparisons simply as method comparisons hereafter.
In measurement method comparison studies, the main interest is to determine whether the measurements made on the same subject or specimen by different methods can be used interchangeably [3
]. Typically, measurement method comparison studies are motivated when newer, less invasive, safer or cheaper measurement techniques become available and we wish to assess the agreement between them and some "gold standard" or existing technique. Lack of agreement between different methods is inevitable, as all instruments measure with some error, but the questions of interest is by how much do the methods disagree and is this difference important? Multiple statistical strategies exist that can be used to assess this form of agreement [3
], including the Bland-Altman limits of agreement approach [4
], regression techniques [7
], nonparametric methods [6
], and survival-agreement plots [9
]. As the Bland-Altman limits of agreement approach is simple to employ, practical, and detects bias, it has become the preferred method within health research in recent years [3
In its simplest form, the Bland-Altman limits of agreement approach compares unreplicated paired measurements between two methods over a number of subjects or specimens [5
]. A graphical depiction of differences between paired observations versus their average is typically presented in a scatter-plot. Generally, superimposed on the scatter-plot is a horizontal line indicating bias (calculated as the mean difference between measurement pairs,
) and horizontal lines giving the 95% limits of agreement (calculated, assuming the differences are approximately normally distributed, using the standard deviation of the differences, s
± 1.96 × s
). The limits of agreement define the range within which 95% of the differences between measurements by the two methods are predicted to lie. The scatter-plot is used to determine whether any patterns exist in the data, thereby potentially violating the method's assumptions, or revealing whether data transformation is necessary. A histogram of the paired differences ordinarily accompanies the scatter-plot and should be normally distributed. Only once these checks are completed and assumptions satisfied can an assessment be made to the acceptability of the quantified level of agreement for clinical or epidemiological purposes.
At times, however, more than two measurement observers or instruments are of particular interest and simultaneously assessed. For example, the research question that motivated this paper was where should pedometers (devices for counting steps) be positioned on children (left hip, right hip, or the back) to give best agreement with observed step counts (the 'gold-standard')? Most statistical approaches use separate pair-wise comparisons of methods in these situations [6
]. However, this situation lends itself to a multivariate form of analysis.
Measurement repeatability is important in measurement method comparison studies because it limits the amount of agreement which is possible [5
]. If methods have poor repeatability then there is likely to be considerable variation in repeated measurements on the same subject or specimen thus resulting in poor agreement. Given this importance of repeatability, Bland and Altman advocated in their 1986 paper a design that allowed estimation of both limits of agreement between two methods and coefficients of repeatability for each method [5
]. However, in 2003, these authors note, to their chagrin, that this approach has not been widely adopted by researchers [4
It might be opined that one of the primary reasons why so few repeatable measurement studies have been undertaken is due to the lack of readily available and easily implemented statistical machinery for the analysis of such data, especially if the number of replicates is unbalanced or some data are missing. In an effort to circumvent this problem, Bland and Altman in 1999 presented analytical techniques similar to their limits of agreement approach to quantify the repeatability of a method where the underlying values for subjects remain static over replications (where values can be considered as being exchangeable) using one-way analysis of variance methods and variance component techniques [6
]. They also described a method for analysing replicated data in pairs where several pairs of measurements are made by two methods on each subject or specimen where the underlying true value changes from pair to pair (here the measurement pairs are consider non-exchangeable). While most of these methods are straightforward and relatively easily implemented, some of the assumptions are restrictive and potentially unrealistic [3
]. Moreover, should there been more than two methods under consideration then the proposed techniques are not easily generalised to simultaneously assess these methods.
In 2004, Carstensen described more general regression and variance component methods for the analysis of such data [8
]. While conceptually appealing, these methods can be difficult to implement thereby limiting their utility. Recent energies by Carstensen and colleagues have been to report simplified versions of his methods and develop new techniques with greater practical utility [11
Until now, there have been no published Bayesian methods focusing on measurement method comparison studies. This is perhaps surprising given the increased utilisation of Bayesian techniques and their apparent suitability to this type of problem. In a complementary analysis of repeated measurements of paired outcomes data, a multivariate hierarchical Bayesian method has already been successfully employed and many salient advantages described [12
]. Bayesian methods have the advantage of embodying and yielding parameter distributions rather than using point-estimates; the balance of the data is unimportant, multiple methods can be compared simultaneously in a single analysis, they are readily implemented and interpreted; and, they are easily generalised to more complex study designs and hierarchies [12
]. As bounded prior distributions can be incorporated into Bayesian analyses, sensible posterior distributions and credible regions can be derived for all parameters of interest, and many convergence or computational problems associated with non-Bayesian methods can be eliminated. Moreover, the methods are easily extended to include informative prior distributions, allow covariates and subject subgroup structures to be incorporated, and provide probabilistic subject specific and overall group results [12
Based on the limits of agreement approach framework, this paper advocates assessing agreement in repeated measurement method comparison studies using a fully parametric multivariate hierarchical Bayesian approach. Two models are proposed in this paper; the selection of the appropriate analysis depends on the underlying values of the variable of interest. Like that propounded by Bland and Altman in 1999, one model assumes exchangeable values for each subject while the other accommodates non-exchangeable values [6
]. Section 2 describes the two related statistical models we employ. Using data previously presented and analysed by Bland and Altman [6
] and new data from Oliver and colleagues [15
], we illustrate the use of the proposed models with numerical results in Section 3. Concluding remarks are then presented in Section 4.