Classical precision metrics for single studies
Let P represent some one particular measurement process, X the results generated by that process, and x a particular single result. For the data considered here, P is the log2-transformed MAS5 evaluation of a particular probeset, X the set of log2-transformed MAS5 results for that probeset in all arrays studied, and x the log2-transformed MAS5 result for a particular array.
A variety of linear models have been employed to characterize measurement processes from various types of repeated measurements [
26]. While complicated models are appropriate when stationary effects (biases) are expected and of interest, given the among-round variation in participants and their measurement protocols, we begin with the general ISO 5725 model. Rather than relying on strong assumptions about the structure of the variance, this simple model is designed to have the general applicability needed for a standard approach. It asserts that each
x can be expressed as the sum of three components [
12]:
where

is either the true value or more typically a consensus estimate of the quantity being measured,
B is the systematic difference (bias) between

and the expected result for the given participant's implementation of
P (as estimated from replicate measurements), and
ε is the random difference between the expectation for the implementation (ie,

+
B) and the given value. For a single material evaluated in a single study (that is, at some given point in time), the
ε are assumed to follow a random distribution centered on zero with a standard deviation associated with the (metrological) repeatability precision characteristic of
P. Likewise, the variability of the
B among all participants is assumed to follow a random distribution centered on zero with a standard deviation associated with the (metrological) reproducibility precision of
P. These terms are described more fully in the
Results Section; we here detail a standard approach to their estimation.
Let xijk represent the kth of Nm replicate measurements of P reported by the jth participant in the ith study. The standard deviation of the replicates, s(xij), estimates the random variability of P as implemented by the jth participant in the ith study:
where

is the mean of the replicates. Assuming that the variance magnitude is roughly similar for all participants (as is the case, see Figure ), the expected random variability of
P common to all participants (
i.e., its repeatability precision) in the
ith study,
sri, can be estimated by combining the individual
s(
xij) over all
Npi participants. While the general formulae detailed in [
12] describe variable numbers of replicate measurements for different participants, for these data the same numbers of replicates were reported by every participant in every round. The appropriate formula for pooling variance in this circumstance is:
The sri can be interpreted as the "average" standard deviation expected for technical replicate measurements made by a typical laboratory, where "typical" is defined by the population of actual participants.
The between-participant precision for the ith study, sLi, is estimated from the standard deviation of the estimated participant-specific biases:
where
max() is the "take the maximum value of the arguments" function and

is the mean of the participant mean values. The

term estimates for the repeatability contribution to the variance of the biases. With atypically noisy replicate measurements, the corrected bias variance is defined as zero – allocating the observed variance to the least-complex source. The
sLi estimates the extent of agreement among the various implementations of
P used by the participants. Ideally, all participants will observe the same mean value for
X and the value for
sLi will be near zero; in practice, studies involving more than one measurement protocol often (by conscious study or from hard experience) discover
sLi to be several times larger than
sri.
The reproducibility of P during the ith study, sRi, is estimated by combining the sri and sLi variance components:
The
sRi combines all of the factors influencing
P at the time the study was performed; the implementation of
P in a typical laboratory is expected to yield results that agree with results of other such users within confidence limits appropriate to a normal distribution having mean

and standard deviation
sRi.
The notation used in the above calculations is summarized in Additional file
1. Additional file
2 lists the data and results for the above calculations for one exemplar probeset. Figure displays the relationships among the above estimates for all 31054 probesets in Round 1. While repeatability magnitude is related to signal level, there is considerable variety among the magnitude of the between-participant precision regardless of level. A considerable number of the
sL1 are plotted along the left-hand margin (
sL1 = 2
0 = 1), a consequence of estimating variance with a relatively small number of replicate measurements.
Additional file
3 summarizes the results of the above calculations for four exemplar probesets in all three rounds. Figure displays these results in a "dot-and-bar" format commonly used with interlaboratory studies. These four probesets were selected as typical of the observed range of the {
sL1,
sr1} pairs displayed in Figure , where the {
sL1,
sr1} locations are marked with open circles labeled
a to
d. Exemplar 1, probeset 1379568_at of Mix 2, has very small
sL1 and
sr1; this represents results that are about the same for all participants, for all replicate samples. Exemplar 2, 1395685_at of Mix 1, has small
sL1 but large
sr1; this represents results with considerable technical variability but with averages that are about the same for all participants participants. Exemplar 3, 1371165_a_at of Mix 1, has moderate
sL1 and
sr1; this represents modest variability with some systematic differences among the participants. Exemplar 4, AFFX_Rat_Hexokinase_5_at of Mix 1, has large
sL1 but relatively small
sr1; this represents results with considerable and quite consistent systematic differences among the participants.
Classical precision metrics for multiple studies
When results for two or more qualitatively similar interlaboratory studies are available, the individually short-term study-specific estimates can be used to define the long-term performance characteristics of the measurement process,
P, with great confidence. It may also be possible to explore the temporal stability of participant-specific systematic bias,
B, and random variability,
ε. Indeed, an explicit goal of many interlaboratory studies is to help participants identify and minimize sources of systematic difference in their individual implementations of
P and to establish tighter statistical control over its random influences [
27]. Changes in individual performance will manifest may manifest as changes in the study-specific repeatability and reproducibility estimates [
28].
Given Ns studies that evaluate identical samples, laboratory-specific repeatabilities can be estimated for all participants that use nominally identical implementations of P in at least two of the studies. For the jth such laboratory, srj is estimated by combining the simple standard deviations, s(xij), over the Nsj studies in which they participated. Since here the same number of replicates were evaluated in each round, the general pooling formula again simplifies to:
In a manner analogous to the between-participant precision described above, laboratory-specific estimates can be obtained for the extent of agreement among results they obtain over relatively long periods of time. The among-round precision for the jth participant, sWj, is calculated:
where

is the mean of the mean values for the particular laboratory over all of the studies in which they participated. The intermediate precision over time for the participant,
sI(T)j [
13], is the combination of
srj and
sWj:
Additional file
4 summarizes these long-term within-participant precision calculations for the four exemplar probesets.
The long-term expected value for
X,

, and the total number of sets of
X values reported by all participants in all of the studies,
Nt, can be calculated across the
Np participants or across the
Ns studies:
The expected long-term repeatability, sr, is directly calculated from the participation-weighted average of the laboratory-specific repeatability variances; however, the same value is obtained by pooling the study-specific repeatability estimates:
Similarly, the expected long-term between-laboratory precision, sL, is calculated from the study size-weighted average of the between-laboratory precision variances:
The expected long-term reproducibility of the measurement process, sR, can be calculated from the study size-weighted average of the study-specific reproducibility variances or by combining sr and sL:
Figure displays the long-term repeatabilities and between-laboratory precisions characteristic of the microarray platform for all of the 31054 probesets. The relationship of the precision estimates to the signal level is not changed from Figure , although the increased number of data used in the estimates can be seen in the much reduced number of probesets with sL = 20 = 1. Note that the locations of the exemplar probesets relative to {sL, sr} are unchanged after three rounds. Figure displays all of the above precision estimates for the four exemplar probesets.