In order to evaluate the performance of the methods described in the previous section, two ERs (>5 spots/100,000 PBMCs & >2-fold background, >2-fold background) and the three STs (one-sided

*t* test, DFR(eq), and DFR(2x)) were applied to results from large data sets that were generated in three consecutive interlaboratory testing projects organized by the CIP [

22,

23]; data are available upon request. In the referred studies, groups of 11, 13 and 16 laboratories (phases I, II and III, respectively) quantified the number of CD8 T cells specific for two model antigens within PBMC samples that were centrally prepared and then distributed to the participating laboratories. All participants were allowed to use their preferred ELISPOT protocol. Therefore, the data sets generated in these studies can be considered representative of results generated by a wide range of different protocols commonly applied within Europe. Each participating center was asked to test in triplicate 18 preselected donors (5 in the first phase, 8 in the second phase and 5 in the third phase) with two synthetic peptides (HLA-A*0201 restricted epitopes of CMV and Influenza) as well as PBMCs in medium alone for background determination. The donors were selected so that 21 donor/antigen combinations (6 in the first phase, 8 in the second phase and 7 in the third phase) were expected to demonstrate a positive response with the remaining 15 donor/antigen combinations not expected to demonstrate a positive response. Pretesting of potential donor samples for the proficiency panels was routinely done at two time points in two independent labs. Only samples from donors that had consistent results in all four performed experiments were finally selected for distribution to the participating centers.

Statistical test versus empirical criteria

Table outlines the response detection rate for each center based on the empirical and statistical response criteria. The overall response detection rate from all 19 centers across all three phases of testing was 59% based on the first ER (>5 spots/100,000 PBMCs & >2-fold background), 74% based on the second ER (>2-fold background), 76% based on the *t* test, 75% based on the DFR(eq) method (equal means), and 61% based on the DFR(2x) method (>2-fold difference). Table details the false positive response rate for each center based on the empirical and statistical response criteria. The overall false positive rate from all 19 centers across all three phases of testing was 3% based on the first ER (>5 spots/100,000 PBMCs & >2-fold background), 17% based on the second ER (>2-fold background), 10% based on the *t* test, 11% based on the DFR(eq) method (equal means), and 2% based on the DFR(2x) method (>2-fold difference).

| **Table 1**Detection rates per lab based on two empirical rules and three statistical tests (CIP proficiency panel phases I–III) |

| **Table 2**False positive rates per laboratory based on two empirical rules and three statistical tests (CIP proficiency panel phases I–III) |

The first ER yielded response detection rates that were lower than those derived from the *t* test, the DFR(eq) method and the second ER (>2-fold background). However, the false positive rates with the first ER were similar to the false positive rate found for DFR(2x), lower than the false positive rates with the *t* test or DFR(eq) method and much lower than the false positive rate of the second ER. The DFR(eq) method had similar response detection rates as the *t* test—only in 17 of 478 comparisons did the conclusion of the STs differ. The DFR(eq) method with a null hypothesis of equal means had higher detection rates compared to the DFR(2x) method where the null hypothesis was less than or equal to a twofold difference of the experimental counts over the background. However, the DFR(eq) also resulted in a higher false-positive rate than the DFR(2x) method.

There were 478 comparisons made: 282 donor/antigen combinations versus control expected to demonstrate a positive response and 196 donor/antigen combinations versus control expected not to demonstrate a positive response. There were 20 instances where a response designation was not possible with both the DFR methods due to some laboratories having only performed duplicates for a control or experimental condition. Comparing the DFR(eq) response determination rule to the first ER, there was disagreement for 76 of the 478 comparisons; for 74 comparisons, the DFR(eq) test declared the triplicate a positive response while the ER did not while for two comparisons the reverse was true. Comparing the DFR(eq) response determination to the second ER (>2-fold background), there were 50 disagreements: 25 times the DFR(eq) test declared the triplicate a positive response while the ER did not and 25 times the ER declared the triplicate a positive response while the DFR(eq) test did not. Comparing the DFR(2x) response determination rule to the first ER (>5 spots/100,000 PBMCs & >2-fold background), there was disagreement for 43 of the 478 comparisons; for 29 comparisons, the DFR(2x) test declared the triplicate a positive response while the ER did not while for 14 comparisons the reverse was true. Comparing the DFR(2x) response determination to the second ER (>2-fold background), there were 58 disagreements: the ER declared the triplicate a positive response while the DFR(2x) test did not.

This led us to investigate under what conditions the ST differs in response determination from the ER and under what conditions the two statistical DFR tests differ.

Simulation study to compare response determination with STs and ERs

A simulation study was conducted to assess under what conditions a ST would differ in response determination from an ER (Supplementary Figures 1a and 1b). One thousand hypothetical donors with triplicate wells for background and experimental conditions were generated. Spot count data were randomly generated by assuming that the counts follow a Poisson distribution. The mean spot count for the background wells was set at 10 per 100,000 PBMCs, reflective of the mean in our example data set. The mean spot count for the experimental wells was varied over 40 values from a mean of 10 to 50 per 100,000 PBMCs. The signal-to-noise ratio for each experimental condition was calculated as the mean of the triplicate in the experimental well divided by the mean of the triplicate in the background well for a given donor. A signal-to-noise ratio greater than two would be considered a positive response based on the first ER. A one-sided *t* test was also performed comparing each experimental condition to its corresponding background. The intra-replicate variation was calculated as the sample variance of the triplicate/(median of the triplicate + 1). The reason for expressing the variability in this way was to normalize the variation so as to make it comparable across replicates with large differences in their spot counts. In the setting where there is a large outlier in one of the experimental wells compared to the other two wells, e.g., 50, 2, 6 spots, the median reflects the central tendency of the data but, unlike the mean, is not influenced by the outlier (i.e. 50 spots). Hence, we consider the ratio of the variance to median to identify cases that have large variability in the experimental well replicates but have a small median. Since the median response may in some cases be 0 spots, a 1 is added to the denominator to avoid division by 0. The response determination based on the empirical (>2 signal-to-noise ratio) and the statistical rule (one-sided *t* test *p* value ≤ 0.05) is the same for most of the experimental triplicates (Supplemental Figures 1a and 1b). However, when the intra-replicate variation is large, the ER would sometimes consider the triplicate a response while the ST would not. Conversely, when the intra-replicate variation was small, the ST would sometimes consider the triplicate a response while the ER would not. This simulation clearly showed that ERs should only be applied in settings where the variation within replicates is known and can be reliably consistent across experiments. It also demonstrates that STs account for the variation within reported triplicates. Conversely, the ST may not declare a large signal-to-noise ratio a positive response if there is very high variability between replicates. This may indicate that the declaration of a positive response requires more compelling evidence for that sample.

Simulation study to compare response determination with DFR(eq) and DFR(2x) statistical methods

A simulation study was conducted to evaluate the overall false positive rate and the overall true positive rate (sensitivity) of each DFR method under a variety of conditions. An overall positive response is declared if at least one antigen is declared positive. To calculate the overall false positive rate, background and experimental spot counts for each donor were generated under the same model. Hence, for these donors, no response should be detected. Five thousand donors with triplicate wells for background and experimental conditions were generated. Spot count data were randomly generated by assuming that the counts follow a Poisson distribution. The mean spot count for the background and experimental (i.e., antigen-containing) wells was 2, 5, 10, 15, 20, 25, 30, 35, 40, 45, or 50 (per 100,000 PBMCs). This was examined in the setting when testing with two or ten antigen preparations (*k* = 2, 10). To assess the overall true positive rate, background spot counts for each donor were again generated from a Poisson distribution with background mean spot counts of 2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50; however, the experimental means were shifted by 6 (small difference), 20 (moderate difference), or 50 (large difference) relative to the background means. All other conditions were the same as in the simulations for assessing the overall false positive rate.

Figure illustrates the response detection rates versus the mean background spot counts for both the DFR(eq) method (closed circle) and the DFR(2x) method (open circle). The two graphs on the upper row display the response detection rates in the setting where the mean number of spots was the same for both the experimental and background wells (*d* = 0). In this setting, the response detection rates for the two methods are expected to be low. In fact, in 5,000 simulated data sets, the average response detection rates for at least one of the antigens (*k* = 2 or 10) were <5% with the DFR(eq) method and <1% with the DFR(2x) method across a variety of mean background and experimental spot counts (2 to 50).

The graphs in the rows 2, 3 and 4 of Fig. display the overall response detection rates for small (*d* = 6), moderate (*d* = 20), or large (*d* = 50) mean differences. The response detection rates are high (>80%) in the DFR(eq) method for the large differences in background and experimental wells (*d* = 20 or 50) for a wide range of background levels (2–50). However, the response detection rates for the DFR(2x) method are much lower for higher background levels. This is not surprising given that the null hypothesis for the DFR(2x) method is less than or equal to a twofold difference over the background and therefore background levels that exceed *d*/2 would generally fail to reject the null hypothesis and not be considered a positive response. For small differences in background and experimental wells (*d* = 6), response detection rates were high only for low background levels for both DFR(eq) and DFR(2x) methods although the DFR(eq) method had higher sensitivity.

These simulations suggest that the DFR(eq) method can be used in situations where a 5% false positive rate is acceptable and an experimental mean larger than background implies a positive response regardless of the level of that background. The DFR(2x) method is appropriate in settings where one wants to control the false positive rate at a lower level, e.g., 1%, or when a fold difference in the means of experimental versus control well is more of interest than inequality of means in determining positivity, e.g., when high background is present.

Intra-replicate variation

Even if a ST declares a positive response, it does not automatically imply that this result is biologically meaningful. When the spot counts found in the replicates of an experimental condition are highly variable, the experimental results are suspect and therefore response detection results for these replicates would not be believable even when declared statistically significant. However, ‘highly variable’ is a subjective term that may differ from laboratory to laboratory. We sought to quantify the typical range of intra-replicate variation found across a broad variety of different ELISPOT protocols in order to determine a variability cutoff for recommending that those replicates should be re-run. Data from the three CIP proficiency panel phases were used to analyze the intra-replicate variability of experimental results in ELISPOT assays. Nineteen different laboratories participated in at least one of the three phases and they reported a total of 717 triplicate experiments (this includes control and experimental wells). The intra-replicate variation was calculated as the sample variance of the replicates/(median of the replicates + 1) as explained in “

Simulation study to compare response determination with STs and ERs”.

Figure displays the intra-replicate variation of all 717 experiments reported on the vertical axis with its corresponding rank (percentile) plotted on the horizontal axis. The minimum intra-replicate variation was zero and the maximum was 95.4 with the 25th and 75th percentiles (the middle 50% of reported results) between 0.31 and 2.47 (Fig. , inserted table). To determine a filter for results that have ‘very high’ variability, we looked at the variance value at the 95th percentile, 10.13. Based on this finding, we would recommend that triplicates with variability greater than 10 should be considered unreliable data. Supplementary Table 1a shows the number of replicates with extremely high variation for each of the 19 participating laboratories. In depth analysis of the 36 replicates above the 95th percentile revealed that 7 of the 19 laboratories reported 3 or more triplicates with very large variation for a total of 28 replicates. The remaining eight highly variable replicates were reported by six laboratories, implying that replicates with extremely high variation do not occur randomly across all participating laboratories but rather accumulate in a few centers.

Revisiting the data from the three phases (summarized in Tables and ), there were only 7 experimental replicates in the 282 positive donor/antigen combinations that had a large variability (>10). Removing these replicates with large variability, the response rate was 59% (*n* = 161/275) for the first ER, 75% (*n* = 206/275) for the second ER, 77% (*n* = 211/275) for the *t* test, 76% (*n* = 208/275) for the DFR(eq) ST, and 61% (*n* = 169/275) for the DFR(2x) ST. There were 14 experimental replicates in the 196 donor/antigen combinations not expected to demonstrate a positive response that had large variability. Removing the replicates with large variability, the false positive rate was 2% (*n* = 3/182) for the first ER, 17% (*n* = 31/182) for the second ER, 10% (*n* = 20/182) for the *t* test, 10% (*n* = 19/182) for the DFR(eq) ST, and 2% (*n* = 3/182) for the DFR(2x) ST. Hence, the response detection rates did not change after removing the replicates with large variability. This is not surprising due to the small number of replicates with large variability that were removed from the total data set.

Estimation of the limit of detection in ELISPOT assays

A second factor to consider when deciding on the relevance of a positive response is the limit of detection of the ELISPOT assay. The international conference on harmonization of technical requirements for registration of pharmaceuticals for human use (ICH) produced a guideline on the validation of analytical procedures (

http://www.ich.org/LOB/media/MEDIA417.pdf). In this guideline (named Q2R1), the limit of detection is defined as the lowest amount of analyte in a sample which can be detected but not necessarily quantified as an exact value. The guideline describes three approaches to estimate the limit of detection for an analytical test: visual evaluation, signal-to-noise, and response based on standard deviation and slope. Visual evaluation and response based on standard deviation and slope are not applicable to the ELISPOT setting. The signal-to-noise approach compares spot counts in the experimental wells (signal) to spot counts from the medium control wells (noise). A signal-to-noise ratio between 2:1 and 3:1 is generally considered acceptable for estimating the detection limit. We applied this guideline to estimate the limit of detection of the ELISPOT assay for a broad range of protocols.

There were 239 triplicate medium control experiments reported from all three CIP proficiency panel phases. The mean of these triplicates ranged from 0 to 218 spots per 100,000 PBMCs. The median of the triplicate background means was 2.1 spots/100,000 PBMCs with the 25th and 75th percentiles, 0.6 spots/100,000 PBMCs and 6.5 spots/100,000 PBMCs, respectively. This is illustrated in Fig. where the mean medium spot count for all reported replicates is plotted on the vertical axis with its corresponding rank displayed on the horizontal axis. Using an acceptable signal-to-noise ratio of 2:1 or 3:1 and choosing as the noise the median of the average background spot counts (50th percentile in Fig. ), we estimate a typical detection limit for the ELISPOT assay to be 4 spots/100,000 PBMCs or 6 spots/100,000 PBMCs, respectively. For a heterogeneous group of laboratories that participate in a proficiency panel program, we recommend to use a threshold of 6 spots per 100,000 PBMCs (a signal-to-noise ratio of 3:1) as the typical limit of detection for an ELISPOT assay. Hence, we would recommend that even if the results of the ST lead to the rejection of the null hypothesis, if the mean of the experimental wells is less than 6 spots/100,000 PBMCs this finding should regarded with caution since it is likely that it is at the limit of detection of the ELISPOT assay, at least for laboratories with similar average performance as those included in our proficiency panel program.

This limit of detection is close to the threshold selected for the first ER (5 spots/100,000 PBMCs) and would provide further justification for applying a threshold in the ER. However, it is important to note that the limit of detection is based on the average background from all the laboratories. This means that laboratories with lower background spot counts than the average of the panel will likely have a limit of detection that is lower than 6 spots/100,000 PBMCs. Similarly, laboratories with larger background spot counts than the average of the panel will likely have a limit of detection that is higher than 6 spots/100,000 PBMCs. Therefore, the threshold or limit of detection might be too strict for some laboratories in declaring a response positive and not strict enough for others. This is clearly illustrated in supplementary Table 1B that shows the mean number of spots 100,000 PBMCs in the medium control as reported by each of the 19 participating laboratories. The mean background spot production observed in individual laboratories across all tested donors differed significantly between the participating laboratories and could be very low (0.2 spots per 100,000 PBMCs) to very high (58.1 spots per 100,000 PBMCs).