This study examined the reproducibility of ICS assays across sites using different assay formats. It was not designed to compare ICS with other immune monitoring assays, comparisons of which have been published [
15-
21]. The current study used 96-well plate-based protocols exclusively, as these were considered more convenient, and have recently been validated against tube-based protocols for both PBMC and whole blood [
1].
Lyophilised reagents in plates were used for Experiment 4. These have been extensively compared to liquid reagents ([
22] and Figure ) and shown to be largely equivalent. In addition to convenience of assay set-up, the lyophilised reagent plates offer long reagent stability, even at room temperature (>1 year, data not shown), and a potential reduction in errors caused by incorrect reagent addition. Intra-plate variability using lyophilised reagents was determined to be <10% in ICS assays (data not shown).
There are some potential drawbacks to the use of 96-well plates. One of these is the possibility of well-to-well contamination during the assay. This was observed in an initial subset of Experiment 4 (data not shown), in which some sites received lyophilised plates with SEB as a positive control. Some of these sites experienced high backgrounds in the negative control wells adjacent to the SEB-containing wells. It was later determined that cross-contamination probably occurred during the initial distribution of the antigens on the plates, and this was compounded by the fact that the donors used were unusually sensitive to SEB stimulation (responses >30% of CD4+ and CD8+ T cells). When SEB was replaced with CEF as a positive control, no such problems were noted. This experience suggests that the choice and placement of positive control wells on a plate deserves consideration.
The current study was designed to determine inter-lab variability in ICS assays. As such, there were no data "filters" applied to exclude potentially erroneous data or outliers. However, improved precision of ICS results might be obtained if certain acceptance criteria were applied before data were taken as valid. For example, a minimum number of collected events could be specified (sites in this study were asked to collect 10,000–40,000 CD4+ or CD8+ T cells per sample, or 60,000 CD3+ cells). This number of events was designed to yield precision levels that would minimize event number as a factor in inter-lab reproducibility. There could also be acceptance criteria based upon the absolute level of background, or the degree of reproducibility between duplicate samples, if run (the current study did not use duplicate samples).
It is also possible to apply statistics to derive further meaning from numerical results. For example, statistical tests could be used to determine whether a given response can be discriminated from a given background, for a particular number of events collected [
23,
24]. This can be given by a power calculation as follows:
N = [2*Pav(1-Pav)(Zα +Zβ)2]/Δ2
where N is the number of events in each sample needed for significance, P
av is the average proportion (between the background and test samples), and Δ is the difference between these two proportions. The term (Z
α +Z
β)
2 is referred to as a power index, and varies depending upon the desired power and p value. For example, (Z
α +Z
β)
2 = 23.9 for 99% power and p < 0.005 [
23].
In addition, a confidence interval could be derived around the difference of the test result and the negative control [
24], in order to allow discrimination of significant differences between various samples. Other statistical methods have also been employed in order to determine cut-off values for positive responses in ICS [
25,
26]. No attempt was made in the current study to define which results were positive, as all data were reported objectively, and all donors were known to be CMV seropositive.
Examination of the data from Figures , , , and suggests that samples with a low number of cytokine-positive cells had higher variability than samples with a high number of cytokine-positive cells. The relationship of response level and C.V. is summarized in Table for all assays (CD4 and CD8, whole blood and PBMC) considered together. These data emphasize the difficulty of achieving precise results at response levels of less than 0.1% of CD4 or CD8 T cells. For these samples, collecting even more events than what was suggested would be expected to improve precision, per the discussion above.
| Table 2Percent C.V. by mean percent cytokine-positive T cells. |
The average C.V. across the four experiments is summarized in Table . These data are confounded by the fact that different donors and different laboratories participated in the four experiments. However, variability due to individual analysis can be excluded by comparing only centrally analysed data (bottom row of Table ). Assuming no effect from the other confounding variables, we found that Experiment 4 (cryopreserved PBMC with lyoplates) yielded a significantly lower average C.V. than Experiment 2 (shipped whole blood) (p < 0.05). Also, the average C.V. of centrally analysed data from all experiments was significantly lower than that of individually analysed data (p < 0.0001). This highlights the amount of variability in each experiment that is due to gating differences between sites.
| Table 3Percent C.V. by assay format. |
Mitigation of gating variability was achieved in these experiments by centralized analysis with a dynamic gating template (see Figure ). The dynamic gating template allowed for more automated, batch analysis of the data. Once such a template was created and optimized (see Materials and Methods section for description), it could also have been provided to individual sites in order to yield the same results. It is further possible that similar results could be achieved by manual analysis, provided it was done by a single operator. Standardization of gating techniques, in the absence of centralized analysis or dynamic gating templates, could also improve precision. The improvement in C.V. made by centralized analysis was most marked in the first experiment, and progressively less in experiments 2 and 3, perhaps because of standardization of gating among sites over time. Experiment 4 included many new sites, and the improvement in C.V. from centralized analysis was again more marked.
Because the C.V. varies as a function of the response level (Table ), it is possible that differences in mean C.V. between assay formats were due to the number of low versus high responders in each experiment (since different donors were used in the four experiments). Also, the C.V. is highly sensitive to small changes in the mean, when the mean is a very low number. Therefore, an analysis of S.D. versus mean was also performed for the four experiments (Figure ). This data confirms the data of Table , indicating that the three assay formats showed grossly similar reproducibility. However, when analysis variability was removed, cryopreserved PBMC assays appeared to be slightly more reproducible than shipped whole blood assays. This seemed especially apparent in experiment 4, where lyophilised reagents were used.
In addition to differences in reproducibility, the various assay formats have other benefits and drawbacks as well. Cryopreserved PBMC are much more amenable to peptide (and superantigen) stimulation than to whole protein stimulation [
9]; while whole blood assays are equally amenable to stimulation with either type of antigen. Also, consistently good cryopreservation of PBMC at multiple clinical sites is difficult to achieve, but highly important for achieving reproducible results with PBMC [
27,
28] (DeLaRosa et al., manuscript in preparation). This could become less of a factor if a stabilizing matrix for preserving whole blood or PBMC function during shipping were discovered. All in all, the choice of assay format for a clinical trial will depend not only upon considerations of assay precision, but also upon the type of antigen(s) used and the capabilities of the participating clinical sites.
The use of lyophilised reagent plates appeared to reduce inter-lab variability. This conclusion cannot be drawn with certainty, because different participating laboratories and different donors were used between experiments 3 and 4. However, it is intriguing to note that, when centrally analysed data was compared (to remove gating as a source of variability), the mean C.V.'s of experiment 4 were the lowest of all four experiments (18%, Table ). This is despite the fact that the donors and stimuli used in experiment 4 resulted in lower mean response levels, which should tend to increase the C.V. This is also borne out by the analysis of Figure 11B, where the results for experiment 4 appeared to be generally closer to the theoretical minimum SD than did the results for the other experiments.
With the possibility of achieving inter-laboratory C.V.'s of less than 20%, even with relatively low responses, ICS compares favourably to ELISPOT, for which interassay C.V.'s of 17–18% for PHA and 55–65% for Candida have been reported [
29,
30]. ICS is also comparable to cytokine ELISA, the latter having reported interassay C.V.'s of <25% [
31,
32]. Phenotypic staining, such as used for CD4 counting, can achieve higher precision levels than functional assays, and averaged around 10% C.V. in one multisite study [
33]. For comparison, the inter-lab C.V. of the CD4
+ or CD8
+ cell percentages was around 5% in experiment 4 of the present study (data not shown). CD4 counting precision has also been shown to be dependent upon the number of events collected, gating, and use of automated analysis [
33,
34]. Since functional assays are subject to more variables than phenotypic staining, the ability to achieve precision levels such as those reported here should be considered favourable. ICS could thus be a viable tool for comparing immune responses even across clinical trials, provided the methodology was standardized.