3.1. Inter-laboratory variation detected using a 4-color ICS assay
Cytokine responses from the different donors to each stimuli tested were compared across laboratories in each round. The variability across laboratories was calculated as %CV or SD. As noted before (Maecker et al., 2008
), %CV were not informative for low cytokine responses. Based on all the data generated in the current study, we established that %CV would stabilize and could only be used when assessing responses higher than 0.2%; hence, SD was used for cytokine responses lower than 0.19%.
The 4-color cocktail CD4 FITC/IFN-γ + IL-2 PE/CD8 PerCP-Cy5.5/CD3 APC was used in all rounds of testing. An example of the staining obtained using this combination is presented in .
The initial rounds of testing included donors with high, intermediate and low responses to CMVpp65 and/or CEF. Beginning in Round 6, only donors with intermediate and low responses were included, since detection of these levels of antigen-specific responses was the most frequent and challenging scenario in the analysis of unknown clinical specimens.
Examples of the different levels of responses obtained across donors and stimuli are presented in . In these plots, the responses detected in each replicate for each laboratory are represented, and the variation detected across laboratories is indicated at the top of each graph. A summary of all %CVs or SDs calculated for each response in each round is presented in . For cytokine responses higher than 0.2%, the %CVs detected across rounds were overall very similar (mean of 32.8%, 26.4%, 27.6%, 34.2%, and 36.3% in Rounds 2, 3, 4, 5, and 6, respectively). However, in Rounds 1 and 7, the %CV were higher (mean of 57.4% and 64.4% for Rounds 1 and 7, respectively) compared to the ones in other rounds; although there was not a statistically significant difference (p>0.05). For lower responses, the SD followed a similar trend with the smaller variations detected in Rounds 2, 3, 5, and 6 (range of mean of all SD from 0.01 to 0.08) and higher variations in Rounds 1, 4 and 7 (0.59, 0.25 and 011, respectively; p<0.0001 for Round 1 versus Rounds 2, 3, 5, and 6).
Inter-laboratory variability of cytokine responses across seven rounds of the ICS QAP using a 4-color cocktail
Finally, it is also significant to specify that when analyzing the responses to the different stimuli, background subtraction was not performed. The rationale behind this decision was that if an error had occurred in the negative control wells, such an error will be carried over to the stimulated ones by doing the subtraction. However, we did compare the variation across laboratories with and without background subtraction (data not shown), and no significant differences were observed.
3.2. Inter-laboratory variation detected using a 7-color ICS assay
In order to evaluate an ICS assay that would mimic the daily practice in most of the participating laboratories, a 7-color cocktail was included to begin in Round 6. This cocktail was designed so that poly-functional T-cells (i.e. T-cells producing more than one cytokine) could be identified. A survey conducted across the participating institutions revealed that most laboratories were interested in detecting IFN-γ, IL-2 and TNF-α in separate channels. Since all laboratories utilized frozen specimens, the inclusion of a viability dye was suitable (Horton et al., 2007
). The resulting 7-color cocktail, unlike the 4-color cocktail, was designed in the central laboratory and has not been formally validated. The detection of cytokine responses using this combination is presented in . Using this 7-color combination, the cytokine responses that were consistently detected across donors after CMVpp65 stimulation were CD4+ and CD8+ T-cells expressing all three cytokines (IFN-γ+, IL-2+, TNF-α+) or IFN-γ and TNF-α. These specific subsets were easily identified by all participants, across all donors in Rounds 6 and 7. The variation observed in the quantification of poly-functional T-cells was similar to the one obtained when only looking at a single cytokine or at the combination of cytokines in one channel, as done in the 4-color cocktail (data not shown).
Inter-laboratory variability of cytokine responses across seven rounds of the ICS QAP using a 7-color cocktail
In order to directly compare responses obtained with the 4-color versus 7-color cocktails, the frequencies of all subsets identified as producing IFN-γ and/or IL-2 with the 7-color cocktail were summed for both CD4+ and CD8+ T-cells. Interestingly, increasing the number of colors in the assay, and hence its complexity, did not result in an increase in the variability observed across laboratories. The responses detected by the different participants, using both combinations, is shown as an example in . Moreover, the mean of the %CVs calculated across responses detected with both cocktails were very similar (50.0% versus 53.4% in Round 6 and 62.1% versus 59.8% in Round 7), as well as the mean of the SD for low responses (0.09 for both cocktails in Round 6 and 0.17 versus 019 in Round 7; ).
3.3. Definition of optimal range and outlier measurements
Because of the variability across laboratories, we needed to create a robust method for defining outlier measurements. Therefore, at the start of Round 3, the central laboratory performed five to six assays, with at least two different operators on different days, using the same batch of reagents distributed to the laboratories and acquiring the cells in different instruments (FacsCalibur, Canto II and LSRII). We considered that the cumulative data from those experiments reflected the variability inherent to this assay and provided a range of where the responses should fall. We refer to the results obtained from these series of experiments in each round as the Gold Standard (GS). In most of the measurements, the range defined by ±2SD of the mean of the GS was considerably narrower than the one defined by the mean ±2SD of the measurements from all laboratories. In addition, the mean ±2SD of the GS also included most of the measurements that fell between the 25th to 75th percentiles of all measurements. Moreover, for a given response, the mean and median of the GS were almost identical (correlation coefficient ≥ 0.9, data not shown) and highly correlated with the median of the responses reported by all participants (). Taking this into consideration, we determined that if a response reported by a participant was not within ±2SD of the GS, it would be considered an outlier. Using this strategy, we were then able to quantify the number of outlier measurements for each laboratory in each round. As shown in for Rounds 3, 4 and 5 in which we tested the same donors and used the same stimuli, there was considerable variation in this number from one laboratory to another. In general, if a laboratory had a high number of outlier measurements in a given round, this number decreased in subsequent rounds (clear examples are Laboratories 3, 11 and 13). After completion of every round, the central laboratory generated a detailed report summarizing the performance of each laboratory and provided feedback regarding any issues encountered during that particular round. This response very likely contributed to improved performance for successive rounds, if the operator was consistent between rounds. Variation in the GS measurement from one round to the next would reflect variation across different days, different operators and different lots of reagents. We found (data not shown) that the GS was consistent in detecting almost the same range of responses between rounds. In addition, the %CVs or SDs were similar and significantly lower than the ones calculated with the data from all laboratories.
After careful analysis of the data generated by all the laboratories in the seven rounds of the ICS QAP, we have identified critical factors to consider when evaluating laboratory performance of ICS assays. We will describe these factors and present data supporting the most effective factors contributing to intra-laboratory variation and generation of outlier measurements.
3.4. Viability and recovery
When performing the assays, participants were instructed to determine the percentage of viability and recovery using their preferred technique (trypan blue and hemocytometer, Guava, etc).
Importantly, the viability of the cells across rounds was very good and in general, higher than 85% (). The protocol distributed to participants contained a recommended thawing procedure. However, isolated cases of low viability were reported and always related to inappropriate thawing technique, such as using non-optimal media or thawing too many sample vials at a time. As expected, low viability was consistently associated with suboptimal responses (data not shown).
Viability and Recovery of PBMC tested across rounds
Notably, the viability of the cells obtained in laboratories within the US compared to laboratories overseas (Asia and Africa, for example) were almost identical (Supplemental Figure 2S
), suggesting that shipping frozen PBMC using liquid nitrogen containers is suitable for these types of studies. It is important to point out that the requirement of this shipping methodology had been established in previous studies (Bull et al., 2007
), and it was determined that despite the high cost associated with this shipping strategy, it was critical to adequately preserve the specimens.
In initial rounds of this study, we did not include a viability marker in the staining combination, but a viability marker was included at the beginning of Round 6 in the 7-color cocktail. Using this staining, we were able to determine that across institutions the percentage of viable cells recovered after the assay was comparable (data not shown).
In contrast to the consistent viability values, we detected considerable variation in the number of cells recovered in each laboratory (). Recovery could range from 40 to 140% for a given donor. We could not find a correlation between % recovery and outcome of the assay (as determined by the number of outlier measurements for that donor; data not shown). We hypothesize that this could indicate discrepancies in cell counting methodologies used across laboratories. Although ICS assays, in contrast to other cell-based assays such as ELISpot or proliferation, are not sensitive to variation in the cell input (similar results are obtained if one or two million cells are stimulated; data not shown), it is necessary that a minimum number of cells are stimulated to generate enough events for acquisition and analysis as discussed below.
3.5. Number of collected events
When analyzing the individual FCS files generated in initial rounds, it appeared that the participating laboratories had acquired a wide range of total cell numbers per well, and that within the same laboratory, different numbers were acquired across different wells as shown in . Although the protocol for Rounds 1 through 5 stated that a minimum of 80,000 CD3+ lymphocytes were needed to be acquired per well, some laboratories were not in compliance, acquiring too few events (20,000 or fewer CD3+ lymphocytes). Starting in Round 3, the central analysis of the data included evaluation of total numbers of CD3+ lymphocytes. The protocol was modified, and it was emphasized that a minimum of 80,000 and a maximum 100,000 cells (this top limit was introduced in order to narrow the range of acquired events across laboratories) of interest needed to be acquired per well. Moreover, feedback was given to the laboratories that were not able to acquire sufficient events. Possible causes for this problem were inaccurate cell counting methodology, inadequate centrifugation speed after fixation and permeabilization, or cell loss during aspiration of the wells after wash (either by incorrect use of the manifold or poor plate flicking technique). There was no correlation between different cell counting techniques (Guava, hemocytometer, etc.) or aspiration methodology (flicking versus vacuum manifold) and cell numbers acquired. The feedback given to the laboratories resulted in improvements for certain laboratories, but others still struggled to reach the minimum number of required events. In addition, since Round 6, the number of events to be acquired was further increased to 120,000 to 150,000 CD3+ lymphocytes. Due to the need for analyzing smaller cell subsets identified with the 7-color cocktail, the cell recovery requirement at the end of the assay became more challenging for some laboratories.
As expected, acquisition of a low number of events had a clear impact on the accuracy and precision of the data, and could contribute to decrease in inter-laboratory reproducibility. In the example shown in , acquisition of a low number of events led to imprecision by under-estimating cytokine responses, especially when looking at low frequency responses like the CD4 cytokine response in this case.
3.6. Intra-laboratory variation
Beginning with Round 3, we were able to provide enough cells to the participants so that each response was tested in triplicate. This allowed us to evaluate intra-laboratory variability. For each laboratory, the %CV or SD, depending on the level of response, was calculated for each set of triplicates (). Moreover, this variability was compared to the intra-laboratory variability of the GS, since that set of data was considered optimal. The variability within triplicates for the GS was low with a mean CV lower than 7% and a mean SD lower than 0.02 in Rounds 3 through 7.
In general, the majority of the laboratories generated reproducibly tight data, with %CVs and SDs, which were equal to or lower than the ones from the GS (represented by small boxes in ; for example, Laboratories 8 and 9 for both %CV and SD). Variable data represented as large boxes (i.e. Laboratories 5 and 13 CVs) or high %CVs and SDs (above the GS mean) usually was attributed to technical problems during the assay (cell loss in a given well due to inadequate aspiration method, cross-contamination, etc.). Interestingly, although there was no correlation between data from different laboratories and/or rounds of testing, high variability within triplicates usually indicated a high number of outlier measurements for a given laboratory.
3.7. Instrument performance and setup
Since central analysis of individual FCS files was performed in each round, we were able to detect issues related to instrument performance and setup that had an impact on the quality of the data, which could also account for inter-laboratory variability.
For instance, compensation was examined by looking at all the possible color combinations in un-gated plots. Over or under-compensation was seen in some cases, especially during initial rounds. As expected, inadequate compensation yielded outlier measurements (data not shown).
In addition, we observed that different populations (FSC, SSC, CD3, CD4, CD8, cytokine +) would fall in diverse locations. This variation could contribute to inaccurate measurement of the responses, if the populations of interest were off-scale or difficult to discriminate. Furthermore, to conduct a centralized analysis, this variability made it extremely difficult to generate an accurate generalized analysis template to fit the majority of the data.
Beginning in Round 3, single stained cells or pre-stained compensation beads were lyophilized and included in the staining plates so all participants could use exactly the same reagents for instrument setup. In addition, the protocol distributed included target values for each channel (generated using the pre-stained lyophilized cells or beads in the central laboratory) specific for the different instruments, along with instructions for manual or automatic compensation. This strategy led to overall better instrument setup and more homogenous data across sites (data not shown) that significantly facilitated central analysis.
In addition to providing detailed guidelines for instrument setup, we evaluated and compared the performance of flow cytometers across institutions, starting at Round 6. This assessment was done both quantitatively and qualitatively in each participating laboratory by providing calibration beads (CS&T and fluorescence calibration 8-peak beads), pre-stained compensation beads and a detailed protocol. For each detector, the following parameters were calculated: Qr (a measure of fluorescence detection efficiency); Br (optical background); SDen (standard deviation of electronic noise); and Stain Index (SI) of the compensation beads. SI provides an estimate of the sensitivity/resolution of each channel and is calculated using the following formula: SI = (median [+ peak] – median [− peak])/2*robust SD. A high SI, high Qr and low Br are desirable, as they reflect good resolution and low background for a given channel. For each detector, all the calculated parameters listed here varied widely across institutions. For example, the SI of the compensation beads ranged from 28.5 to 151, 46.1 to 347.3, and 6.8 to 36.1 for FITC, APC and V450 channels, respectively. Of note, SI calculated from beads was highly correlated with SI calculated from stained cells (Supplemental Figure 1
). More importantly, the combination of the instrument performance information obtained from the SI and the 8-peak beads, together with the measurements Qr and Br for each detector, allowed us to dissect the specific causes for low sensitivity in a given detector. For example, if high Br was calculated, poor resolution was very likely due to high optical background, suggesting that the filter and/or the flow cell needed to be checked for that particular instrument. If low Qr was calculated, then low sensitivity was linked to an issue with the optics for that given detector. In the example shown in Supplemental Figure 1S
, the difference in resolution in the FITC channel between the instruments in Laboratories 1 and 2 was due to a combination of high background and low detection efficiency of the FITC detector in Laboratory 2.
The results of the instrument performance assessment were summarized in the individual reports distributed at the end of each round, and suggestions were made in order to correct any documented issues.
3.8. Gating strategy
All participating laboratories were requested to provide raw data (FCS files) at the end of each round. This enabled the central laboratory to analyze all of the data generated by each participating laboratory. The responses obtained from this centralized analysis were compared to those from the individual analysis performed in each laboratory.
In the majority of cases, good correlation was observed between the two analyses, with correlation coefficients higher than 0.9 (data not shown). When a poor correlation was obtained, the central laboratory carefully analyzed the gating strategy used by a particular laboratory. One of the most common problems encountered is represented in . Non-inclusion of down-modulated cell surface antigen populations (CD3 dims, CD4 dims or CD8 dims) led to underestimation of the gated cytokine responses, since some of the cytokine+ cells fall into these dim subsets as it has been previously shown (Maecker et al., 2001
; Bitmansour et al., 2002
When these problems or any other non-optimal gating issues were observed, feedback was given to the relevant participants. In addition, a suggested gating strategy was provided, and it was highly recommended to back-gate on the cytokine+ cells, in order to visualize the location of those cells within the CD3, CD4 and CD8 parameters. When it was possible to follow up on these recommendations with the same operator, usually high correlation was achieved after one or two rounds of testing between the centralized and the laboratory analysis (data not shown).
Finally, a common problem accounting for low correlation between centralized versus individual analysis was defining how conservative to set the gates to discriminate cytokine positive and negative cells.
3.9. Pass/fail criteria for ICS assays
After analysis of the data generated during seven rounds of this ICS QAP, we have identified key factors responsible for inter-laboratory variability that reflect the optimal or suboptimal performance of an ICS assay. Those factors cover all aspects of this assay, from cell processing (cell viability, cell recovery and intra-laboratory variability), instrument setup and acquisition (compensation and number of collected events) to data analysis.
During all rounds of testing, the importance of these factors was extensively discussed with all participants, and data analysis regarding each element was provided within the individual report generated for every laboratory. As shown in , when these different factors were taken into consideration and only optimal data is considered, along with a centralized analysis, the inter-laboratory variability decreased significantly.
Effect of acceptance criteria and gating strategy on inter-laboratory variability
The participants of this program agreed to adopt these factors, along with the optimal range determined by the GS, as pass/fail criteria (). For each parameter, we established a target value that we consider optimal. These key indicators of performance were based either on the data provided by all laboratories (e.g., for cell viability and recovery in each round) or data from the GS (e.g., for intra-laboratory variability), and were set so excellent versus good versus fair performances could be discriminated. Additionally, a scoring system was designed that allows one to accurately grade the performance of a given laboratory in each specific aspect of the assay. For example, in regard to the number of collected events, 10 points are alloted if the average of relevant events collected across wells is equal to the target number provided in the protocol; 8 points are alloted if the average is between 80–99% of the target; 6 points if the average is only 60–79% and so on. Zero or no points are allotted if the average number of collected events is less than 19% of the target. When this scoring system was applied to the data generated in Round 7, the scores assigned to each laboratory highly correlated with the overall impression assigned by the central laboratory and the quality of the data (data not shown) based on its experience.
ICS Assay Acceptability Criteria