We observed substantial within-subject interlaboratory variability in QFT-GIT interpretations and IFN-γ measurements when blood samples collected from the same person at the same time were tested in three different labs. Of the 97 subjects tested in three labs, 11% had discordant QFT-GIT interpretations based on the original reported data. Electronic transfer of data was not possible for one of the three labs testing specimens for this study, and a portion of the variability in test interpretation was associated with manual data entry errors. Data entry errors included data misalignments and a misplaced decimal point that were encountered with manual data entry but not electronic data transfers. All three labs used an automated ELISA workstation to assist in performing QFT-GIT, and this may have avoided additional data entry errors. As compared to manually performed ELISAs, automated ELISA workstations can read specimen barcodes that discriminate subjects and QFT-GIT tube type (i.e., Nil tube, TB Antigen tube, Mitogen tube) and assign OD values to specific specimens. This avoids some inaccuracies that have been attributed in prior studies to data entry errors and transposition of IFN-γ measurements 
A third type of error was recognized for six subjects who had exaggerated TB values in one lab due to errors in interpreting OD values when they were over the working range of the ELISA workstation. Certain lots of ELISA kits with higher activity as evidenced by higher OD values for standards tended to have higher ODs for plasma samples and have more TB ODs above the working range for the ELISA readers (data not shown). Data from the six subjects with OD values over the working range were excluded from the reconciled dataset. Removal of these subjects with methodological errors did not appreciably alter interpretation agreement because all were concordantly positive.
Corrections of data entry errors made a substantial difference in interpretative agreement between each lab and among all three labs. When reconciled data from Lab1 vs. Lab2, Lab1 vs. Lab3, or Lab2 vs. Lab3 were compared, 94.5%, 93.4%, and 96.7% of interpretations agreed, respectively. However, among all three labs, 92.3% of subjects had concordant results after the data were reconciled.
Several pieces of evidence suggest that the majority of discordance in QFT-GIT interpretation remaining after data reconciliation was due to variability in measuring TB Response. While none of the subjects with discordance attributed to data entry errors had all TB Response values within 0.25 IU/mL of the cutoff separating positive and negative interpretations, 86% of those with discordance after data were reconciled had all TB Response values within this range. Additionally, 37% of the subjects who had one or more TB Response values within this range after data were reconciled had discordance, but none of the subjects without a TB response within this range had discordance. These statistics do not describe the actual magnitude of variability in TB Response.
We examined the magnitude of variability in TB Response and the two IFN-γ measurements used to calculate TB Response. Of the many indices of variability, LOA may be the most informative. LOA is expressed in units of test measurement and includes bias. W-S CV% masks the impact of IFN-γ concentration magnitude on variability, while ICC and W-S SD do not take into account the bias between measurements. Variability, as measured by LOA, was greater for higher IFN-γ measurements. This was observed for Nil, TB, and TB Response, but because TB and TB Response values tended to be larger than Nil values, greater variability was observed in TB and TB Response, especially for subjects with concordant positive interpretations. Because TB Response is calculated from two measurements, its variability could be greater than the variability in measurements used in the calculation (i.e., TB and Nil). Additionally, because Nil and TB are measured in the same ELISA, subtraction of Nil from TB could reduce variability in TB Response by compensating for interassay bias if the bias was constant regardless of the level of IFN-γ measured. However, we observed that (1) the bias in measuring IFN-γ concentration was not constant, (2) the variability in TB Response tracked the variability in TB, and (3) subtracting Nil did not fully compensate for variability in TB when calculating TB Response. Another reason for lower quantitative variability for people with negative results is that the TB Response is constrained to a relatively small range (typically <0.35 IU/mL) compared to the TB Response for those with positive results.
While subjects with concordant positive interpretations had more variability in TB Response than those with concordant negative interpretations, the variability near the cutoff is of greater importance because of its effect on interpretive agreement. Bland-Altman analysis allows assessment of variability in paired measurements and identifies the range of measurements encompassing 95% of TB Response variability associated with repeat testing. Because variability is not uniform across the range of TB Response values, applying a global measure of variability derived from the entire range may not be suitable near the cutoff. Among the 14 subjects with a mean TB Response of 0.10 through 0.60 IU/mL (i.e., 0.35±0.25 IU/mL), which included 6 of the 7 subjects with discordant QFT-GIT interpretations, the upper LOA was as high as 0.43 IU/mL and the lower LOA was as low as −0.46 IU/mL (). The 95% CIs for LOAs may be relatively large because of the small number of subjects with mean TB Response values near the cutoff. Clinicians, naive to the direction of comparison, can expect results from a second lab to be within 0.46 IU/mL of the first with 95% certainty. Because this estimate of variability is determined for a range (i.e., 0.10 through 0.60 IU/mL), it overestimates variability for TB Response values near 0.10 IU/mL and underestimates variability for TB Response values near 0.60 IU/mL. Another consideration is that for a particular TB Response, changes in only one direction can alter test interpretation.
The amount of uncertainty in interpreting QFT-GIT that is acceptable has not been established. Whereas LOA encompasses a range for 95% of the test-retest differences, bias ± W-S SD encompasses 52% of the variability expected with retesting 
. W-S SD also reflects the variability relative to the true value such that 68% of measurements will be within one W-S SD of the theoretical true value (typically estimated as the subject's mean value) 
. W-S SD for TB Response was as high as 0.16 IU/mL for subjects with mean TB Response near the cutoff (i.e., 0.10 through 0.60 IU/mL). W-S SD, which is also referred to as “wobble”, is intended to describe random variation. What we measured as interlaboratory bias could be misinterpreted as random variation if testing were performed in a random selection of laboratories.
We harmonized testing methods as much as possible, so that there were no differences in delays to incubation, incubation time, incubation temperature, and minimal differences in duration of storage. However, there were areas where consistency could not be maintained. For example, labs used QFT-GIT kits with different lot numbers, different automated ELISA workstations, different calibration curves, and different reporting methods. Greater variability may have occurred with less harmonization of test methods.
Various borderline zones around the cutoff have been proposed to address variability 
. However, prior investigations have not considered interlaboratory variability or the impact of non-uniform variability in measuring TB Response. Most prior investigations of variability have been challenged to analyze relatively small sample sizes. The small number of subjects near the cutoff also challenged our stratified analysis. Despite the lack of available data from interlaboratory reproducibility studies, our estimates of discordance (11.3% to 7.7%) seem to be in keeping with those seen in intralaboratory between-run estimates of discordance 
Interlaboratory variability is a symptom of a larger problem of IGRA imprecision. IGRA imprecision may also explain a portion of the variability encountered with serially performed IGRAs among healthcare workers 
. We measured test variation that is not attributable to subject variation (e.g., due to new infection, treatment, or fluctuations in immune status). Blood samples were collected at the same time to exclude the effect of subject variation due to time. Additional studies are needed to assess IGRA imprecision and understand the components of variation seen in serial testing. The imprecision demonstrated with serial testing and by interlaboratory variability is also relevant when interpreting individual or initial IGRA results.
In conclusion, greater interlaboratory variability was associated with manual data entry and higher IFN-γ measurements. Manual data entry should be avoided. Our data suggest that variability in measuring TB Response may affect QFT-GIT interpretation, especially when near the cutoff. Therefore, consideration should be given to interpreting such responses as “borderline” rather than negative or positive, and clinical decisions regarding treatment or the need to repeat these tests should be based on individualized clinical judgment considering the risk of infection, the risk of disease, and the proximity of the TB Response to the cutoff. In the population we studied, interpreting TB Response values of 0.10 through 0.60 as “borderline” would have avoided most changes in test interpretation due to measurement variability. However, this may not be the appropriate range for the entire population for whom QFT-IT is recommended. Additional studies are needed to determine the optimal range of values for borderline results and to explore the impact of using a borderline interpretation.