We processed 710 images acquired at 56 different sites, cross-validated and tested classifiers for their detection accuracy to evaluate a possible effect of the manufacturer, magnetic field strength or coil configuration. Apart from a trend in two sets, the subjects in all sets had equal age distributions. The largest two classifiers were trained on 417 images and their cross-validation accuracies reached on average 87% which corresponds to previously reported performances (Cuingnet et al., 2011
). The mean LOO-CV accuracy consistently increased, as expected, with higher numbers of training samples. When applying the classification algorithm to new data sets, the accuracy of the proposed method was reasonably high and performed comparable to several other approaches both on ADNI data as well as single site data (Cuingnet et al., 2011
; Plant et al., 2010
). Unlike most studies, which either have single site data or pool the entire ADNI data as training samples to validate the algorithm, our goal was to test the hardware effects on classification accuracy and required us to separate the data into smaller subgroups.
We assumed that pure hardware sets would show a better LOO-CV accuracy. Performance of individual pure sets of images varied strongly, as shown in Supplementary Figs. 1–3
. Such single performance values are an uncertain estimator, and results from the permutation tests () indicate that classifiers using images acquired with mixed hardware performed equally well. Since each pure set of images consisted of different subjects, the effect of individual anatomy on the accuracy was a covariate to the hardware effect. It is important to note that the sample sizes in each of the subgroups were different but was greater than 20, which was found to be the minimum for the proposed classification problem (Klöppel et al., 2009
). Even though large sample sizes may mean more stable classifiers and better performance, the performance of each of the different hardware settings () was found to be statistically similar.
In the comparison of PAIR_1.5 T and PAIR_3.0 T, the effect of individual anatomy was reduced to changes of aging and eventually progressive atrophy caused by disease over a period of 2 to 102 days. Because all images at 3 T were acquired after the 1.5 T scans, we expected the set of images taken at a later time point to be more or equally discriminative due to the progression of the disease in some individuals. Experimentally the opposite was observed. Classifiers trained on images acquired at 1.5 T predicted the image sets acquired at the same field strength slightly (1.6 percentage points) but significantly better. The test result of the SOLO_1.5 T set performed 6 percentage points better on the 1.5 T than on the 3 T test data (). Given that the test sets were composed of the same subjects, these differences are remarkable. However, it was probably due to chance, since the variation of the decision value was centered on zero for both diagnostic groups (). The higher SNR of 3 T systems compared to 1.5 T was by design used in the ADNI study to increase spatial resolution (Jack et al., 2008
). Higher resolution of the images did in this case not improve performance potentially because the processing pipeline included the resampling of GM maps to 1.5 mm isotropic voxel size during spatial normalization. Reproducibility of the decision value was similarly high, for both sample sizes tested. The standard deviation of the introduced error was more than 10 times higher than the difference between the means of the diagnostic groups. Changing field strength of the scanner led to variance that was 3 times higher than the back-to-back variance. Despite a change in field strength, no systematic effect on the decision value could be observed. The small training set was not more vulnerable to changes in hardware; on the other hand, the larger training set did not decrease these kinds of errors. The difference in the decision value between groups increased with the size of the training set. The large data set pronounced differences related to the disease but also differences that are related to the acquisition process. When the number of training samples was small, adding samples from heterogeneous hardware to the training set increased the accuracy of the classifier, assumingly because benefits from a larger sample size exceed those of hardware inhomogeneity.
From these results we conclude that reproducibility of the post-acquisition pipeline is similarly high at both field strengths. The source of variation – indistinguishable with the performed analysis – are (a) scanner noise, (b) varying image quality, (c) variations in any step of the pre-processing pipeline such as segmentation, resampling or spatial normalization. Furthermore a change in hardware setting introduces variation that can shift the decision value substantially. Two possible explanations come to mind: (a) Random effects due to physiological conditions of the patient, the positioning of the head, motion, or (b) Systematic effects that are related to a specific change in system. It should, however, be kept in mind that the results of the current study cannot readily be extended to multi-center with a less stringent system of quality control. In addition, the attempts to increase the comparability between 1.5 T and 3 T data are specific to the ADNI study and allowed a successful classification across field strengths.
The results of this study have substantial implication for the clinical setting. Changing field strength introduces additional variance in the computed decision value and thus decreases accuracy, compared to repeated measures on the same scanner. It should be noted that two scanners with the identical hardware setting will not produce exactly identical results and this may also influence classification accuracy. From the practical point of view, the choice of hardware would normally influence the decision in about 5% of the cases. The obtained accuracy of about 84% presents encouraging results for automated SVM-based disease classifier with the use of images acquired at different centers in comparison to conventional clinical ante-mortem AD diagnosis, which is not 100% reliable. Specifically, approximately 30% of cognitively normal subjects will meet pathological criteria for AD at post-mortem (Morris and Price, 2001
). Especially when the number of available samples from one center was small, the combination of training images from two sets often resulted in a clear improvement of performance. The results did not indicate that mixing data from different centers would lead to substential loss of classification accuracy.
Since the 95% CIs of the performance were varying as function of training sample size, and were large for small sample sizes (e.g. 62.5–90% with 20 subjects per diagnostic group), a quantification of the performance by a single estimation of the accuracy is doubtful. Reporting CI confidence intervals as in strengthens the interpretability of the estimation of the classification performance and provides a measure of diagnostic confidence for clinical applications.