In this paper, we calculated the magnitude of phantom derived voxel scaling changes in structural MRI images collected in the ADNI trial. We assessed whether a scaling correction method based on a post-processing of the brain scans themselves using a widely available registration algorithm (
Woods et al., 1993) can correct for scaling changes as effectively as a scaling phantom. In all cases, scan pairs were within-scanner.
The phantom scans gave a mean absolute percentage volume scale change of 0.90% for baseline scans and 0.87% for the repeat scans with no significant difference (
p = 0.97) in the magnitude of the correction between baseline and repeat scans. The implication of this finding is, as would be expected, that on average there was neither a systematic scaling difference (bias) in the scanners over time, nor a systematic change in the phantom scaling correction. The phantom corrections at baseline and follow-up for individual subjects were correlated with a mean absolute % volume change between repeat and baseline scans of only 0.33%. This suggests that scanner-related change in voxel sizes resulted in artifactual errors in the measurement of brain volume change that were on average of a similar magnitude as those seen in normal ageing over 1 year, typically 0.3–0.5% for healthy individuals aged 50–75 years (
Scahill et al., 2003). This is considerably less than the annual losses in MCI or AD (1–2% of baseline brain volume per year) (
Fox and Schott, 2004). However, this finding must come with a number of caveats. First, because this study was designed to compare methods of correction for scaling change, the scanners chosen were not representative of scanners generally; they had to have been “qualified” to be included in ADNI; they were part of an ongoing QC programme and importantly scans from scanners with an obvious problem had been excluded by the central QC site (Mayo Clinic) that selected the scans for this comparison. For these reasons, the temporal stability of the scanners analyzed here is most likely better than what might be expected in a typical clinical trial. Thus these data likely underestimate the deleterious impact of scanner scaling instability in most clinical trials. Secondly, although some scanners showed no change at all in voxel sizes, there was quite a range ( and ) in the individual scanner-related change with a number of scan pairs showing more than 1% volume change as measured by the phantom.
Nine degrees of freedom (9DOF) registration of the phantom corrected images was used to test if there is any residual scaling error after phantom correction. In an ideal world, if both methods of scaling correction were perfect, the phantom correction would correct the images perfectly, and having done so the registration algorithm would correctly recover scalings change of exactly 1.0 for all scan pairs and would give a mean volume change and mean absolute volume change of 0.0%. Any deviation from this could be caused by either the phantom correction algorithm, whereby data that we are assuming is phantom corrected is in fact not perfectly corrected, or the registration algorithm is inaccurate (e.g. adjusting for scaling unnecessarily), or both. The mean percentage volume change was only 0.04%, which is negligible in practical terms, and importantly, when we reversed the images and re-ran the experiment, the result was −0.04%. This indicates that the registration algorithm is performing symmetrically, and hence there is a small bias in the data. There was no significant difference (p = 0.97) between the control and AD groups indicating no disease-related bias: implying that progressive atrophy in the AD group did not influence the registration-based correction. The majority of scan pairs in our data (98 out of 129 pairs, 76%) had a phantom scaling correction of less than 0.5% (arbitrarily chosen). For small scale changes (e.g. <0.05%) the phantom and the registration-based scalings are not tightly correlated perhaps implying we are at the practical limits of correction with this method. Future work should seek ways of improving the precision of scaling correction. Importantly however, for large scale changes we found a small number of cases (7) where there was a marked and material difference between the phantom derived scalings and the registration derived scalings. Visual inspection (blind to method of correction) suggested that the 9DOF registration produced a more correct solution. We feel that this implies that in a small number of cases the phantom produced an incorrect scale change which could be corrected by the 9DOF registration. These results combined, suggest that the additional expense and logistic effort of scanning a phantom with every patient scan can be avoided by registration-based scaling correction.
In terms of the effect on the measurement of brain atrophy (Materials and methods), the mean BBSI values were similar whether measured from the uncorrected, phantom, or 9DOF registration corrected scans (). Although not significantly different it is worth noting that the BBSI values were on average about 3% lower with 9DOF correction. Importantly however, there was a trend towards a reduction in the variability (standard deviation) of the BBSI value scans when corrected for scaling errors with either method. The reduction in variability was greatest with 9DOF correction for the control group and was statistically significant. Both forms of correction reduced the SD of the mean atrophy rates: the 9DOF correction producing about 10% reduction in the SD for both control (13%) and AD (9%) groups—this is equivalent to approximately 20% reduction in variance which if there were no changes in mean rates of atrophy equate to approximately 20% reduction in sample sizes. Sample size estimates for disease modifying trials in AD are driven by the variance in the outcome measure and the expected difference in the mean rate of atrophy in the treated group versus the placebo group. The maximum effect one could reasonably expect for an atrophy slowing therapy would be to reduce the AD rate to the control rate, as such, sample sizes are proportional to (SD/(difference in means))
2 (
Fox et al., 2000)—sample size estimates based upon the atrophy rates in the AD subjects in this study would therefore be 10–12% lower with (either) correction for voxel scaling than if no correction was used. This could improve group separation of atrophy rates in AD and controls and make a material difference in therapeutic trials especially if less well controlled scanners are included.
An important aspect of the method is the pre-segmentation of the brain prior to the use of the registration step. The original Woods method (
Woods et al., 1993) required a segmented image of the brain. Subsequent validation studies showed that this significantly improved the accuracy of the overall registration compared with unsegmented images (
Freeborough et al., 1996;
Woods et al., 1998).
Gunter et al. (2003) later showed better group separation (AD and control groups) using a dilated brain mask. For this paper we used 8 dilations which include the skull/scalp boundary. In this paper we did not assess different registration algorithms for correction of voxel scaling changes in longitudinal MR studies. We focused on a single widely used algorithm. Future work could investigate different interpolation methods to smooth the cost function near the registration point, with the aim of improving the precision. Additionally, it would be useful to understand further which parts of the image are most important for this type of registration—a highly complex structure such as the brain provides good 6DOF registration, but the skull or scalp/skull high contrast boundary may be more important to constrain scaling, either as part of a 9DOF algorithm or a 3DOF algorithm (just scalings). Another alternative, is to use an intensity based method that is robust to large percentages of statistical outliers. Approaches like this have been proposed (
Smith et al., 2002;
Freeborough et al., 1996;
Ourselin et al., 2000) and the ADNI dataset may be a way of assessing their performance at correcting for these scaling issues. Furthermore having run these experiments on a subset of 129 well controlled pairs of scans, it would be interesting to examine the whole ADNI dataset. This should have greater power to assess 9DOF registration correction of scaling errors and assess whether the trend towards a reduction in variance is significant with larger datasets.
The ADNI study went to great lengths to image a phantom with every subject scan, and has provided us with realistic, quantitative data such as might be obtained in future clinical trials. In this dataset, the mean correction to baseline and repeat scans was small, and the ratio of the measurements (i.e. change over time) was smaller. In addition the effects of phantom correction on the BBSI were not significant, and there was a correlation between the size and direction of the correction applied to baseline and repeat scans. This suggests that it is more important to ensure that a subject is scanned at the same centre and on the same quality controlled scanner than it is to scan the phantom with every subject. In this way, as long as the scanner was regularly and carefully serviced, the relative change would be small enough to not have a significant effect on measurements of atrophy, even if larger absolute scaling errors are present and unchanging over time. Phantoms will clearly play an important role in calibrating the scanner as part of routine maintenance due to the high level of accuracy and precision thus obtained, and the use of high quality phantoms to accredit imaging sites for clinical trial could have great value in ensuring that all sites in multi-site trials have similar stability to the carefully monitored sites used in the ADNI study. The results from the visual assessment also suggest alternative strategies. In general the 9DOF registration was the preferred solution where the scaling factors found by the phantom and the 9DOF registration method were most different. However, we can imagine cases where the 9DOF registration will fail. The 9DOF registration is most likely to be inaccurate when there is significant motion artifact, excessive amounts of atrophy or large intensity differences. If any of these factors are known to be likely, then a phantom scan may be prudent. For example phantom scaling correction may be preferable for patients that are more likely to move during the scan, for longer running trials, or if a known scanner upgrade is unavoidable. The results from the visual assessment also showed an example with a warping distortion presumably due to uncorrectable gradient non-linearity. This suggests that it is also important to place subjects consistently as close as possible to the iso-centre of the magnet and to position subjects in the same location for each visit. In addition, it may be the case that the organizers of a clinical trial should invest in a pre-qualifying phase, where an imaging centre uses a phantom to benchmark their quality control processes and prove to a hub site that they can routinely scan subjects to a known quality standard (
Jack et al., 2003). These recommendations may provide an alternative, more cost effective method of control than a phantom scan with every subject. The comparison of the value of the two scaling change correction methods was done using a single structural MRI endpoint, namely the BBSI for quantification of global brain atrophy. For other endpoints, especially those involving local measurements of atrophy or of cortical thickness, the relative merits of the two approaches may possibly differ, however it is likely that scaling changes would affect any measure of volume change over time. Also, this paper has focused on longitudinal measurements of brain atrophy. For cross-sectional studies, although absolute voxel scaling errors (which the registration method does not correct) may have an impact, any effect will be small compared to inter-individual variation in brain volumes and morphology.