This study provides the first analysis of false positive rates in an imaging genetics study of VBM data using cluster size inference. Images from a group of 181 subjects with mild cognitive impairment were tested against a set of 700 ‘null’ SNPs. The analysis presented here suggests that rejection rates under both stationary and non-stationary assumptions are poorly controlled at low cluster-forming thresholds or for images with low smoothness.
The use of real genotype data is considered important, since the accurate modelling of linkage disequilibrium and population stratification is a challenge, and their effect on neural phenotypes is unknown (Meyer-Lindenberg et al., 2008
Null SNPs were selected from chromosome 3, with the simple rationale that none of the genes reported to have the strongest link with AD are present on this chromosome. While this is clearly a crude measure for selecting SNPs with no effect on grey matter distribution, the use of multiple permutations ensures that any possible effects are removed by breaking the association between genotype and phenotype. In fact, rejection rates obtained using permuted SNPs are not significantly different from those obtained without permutation (considering a 95% confidence interval at ± 2 SD), indicating that, for the purposes of this study, SNP effects on brain structure are indeed negligible.
We begin by considering the results obtained with the ADNI image dataset.
Effect of cluster-forming threshold, αc
The choice of cluster-forming threshold, αc was found to have a significant effect on cluster size inference rejection rates. For images smoothed with a 12 mm Gaussian kernel, both stationary and non-stationary tests were found to be well-controlled or conservative at the most stringent threshold (αc = 0.001). However, tests became increasingly anticonservative at lower thresholds uc (higher αc) for both 12 mm and 6 mm smoothed images.
A possible explanation for the poor performance at low uc
is bias in RFT's estimate of the expected number of clusters,
is over-estimated, expected cluster size is under-estimated (see Eq. (3)
), meaning that more clusters of a given size are labelled as significant. This over-estimation of
may reflect the inability of the Euler Characteristic, ρ
), to accurately estimate the number of clusters at low thresholds, where clusters are more numerous and tend to coalesce to form topologically complex patterns (Taylor and Worsley, 2008
Fig. 2 Accuracy of estimation of c, the theoretical number of clusters under RFT. Histograms show the empirical distribution of c across all 700 SNPs at three different cluster-forming thresholds (left to right), and with two different smoothing kernels (top (more ...)
12mm vs. 6 mm smoothing kernels
The application of a wide range of Gaussian smoothing kernels in VBM is evident in the literature — e.g. 4 mm (Schwartz et al., 2010
), 8 mm (Folley et al., 2010
) and 10 mm (Shen et al., 2010
), as well as the ‘standard’ 12 mm (Rosen et al., 2010; Ueda et al., 2010
). However guidelines on the particular choice of smoothing kernel have been described as ‘vague’ (Hayasaka and Nichols, 2003
), and there is a suggestion that kernel widths should be determined empirically (Worsley et al., 1996b
). Notably, with the use of high-dimensional warping methods like DARTEL (Ashburner, 2007
), there appears to be a trend towards lower smoothing kernels. Improved intersubject alignment means there is a reduced need for smoothing to ‘blur out’ warping errors. For example, Bergouignan et al. (2009)
use 12 mm smoothing with SPM's standard normalisation and 8 mm with DARTEL. While reduced smoothing should increase sensitivity to effects of smaller size by “Matched Filter” arguments, cluster size tests are most sensitive to effects that are larger than the noise smoothness. Hence, to the extent that large scale anatomical effects are present after either low- or high-resolution warping, high-resolution results may be more sensitive as effects will be larger in units of resels.
In the present study, differing amounts of smoothing were found to have a pronounced effect on rejection rates. Tests on images smoothed with a 6 mm Gaussian kernel were highly anticonservative at all thresholds including the highest (αc = 0.001), and were consistently more anticonservative when compared with 12 mm smoothing results.
Poor performance for low smoothness images is in fact to be expected under the lattice assumption of random field theory (Hayasaka and Nichols, 2003
). As image smoothness decreases, this lattice approximation breaks down, since the underlying variation is poorly-captured by discrete, voxel-wise sampling. This means that continuous RFT results are modelling unobserved, large intensity changes between sampled voxels. While previous reports have suggested 3 voxel FWHM smoothing (i.e. 6 mm FWHM smoothing for the 2 mmvoxels considered here) is sufficient (Nichols and Hayasaka, 2003
), for the ADNI data this is insufficient. Specifically, we find an over-estimation of the expected number of clusters, with the gap between expected and observed values,
, generally greater at 6 mm than at 12 mm (see ).
Stationary vs. non-stationary tests
Non-stationary cluster-wise rejection rates were generally similar, or slightly better-controlled than those assuming stationarity, suggesting that there is at least some non-stationarity present in the images. For non-stationary images, stationary tests would also be expected to perform worse at lower thresholds where clusters are larger and more likely to encompass extra-smooth regions, and this is indeed the case. A heuristic measure of image non-stationarity was obtained by plotting the distribution of voxel-wise FWHM, obtained from the RPV image produced by SPM (FWHM = RPV − 1/3
). A completely stationary image would be expected to have constant FWHM across the entire image volume. Any pronounced departure from this suggests non-stationarity. An analysis of 6 mm and 12 mm FWHM images (see ) finds a spread of around 4 mm to 8 mm and SD of 1.0 mm for 6 mm smoothed images, and 7 mm to 17 mm and SD of 2.6 mm for 12 mm images. While this spread of FWHM could be attributed to sampling variation, the theoretical SD of the FWHM estimator can be computed by simulation (see Appendix B of Hayasaka et al. (2004)
). We find theoretical SDs of 0.696 mm for 12 mm smoothed stationary images, and 0.348 mm for 6 mm images, which are much smaller than our observed values. While these theoretical SDs under stationarity again depend on the accuracy of the RFT results (and, note in particular the bias in the smoothness estimation for 6 mm smoothing), they provide further evidence of substantial image non-stationarity.
Fig. 3 Distribution of voxel-wise FWHM for ADNI images smoothed with 6 mm (left) and 12 mm (right) Gaussian smoothing kernels. Voxel-wise FWHM gives an indication of local smoothness and corresponds to the ‘full-width at half-maximum’ (more ...)
t vs. F image results
The t and F image cluster size results cannot be directly compared. While the single degree-of-freedom F-test we used is exactly equal to the square of the t-test used, the set of clusters generated will be different for two reasons. First, the one-sided α level used to define a t statistic threshold will not equal the square root of the F statistic threshold of the same α level (an F's level corresponds to the t's two-sided α level). Further, the F image has the clusters arising from negative t values. Thus there will be both more and different clusters in the F images for the same data and αc.
These caveats aside, the rejection rates on the real data were largely similar for the same αc's, with valid performance found only for 12 mm smoothed data with αc = 0.001.
In marked contrast to tests performed on the ADNI image dataset, non-stationary cluster size tests on simulated stationary and non-stationary random images were found to be valid (conservative) at both high and moderate cluster-forming thresholds (αc = 0.001, 0.01), irrespective of image smoothness.
Other studies using simulated images produced from stationary and non-stationary, Gaussian random fields have also considered the effect of varying both the cluster-forming threshold and image smoothing kernel. With stationary simulated images, Hayasaka and Nichols (2003
, ) found that cluster size tests were conservative over the same range of image smoothness with αc
= 0.001, 0.01, in agreement with our results. Using similar non-stationary simulated data, Hayasaka et al. (2004)
also found that non-stationary cluster size tests were conservative with images of low smoothness (comparable to our 6 mm non-stationary images), and with 20 subjects, but only considered αc
The large discrepancy in cluster size inference rejection rates between real and simulated image data over a range of thresholds and smoothing kernels suggests that there are features of the real VBM data that may be incompatible with the RFT method. This may for example be due to the inherent non-normality of VBM data, or to patterns of non-stationarity in real images that are more complex than those simulated here. Non-normality of VBM data has been reported before, but only when considering the accuracy of voxel-wise significance (Viviani et al., 2007; Salmond et al., 2002
). This other work found that imbalanced group comparisons required 12 mm FWHM smoothing to accurately control voxel-wise false positives, though balanced group comparisons were accurate with smaller kernel sizes. As genotypes are rarely equally frequent, the imbalanced results are most relevant to this setting.
We performed a number of additional simulations in order to investigate the role of non-normality in cluster size inference. VBM data is hard bounded between 0 and 1, and modulated VBM nearly so. A Shapiro–Wilks test for normality at each voxel, using the spmd5beta diagnostic toolbox (http://www.sph.umich.edu/~nichols/SPMd/
) reveals that both 6 mm and 12 mm smoothed images are indeed highly non-normal. This deviation is particularly marked for 6 mm images, with around 45% of voxels exceeding a nominal 5% Shapiro–Wilks threshold. In contrast, the stationary Gaussian noise-derived simulated images describe earlier show no significant deviation from normality. To test the effect of introducing non-normality to our simulations, we generated a set of images by first thresholding Gaussian noise images smoothed with 6 mm and 8 mm kernels, to produce ‘patchy’, binary images. These were then smoothed with 6 mm and 12 mm kernels to produce images with a range of deviations from non-normality that mimicked or exceeded the deviations from normality exhibited by the real VBM data, as measured with a Shapiro–Wilks test. Regression of these images against all 700 SNPs produced similar results to those described earlier, with conservative results at high and moderate cluster-forming thresholds with both 6 mm and 12 mm smoothing.
To test the effect of more complex patterns of non-stationarity, we segmented FWHM images derived from 6 mm and 12 mm smoothed ADNI images to produce a set of topologically complex masks corresponding to regions of high, medium and low ‘smoothness’. Non-stationary simulated images were then generated by filling each masked region with differently smoothed Gaussian noise, as described in the Section Imaging data
. Once again, a full analysis produced conservative results, with rejection rates below a nominal 5% for αc
= 0.001, 0.01 for both 6 mm and 12 mm smoothed images.
One final set of simulated images was produced by again generating complex, non-stationary FWHM-segmented masks, this time filled with non-normal, Gaussian noise-derived data, as described earlier. Rejection rates were again well-controlled, in marked contrast to results obtained using real ADNI data.
A reviewer raised the concern that poor performance might be attributable to the low (2.5%) threshold applied to the mean GM image to create an analysis mask. To address this we ran an additional set of tests using a mask based on a 20% GM threshold. This higher threshold will exclude voxels with the least amount of GM and those likely to have non-Gaussian errors, but also will change the topology of the search region, making it more convoluted. While we did find a slight improvement in test performance with the new mask on real ADNI data, our findings were left unchanged, in that only tests performed on 12 mm smoothed images with αc = 0.001 were well-controlled.