In this study, we proposed the bootstrap resampling technique to assess the reliability of the detected hypometabolic brain regions in patients with pAD or aMCI in comparison to NC. We believe that this approach offers the potential to reliably detect either a statistical ROI or search region in cross-sectional studies, helping in the early detection of AD (e.g., the differential diagnosis of AD, the prediction of subsequent rates of cognitive decline, and the enrichment of clinical trials for those individuals most likely to have AD or show subsequent cognitive decline. It also has potential to detect either a statistical ROI or search region in longitudinal studies, helping to track AD and evaluate AD-modifying treatments with optimal statistical power and freedom from the Type 1 error associated with multiple comparisons (
Chen et al, 2010).
Researchers have proposed using brain imaging techniques, such as FDG-PET and volumetric MRI, to evaluate AD-modifying treatments with better statistical power than clinical endpoints. Unlike the region of interest (ROI)-based method, the widely used voxel-based approach fully utilizes the richness of image-wise information. However, the latter approach has the issue of the inflated type-I error when multiple tests are performed simultaneously (i.e., multiple comparisons). We have recently shown how empirically derived statistical ROI's of CMRgl decline can be used to help optimize statistical power to evaluate AD-modifying treatments free of multiple comparisons (
Chen et al, 2010). The method (using familywise error in the previous publication or bootstrap with resampling [unpublished data] permits us to capitalize on the entire data set, without relying on preconceived ROIs. The bootstrap resampling technique allows us to assess the reliability of the observed hypometabolism and thus define a collection of
reliable voxels as the functional ROI. Although we used cross-sectional data in the current study to introduce this new approach and compared it with the FWE approach, all the proposed procedures can be applied to longitudinal data to define functional ROI. Future work will focus on using the defined and validated functional ROI in power estimation. Moreover, we plan to define the functional ROI with 100% reliability via the bootstrap approach with one dataset and then perform the power analysis using an independent dataset. We believe our planned analyses fully utilize the richness of the neuroimaging data, while at the same time being free of multiple comparison concerns.
In applying this method to examine the longitudinal CMRgl decline, one should be also aware of the effects of the use of global counts as a reference by proportional scaling. Our own previous study showed the global CMRgl decline in AD patients (
Alexander, et al, 2002). The regional CMRgl decline relative to global over time could be artificially reduced due to the decrease of the global measurement for the followup scans. In this study, we followed the conventional use of global counts the same as in a number of previous studies, aiming primarily to introduce and cross-validate the Bootstrap re-sampling based reliability index of the AD related hypometabolism for cross sectional studies. In one recent separate but related study by our group (
Chen, et al., 2010), we examined the effects of using different reference regions for their sensitivity and statistical powers to characterize the 12- month CMRgl decline. The reference regions we examined in that study included the global counts, sensory-motor, thalamus, pontine, cerebellum and the relatively spared brain regions over the 12-month period. Results from that study showed that the use of the spared region gave the highest sensitivity in detecting the longitudinal declines. We note that additional studies are needed to examine the combined effects of using one of these reference regions, the degree of additional smoothing to the images, the threshold to define the decline and spared regions and other settings in conjunction with the generalization of the proposed reliability index defined by the Bootstrap re-sampling procedure to longitudinal data.
In the statistical inference stage of the neuroimaging data analysis, multiple comparison correction is conducted over a specified brain volume (the search volume is the whole brain by default). A volume chosen for multiple comparison correction is based on previous findings or expert knowledge about its involvement with the biological process of interest (such as hypometabolism in patients with AD or activation in response to a stimulus). Each location or voxel in the volume is treated equally in this multiple comparison correction protocol. We are interested in incorporating the differential involvement each location into the correction. For the AD study, for example, the differential involvement can be in the form of the assessed degree of reliability of the hypometabolic voxel. Voxels with lower reliability should have been given less weight than those with higher reliability in the correction. The present study was conducted with the goal of establishing a technique to estimate the variation in reliability. In future studies we will address the issue of how to integrate the estimated reliability into the multiple comparison correction.
It is important not to confuse the chance that a voxel is repeatedly observed as hypometabolic with the FWE. For instance, when using bootstrap resampling with a given uncorrected p-value such as 0.005, the corresponding FWE is high. In our analysis (given the brain mask and the smoothness applied), PFWE=0.896 for uncorrected p=0.005. One may argue that given such high false positive rate (0.896), the chance to see the same group difference in three independent runs is as high as 0.7193 (=0.8963). This reasoning, however, is inadequate since the random-field based FWE is the probability of existence of at least one brain location/voxel at which the magnitude is higher than the threshold corresponding to the given probability, this probability does not address the repeated observations at the same voxel locations over multiple analyses. When a location is observed as significant in the first analysis, the probability to observe significance at the same location again in an independent dataset, under the null hypothesis, is the uncorrected p-value and there is no need to correct for the multiple comparisons for this approach.
Please note that the reliability in the present study was estimated using repeated resampling with replacement from a single dataset. With the exact theoretical statistical nature yet to be fully understood, we caution its use and equating it with conventional perception of reliability. However, our bootstrap results were quite similar to that for each of the three non-overlapping data sub-sets. Moreover, we related the reliability index to the well-understood type-I error (FWE or uncorrected) with a positive and significant association between the two.
Apparently, the computational expense is high for the bootstrap resampling procedure compared to the generation of the statistical parametric maps. However, it is not impractical with the efficient batching procedure in place even for a modern personal computer. Both approaches can be used to define statistical region of interests for examining the cross-sectional group differences and for investigating the longitudinal declines in each subject group and the decline differences among different groups. Though the results of both approaches could be similar, results from Bootstrap approach could be easier to interpret (see more details below) and potentially with higher power as our on-going internal investigations suggested and as the possibility of the violation of the normal distribution assumption for the parametric approach. In that, we note that p-value threshold used in the bootstrap procedure was not for statistical inference purpose.
In place of the type-I error concept, we proposed the notion of reliability for potential power estimation and for a more precise multiple comparison correction taking the degree of confidence of prior knowledge into consideration. We believe this tactic is straightforward, intuitive, and can potentially be used for power estimation and for correction of multiple comparisons. One might think the use of 1-ρ as reliability index (ρ is FWE) as reasoned in the current study relating FWE to the bootstrap defined reliability. FWE is defined under the null hypothesis of no group difference. Thus, the direct interpretation for 1-ρ is the probability that there is no voxel/location in the examined volume that exceeds the threshold determined by ρ. With this strict and technical definition, relating to observation reliability needs abstract and difficult reasoning, a motivation for our current undertaking efforts.
*Research Highlights- Measure hypometabolism reliability for FDG-PET using Bootstrap resampling
- Discuss this reliability with the hypometabolism consistency over multi-datasets
- Characterize this reliability relation to the parametric type-I error
- Propose its use for longitudinal study and for multiple comparison correction