|Home | About | Journals | Submit | Contact Us | Français|
Voxel-Based Morphometry (VBM) has been used for several years to study differences in brain structure between populations. Recently, a longitudinal version of VBM has been used to show changes in gray matter associated with relatively short periods of training. In the present study we use fMRI and three different standard implementations of longitudinal VBM: SPM2, FSL, and SPM5 to assess functional and structural changes associated with a simple learning task. Behavioral and fMRI data clearly showed a significant learning effect. However, initially positive VBM results were found to be inconsistent across minor perturbations of the analysis technique and ultimately proved to be artifactual. When alignment biases were controlled for and recommended statistical procedures were used, no significant changes in grey matter density were found. This work, initially intended to show structural and functional changes with learning, rather demonstrates some of the potential pitfalls of existing longitudinal VBM methods and prescribes that these tools be applied and interpreted with extreme caution.
As early as 1960, researchers demonstrated that learning and experience could produce profound changes in gross measures of brain morphology in rats such as brain weight and cortical thickness (Krech et al., 1960; Rosenzweig et al., 1972; Klintsova and Greenough, 1999). Subsequently, it has been demonstrated that many aspects of brain structure and function can be modified by learning — including synaptic density, neural and glial cell size and ratio, vascularization, dendritic branching, fMRI activation, and neurotransmitter concentration (Black, et al., 1990; van Praag et al., 2000; Poldrack, 2000; Floyer-Lea et al., 2006). In some cases, these gross changes in brain structure can be detected after as little as 10 days of training (Kleim et al., 2007).
Recently, an adaptation of voxel-based morphology (VBM) has been introduced that attempts to non-invasively measure longitudinal changes in gray matter density using MRI (Ashburner and Friston, 2000; Draganski et al., 2004). This within-group approach is statistically more powerful and does not require the large number of subjects traditionally used in VBM studies. Using this approach Draganski et al. (2004) reported a localized 3% increase in gray matter density in MT after 3 months of juggling practice. Draganski et al. (2006) reported both increases and decreases in gray matter density associated with 3 months of studying for a medical exam. Most recently Ilg et al. (2008) reported gray matter and functional increases in right occipital cortex associated with 2 weeks of mirror-reading practice.
Many criticisms of VBM have been published since its introduction. Bookstein (2001), Davatzikos (2004) and Crum et al. (2003) are primarily concerned with the automated nonlinear registration technique used by VBM and the potential problems in establishing functional homology between groups. (See Ashburner and Friston (2001) for a rebuttal to some of these issues.) Although these critiques are focused on between group studies, they illustrate that relatively minor differences in brain anatomy or other initial conditions can have significant effects on final results. While a difference in brain anatomy is not relevant in a longitudinal design, sensitivity to initial conditions such as field inhomogeneities may prove problematic for both cross-sectional and longitudinal studies.
VBM analysis has recently been applied to diffusion tensor MRI (DT-MRI) where authors have also found reason to be skeptical of the reproducibility of the technique. Jones et al. (2005) notes the large impact of different smoothing kernel sizes in the analysis of DT-MRI data. They also note the lack of normality in the residual images hampering the ability to make statistical inferences using parametric methods. Most recently, Jones et al. (2007) demonstrated that ten different groups analyzing a common DT-MRI dataset using voxel-based methods drew widely disparate conclusions. This current state of controversy limits the conclusions one can draw from VBM-based results in isolation. Because of these controversies, it is critical to demonstrate the reproducibility and convergence with other methods to establish the validity of VBM-based results.
In this work we use longitudinal VBM in combination with fMRI to explore functional and structural changes associated with learning a simple visual–motor task. In contrast to other longitudinal VBM studies, subjects participated in both the control and the learning phase of the experiment thus controlling for false positives due to group differences. Since each subject serves as his or her own control, this within-group approach provides an opportunity to explore the robustness and sensitivity of longitudinal VBM that is independent from the problems inherent in establishing homology between groups. We also explore the consistency of longitudinal VBM results by comparing its implementation in three different software packages: SPM2, FSL, and SPM5. We hypothesized that the regions which show learning related changes in functional activation should also demonstrate changes (either increases or decreases) in gray matter density. We also hypothesized that these results should be consistent across implementations of VBM.
Twelve healthy, right-hand dominant subjects were included in this study. (mean age: 32.5 years, range: 23–40, 6 men, 6 women). All gave informed consent according to a protocol approved by the NIH IRB. Five other subjects were excluded due to poor quality scans or missing scan sessions. All of the subjects analyzed participated in both the control and learning phases of the experiment.
The experimental paradigm is illustrated in Fig. 1. Subjects initially underwent a baseline structural MRI scan (scan 1). After a two-week control period the subjects underwent a second structural MRI (scan 2) as well as four fMRI scans during which they alternatively performed the mirror task and the control task (Fig. 1). The control task required the subject to follow a randomly moving white dot on the screen using a joystick held in the right hand. The mirror task was identical to the control task except that the left–right axis of the joystick was reversed. After scan 2, subjects were trained on the mirror task for a total of 2.5 h over 2 weeks (six 25-minute training sessions). At the end of the training subjects received both structural and functional MRI scans identical to those in scan 2. Eight subjects performed the experiment in a continuous four-week block. Four subjects had an interval of 2 to 12 weeks (mean 44.5 days) between the control and learning phase of the experiment. An extra scanning session was conducted on these subjects at the beginning of the two-week learning phase to serve as a baseline for the learning comparison. Thus, all comparisons on all subjects were over a two-week period.
All scans were collected on a 3 T General Electric (GE, HDx 14M3) Scanner using a GE eight-channel head coil. Structural scans consisted of two FSPGR scans (256 × 256, 124 slices, 0.85 × 0.85 × 1.2 mm voxels, TI = 400 ms, TE = ~5 ms). Functional scans consisted of four axial EPI time series (64 × 64, 38 slices, 3.2 × 3.2 × 3.2 mm voxels, TR = 2.5, TE = 30). Each of the four 6-minute scans employed a block design of alternating 30-second periods of performing one of the two tasks with 30 seconds of fixation. Order of task presentation was counterbalanced across subjects.
The first level of fMRI analysis combined like sessions and produced parameter estimates for each subject's activation compared to baseline in each task. This was carried out using FEAT (FMRI Expert Analysis Tool) Version 5.63, part of FSL (FMRIB's Software Library, www.fmrib.ox.ac.uk/fsl). A second higher level analysis performed a paired T-test to compare subject's activation before and after training. This was carried out using FLAME (FMRIB's Local Analysis of Mixed Effects) stage 1 only (Beckmann et al., 2003, Woolrich et al., 2004). Z-statistic images were thresholded using clusters determined by Z>3.5 and a (corrected) cluster significance threshold of p = 0.05 (Worsley et al., 1992).
For each scanning session the two FSPGR images were rigidly aligned using a motion-correction algorithm from either SPM2, FSL, or SPM5 and averaged together. In order to correct for image inhomogeneities, intensity bias correction was performed on the average images using four iterations of the N3 algorithm (MINC Tools, Sled et al., 1998).
The SPM2 analysis was performed in Matlab 7.4 using the VBM2 toolbox (v1.07, http://dbm.neuro.uni-jena.de/vbm/) which implements an optimized VBM pipeline (Ashburner and Friston 2000; Good et al., 2001). The first scan of each subject was used as the “baseline” to create a custom gray matter template. All structural scans for a given subject were fed through the longitudinal analysis pipeline of the VBM2 toolbox such that the first scan was used as the source for spatial normalization (piecewise linear). The analysis was also repeated using the scan immediately before training as the baseline for the template and the source for spatial normalization. A third iteration was performed using the subject's mean image aligned to the halfway point between scans 1 and 2 as the baseline. The halfway point was determined by taking the square root of the alignment transformation (rigid, 6 DOF) between scans 1 and 2. This manipulation has been previously utilized in other structural analysis packages such as SIENA (Smith et al., 2002).
Note that for the longitudinal stream the VBM2 toolbox does not “modulate” images (i.e. multiply by the Jacobian determinate, Good et al., 2001). Default options were used for hidden Markov random field (HMRF) weighting, bias correction, cutoff spatial normalization, nonlinear regularization, and number of nonlinear iterations. Normalized gray matter images were smoothed using default settings: an 8 mm full-width half-max (FWHM) Gaussian kernel. An absolute intensity threshold mask of 0.2 was used to remove regions with minimal grey matter intensity. Pseudo paired T-tests were used to compare scan sessions one and two (the control period) and two and three (the training period). Cluster-based inference in VBM is complicated by the lack of stationarity or uniform smoothness (Worsley et al., 1999). Hayasaka et al. (2004) prescribe either non-stationarity correction or permutation-based methods (depending on degrees of freedom) in order to draw statistically valid inferences on VBM data. Both methods are employed here using either the nS toolbox from Hayasaka et al. (2004) or the SnPM3 toolbox from Nichols and Holmes (2002). For all analyses an uncorrected height threshold of p<0.001 was applied and an extent threshold of p = 0.05, corrected across space. A 1 mm FWHM variance smoothing kernel was used for non-parametric analyses. See Supplemental Fig. 1 for a flowchart of steps.
FSL-VBM (http://www.fmrib.ox.ac.uk/fsl/fslvbm/, version 1.0) has not been specifically designed for longitudinal analysis but can be employed for this purpose with a simple modification to the processing stream. The pre-processed structural scans for all scanning sessions for a given subject were first rigidly aligned to the first scan using FLIRT and a mean image was created. The analysis was run a second time using the halfway point between scans 1 and 2 as the baseline (see above). In both cases, the default settings for FLIRT's 3D rigid body alignment were used (6 DOF) which includes a trilinear interpolation algorithm. The average images were brain-extracted using BET (Smith 2002). Next, tissue-type segmentation was carried out on the subject mean image using FAST (Zhang et al., 2001). The resulting gray-matter partial volume images were then aligned to MNI152 standard space using the affine registration of the IRTK (Rueckert et al., 1999, www.doc.ic.ac.uk/~dr/software). The resulting images were averaged to create a study-specific template. The brain-extraction and segmentation steps were then repeated on the rigidly aligned structural scans from each scanning session. The segmented native gray matter images were then non-linearly registered to the template using the transformations calculated from the averaged images. The segmented images were then smoothed with an isotropic Gaussian kernel with a sigma of 3.5 mm (analogous to an 8 mm FWHM). Paired T-tests were used to compare scan sessions one and two (the control period) and two and three (the training period). Permutation-based, non-parametric testing was used (Randomise, http://www.fmrib.ox.ac.uk/fsl/randomise/, version 2.0 and 2.1, see Results), with a height threshold of p<0.001 and testing clusters for significance at p<0.05, corrected for multiple comparisons across space. A 0.35 intensity threshold mask was chosen to create a mask similar in size and shape to the SPM2 mask and a 1 mm FWHM variance smoothing kernel was used. See Supplemental Fig. 2 for a flowchart of steps.
The SPM5 analysis was performed using the VBM5 toolbox (v1.15, http://dbm.neuro.uni-jena.de/vbm/). The unified segmentation algorithm in SPM5 warps template images into the space of the image to be segmented so no custom template is needed. As in the SPM2 and FSL analyses, spatial normalization was estimated using either the baseline scan or a subject mean image aligned to the halfway point between scans 1 and 2. This nonlinear spatial normalization was then applied to the segmented grey matter images. Defaults were used for all toolbox options. SnPM5 was used for permutation-based non-parametric testing forming clusters at p<0.001 and testing clusters for significance at p<0.05, corrected for multiple comparisons across space. An absolute intensity threshold mask of 0.2 and a 1 mm FWHM variance smoothing kernel were used. All results reported are corrected on the cluster level. See Supplemental Fig. 3 for a flowchart of steps.
Performance on the tracking task was measured by the average distance of the joystick cursor from the randomly moving dot during the six-minute scans. For the mirror-tracking task, a paired T-test comparing average distance before and after training showed a significant decrease (p<0.0001). For the normal tracking task there was no significant difference in cursor distance before and after training (p = 0.52, Fig. 2).
Comparing functional scans associated with performance of the mirror-tracking task before and after training showed several clusters of increased and decreased activation (Fig. 3 and Table 1). The areas showing decreased activation included the middle frontal gyrus and a large expanse of parietal cortex extending from precuneus to lateral parietal cortex. These regions have been implicated in visual tracking tasks and mental rotation, respectively (Luna et al., 1998; Cohen et al., 1996). Increased activity was seen in the medial frontal cortex and cingulate cortex. These areas fall within the “default” or “resting state” network. Regions in this network commonly show decreased activation during difficult tasks and a relative increase in activation or no change during rest or simple tasks (Supplementary Fig. 4; Raichle et al., 2001). This is consistent with our results as these regions showed greater decreases in activation relative to baseline before training than after.
In the first run of the SPM2 analysis in which we used the baseline scan as the source for spatial normalization, we found no significant increases or decreases in gray matter density during the control period. During the learning period, one cluster of increased gray matter density was found on the ventro-medial edge of primary visual cortex in the left hemisphere (Fig. 4A). Another small cluster of decreased grey matter was found in the right pre-central gyrus. In the second run of the SPM2 analysis in which we used the pre-training scan as the source for spatial normalization, a small cluster of increased grey matter density was found in the right medial frontal gyrus during the control period (Fig. 4B). During the learning period, one cluster of decreased grey matter density was found in the left cerebellum.
The variability of our results depending on which scan was used as the source of spatial normalization and the presence of a significant cluster in the control period led us to re-examine the analysis pipeline used here and in other published longitudinal VBM studies (see Discussion). We hypothesized that using any one of the scans as the target of the rigid alignment and the source of the spatial normalization could bias the segmentation results of that scan. We attempted to address this potential confound by aligning to the halfway point between scans 1 and 2 by taking the square root of the alignment transformation matrix. Using this method no significant clusters of grey matter change were found during the control or training period (Fig. 4C).
We also noted that the non-stationarity correction method of statistical inference has been demonstrated to be anticonservative for degrees of freedom less than 30 as is the case here (Hayasaka et al., 2004). Hayasaka et al. recommend the use of permutation-based methods for analyses with relatively small degrees of freedom. Therefore, we re-ran the analysis using SnPM3 for statistical inference. Using this method resulted in no significant clusters of grey matter change regardless of which alignment technique was used.
A very similar pattern of results was found in the FSL-VBM analysis. We began by using the initial scan as the target of the rigid alignment and the source of the spatial normalization. In this analysis we found three clusters of decreased gray matter density and two clusters of increased gray matter density during the control period (Fig. 5A). During the training period five clusters of decreased gray matter were found. Re-running the analysis using the halfway alignment technique resulted in no significant clusters of grey matter change during the control period and a single small cluster of decreased grey matter density on the inferior edge of right temporal cortex (Fig. 5B).
Although all FSL-VBM analyses are conducted using permutation-based methods for statistical inference, FSL's “Randomise” program was recently updated to change the method of dealing with confounds in permutation. The method previously used was demonstrated to be very anticonservative (Nichols et al., 2008). Using this updated version of Randomise (2.1) no significant clusters were found in either condition (Fig. 5C).
In the SPM5 analysis no significant clusters of grey matter change were found in either condition regardless of whether scans were aligned to the initial scan or to a halfway point.
In this study we have explored longitudinal structural changes as measured by VBM and some of the potential pitfalls in the analyses of these data. Our approach to this question is unique in that we have used the same pool of subjects for both the control and learning phase of the experiment. Contrary to previously published studies, we found no statistically significant grey matter changes associated with 2 weeks of training on a visuo-motor task even though significant changes were found in fMRI activation and behavioral performance. We have suggested modifications to the structural analysis stream used in previously published longitudinal VBM studies. We have also carried out the longitudinal analyses using three different software packages to evaluate the consistency of the structural results.
Initial analyses in both SPM2 and FSL revealed clusters in both learning and control conditions that were ultimately determined to be artifactual. These clusters were determined to be due to two factors. First by aligning all scan sessions to the initial baseline scan a difference in interpolation is introduced, biasing the comparison. Scans that were interpolated are slightly smoothed before segmentation while the baseline scan is not. This leads to artificial differences in apparent grey matter density. When scans were aligned to a halfway point between the first two scans no significant clusters were found. Secondly, the non-stationarity correction method of statistical inference used in SPM2 is anticonservative for analyses in which the degrees of freedom are less than thirty (Hayasaka et al., 2004). Performing the analysis using permutation-based methods in SPM2 resulted in no significant clusters of grey matter change. Also, version 2.0 of FSL's permutation tool, Randomise, used an anticonservative method of handling confounds (Nichols et al., 2008). The corrected method used in Randomise 2.1 results in no significant clusters of grey matter change.
Analysis of the change in BOLD signal with training revealed several regions of decreased activity with learning including inferior parietal, precuneus, middle cingulate cortex, and bilateral middle frontal gyrus as well as several regions of increased activity including the medial frontal cortex. The large cluster spanning parietal and precuneus has been implicated in tasks involving spatial transformations while the middle cingulate cortex has been associated with response inhibition (Cohen et al., 1996; Ridderinkhof et al., 2004). Training on the mirror-tracking task enhances both of these skills. The changes in the activation of the middle frontal cortical areas are possibly related to the motor transformations required for inverting the movement of the joystick. Although our results on fMRI changes related to learning are statistically strong, they are challenging to interpret due to potential changes in the subjects' strategy or changes in the relative difficulty of the task. Most fMRI studies of learning suffer from similar difficulties (Poldrack, 2000).
The central goal of this study was to determine if functional and structural measures of plasticity overlapped. We also sought to determine whether structural measures of plasticity were consistent across VBM implementations. Ultimately no significant structural changes were found, however it is still important to note there are significant differences in currently available longitudinal VBM implementations which are illustrated in the processing flow charts in Supplementary Figs. 1–3. Note that SPM2 performs the segmentation step after the brain has been spatially normalized whereas SPM5 and FSL perform segmentation in the subject's native space. However we believe the most important source of variance is the segmentation itself. Fig. 6 shows the segmented gray matter images and cumulative image histograms from a single subject using each of the three packages. The SPM2 segmented image differs from the others in that a much higher proportion of voxels are classified as 100% gray matter density. The FSL segmentation classifies voxel more continuously between 0 and 100% grey matter. In the SPM5 segmentation, the distribution of grey matter density appears to lie somewhere between the sharp distinction of SPM2 and the more gradual curve of FSL. Note that SPM5 uses the “grand unified segmentation” algorithm that does not rely on study-specific templates (Ashburner and Friston, 2005).
Our results also demonstrated that interpolation related to the rigid alignment step had significant effects on the final results. Interestingly, the histograms in Fig. 6 change very little when they are generated using volumes that were rigidly aligned and interpolated. This illustrates the point that the interpolation does not globally bias the grey matter density in one direction or the other. Rather, at different locations within the volume, artifactual focal changes in grey matter may be introduced in an unpredictable way. The changes balance each other out when averaged over the whole volume, but when an interpolated volume is compared against one that is not interpolated, false positives may be detected.
Currently all published longitudinal VBM studies of which we are aware have used the SPM2 pipeline (Draganski et al., 2004, 2006; May et al., 2007; Boyke et al., 2008; Ilg et al., 2008; Driemeyer et al., 2008). In the analyses we performed we have made several small but significant changes from the processing stream of these studies. First, rather than aligning each scan to the initial scan, we align to a halfway point between the two scans being compared. Note that this point is significant regardless of whether subjects are used as their own controls or a separate group of controls is used. Second, we used non-parametric methods for all statistical testing and multiple comparison correction as the non-stationarity correction method has been demonstrated as anticonservative for analyses with relatively small degrees of freedom (Hayasaka et al., 2004). We have demonstrated that these changes have a significant effect on the ultimate results across different software packages. This provides an important demonstration of the point made by Ridgway et al. (2008) that very detailed explanations of VBM method are required for experiments to be replicable.
It should be emphasized that the animal literature leaves little doubt that it is possible for the mammalian brain to undergo large scale changes in its structure (cortical thickness, synaptic and capillary density, etc.) on a time scale of days to weeks (Klintsova and Greenough, 1999). Most of these studies have used histological techniques to measure these changes, though some imaging work has been conducted measuring changes in cortical vascular in rats in response to exercise (Pereira et al., 2007; Swain et al., 2003). Thus despite the methodological concerns we have raised here, previously published studies may have provided more favorable conditions for detecting grey matter changes. Several of these studies were conducted on larger groups of subjects and employed longer periods of training that may have resulted in more dramatic grey matter changes. Nonetheless the contrary findings reported here and the demonstrated susceptibility of longitudinal VBM to false positives warrants a more careful examination of these methods. The development of a standard and robust method of investigating within-subject structural brain change remains an important challenge.
In an effort to help produce such a standard, all of the raw data used in this experiment will be made available on the SUMS database at Washington University (Dickson et al., 2001; http://sumsdb.wustl.edu/sums/directory.do?id=6694686). We encourage other researchers to download these data and reanalyze them with novel methods. It is possible that multivariate analysis techniques may prove more sensitive than traditional univariate analysis. Some of these methods have already been employed on cross-sectional VBM data (Kloppel et al., 2008; Kawasaki et al., 2007). Future studies might employ focused, higher resolution scanning or more targeted pulse sequences to obtain a more detailed picture of underlying changes (Swain et al., 2003; van der Kouwe et al., 2008). We are confident that these techniques in combination with rigorous statistical controls will one day make in vivo measurement of human brain structure possible, opening up an entirely new technique in the study of learning and memory.
This work was supported by the NIMH Intramural Research Program. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD (http://biowulf.nih.gov). The authors would like to acknowledge comments and assistance from Rasmus Birn, Gang Chen, Dan Handwerker, David McMahon, Kevin Murphy, Allison Nugent, and Regina Nuzzo.