Many of our visual experiences are dynamic: perception, visual imagery, dreaming and hallucinations all change continuously over time, and these changes are often the most compelling and important aspects of these experiences. Obtaining a quantitative understanding of brain activity underlying these dynamic processes would advance our understanding of visual function. Quantitative models of dynamic mental events could also have important applications as tools for psychiatric diagnosis, and as the foundation of brain machine interface devices [3
Modeling dynamic brain activity is a difficult technical problem. The best tool available currently for non-invasive measurement of brain activity is functional MRI, which has relatively high spatial resolution [12
]. However, blood oxygen level dependent (BOLD) signals measured using fMRI are relatively slow [9
], especially when compared to the speed of natural vision and many other mental processes. It has therefore been assumed that fMRI data would not be useful for modeling brain activity evoked during natural vision or by other dynamic mental processes.
Here we present a new motion-energy [10
] encoding model that largely overcomes this limitation. The model separately describes the neural mechanisms mediating visual motion information and their coupling to much slower hemodynamic mechanisms. In this report we first validate this encoding model by showing that it describes how spatial and temporal information are represented in voxels throughout visual cortex. We then use a Bayesian approach [8
] to combine estimated encoding models with a sampled natural movie prior, in order to produce reconstructions of natural movies from BOLD signals.
We recorded BOLD signals from three human subjects while they viewed a series of color natural movies (20 × 20 degrees at 15 Hz). A fixation task was used to control eye position. Two separate data sets were obtained from each subject. The training data set consisted of BOLD signals evoked by 7,200 seconds of color natural movies, where each movie was presented just once. These data were used to fit a separate encoding model for each voxel located in posterior and ventral occipito-temporal visual cortex. The test data set consisted of BOLD signals evoked by 540 seconds of color natural movies, where each movie was repeated ten times. These data were used to assess the accuracy of the encoding model, and as the targets for movie reconstruction. Because the movies used to train and test models were different, this approach provides a fair and objective evaluation of the accuracy of the encoding and decoding models [2
BOLD signals recorded from each voxel were fit separately using a two-stage process. Natural movie stimuli were first filtered by a bank of neurally-inspired nonlinear units sensitive to local motion-energy [10
]. Then L1-regularized linear regression [15
] was used to fit a separate hemodynamic coupling term to each nonlinear filter (; also see Supplemental Information
). The regularized regression approach used here was optimized to obtain good estimates even for computational models containing thousands of regressors. In this respect our approach differs from the regression procedures used in many other fMRI studies [17
Schematic diagram of the motion-energy encoding model
To determine how much motion information is available in BOLD signals we compared prediction accuracy for three different encoding models (): a conventional static model that includes no motion information [8
]; a non-directional motion model that represents local motion energy but not direction; and a directional model that represents both local motion energy and direction. Each of these models was fit separately to every voxel recorded in each subject, and the test data were used to assess prediction accuracy for each model. Prediction accuracy was defined as the correlation between predicted and observed BOLD signals. The averaged accuracy across subjects and voxels in early visual areas (V1, V2, V3, V3A and V3B) are 0.24, 0.39 and 0.40 for the static, non-directional and directional encoding models, respectively (; see Fig. S1A
for subject- and area-wise comparisons). This difference in prediction accuracy is significant (P
<0.0001, Wilcoxon signed-rank test). An earlier study showed that the static model tested here recovered much more information from BOLD signals than had been obtained with any previous model [8
]. Nevertheless, both motion models developed here provide far more accurate predictions than are obtained with the static model. Note that the difference in prediction accuracy between the directional and non-directional motion models, though significant, is small ( and S1A
). This suggests that BOLD signals convey spatially localized but predominantly non-directional motion information. These results show that the motion-energy encoding model predicts BOLD signals evoked by novel natural movies.
The directional motion-energy model capture motion information
To further explore what information can be recovered from these data we estimated the spatial, spatial frequency and temporal frequency tuning of the directional motion-energy encoding model fit to each voxel. The spatial receptive fields of individual voxels are spatially localized (, left) and are organized retinotopically (), as reported in previous fMRI studies [12
]. Voxel-based receptive fields also show spatial and temporal frequency tuning (, right), as reported in previous fMRI studies [24
To determine how motion information is represented in human visual cortex we calculated the optimal speed for each voxel by dividing the peak temporal frequency by the peak spatial frequency. Projecting the optimal speed of the voxels onto a flattened map of the cortical surface () reveals a significant positive correlation between eccentricity and optimal speed: relatively more peripheral voxels are tuned for relatively higher speeds. This pattern is observed in areas V1, V2 and V3 and for all three subjects (P
<0.0001, t-test for correlation coefficient; see Fig. S1B
for subject- and area-wise comparisons). To our knowledge this is the first evidence that speed selectivity in human early visual areas depends on eccentricity, though a consistent trend has been reported in human behavioral studies [26
] and in neurophysiological studies of non-human primates [29
]. These results show that the motion-energy encoding model describes tuning for both spatial and temporal information at the level of single voxels.
To further characterize the temporal specificity of the estimated motion-energy encoding models we used the test data to estimate movie identification accuracy. Identification accuracy [7
] measures how well a model can correctly associate an observed BOLD signal pattern with the specific stimulus that evoked it. Our motion-energy encoding model can identify the specific movie stimulus that evoked an observed BOLD signals 95% (464/486) of the time within ± one volume (one second; Subject S1, ). This is far above what would be expected by chance (<1%). Identification accuracy (within ± one volume) is greater than 75% for all three subjects even when the set of possible natural movie clips includes one million separate clips chosen at random from the internet (). This result demonstrates that the motion-energy encoding model is both valid and temporally specific. Furthermore, it suggests that the model might provide good reconstructions of natural movies from brain activity measurements [5
We used a Bayesian approach [8
] to reconstruct movies from the evoked BOLD signals (see also Fig. S2
). We estimated the posterior probability by combining a likelihood function (given by the estimated motion-energy model; see Supplemental Information
) and a sampled natural movie prior. The sampled natural movie prior consists of ~18 million seconds of natural movies sampled at random from the internet. These clips were assigned uniform prior probability (and consequently, all other clips were assigned zero prior probability; note also that none of the clips in the prior were used in the experiment). Furthermore, to make decoding tractable reconstructions were based on one second clips (15 frames), using BOLD signals with a delay of four seconds. In effect, this procedure enforces an assumption that the spatio-temporal stimulus that elicited each measured BOLD signal must be one of the movie clips in the sampled prior.
shows typical reconstructions of natural movies obtained using the motion-energy encoding model and the Bayesian decoding approach (see Movie S1
for the corresponding movies). The posterior probability was estimated across the entire sampled natural movie prior separately for each BOLD signal in the test data. The peak of this posterior distribution is the conventional maximum a posteriori
(MAP) reconstruction [8
] for each BOLD signal (see second row in ). When the sampled natural movie prior contains clips that are similar to the viewed clip then the MAP reconstructions are good (e.g., the close-up of a human speaker shown in ). However, when the prior contains no clips similar to the viewed clip then the reconstructions are poor (e.g., ). This likely reflects both the limited size of the sampled natural movie prior and noise in the fMRI measurements. One way to achieve more robust reconstructions without enlarging the prior is to interpolate over the sparse samples in the prior. We therefore created an averaged high posterior
(AHP) reconstruction, by averaging the 100 clips in the sampled natural movie prior that had the highest posterior probability (see also Fig. S2
; Note that the AHP reconstruction can be viewed as a Bayesian version of bagging [31
]). The AHP reconstruction captures the spatio-temporal structure within a viewed clip even when it is completely unique (e.g., the spreading of an inkblot from the center of the visual field shown in ).
Reconstructions of natural movies from BOLD signals
To quantify reconstruction quality we calculated the correlation between the motion-energy content of the original movies and their reconstructions (see Supplemental Information
). A correlation of 1.0 indicates perfect reconstruction of the spatio-temporal energy in the original movies and a correlation of 0.0 indicates that the movies and their reconstruction are spatio-temporally uncorrelated. The results for both MAP and AHP reconstructions are shown in . In both cases reconstruction accuracy is significantly higher than chance (P
<0.0001, Wilcoxon rank-sum test; see Supplemental Information
). Furthermore, AHP reconstructions are significantly better than MAP reconstructions (P
<0.0001, Wilcoxon signed-rank test). Although still crude (motion-energy correlation ~ 0.3), these results validate our general approach to reconstruction and demonstrate that the AHP estimate improves reconstruction over the MAP estimate.