|Home | About | Journals | Submit | Contact Us | Français|
Quantitative modeling of human brain activity can provide crucial insights about cortical representations [1, 2], and can form the basis for brain decoding devices [3–5]. Recent functional magnetic resonance imaging (fMRI) studies have modeled brain activity elicited by static visual patterns, and have shown that it is possible to reconstruct these images from brain activity measurements [6–8]. However, blood oxygen level dependent (BOLD) signals measured using fMRI are very slow , so it has been difficult to model brain activity elicited by dynamic stimuli such as natural movies. Here we present a new motion-energy [10, 11] encoding model that largely overcome this limitation. Our motion-energy model describes fast visual information and slow hemodynamics by separate components. We recorded BOLD signals in occipito-temporal visual cortex of human subjects who passively watched natural movies, and fit the encoding model separately to individual voxels. Visualization of the fit models reveals how early visual areas represent moving stimuli. To demonstrate the power of our approach we also constructed a Bayesian decoder , by combining estimated encoding models with a sampled natural movie prior. The decoder provides remarkable reconstructions of natural movies, capturing the spatio-temporal structure of the viewed movie. These results demonstrate that dynamic brain activity measured under naturalistic conditions can be decoded using current fMRI technology.
Many of our visual experiences are dynamic: perception, visual imagery, dreaming and hallucinations all change continuously over time, and these changes are often the most compelling and important aspects of these experiences. Obtaining a quantitative understanding of brain activity underlying these dynamic processes would advance our understanding of visual function. Quantitative models of dynamic mental events could also have important applications as tools for psychiatric diagnosis, and as the foundation of brain machine interface devices [3–5].
Modeling dynamic brain activity is a difficult technical problem. The best tool available currently for non-invasive measurement of brain activity is functional MRI, which has relatively high spatial resolution [12, 13]. However, blood oxygen level dependent (BOLD) signals measured using fMRI are relatively slow , especially when compared to the speed of natural vision and many other mental processes. It has therefore been assumed that fMRI data would not be useful for modeling brain activity evoked during natural vision or by other dynamic mental processes.
Here we present a new motion-energy [10, 11] encoding model that largely overcomes this limitation. The model separately describes the neural mechanisms mediating visual motion information and their coupling to much slower hemodynamic mechanisms. In this report we first validate this encoding model by showing that it describes how spatial and temporal information are represented in voxels throughout visual cortex. We then use a Bayesian approach  to combine estimated encoding models with a sampled natural movie prior, in order to produce reconstructions of natural movies from BOLD signals.
We recorded BOLD signals from three human subjects while they viewed a series of color natural movies (20 × 20 degrees at 15 Hz). A fixation task was used to control eye position. Two separate data sets were obtained from each subject. The training data set consisted of BOLD signals evoked by 7,200 seconds of color natural movies, where each movie was presented just once. These data were used to fit a separate encoding model for each voxel located in posterior and ventral occipito-temporal visual cortex. The test data set consisted of BOLD signals evoked by 540 seconds of color natural movies, where each movie was repeated ten times. These data were used to assess the accuracy of the encoding model, and as the targets for movie reconstruction. Because the movies used to train and test models were different, this approach provides a fair and objective evaluation of the accuracy of the encoding and decoding models [2, 14].
BOLD signals recorded from each voxel were fit separately using a two-stage process. Natural movie stimuli were first filtered by a bank of neurally-inspired nonlinear units sensitive to local motion-energy [10, 11]. Then L1-regularized linear regression [15, 16] was used to fit a separate hemodynamic coupling term to each nonlinear filter (Fig. 1; also see Supplemental Information). The regularized regression approach used here was optimized to obtain good estimates even for computational models containing thousands of regressors. In this respect our approach differs from the regression procedures used in many other fMRI studies [17, 18].
To determine how much motion information is available in BOLD signals we compared prediction accuracy for three different encoding models (Fig. 2A-C): a conventional static model that includes no motion information [8, 19]; a non-directional motion model that represents local motion energy but not direction; and a directional model that represents both local motion energy and direction. Each of these models was fit separately to every voxel recorded in each subject, and the test data were used to assess prediction accuracy for each model. Prediction accuracy was defined as the correlation between predicted and observed BOLD signals. The averaged accuracy across subjects and voxels in early visual areas (V1, V2, V3, V3A and V3B) are 0.24, 0.39 and 0.40 for the static, non-directional and directional encoding models, respectively (Fig. 2D and 2E; see Fig. S1A for subject- and area-wise comparisons). This difference in prediction accuracy is significant (P<0.0001, Wilcoxon signed-rank test). An earlier study showed that the static model tested here recovered much more information from BOLD signals than had been obtained with any previous model [8, 19]. Nevertheless, both motion models developed here provide far more accurate predictions than are obtained with the static model. Note that the difference in prediction accuracy between the directional and non-directional motion models, though significant, is small (Fig. 2E and S1A). This suggests that BOLD signals convey spatially localized but predominantly non-directional motion information. These results show that the motion-energy encoding model predicts BOLD signals evoked by novel natural movies.
To further explore what information can be recovered from these data we estimated the spatial, spatial frequency and temporal frequency tuning of the directional motion-energy encoding model fit to each voxel. The spatial receptive fields of individual voxels are spatially localized (Fig. 2F and 2G, left) and are organized retinotopically (Fig. 2H and 2I), as reported in previous fMRI studies [12, 19–23]. Voxel-based receptive fields also show spatial and temporal frequency tuning (Fig. 2F and 2G, right), as reported in previous fMRI studies [24, 25].
To determine how motion information is represented in human visual cortex we calculated the optimal speed for each voxel by dividing the peak temporal frequency by the peak spatial frequency. Projecting the optimal speed of the voxels onto a flattened map of the cortical surface (Fig. 2J) reveals a significant positive correlation between eccentricity and optimal speed: relatively more peripheral voxels are tuned for relatively higher speeds. This pattern is observed in areas V1, V2 and V3 and for all three subjects (P<0.0001, t-test for correlation coefficient; see Fig. S1B for subject- and area-wise comparisons). To our knowledge this is the first evidence that speed selectivity in human early visual areas depends on eccentricity, though a consistent trend has been reported in human behavioral studies [26–28] and in neurophysiological studies of non-human primates [29, 30]. These results show that the motion-energy encoding model describes tuning for both spatial and temporal information at the level of single voxels.
To further characterize the temporal specificity of the estimated motion-energy encoding models we used the test data to estimate movie identification accuracy. Identification accuracy [7, 19] measures how well a model can correctly associate an observed BOLD signal pattern with the specific stimulus that evoked it. Our motion-energy encoding model can identify the specific movie stimulus that evoked an observed BOLD signals 95% (464/486) of the time within ± one volume (one second; Subject S1, Fig 3A and B). This is far above what would be expected by chance (<1%). Identification accuracy (within ± one volume) is greater than 75% for all three subjects even when the set of possible natural movie clips includes one million separate clips chosen at random from the internet (Fig. 3C). This result demonstrates that the motion-energy encoding model is both valid and temporally specific. Furthermore, it suggests that the model might provide good reconstructions of natural movies from brain activity measurements .
We used a Bayesian approach  to reconstruct movies from the evoked BOLD signals (see also Fig. S2). We estimated the posterior probability by combining a likelihood function (given by the estimated motion-energy model; see Supplemental Information) and a sampled natural movie prior. The sampled natural movie prior consists of ~18 million seconds of natural movies sampled at random from the internet. These clips were assigned uniform prior probability (and consequently, all other clips were assigned zero prior probability; note also that none of the clips in the prior were used in the experiment). Furthermore, to make decoding tractable reconstructions were based on one second clips (15 frames), using BOLD signals with a delay of four seconds. In effect, this procedure enforces an assumption that the spatio-temporal stimulus that elicited each measured BOLD signal must be one of the movie clips in the sampled prior.
Fig. 4 shows typical reconstructions of natural movies obtained using the motion-energy encoding model and the Bayesian decoding approach (see Movie S1 for the corresponding movies). The posterior probability was estimated across the entire sampled natural movie prior separately for each BOLD signal in the test data. The peak of this posterior distribution is the conventional maximum a posteriori (MAP) reconstruction  for each BOLD signal (see second row in Fig. 4). When the sampled natural movie prior contains clips that are similar to the viewed clip then the MAP reconstructions are good (e.g., the close-up of a human speaker shown in Fig. 4A). However, when the prior contains no clips similar to the viewed clip then the reconstructions are poor (e.g., Fig. 4B). This likely reflects both the limited size of the sampled natural movie prior and noise in the fMRI measurements. One way to achieve more robust reconstructions without enlarging the prior is to interpolate over the sparse samples in the prior. We therefore created an averaged high posterior (AHP) reconstruction, by averaging the 100 clips in the sampled natural movie prior that had the highest posterior probability (see also Fig. S2; Note that the AHP reconstruction can be viewed as a Bayesian version of bagging ). The AHP reconstruction captures the spatio-temporal structure within a viewed clip even when it is completely unique (e.g., the spreading of an inkblot from the center of the visual field shown in Fig. 4B).
To quantify reconstruction quality we calculated the correlation between the motion-energy content of the original movies and their reconstructions (see Supplemental Information). A correlation of 1.0 indicates perfect reconstruction of the spatio-temporal energy in the original movies and a correlation of 0.0 indicates that the movies and their reconstruction are spatio-temporally uncorrelated. The results for both MAP and AHP reconstructions are shown in Fig. 4D. In both cases reconstruction accuracy is significantly higher than chance (P<0.0001, Wilcoxon rank-sum test; see Supplemental Information). Furthermore, AHP reconstructions are significantly better than MAP reconstructions (P<0.0001, Wilcoxon signed-rank test). Although still crude (motion-energy correlation ~ 0.3), these results validate our general approach to reconstruction and demonstrate that the AHP estimate improves reconstruction over the MAP estimate.
In this study we developed a new encoding model that predicts BOLD signals in early visual areas with unprecedented accuracy. By using this model in a Bayesian framework we provide the first reconstructions of natural movies from human brain activity. This is a critical step toward the creation of brain reading devices that can reconstruct dynamic perceptual experiences. Our solution to this problem rests on two key innovations. The first is a new motion-energy encoding model that is optimized for use with fMRI, and that aims to reflect the separate contributions of the underlying neuronal population and hemodynamic coupling (Fig. 1). This encoding model recovers fine temporal information from relatively slow BOLD signals. The second is a sampled natural movie prior that is embedded within a Bayesian decoding framework. This approach provides a simple method for reconstructing spatio-temporal stimuli from the sparsely sampled and slow BOLD signals.
Our results provided the first evidence that there is a positive correlation between eccentricity and speed tuning in human early visual areas. This provides a functional explanation for previous behavioral studies indicating that speed sensitivity depends on eccentricity [26–28]. This systematic variation in speed tuning across the visual field may be an adaptation to the non-uniform distribution of speed signals induced by selective foveation in natural scenes . From the perspective of decoding, this result suggests that we might further optimize reconstruction by including eccentricity-dependent speed tuning in the prior.
We found that a motion-energy model that incorporates directional motion signals was only slightly better than a model that does not include direction. We believe that this likely reflects limitations in the spatial resolution of fMRI recordings. Indeed, a recent study reported that hemodynamic signals were sufficient to visualize a columnar organization of motion direction in macaque area V2 . Future fMRI experiments at higher spatial or temporal resolution [34, 35] might therefore be able to recover clearer directional signals in human visual cortex.
In preliminary work for this study we explored several encoding models that incorporated color information explicitly. However, we found that color information did not improve the accuracy of predictions or identification beyond what could be achieved with models that include only luminance information. We believe that this reflects the fact that luminance and color borders are often correlated in natural scenes [36, 37]; but see . (Note that when iso-luminant, mono-chromatic stimuli are used, color can be reconstructed from evoked BOLD signals .) The correlation between luminance and color information in natural scenes has an interesting side effect: our reconstructions tend to recover color borders (e.g., borders between hair vs. face or face vs. body), even though the encoding model makes no use of color information. This is a positive aspect of the sampled natural movie prior and provides additional cues to aid in recognition of reconstructed scenes (see also ).
We found that the quality of reconstruction could be improved by simply averaging around the maxiumum of the posterior movies. This suggests that reconstructions might be further improved if the number of samples in the prior is much larger than the one used here. Likelihood estimation (and thus reconstruction) would also improve if additional knowledge about the neural representation of movies were used to construct better encoding models (e.g., ).
In a landmark study Thirion et al.  first reconstructed static imaginary patterns from BOLD signals in early visual areas. Other studies have decoded subjective mental states, such as the contents of visual working memory , or whether subjects are attending to one or another orientation or direction [3, 43]. The modeling framework presented here provides the first reconstructions of dynamic perceptual experiences from BOLD signals. Therefore, this modeling framework might also permit reconstruction of dynamic mental content such as continuous natural visual imagery. In contrast to earlier studies that reconstruct visual patterns defined by checkerboard contrast [6, 7], our framework could potentially be used to decode involuntary subjective mental states (e.g., dreaming or hallucination), though it would be difficult to determine whether the decoded content was accurate. One recent study showed that BOLD signals elicited by visual imagery are more prominent in ventral-temporal visual areas than in early visual areas . This finding suggests that a hybrid encoding model that combines the structural motion-energy model developed here with a semantic model of the form developed in previous studies [8, 45, 46] could provide even better reconstructions of subjective mental experiences.
Visual stimuli consisted of color natural movies drawn from the Apple QuickTime HD gallery (http://www.apple.com/quicktime/guide/hd/) and YouTube (http://www.youtube.com/; see the list of movies in Supplemental Information). The original high-definition movies were cropped to a square and then spatially down-sampled to 512 by 512 pixels. Movies were then clipped to 10–20 seconds in length, and the stimulus sequence was created by randomly drawing movies from the entire set. Movies were displayed using a VisuaStim LCD goggles system (20x20 degrees, 15 Hz). A colored fixation spot (4 pixels or 0.16 degree square) was presented on top of the movie. The color of the fixation spot changed three times per second to ensure that it was visible regardless of the color of the movie.
The experimental protocol was approved by the Committee for the Protection of Human Subjects at University of California at Berkeley. Functional scans were conducted using a 4 Tesla Varian INOVA scanner (Varian, Inc., Palo Alto, CA) with a quadrature transmit/receive surface coil (Midwest RF, LLC, Hartland, WI). Scans were obtained using T2*-weighted gradient-echo EPI: TR = 1 second, TE = 28 ms, Flip angle = 56 degrees, voxel size = 2.0 × 2.0 × 2.5 mm3, and FOV = 128 × 128 mm2. The slice prescription consisted of 18 coronal slices beginning at the posterior pole and covering the posterior portion of occipital cortex.
Functional MRI scans were made from three human subjects, S1 (author S.N., age 30), S2 (author T.N., age 34) and S3 (author A.V., age 23). All subjects were healthy and had normal or corrected-to-normal vision. The training data were collected in 12 separate 10 minute blocks (7200 seconds total). The training movies were shown only once each. The test data were collected in 9 separate 10 minute blocks (5400 seconds total) that consists of 9 minute movies repeated 10 times each. To minimize effects from potential adaptation and long-term drift in the test data, the 9 minute movies were divided into 1 minute chunks and these were randomly permuted across blocks. Each test block was thus constructed by concatenating 10 separate one minute movies. All data were collected across multiple sessions for each subject, and each session contained multiple training and test blocks. The training and test data sets used different movies.
We thank B. Inglis for assistance with MRI, and K. Kay and K. Hansen for assistance with retinotopic mapping. We also thank M. Oliver, R. Prenger, D. Stansbury, A. Huth and J. Gao for their help in various aspects of this research. This work was supported by NIH and NEI.
The authors declare no conflict of interest.
Additional methods can be found in Supplemental Information online.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.