|Home | About | Journals | Submit | Contact Us | Français|
A challenging goal in neuroscience is to be able to read out, or decode, mental content from brain activity. Recent functional magnetic resonance imaging (fMRI) studies have decoded orientation1,2, position3, and object category4,5 from activity in visual cortex. However, these studies typically used relatively simple stimuli (e.g. gratings) or images drawn from fixed categories (e.g. faces, houses), and decoding was based on prior measurements of brain activity evoked by those same stimuli or categories. To overcome these limitations, we develop a decoding method based on quantitative receptive field models that characterize the relationship between visual stimuli and fMRI activity in early visual areas. These models describe the tuning of individual voxels for space, orientation, and spatial frequency, and are estimated directly from responses evoked by natural images. We show that these receptive field models make it possible to identify, from a large set of completely novel natural images, which specific image was seen by an observer. Identification is not a mere consequence of the retinotopic organization of visual areas; simpler receptive field models that describe only spatial tuning yield much poorer identification performance. Our results suggest that it may soon be possible to reconstruct a picture of a person’s visual experience from brain activity measurements alone.
Imagine a general brain-reading device that could reconstruct a picture of a person’s visual experience at any moment in time6. This general visual decoder would have great scientific and practical utility. For example, we could use the decoder to investigate differences in perception across people, to study covert mental processes such as attention, and perhaps even to access the visual content of purely mental phenomena such as dreams and imagery. The decoder would also serve as a useful benchmark of our understanding of how the brain represents sensory information.
How do we build a general visual decoder? We consider as a first step the problem of image identification3,7,8. This problem is analogous to the classic “pick a card, any card” magic trick: We begin with a large, arbitrary set of images. The observer picks an image from the set and views it while brain activity is measured. Is it possible to use the measured brain activity to identify which specific image was seen?
To ensure that a solution to the image identification problem will be applicable to general visual decoding, we introduce two challenging requirements6. First, it must be possible to identify novel images. Conventional classification-based decoding methods can be used to identify images if brain activity evoked by those images has been measured previously, but they cannot be used to identify novel images (see Supplementary Discussion). Second, it must be possible to identify natural images. Natural images have complex statistical structure9 and are much more difficult to parameterize than simple artificial stimuli such as gratings or pre-segmented objects. Because neural processing of visual stimuli is nonlinear, a decoder that can identify simple stimuli may fail when confronted with complex natural images.
Our experiment consisted of two stages (Fig. 1). In the first stage, model estimation, fMRI data were recorded from visual areas V1, V2, and V3 while each subject viewed 1,750 natural images. These data were used to estimate a quantitative receptive field model10 for each voxel (Fig. 2). The model was based on a Gabor wavelet pyramid11,13 and described tuning along the dimensions of space3,14,19, orientation1,2,20, and spatial frequency21,22. (See Supplementary Discussion for a comparison of our receptive field analysis to those of previous studies.)
In the second stage, image identification, fMRI data were recorded while each subject viewed 120 novel natural images. This yielded 120 distinct voxel activity patterns for each subject. For each voxel activity pattern we attempted to identify which image had been seen. To do this, the receptive field models estimated in the first stage of the experiment were used to predict the voxel activity pattern that would be evoked by each of the 120 images. The image whose predicted voxel activity pattern was most correlated (Pearson’s r) with the measured voxel activity pattern was selected.
Identification performance for one subject is illustrated in Fig. 3. For this subject 92% (110/120) of the images were identified correctly (subject S1), while chance performance is just 0.8% (1/120). For a second subject 72% (86/120) of the images were identified correctly (subject S2). These high performance levels demonstrate the validity of our decoding approach, and indicate that our receptive field models accurately characterize the selectivity of individual voxels to natural images.
A general visual decoder would be especially useful if it could operate on brain activity evoked by a single perceptual event. However, because fMRI data are noisy the results reported above were obtained using voxel activity patterns averaged across 13 repeated trials. We therefore attempted identification using voxel activity patterns from single trials. Single-trial performance was 51% (834/1620) and 32% (516/1620) for subjects S1 and S2, respectively (Fig. 4a); once again, chance performance is just 0.8% (13.5/1620). These results suggest that it may be feasible to decode the content of perceptual experiences in real-time7,23.
We have so far demonstrated identification of a single image drawn from a set of 120 images, but a general visual decoder should be able to handle much larger sets of images. To investigate this issue we measured identification performance for various set sizes up to 1,000 images (Fig. 4b). As set size increased 10-fold from 100 to 1,000, performance only declined slightly, from 92% to 82% (subject S1, repeated-trial). Extrapolation of these measurements (see Supplementary Methods) suggests that performance for this subject would remain above 10% even up to a set size of ~1011.3 images. This is more than 100 times larger than the number of images currently indexed by Google (~108.9 images; source: http://www.google.com/whatsnew/, June 4, 2007).
Early visual areas are organized retinotopically, and voxels are known to reflect this organization14,16,18. Could our results be a mere consequence of retinotopy? To answer this question we attempted identification using an alternative model that captures the location and size of each voxel’s receptive field but discards orientation and spatial frequency information (Fig. 4c). Performance for this retinotopy-only model declined to 10% correct at a set size of just ~105.1 images, whereas performance for the Gabor wavelet pyramid model did not decline to 10% correct until ~109.5 images were included in the set (repeated-trial, performance extrapolated and averaged across subjects). This result indicates that spatial tuning alone does not yield optimal identification performance; identification improves substantially when orientation and spatial frequency tuning are included in the model.
To further investigate the impact of orientation and spatial frequency tuning, we measured identification performance after imposing constraints on the orientation and spatial frequency tuning of the Gabor wavelet pyramid model (Supplementary Fig. 8). The results indicate that both orientation and spatial frequency tuning contribute to identification performance, but that the latter makes the larger contribution. This is consistent with recent studies demonstrating that voxels have only slight orientation bias1,2. We also find that voxel-to-voxel variation in orientation and spatial frequency tuning contributes to identification performance. This reinforces the growing realization in the fMRI community that information may be present in fine-grained patterns of voxel activity6.
To be practical our identification algorithm must perform well even when brain activity is measured long after estimation of the receptive field models. To assess performance over time2,4,6,23 we attempted identification for a set of 120 novel natural images that were seen approximately two months after the initial experiment. In this case 82% (99/120) of the images were identified correctly (chance performance 0.8%; subject S1, repeated-trial). We also evaluated identification performance for a set of 12 novel natural images that were seen more than a year after the initial experiment. In this case 100% (12/12) of the images were identified correctly (chance performance 8%; subject S1, repeated-trial). These results demonstrate that the stimulus-related information that can be decoded from voxel activity remains largely stable over time.
Why does identification sometimes fail? Inspection revealed that identification errors tended to occur when the selected image was visually similar to the correct image. This suggests that noise in measured voxel activity patterns causes the identification algorithm to confuse images that have similar features.
Functional MRI signals have modest spatial resolution and reflect hemodynamic activity that is only indirectly coupled to neural activity24,25. Despite these limitations we have shown that fMRI signals can be used to achieve remarkable levels of identification performance. This indicates that fMRI signals contain a considerable amount of stimulus-related information4 and that this information can be successfully decoded in practice.
Identification of novel natural images brings us close to achieving a general visual decoder. The final step will require devising a way to reconstruct the image seen by the observer, instead of selecting the image from a known set. Stanley and co-workers26 reconstructed natural movies by modeling the luminance of individual image pixels as a linear function of single-unit activity in cat LGN. This approach assumes a linear relationship between luminance and the activity of the recorded units, but this condition does not hold in fMRI27,28.
An alternative approach to reconstruction is to incorporate receptive field models into a statistical inference framework. In such a framework, receptive field models are used to infer the most likely image given a measured activity pattern. This model-based approach has a long history in both theoretical and experimental neuroscience29,30. Recently, Thirion and co-workers3 used it to reconstruct spatial maps of contrast from fMRI activity in human visual cortex. The success of the approach depends critically on how well the receptive field models predict brain activity. The present study demonstrates that our receptive field models have sufficient predictive power to enable identification of novel natural images, even for the case of extremely large sets of images. We are therefore optimistic that the model-based approach will make possible the reconstruction of natural images from human brain activity.
The stimuli consisted of sequences of natural photos. Photos were obtained from a commercial digital library (Corel Stock Photo Libraries from Corel Corporation, Ontario, Canada), the Berkeley Segmentation Dataset (http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/), and the authors’ personal collections. The content of the photos included animals, buildings, food, humans, indoor scenes, manmade objects, outdoor scenes, and textures. Photos were converted to grayscale, downsampled so that the smaller of the two image dimensions was 500 px, linearly transformed so that the 1/10th and 99 9/10th percentiles of the original pixel values were mapped to the minimum (0) and maximum (255) pixel values, cropped to the central 500 px × 500 px, masked with a circle, and placed on a gray background (Supplementary Fig. 1a). The luminance of the background was set to the mean luminance across photos, and the outer edge of each photo (10% of the radius of the circular mask) was linearly blended into the background.
The size of the photos was 20° × 20° (500 px × 500 px). A central white square served as the fixation point, and its size was 0.2° × 0.2° (4 px × 4 px). Photos were presented in successive 4-s trials; in each trial, a photo was presented for 1 s and the gray background was presented for 3 s. Each 1-s presentation consisted of a photo being flashed ON-OFF-ON-OFF-ON where ON corresponds to presentation of the photo for 200 ms and OFF corresponds to presentation of the gray background for 200 ms (Supplementary Fig. 1b). The flashing technique increased the signal-to-noise ratio of voxel responses relative to that achieved by presenting each photo continuously for 1 s (data not shown).
Visual stimuli were delivered using the VisuaStim goggles system (Resonance Technology, Northridge, CA). The display resolution was 800 × 600 at 60 Hz. A PowerBook G4 computer (Apple Computer, Cupertino, CA) controlled stimulus presentation using software written in MATLAB 5.2.1 (The Mathworks, Natick, MA) and Psychophysics Toolbox 2.53 (http://psychtoolbox.org).
The experimental protocol was approved by the UC-Berkeley Committee for the Protection of Human Subjects. MRI data were collected at the Brain Imaging Center at UC-Berkeley using a 4 T INOVA MR scanner (Varian, Inc., Palo Alto, CA) and a quadrature transmit/receive surface coil (Midwest RF, LLC, Hartland, WI). Data were acquired using coronal slices that covered occipital cortex: 18 slices, slice thickness 2.25 mm, slice gap 0.25 mm, field-of-view 128 × 128 mm2. (In one scan session, a slice gap of 0.5 mm was used.) For functional data, a T2*-weighted, single-shot, slice-interleaved, gradient-echo EPI pulse sequence was used: matrix size 64 × 64, TR 1 s, TE 28 ms, flip angle 20°. The nominal spatial resolution of the functional data was 2 × 2 × 2.5 mm3. For anatomical data, a T1-weighted gradient-echo multislice sequence was used: matrix size 256 × 256, TR 0.2 s, TE 5 ms, flip angle 40°.
Data for the model estimation and image identification stages of the experiment were collected in the same scan sessions. Two subjects were used: S1 (author T.N., age 33) and S2 (author K.N.K., age 25). Subjects were healthy and had normal or corrected-to-normal vision.
Five scan sessions of data were collected from each subject. Each scan session consisted of five model estimation runs and two image identification runs. Model estimation runs (11 min each) were used for the model estimation stage of the experiment. Each model estimation run consisted of 70 distinct images presented 2 times each. Image identification runs (12 min each) were used for the image identification stage of the experiment. Each image identification run consisted of 12 distinct images presented 13 times each. Images were randomly selected for each run and were mutually exclusive across runs. The total number of distinct images used in the model estimation and image identification runs was 1,750 and 120, respectively. (For additional details on experimental design, see Supplementary Methods.)
Three additional scan sessions of data were collected from subject S1. Two of these were held ~2 months after the main experiment, and consisted of five image identification runs each. The third was held ~14 months after the main experiment, and consisted of one image identification run. The images used in these additional scan sessions were randomly selected and were distinct from the images used in the main experiment.
Functional brain volumes were reconstructed and then coregistered to correct differences in head positioning within and across scan sessions. Next, voxel-specific response timecourses were estimated and deconvolved from the time-series data. This produced, for each voxel, an estimate of the amplitude of the response (a single value) to each image used in the model estimation and image identification runs. Finally, voxels were assigned to visual areas based on retinotopic mapping data17 collected in separate scan sessions. (Details on these procedures are given in Supplementary Methods.)
A receptive field model was estimated for each voxel based on its responses to the images used in the model estimation runs. The model was based on a Gabor wavelet pyramid11,13. In the model each image is represented by a set of Gabor wavelets differing in size, position, orientation, spatial frequency, and phase (Supplementary Fig. 2). The predicted response is a linear function of the contrast energy contained in quadrature wavelet pairs (Supplementary Fig. 3). Because contrast energy is a nonlinear quantity, this is a linearized model10. The model was able to characterize responses of voxels in visual areas V1, V2, and V3 (Supplementary Table 1) but it did a poor job of characterizing responses in higher visual areas such as V4.
Alternative receptive field models were also used, including the retinotopy-only model and several constrained versions of the Gabor wavelet pyramid model. Details on these models and model estimation procedures are given in Supplementary Methods.
Voxel activity patterns were constructed from voxel responses evoked by the images used in the image identification runs. For each voxel activity pattern, the estimated receptive field models were used to identify which specific image had been seen. The identification algorithm is described in the main text. See Supplementary Fig. 4 and Supplementary Methods for details concerning voxel selection, performance for different set sizes, and noise ceiling estimation. See Supplementary Discussion for a comparison of identification to the decoding problems of classification and reconstruction.
This work was supported by an NDSEG fellowship (K.N.K.), NIH, and UC-Berkeley intramural funds. We thank B. Inglis for MRI assistance, K. Hansen for retinotopic mapping assistance, D. Woods and X. Kang for acquisition of whole-brain anatomical data, and A. Rokem for assistance with scanner operation. We also thank C. Baker, M. D’Esposito, R. Ivry, A. Landau, M. Merolle, F. Theunissen, and the anonymous referees for comments on the manuscript. Finally, we thank S. Nishimoto, R. Redfern, K. Schreiber, B. Willmore, and B. Yu for their help in various aspects of this research.
Author Contributions K.N.K. designed and conducted the experiment and was first author on the paper. K.N.K. and T.N. analyzed the data. R.J.P. provided mathematical ideas and assistance. J.L.G. provided guidance on all aspects of the project. All authors discussed the results and commented on the manuscript.
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.
The authors declare no competing financial interests.