Imagine a general brain-reading device that could reconstruct a picture of a person’s visual experience at any moment in time6
. This general visual decoder
would have great scientific and practical utility. For example, we could use the decoder to investigate differences in perception across people, to study covert mental processes such as attention, and perhaps even to access the visual content of purely mental phenomena such as dreams and imagery. The decoder would also serve as a useful benchmark of our understanding of how the brain represents sensory information.
How do we build a general visual decoder? We consider as a first step the problem of image identification3,7,8
. This problem is analogous to the classic “pick a card, any card” magic trick: We begin with a large, arbitrary set of images. The observer picks an image from the set and views it while brain activity is measured. Is it possible to use the measured brain activity to identify which specific image was seen?
To ensure that a solution to the image identification problem will be applicable to general visual decoding, we introduce two challenging requirements6
. First, it must be possible to identify novel images
. Conventional classification-based decoding methods can be used to identify images if brain activity evoked by those images has been measured previously, but they cannot be used to identify novel images (see Supplementary Discussion
). Second, it must be possible to identify natural images
. Natural images have complex statistical structure9
and are much more difficult to parameterize than simple artificial stimuli such as gratings or pre-segmented objects. Because neural processing of visual stimuli is nonlinear, a decoder that can identify simple stimuli may fail when confronted with complex natural images.
Our experiment consisted of two stages (). In the first stage, model estimation
, fMRI data were recorded from visual areas V1, V2, and V3 while each subject viewed 1,750 natural images. These data were used to estimate a quantitative receptive field model10
for each voxel (). The model was based on a Gabor wavelet pyramid11,13
and described tuning along the dimensions of space3,14,19
, and spatial frequency21,22
. (See Supplementary Discussion
for a comparison of our receptive field analysis to those of previous studies.)
Receptive field model for a representative voxel
In the second stage, image identification, fMRI data were recorded while each subject viewed 120 novel natural images. This yielded 120 distinct voxel activity patterns for each subject. For each voxel activity pattern we attempted to identify which image had been seen. To do this, the receptive field models estimated in the first stage of the experiment were used to predict the voxel activity pattern that would be evoked by each of the 120 images. The image whose predicted voxel activity pattern was most correlated (Pearson’s r) with the measured voxel activity pattern was selected.
Identification performance for one subject is illustrated in . For this subject 92% (110/120) of the images were identified correctly (subject S1), while chance performance is just 0.8% (1/120). For a second subject 72% (86/120) of the images were identified correctly (subject S2). These high performance levels demonstrate the validity of our decoding approach, and indicate that our receptive field models accurately characterize the selectivity of individual voxels to natural images.
A general visual decoder would be especially useful if it could operate on brain activity evoked by a single perceptual event. However, because fMRI data are noisy the results reported above were obtained using voxel activity patterns averaged across 13 repeated trials. We therefore attempted identification using voxel activity patterns from single trials. Single-trial performance was 51% (834/1620) and 32% (516/1620) for subjects S1 and S2, respectively (); once again, chance performance is just 0.8% (13.5/1620). These results suggest that it may be feasible to decode the content of perceptual experiences in real-time7,23
Factors that impact identification performance
We have so far demonstrated identification of a single image drawn from a set of 120 images, but a general visual decoder should be able to handle much larger sets of images. To investigate this issue we measured identification performance for various set sizes up to 1,000 images (). As set size increased 10-fold from 100 to 1,000, performance only declined slightly, from 92% to 82% (subject S1, repeated-trial). Extrapolation of these measurements (see Supplementary Methods
) suggests that performance for this subject would remain above 10% even up to a set size of ~1011.3
images. This is more than 100 times larger than the number of images currently indexed by Google (~108.9
images; source: http://www.google.com/whatsnew/
, June 4, 2007).
Early visual areas are organized retinotopically, and voxels are known to reflect this organization14,16,18
. Could our results be a mere consequence of retinotopy? To answer this question we attempted identification using an alternative model that captures the location and size of each voxel’s receptive field but discards orientation and spatial frequency information (). Performance for this retinotopy-only model declined to 10% correct at a set size of just ~105.1
images, whereas performance for the Gabor wavelet pyramid model did not decline to 10% correct until ~109.5
images were included in the set (repeated-trial, performance extrapolated and averaged across subjects). This result indicates that spatial tuning alone does not yield optimal identification performance; identification improves substantially when orientation and spatial frequency tuning are included in the model.
To further investigate the impact of orientation and spatial frequency tuning, we measured identification performance after imposing constraints on the orientation and spatial frequency tuning of the Gabor wavelet pyramid model (Supplementary Fig. 8
). The results indicate that both orientation and spatial frequency tuning contribute to identification performance, but that the latter makes the larger contribution. This is consistent with recent studies demonstrating that voxels have only slight orientation bias1,2
. We also find that voxel-to-voxel variation in orientation and spatial frequency tuning contributes to identification performance. This reinforces the growing realization in the fMRI community that information may be present in fine-grained patterns of voxel activity6
To be practical our identification algorithm must perform well even when brain activity is measured long after estimation of the receptive field models. To assess performance over time2,4,6,23
we attempted identification for a set of 120 novel natural images that were seen approximately two months after the initial experiment. In this case 82% (99/120) of the images were identified correctly (chance performance 0.8%; subject S1, repeated-trial). We also evaluated identification performance for a set of 12 novel natural images that were seen more than a year after the initial experiment. In this case 100% (12/12) of the images were identified correctly (chance performance 8%; subject S1, repeated-trial). These results demonstrate that the stimulus-related information that can be decoded from voxel activity remains largely stable over time.
Why does identification sometimes fail? Inspection revealed that identification errors tended to occur when the selected image was visually similar to the correct image. This suggests that noise in measured voxel activity patterns causes the identification algorithm to confuse images that have similar features.
Functional MRI signals have modest spatial resolution and reflect hemodynamic activity that is only indirectly coupled to neural activity24,25
. Despite these limitations we have shown that fMRI signals can be used to achieve remarkable levels of identification performance. This indicates that fMRI signals contain a considerable amount of stimulus-related information4
and that this information can be successfully decoded in practice.
Identification of novel natural images brings us close to achieving a general visual decoder. The final step will require devising a way to reconstruct the image seen by the observer, instead of selecting the image from a known set. Stanley and co-workers26
reconstructed natural movies by modeling the luminance of individual image pixels as a linear function of single-unit activity in cat LGN. This approach assumes a linear relationship between luminance and the activity of the recorded units, but this condition does not hold in fMRI27,28
An alternative approach to reconstruction is to incorporate receptive field models into a statistical inference framework. In such a framework, receptive field models are used to infer the most likely image given a measured activity pattern. This model-based approach has a long history in both theoretical and experimental neuroscience29,30
. Recently, Thirion and co-workers3
used it to reconstruct spatial maps of contrast from fMRI activity in human visual cortex. The success of the approach depends critically on how well the receptive field models predict brain activity. The present study demonstrates that our receptive field models have sufficient predictive power to enable identification of novel natural images, even for the case of extremely large sets of images. We are therefore optimistic that the model-based approach will make possible the reconstruction of natural images from human brain activity.