|Home | About | Journals | Submit | Contact Us | Français|
Previous work using functional magnetic resonance imaging (fMRI) has shown that the identities of isolated objects can be extracted from distributed patterns of activity in the human brain . Outside the laboratory, however, objects almost never appear in isolation; thus it is important to understand how multiple simultaneously-occurring objects are encoded in the visual system. Here we use multi-voxel pattern analysis to examine this issue, specifically testing whether patterns evoked by pairs of objects in the lateral occipital complex (LOC) showed an ordered relationship to patterns evoked by their constituent objects presented alone. Subjects viewed four categories of objects, presented either alone or in different-category pairs, while performing a one-back task that required attention to each item on the screen. Applying a “searchlight” pattern classification approach  to identify voxels with the highest signal-to-noise ratios, we found that the responses to object pairs among these informative voxels were well-predicted by the averages of their responses to the corresponding component objects. We validated this relationship by classifying patterns evoked by object pairs based on synthetic patterns created by averaging patterns evoked by single objects. These results indicate that the representation of multiple objects in LOC is governed by response normalization mechanisms similar to those reported in the visual systems of several species, including macaques [3–6]. They also suggest a coding scheme that allows patterns of population activity to preserve information about multiple objects under conditions of distributed attention, facilitating fast object and scene recognition during natural vision.
In a block design, subjects viewed single objects in four categories (shoes, chairs, cars, or brushes), as well as object pairs containing objects from two categories. Several previous studies have shown that information about the category of viewed objects is present in distributed patterns of activity measured with fMRI [1, 7] and we first wished to replicate this finding as a means of validating the quality of our data. Figure 1A shows classification performance for single objects within standard functionally-defined regions of interest (ROIs) (see Supplemental Procedures). Consistent with previous work, classification accuracy was significantly above chance in LOC (two-tailed t-test, t(11) = 4.95, p = 0.0004). Classification accuracy was also above chance in the parahippocampal place area (PPA; t(11) = 2.77, p = 0.018) but not in the fusiform face area (FFA; t(11) = 1.56, p = 0.15) or a non-brain ROI (t(11) = 0.13, p = 0.89). (See Supplemental Results for additional classification analyses, including the impact of changes in stimulus position upon accuracy.)
We next assessed the accuracy of the classifier in distinguishing among object pairs (Figure 1B). For classification purposes, each unique object pair was treated as a distinct stimulus (e.g. chair+brush and car+brush were treated as different stimulus categories), producing six total pairs from the pool of four object categories. Classification accuracy for pairs was significantly above chance in LOC (t(11) = 4.68, p = 0.0007) but not in the PPA(t(11) = 1.04, p = 0.31), FFA (t(11) = 1.68, p = 0.12) or the non-brain ROI (t(11) = 0.86, p = 0.40). These results, along with whole-brain maps of local classification accuracy (Supplemental Figure 1), indicate that activity patterns in LOC reliably discriminate between object pairs as well as between single objects.
Do LOC patterns evoked by pairs bear any relationship to patterns evoked by their constituent objects? We first assessed the ability of a linear model to explain responses evoked by object pairs [3, 5]. For each voxel, we performed a linear regression of the responses to pairs against the sum of responses to their constituent objects. The procedure is illustrated in Figure 2A for a voxel with a strong linear relationship between responses evoked by pairs and single objects (R2 = 0.96).
Many voxels had much lower R2 values, which could have reflected either a nonlinear relationship, or the impact of noise on a truly linear relationship. To differentiate between these possibilities, we used a searchlight classification technique to identify local voxel clusters that carried information about stimulus identity (see Experimental Procedures). We reasoned that searchlight clusters that most accurately differentiated among object pairs would contain voxels that were the most instructive of the “true” relationship between responses to pairs and constituent single objects. Therefore, if a linear model provides a good description of this relationship, we would expect to see R2 increase as a function of searchlight classification accuracy. (See Supplemental Results and Discussion for a detailed treatment of this approach.)
Figure 2B plots median R2 within each LOC searchlight cluster as a function of cluster classification rank for one subject. (We used classification rank, rather than raw classification accuracy, as the independent variable in order to facilitate averaging data across subjects, between whom overall classification accuracy varied.) For this subject, there was a clear trend toward higher R2 values as classification accuracy improved. This relationship was also apparent in R2 averaged across subjects (Figure 2C). To quantify this trend, we computed correlation coefficients between R2 and classification rank within LOC for each subject. All subjects had positive correlation coefficients and all but two were significantly greater than zero at a p < 0.05 threshold. Across subjects, mean correlation coefficients were significantly above zero (mean = 0.33, t(11) = 6.32, p = 0.00006). From the positive relationship between R2 and classification rank, we infer that responses to object pairs are well-approximated by a linear combination of responses to single objects. (Similar analyses for the PPA, FFA, and retinotopic cortex can be found in the Supplemental Results.) A permutation-based control analysis demonstrated that this relationship was not a trivial outcome of voxel selection (i.e. “peeking”; see Supplemental Results).
While R2 captures the quality of a linear relationship, it does not specify its parameters. To understand whether voxels in LOC obeyed any specific linear relationship between pair and single object responses, we examined the slope terms of the linear regressions described above. Figure 2D illustrates the relationship between classification rank and median slope for each searchlight position for one subject, while Figure 2E plots the same relationship averaged over all subjects. As with R2, median searchlight slopes increased as classification accuracy improved. More importantly, slope values among high performing clusters fell close to 0.5, indicating that pair responses were approximately the average of responses to their constituent single objects. This result echoes a previous finding by Zoccolan et al. that neuronal responses in macaque inferotemporal (IT) cortex to pairs of objects are well-predicted by the average of the single-object responses. Although the terminal slope value in Figure 2E was 0.62, this value was not significantly different from 0.5. Terminal slope values for LOC were fairly consistent across subjects, with 8 of 12 subjects’ values falling between 0.35 and 0.65. Furthermore, analysis of the distribution of residual error between actual pairs responses and regression lines indicates that these results are more consistent with pair responses that are simple, rather than weighted, averages of responses to constituent single objects (see Supplemental Results).
Linear regression returns an intercept term in addition to slope, which was not significantly different from zero in the top 30 LOC searchlight clusters (t(11) = 0.76, p = 0.45). Thus, the responses to object pairs were truly the averages of responses to single objects, without any additional offset reflecting systematic differences in overall activity evoked by pairs and single objects.
The preceding analyses suggest that we may approximate the responses of LOC voxels to object pairs as the averages of responses evoked by their constituent objects. To test this assertion, we repeated the pair pattern classification procedure but replaced pair patterns in one half of the data with “synthetic” patterns that were the average of patterns evoked by the corresponding single objects. Patterns were limited to voxels that fell within the 30 highest performing searchlights in terms of pair classification, which typically afforded the highest average classification performance across subjects (Supplemental Figure 4). It is critical to note that although these voxels were selected on the basis of high pair classification in their searchlights, this criterion was completely independent of the responses to single objects that were used to construct synthetic patterns.
Classification accuracies using synthetic patterns are shown in Figure 3A. At a rate well above chance (t(11) = 8.54, p < 0.00001), the classifier was able to correctly identify patterns evoked by object pairs based on comparison to synthetic response patterns derived by averaging the single-object responses within each voxel. Although classification based on these synthetic patterns was not as accurate as classification based on actual pair patterns, it was significantly more accurate (t(11) = 5.45, p = 0.0002) than classification based on a set of “MAX” function synthetic patterns generated by taking the higher of each voxel’s responses to the two single objects comprising each pair . This is consistent with the idea that pair responses reflect linear rather than non-linear combinations of single-object responses.
Our ability to classify pairs from single-object patterns suggests that inverting the operation should allow us to decode the identities of single object from the pattern evoked by a pair. Reddy and Kanwisher  found that classification accuracy for single objects was markedly degraded in LOC when a second object was present. The origin and nature of this “clutter cost” was unclear, however. Was information about the identity of objects actually lost? Or did the “cost” simply reflect the joint representation of both objects? Under the second scenario, we should be able to recoup clutter costs through appropriate decoding of patterns evoked by pairs.
We first assessed the impact of clutter in our own data by measuring classification accuracy for single objects within pairs. A correct classification decision was recorded when the Euclidean distance between the pattern evoked by a pair and the pattern evoked by one of its component objects (the “target” object for the purposes of classification only) was less than the distance between the pair pattern and the pattern for a comparison object not in the pair. Consistent with Reddy and Kanwisher , accuracy for single objects in pairs was significantly lower than accuracy for single objects by themselves (Figure 3B) in LOC (t(11) = 4.27, p = 0.0013), reflecting a substantial clutter cost.
To recover this clutter cost, we assumed that patterns evoked by object pairs were the means of patterns evoked by their constituent objects. Accordingly, to extract the pattern evoked by a target object from a pair response, we subtracted a half-scaled version of the pattern evoked by the non-target object, and multiplied the resulting pattern by two. Applying this treatment produced a significant improvement in classification (Figure 3B; t(11) = 4.02, p = 0.002). Linear decoding using this approach recovered an average of 48% of clutter costs associated with the presence of a second object. This result confirms the claim that information about each individual object is embedded in pattern evoked by object pairs.
The principal finding of this study is that under conditions of distributed attention, voxelwise patterns of activity in object-selective cortex evoked by pairs of objects are the average of the patterns evoked by the individual component objects. Consistent with this result, pair patterns could be decoded with high accuracy by reference to synthetic patterns generated by averaging the single-object responses. Conversely, subtraction of an appropriately-scaled version of the voxel pattern evoked by one object of a pair recovered the pattern evoked by the second object.
This work builds on and extends two previous findings. First, Zoccolan et al.  demonstrated that responses of object-selective neurons of macaque area IT to pairs of objects were precisely predicted by the average of responses to their constituent objects. Our results demonstrate that a similar averaging rule applies to human LOC. Second, Reddy and Kanwisher  demonstrated a clutter cost for classification of single, focally-attended objects when a second, unattended stimulus was present. Here we demonstrate that when the two objects are equally attended, a substantial portion of this cost for one object can be recouped if the response pattern to the second object is known.
These results potentially provide important insights into how visual recognition might proceed in the real world. The fact that objects in natural scenes almost always appear amidst other objects presents both a challenge and an opportunity for the visual system. The challenge is to identify single objects even when they are surrounded by the clutter of other stimuli. Attentional mechanisms might help solve this problem by boosting up the neural response to attended objects while suppressing the neural response to unattended objects [9–11]. However, this suppression of unattended object response can potentially negate an important informational opportunity. Specifically, the multiple objects within the scene might, if considered together, convey information about the “gist” or “context” of the scene [12–15]. Behavioral studies indicate that humans can indeed extract this “gist” information very rapidly [12, 16]; furthermore, observers can report the identities of objects within a scene even after very brief presentation times that are unlikely to permit attention to be moved serially from object to object . Our results suggest a way in which the visual system might accomplish this feat. In particular, if the pattern evoked by a multiple-object scene is linearly related to the patterns evoked by its constituent objects, then “gist” might correspond simply to an initial hypothesis about the set of objects contributing to this overall pattern and a judgment about the category of scene that is most likely to contain such objects. Indeed, if there is a lawful relationship between the representations of a whole scene and of its component objects, then the same neural system can be use to represent both.
This reasoning explains why it would be advantageous for the visual system to maintain a linear relationship between single- and multiple-object responses, but it does not explain why the voxel patterns evoked by pairs resemble the average of single-object patterns. In its adherence to the mean, LOC appears to obey rules similar to those that have been described previously in a variety of visual areas in non-human primates [4, 6, 18–20], and which have traditionally been explained as an outcome of competition between stimuli for limited neural bandwidth [10, 11, 19, 21]. Our results suggest an alternative framing of this phenomenon in which response averaging reflects a normalization process that actively supports the coding of multiple simultaneous objects by avoiding the problems presented by saturation of neural responses. Because individual neurons have finite firing rates, pure summation of responses to multiple objects runs the risk of driving some neurons to saturation, particularly those which respond well to both objects. Once this happens, the population response to a pair of objects will no longer be a linear combination of the patterns evoked by each object by itself, and information about the identity of each object is lost. By scaling population responses by the number of stimuli present, normalization helps avoid this problem by ensuring that response saturation cannot be reached.
The presence of multistimulus normalization in LOC might also provide a window into its functional organization. Whenever response normalization has been found in macaque visual areas – such as with oriented bars in V2 and V4 [4, 18, 19], direction of motion in medial temporal (MT) and medial superior temporal (MST) cortex [6, 20], or shape in IT  -- the simultaneously presented stimuli have differed along some dimension that is “mapped” across the surface of the area under study (i.e., the individual stimuli presented by themselves activate spatially distinct clusters of neurons [23–26]). We speculate that this sort of mosaic-like organization might be a prerequisite to multistimulus normalization; if so then our results provide additional evidence that LOC neurons are clustered according to shape or category. (Indeed, such functional clustering might be necessary for multivoxel pattern analyses to work in the first place. [27, 28].)
Finally, our data revealed two additional novel and somewhat surprising phenomena. First, the LOC territory that best encoded object pairs was largely identical to the LOC territory that best encoded single objects (see Supplemental Figure 1). In contrast, the PPA did not encode object pairs as reliably as LOC even though it did encode information about single objects. These findings are consistent with previous claims that LOC rather than the PPA is the primary region involved in encoding object identity information obtained from a visual display . Second, LOC response patterns did not distinguish between different spatial configurations of a pair (i.e. shoe over brush was indistinguishable from brush over shoe). This suggests that when attention is distributed evenly across a scene, object identity is encoded independently of object location in the ventral stream .
Stimuli were 60 photographic images (1.7° square) of common objects from four categories (brushes, cars, chairs, and shoes) with all background elements removed, which were presented in 15 s blocks (see Supplemental Procedures). In single-object blocks, 15 exemplars from the same object category were presented one at a time at a single screen position which was centered either 1.7° above or below the fixation point. In paired-object blocks, 15 exemplars from two categories (30 total) were presented two at a time, with exemplars from one category appearing in the top screen position and exemplars from the other category appearing in the bottom screen position. Within each scan run, each object category was presented twice in the single-object condition (once in the upper screen position and once in the lower screen position) and each category pairing was shown twice (corresponding to the two possible spatial configurations; e.g. top-brush/bottom-chair and top-chair/bottom-brush).
To ensure that attention was paid equally to all objects, subjects (N=12) performed a one-back repetition detection task while maintaining central fixation. In paired-object blocks the repetition could occur at either stimulus location, forcing subjects to attend to both.
Following standard preprocessing, fMRI data were passed to a general linear model implemented in VoxBo, from which voxelwise beta values associated with each stimulus condition were extracted (see Supplemental Procedures). Multi-voxel pattern classification was implemented with custom code written in Matlab, using an algorithm similar to Haxby et al. . Briefly, response patterns were extracted for each ROI from each of the six experimental scans. Data were then divided into halves (e.g. even runs versus odd runs) and the patterns within each half averaged. A “cocktail” mean pattern (consisting of the average pattern across all stimuli) was calculated separately for each half of the data and then subtracted from each of the individual patterns before classification. Separate cocktails were computed for single-objects and paired-objects. No pattern normalization was applied at any point.
Pattern classification proceeded through a series of pairwise comparisons between stimulus conditions. Correct classification decisions were recorded when Euclidean distance between the patterns evoked by Condition “A” in opposite halves of the data was shorter than between Condition “A” and Condition “B” in opposite halves of the data. This procedure was repeated for every possible stimulus pairing, and correct decisions were accumulated across every possible binary split of the six scan runs. Preliminary analysis showed that the Euclidean distance metric produced classification accuracies similar to a correlation-based classifier .
“Searchlight” voxel selection  was implemented with custom Matlab code. For each voxel, we defined a spherical mask that included all other voxels within a 5mm radius. Searchlight clusters near cortical boundaries were truncated to ensure that only voxels within the cortex were included. Similarly, when searchlights were used on predefined ROIs, searchlight masks were truncated where necessary so that only voxels within the ROI were included.
This work was supported by NIH grant EY-016464 to R.E.