We can rapidly recognize objects within 100–200 ms of seeing a complex scene [
1,
2,
25]. Object recognition in multi-object images poses a challenging problem due to difficulties in segmentation, increased processing time and response attenuation [
4–
15]. Given these challenges, what are the neural mechanisms that underlie rapid recognition in multi-object images? Attention may help filter out “irrelevant” information enhancing certain locations/features/objects. While attention plays an important role in crowded images [
19–
21], it remains difficult to explain the high performance during brief presentation of a novel image by serial attentional shifts [
1–
3,
14,
25]. Alternatively, the first sweep of information through the ventral stream may contain sufficient information to account for recognition in multi-object images. We evaluated this possibility by quantifying how well we can decode information from IFP recordings in human visual cortex in response to two-object images. We report that the representation in inferior occipital gyrus, fusiform gyrus and inferior temporal cortex can support object recognition even in the presence of a second object in the image. The rapid onset of the selective responses suggests that recognition in two-object images may not require additional computational steps.
The degree of response suppression reported here is lower than in previous studies [
4–
15] (see however [
16,
17,
35]). Several non-exclusive factors may account for these differences including the species (macaques versus humans), brain areas (it remains difficult to establish one-to-one homologies between macaques and humans), recording techniques (field potentials versus action potentials), tasks and stimulus characteristics (particularly distance between objects and whether the two objects appear in the same hemifield [
10,
13,
16]). The biophysical nature underlying the IFPs remains only poorly understood. IFPs may reflect synaptic potentials averaged over many neurons [
36]. We
speculate that the IFPs may provide a “population view” that shows enhanced robustness to two-object images compared to individual neurons.
A possible mechanism to account for robustness to two-object images would be rapid attentional shifts and/or saccades to the electrode’s preferred category. While this possibility cannot be entirely ruled out here, it seems to be an unlikely account of our observations. (i) Subjects could not predict where to saccade before image onset (positions were randomized). Additionally, the response distributions were unimodal (
Figure S1) and behavioral performance was indistinguishable across positions. These observations render it unlikely that the results could be accounted by pre-onset fixation or spatial attention to one location. (ii) Adding saccade times of 200–300 ms [
37,
38] and latencies of 100–150 ms [
24] to the 100 ms stimulus flash, physiological responses elicited by saccades would take place after ~300 ms. (iii) In one subject where we monitored eye position, we did not observe any differences in the responses or suppression index (SI) that could be explained by eye movements. (iv) We observed similar SI when the analysis interval was restricted to [50;200] ms (
Figure S2C). (v) Similar SIs were observed for electrodes that preferred faces or other categories (,
S2B–C). Furthermore, in some cases, there were different electrodes in the same subject that preferred different categories; a category-specific attentional account would necessarily fail to explain the responses in some electrodes (e.g.
Figure S1B–C). (vi) The SIs during the initial 300 ms were unaffected by target presence (
Figure S2C). (vii) The latencies to one-object images and two-object images were similar (). Taken together, observations (i)–(vii) do not rule out an attentional account of our findings but delimit the possible roles of attention. The physiological characterization of the spatial summation properties (, ,
S2), category preferences (,
S4, S5), task demands (
Figure S2C) and timing () places strong constraints on how attentional shifts should be incorporated into biophysically-plausible computational circuits for visual recognition (e.g. [
25]).
We used two relatively large objects surrounded by a gray background. Visual recognition becomes more challenging and reveals serial attentional shifts upon increasing the amount of “clutter” in the image. Therefore, we do not claim that the initial physiological signals in temporal cortex can account for visual recognition under all possible visual conditions. Our work does suggest, however, that the presence of two objects and modest response suppression do not completely disrupt visual recognition by the initial sweep of visually selective signals.