|Home | About | Journals | Submit | Contact Us | Français|
Finding a target in a visual scene can be easy or difficult depending on the nature of the distractors. Research in humans has suggested that search is more difficult the more similar the target and distractors are to each other. However, it has not yielded an objective definition of similarity. We hypothesized that visual search performance depends on similarity as determined by the degree to which two images elicit overlapping patterns of neuronal activity in visual cortex. To test this idea, we recorded from neurons in monkey inferotemporal cortex (IT) and assessed visual search performance in humans using pairs of images formed from the same local features in different global arrangements. The ability of IT neurons to discriminate between two images was strongly predictive of the ability of humans to discriminate between them during visual search, accounting overall for 90% of the variance in human performance. A simple physical measure of global similarity – the degree of overlap between the coarse footprints of a pair of images – largely explains both the neuronal and the behavioral results. To explain the relation between population activity and search behavior, we propose a model in which the efficiency of global oddball search depends on contrast-enhancing lateral interactions in high-order visual cortex.
Finding a target in a visual scene can be easy (like finding a fruit in a tree) or difficult (like finding a face in a crowd). Classic accounts of visual search assumed that search is efficient (occurs speedily through parallel processing of all items in the display) only if the target and distractors differ from each other with regard to local features such as are represented in primary visual cortex (Treisman and Gelade 1980, Treisman and Gormican 1988, Treisman and Sato 1990; Treisman 2006). Subsequent accounts incorporated the idea that search efficiency may depend in a graded fashion on the degree of difference between target and distractors as determined not only by local features represented in primary visual cortex but also by global attributes represented in high-order areas (Duncan and Humphreys, 1989, 1992; Wolfe and Horowitz, 2004; Hochstein and Ahissar, 2002). This idea has been difficult to assess because little is known about the representation of global image attributes in high-order visual cortex. To address this issue, we have carried out a study based on the use of images that contain the same local features and differ only at the level of global organization. Such images may be difficult (Fig. 1A–B) or easy (Fig. 1C) for humans to tell apart during oddball search. We have asked, for such image pairs, whether the ability of humans to tell them apart during visual search is correlated with the ability of neurons in macaque inferotemporal cortex (IT) to discriminate between them.
Neurons in IT, unlike those in low-order visual areas, possess receptive fields large enough to encompass an entire image (Op de Beeck and Vogels 2000) and are sensitive to the global arrangement of elements within the image (Kobatake and Tanaka 1994, Tanaka et al. 1991, Messinger et al. 2005, Yamane et al. 2006). Population activity in IT discriminates better between some images than others (Allred et al. 2005, De Baene et al. 2007, Kayaert et al. 2005a,b, Kiani et al. 2007, Lehky and Sereno 2007, Op de Beeck et al. 2001). Moreover, if a pair of images is well discriminated by population activity in IT then humans tend to characterize them as dissimilar (Allred et al., 2005, Kayaert et al., 2005b) and monkeys are able to tell them apart when comparing them across a delay (Op de Beeck et al., 2001). It might be supposed, in light of these observations, that population activity in IT should necessarily predict human search efficiency. However, this outcome is uncertain for two reasons. First, search for an item in an array may depend on brain mechanisms fundamentally different from those underlying inspection of a single item (Treisman and Gelade 1980, Treisman and Gormican 1988, Treisman and Sato 1990; Treisman 2006). Previous studies comparing monkey physiology and human behavior were based on behavioral tests involving the inspection of single items. Second, the representation of global image attributes in high-order visual cortex may differ between monkeys and humans. Previous studies comparing monkey physiology and human behavior employed images differing with regard to local features and so did not touch on this issue (Allred et al., 2005, Kayaert et al., 2005b).
Two rhesus macaque monkeys, one male and one female (laboratory designations Je and Ec) were used. All experimental procedures were approved by the Carnegie Mellon University Institutional Animal Care and Use Committee and were in compliance with the guidelines set forth in the United States Public Health Service Guide for the Care and Use of Laboratory Animals. Prior to the recording period, each monkey was surgically fitted with a cranial implant and scleral search coils. After initial training, a 2-cm-diameter vertically oriented cylindrical recording chamber was implanted over the left hemisphere in both monkeys.
At the beginning of each day's session, a varnish-coated tungsten microelectrode with an initial impedance of ~1.0 megohm at 1 kHz (FHC, Bowdoinham, ME) was introduced into the temporal lobe through a transdural guide tube advanced to a depth such that its tip was about 10 mm above IT. The electrode could be advanced reproducibly along tracks forming a square grid with 1-mm spacing. The action potentials of a single neuron were isolated from the multi-neuronal trace by means of a commercially available spike-sorting system (Plexon Inc, Dallas TX). All waveforms were recorded during the experiments and spike sorting was performed offline using cluster-based methods. Eye position was monitored by means of a scleral search coil system (Riverbend Instruments, Birmingham, AL) and the x and y coordinates of eye position were stored with 1-ms resolution. All aspects of the behavioral experiment, including stimulus presentation, eye position monitoring, and reward delivery, were under control of a computer running Cortex software (NIMH Cortex). Monkeys were trained to fixate for approximately 2 s on a red fixation cross while a series of stimuli appeared in rapid succession (stimulus duration = 200 ms; inter-stimulus interval = 200 ms) at an eccentricity of 2° (Fig. 3A). At the end of each trial, they were rewarded for correctly maintaining fixation by a drop of juice. Although the fixation window was large (4.2°), we found on post-hoc analysis that the gaze remained closely centered on the fixation cross throughout the duration of the trial. The average across sessions of the SD of horizontal and vertical gaze angle was 0.2°. All recording sites were in the left hemisphere. At the end of the data collection period, they were established by magnetic resonance imaging to occupy the ventral bank of the superior temporal sulcus and the inferior temporal gyrus lateral to the rhinal sulcus at levels 14–22 mm anterior to the interaural plane in Je and 9–18 mm anterior in Ec.
All neuronal data analysis was based on the firing rate within a window extending from 50 ms to 300 ms after stimulus onset. A neuron was included in the database if and only if it was visually responsive as indicated by a significant difference (t-test: p < 0.05) between its firing rate in this window and its firing rate in a 50 ms window centered on stimulus onset with data collapsed across all stimuli in an experiment.
The key question, for each pair of images in every experiment, was how well neuronal activity discriminated between the two members of the pair. The index of discriminability was the mean across neurons of |A–B| where A and B were the mean firing rates elicited by the two stimuli. This is the Neuron Index given in Table 1. To compare the discriminability of two image pairs, we applied a paired t-test to distributions of index values obtained across the population of tested neurons. The p-values reported in Table 1 reflect the outcome of this test.
We collected and analyzed data from six adults, four male and two female, each of whom completed the entire battery of visual search experiments described below under a protocol approved by the Institutional Review Board of Carnegie Mellon University. In each experiment, the observer sat facing the screen with the right and left index fingers on two keys. The observer was instructed to maintain fixation on a central cross throughout each trial. The experiment consisted in responding with a key press to each of a succession of displays. On each trial, six stimuli appeared simultaneously at an eccentricity of 5° in a hexagonal array centered on fixation and arranged so that three items were to the right of fixation and three to the left at symmetric locations (Fig. 3B). The observer was instructed to press the key on the same side as the oddball as quickly as possible without guessing. The display was presented continuously until a response had occurred or until 5 seconds had elapsed. At 5.3 seconds, if no response had occurred, the trial was aborted and the next trial began after 0.5 seconds.
The key question was whether the search reaction time differed significantly between two pairs of images. Given a pair of images, A and B, we collapsed the data across cases in which the target was A and the distractor was B and cases in which the reverse was true. We did likewise for the second pair of images, C and D. The decision to collapse was justified by preliminary analysis indicating that there were no significant RT asymmetries. To assess the significance of the difference in RT between the two cases, we then performed a two-way ANOVA with RT as the dependent variable and with subject (six subjects) and image pair (AB or CD) as factors. The p-values reported in Table 1 reflect the outcome of this test.
The raw measure of discriminability between two images (the mean visual search RT) was smaller when the two images were more discriminable. For a more direct comparison to the neuronal data, we converted this to an index that increased with discriminability: Ibehav = (RToddball – RTbaseline)−1, where RToddball was the mean across participants of the reaction time to report the side of the oddball and RTbaseline was the mean across participants of the reaction time to report the location of a single 2° diameter white disk presented 5° to the left or right of fixation in a separate block of trials. This is the Search Index given in Table 1. Computing this index is tantamount to computing the strength of the input fed to an integrator that triggers a behavioral response when its output crosses a fixed threshold (Methods, Neural Net Model of Visual Search).
Each individual part (red or green rectangle, asterisk or slotted rectangle, or upward or downward pointing chevron) was 1° wide and 0.5° tall. In a compound display, the distance from the base of the top part to the apex of the bottom part (the measure of inter-part distance used throughout this paper) was 0.25°. The compound stimuli are shown in the top row of Fig. 2.
The stimuli were presented at locations such that the center of a compound stimulus was 2° contralateral to fixation. For each of three cases (color, pattern and chevron), there were six stimuli: the two compound stimuli with transposed arrangement, the two individual parts at the upper location and the two individual parts at the lower location. We presented individual parts as well as compound stimuli so as to determine whether there was a systematic relation between the responses to the two as discussed in Supplementary Materials, Section 4. On a given trial, all six compound stimuli, all six parts at the upper location or all six parts at the lower location were presented. Each was presented for 200 ms followed by a 200 ms inter-stimulus interval. The order of the stimuli within the trial was random. During recording from a given neuron, each compound stimulus was presented twelve times, and each part was presented six times at the upper location and six times at the lower location.
Observers completed a separate block of trials for each pair of stimuli. For each of three stimulus sets (color, pattern and chevron), there was a block of trials. In each block, on randomly interleaved trials, each member of the pair appeared as an oddball at each location four times.
The stimuli are shown in the second row of Fig. 2. The individual chevrons were 1° wide. The distance between the chevrons (from the base of the top part to the apex of the bottom part) was 1.0° (far), 0.5° (middle) or 0.25° (near).
The stimuli were centered 2° contralateral to fixation. For each of three stimulus sets (far, middle and near), there were two stimuli with transposed arrangement. On a given trial, all six stimuli were presented. Each was presented for 200 ms followed by a 200 ms inter-stimulus interval. The order of the stimuli within the trial was random. During recording from a given neuron, each stimulus was presented twelve times.
Observers completed a separate block of trials for each pair of compound stimuli (far, middle and near). In each block, on randomly interleaved trials, each member of the pair appeared as an oddball at each location four times.
The stimuli are shown in the third row of Fig. 2. The chevrons were 0.5° wide (small), 1° wide (medium) and 2° wide (large). The distance from the base of the top part to the apex of the bottom part was 0.5° in all cases. Selection of these dimensions ensured that the small, medium and large stimuli in this set were scaled versions of the far, middle and near stimuli in Set 2.
With the exception that the stimuli differed in size rather than in distance, all procedures were the same as in experiment 2.
With the exception that the stimuli differed in size rather than in distance, all procedures were the same as in experiment 2.
Representative stimuli are shown in the fourth row of Fig. 2. Each part was constructed by stacking vertically two semicircles and a quarter-circle. The height of each part was 2.25° and the width 0.5°. Compound stimuli were constructed by juxtaposing the concave faces of two parts with a distance between the inner edges of 1° (far), 0.5° (middle) or 0.17° (near). At each distance, there were four stimuli: two symmetric bug configurations (quarter-circles at top or at bottom) and two asymmetric worm configurations (quarter-circles at top left and bottom right or bottom left and top right).
The stimuli were centered 2° contralateral to fixation. On a given trial, the four stimuli with the same inter-part distance (far, middle or near) appeared two times each in random order, subject to the constraint that no stimulus appeared twice in succession. Each stimulus was presented for 200 ms followed by a 200 ms inter-stimulus interval. Recording continued until each stimulus had been presented twelve times.
At each distance (far, middle or near), there were two symmetric bug-like variants (quarter-circles at top or at bottom) and two asymmetric worm-like variants (quarter-circles at top left and bottom right or at bottom left and top right). This made for eight target-distractor combinations (four possible targets crossed with two possible variants of the distractor). Participants completed a separate block of trials for each condition (far, middle or near). The 48 randomly sequenced trials in each block conformed to the 48 cases obtained by crossing two vertical mirror-image variants of bug, two lateral mirror-image variants of worm, six target locations and two target identities (bug or worm). Thus each form (bug or worm) was presented as target four times at each location.
The stimuli were based on a design introduced by Julesz (1981). As noted by that author, the forms with S and IO topology represent different global arrangements of the same local elements. For example, each form contains two line terminators and four corners. As noted by subsequent authors, the discriminability of the two forms is affected by their aspect ratio (Enns 1986; Tomonaga 1999). The stimuli in the present experiment fell into five size groups across which the aspect ratio varied systematically:
|Size group||Aspect Ratio (H:V)||Height||Width|
Each size group contained four stimuli: two mirror images with S topology and two mirror images with IO topology. For data analysis, the five size groups were sorted into two sets (fifth and sixth rows of Fig. 2):
|Set||Size Groups||Selection principle|
|5||A, C, E:||tall, square, wide with width held constant|
|6||B, C, D:||tall, square, wide with perimeter held constant|
Perimeter here refers to the rectangular envelope (= 2*height + 2*width).
The stimuli were centered 2° contralateral to fixation. On a given trial, a set of four stimuli from the same size group appeared two times each in random order, subject to the constraint that no stimulus appeared twice in succession. Each stimulus was presented for 200 ms followed by a 200 ms inter-stimulus interval. Recording continued until each stimulus had been presented twelve times.
For each size group (A–E), there were two mirror-image variants with S topology and two mirror-image variants with IO topology. This made for eight target-distractor combinations (four possible targets crossed with two possible variants of the distractor for the given target). Participants completed a separate block of trials for each size group (A–E). The 48 randomly sequenced trials in each block conformed to the 48 cases obtained by crossing two mirror-image variants of the form with S topology, two mirror-image variants of the form with IO topology, six target locations and two target identities (S or IO). Thus each form (S or IO) was presented as target four times at each location.
To explore the possible nature of the causal linkage between visually selective neuronal activity and visual search reaction time, we developed a six-unit neural network (Fig. 8). The six units had receptive fields centered at the locations of the six hexagonally arrayed visual-search stimuli. The level of activation of each unit was the linear sum of visually driven bottom-up excitatory input and lateral inhibitory input from the five other units. All units were selective for the same stimulus. On any given simulated trial, one unit had a target in its receptive field and five had distractors. Due to symmetry, the five units with distractors exhibited the same level of activation. Thus the network could be described by two equations, one for the activation of the unit with the target in its receptive field (A1), and one for the activation of each of the five units with distractors in their receptive fields (A2):
The first term in each equation represents the average of the activations elicited by the target and the distractor. We set M to 13.7 spikes/s (the average strength of the visual response across all IT neurons and all stimuli). The second term represents bottom-up stimulus-selective visual excitation. If simulating the response of a network of X-selective neurons to a display in which the target was an X and the distractors were diamonds, we set d to 4.8 spikes/s (the average across all neurons of the measured difference in the firing rates elicited by an X and a diamond). If simulating the response of a network of diamond-selective neurons to the same display, we set d to −4.8 spikes/s. For other simulated conditions, we did likewise, always basing d on the average discriminative signal as measured in IT. The third term represents lateral inhibition. We set k to 0.1. The particular choice of the value of k was not critical to the qualitative pattern of results.
We fed the output of each unit to an integrator. The reaction time (RT) was taken as the time following stimulus onset at which the first integrator – the one driven by the most active unit in the network – crossed threshold. In the case with which we were concerned - the case in which the target was the stimulus preferred by the units in the network - the reaction time was given by the equation:
where B was the baseline response time, q was the threshold and c was a heuristic constant. By transposing terms and utilizing the definition of the behavioral discrimination index used in the visual search experiments [Ibehav = 1/(RT-B)], we obtain the relation:
Equations (1) and (2) taken together define the behavioral discrimination index, Ibehav, as a function of the neuronal discrimination index, d, with two free parameters, q and c. We adjusted the free parameters (lsqcurvefit function, MATLAB, Natick MA) to obtain the best fit between 17 measured values of Ibehav and the values of Ibehav generated by the model when given as input the 17 corresponding neuronal discrimination indices (Fig. 5). The best fit was obtained with q = 0.54 and c = −10.5 spikes/s.
For each of the two members of an image pair, we first computed an orientation and spatial frequency power spectrum. This yielded a value for power at each point in a rectangular grid spanning 2D Fourier space. To these points we applied a Cartesian-to-polar transformation. Then we computed by interpolation the power at each point on a rectangular grid in the transformed space. This grid spanned 64 spatial frequencies from one cycle per frame to 64 cycles per frame in equal steps of 0.09 octaves and 61 orientations from −90° to 90° in equal steps of 3°. Next, we blurred the power values to simulate the bandpass characteristics of V4 neurons (David et al. 2006). The blurring function was a Gaussian with a bandwidth (full width at half height) of 1.2 octaves in the spatial frequency domain and 7.7° in the orientation domain. This step eliminated fine-grained patterns, present in the spectra of simple geometric figures, that vary in a non-monotonic fashion under continuous variation of properties such as inter-part distance. We then normalized the spectrum by scaling power at each point to the average across all points. This step, by ensuring that the volume under each surface had a value of one, eliminated accidental effects due to factors such as contrast and brightness. We proceeded to subtract one spectrum from the other. Rectifying and integrating over the resulting surface provided a scalar measure of the difference between the Fourier spectra. The value could range from zero (for identical images) to two (for images - such as orthogonal gratings - with no overlap in Fourier power space).
To calculate the coarse footprint difference index, we first low-pass filtered each image of a pair, using a Gaussian blur function with a standard deviation 0.08 times the extent of the longer dimension. This reduced spectral power by more than 80% beyond a low-pass cutoff of 3 cycles per object. Next we normalized the volume under each image to a value of one. Finally, superimposing the images with their centers of mass in alignment, we created a difference image by pixel-wise subtraction. Rectification and integration of the pixel values in the difference image yielded a scalar index of the difference between the images that could in principle range from 0 to 2.
We carried out parallel single-neuron recording experiments in monkeys and visual search experiments in humans using six stimulus sets each of which consisted of three pairs of images (Fig. 2). The images in each pair contained an identical collection of local elements but differed in the arrangement of those elements. Thus discrimination between the members of the pair depended on registering their global organization. Within each set, the pairs differed with regard to some variable likely to affect the discriminability of the two members of the pair, for example, in set 1, the identity of the parts and, in set 2, the inter-part distance (Fig. 2).
In microelectrode recording experiments, we carried out testing with images from each set in a different block of trials (Fig. 3A). Over the course of a given block, each image in the set was presented twelve times. The number of neurons tested with each set is indicated in Table 1 (Neuron Count). The number varied from set to set because some neurons could not be held long enough for testing with all sets. Neurons in IT responded differentially to stimuli consisting of the same local elements in different global arrangements (Fig. 4). The selective responses were genuinely based on global arrangement as indicated by the fact that responses to parts could not predict responses to wholes (Supplementary Materials, Section 1). Furthermore, they were determined by relatively abstract global properties as indicated by the fact that a given arrangement was preferred consistently over substantial changes in the size of the parts and the distance between them (Supplementary Materials, Section 2). The strength of neuronal selectivity varied across image pairs. For example, in set 1, neurons differentiated poorly between compound images formed from colored parts (Fig. 4A and D) and patterned parts (Fig. 4B and E) but discriminated well between compound images formed from chevrons (Fig. 4C and F). As a measure of the ability of population activity in IT to discriminate between the images in each pair, we computed the average, across all visually responsive neurons in both monkeys, of the absolute difference between the mean firing rates elicited by the two images. The resulting neuronal discrimination index is provided in Table 1 (Neuron Index). All major trends demonstrated by this approach were confirmed by comparing counts of neurons showing individual significant effects (Supplementary Materials, Section 3) and were found to be present in data from each monkey considered individually (Supplementary Materials, Section 4). Population histograms demonstrated qualitatively the same trends as seen in the quantitative analysis (Supplementary Materials, Section 5).
In visual search experiments involving six human participants, we measured the mean reaction time to indicate the location (left or right visual field) of an oddball presented among distractors, where the oddball and the distractors were the two members of a pair (Fig. 3B). We collected data in a separate block of trials for each pair of images. In the course of a block, each member of the pair appeared as an oddball four times at each location. Oddball search was easy for some pairs of images and difficult for others. For example, in set 1, the images formed from chevrons (Fig. 1C) popped out from each other whereas the images formed from colored and patterned parts (Fig. 1AB) did not (for a demonstration that X and diamond stimuli popped out from each other according to the classical criterion for popout, see Supplementary Materials, Section 6). We also collected data in a baseline block of trials that required reporting the location (right or left visual field) of a single salient stimulus. The mean oddball reaction time for each pair and the mean baseline reaction time are presented in Table 1 (RT). For each image pair, we computed a behavioral discrimination index that became larger the more discriminable two images were and that was consequently directly comparable to the neuronal discrimination index. This had the form 1/(RT-B) where RT was the mean oddball reaction time for a given image pair and B was the mean baseline reaction time. This index is equal to the strength of the difference signal that would have to be fed to an integrator starting at time zero in order for its output to cross a fixed threshold and trigger a behavioral response at time RT (see Methods). The behavioral discrimination index for each pair is reported in Table 1 (Search Index).
It is evident from comparison of the neuronal and behavioral discrimination indices (Table 1) that they tended to vary in parallel across image pairs. To quantify this effect, we computed the correlation between the two measures across the entire set of image pairs (Fig. 5). The correlation was strongly positive (r = 0.95) and highly significant (p < 0.00000001). The precision of the correspondence is especially striking because (a) the physiological results for different image sets were obtained from populations of neurons that did not fully overlap, (b) data for each set, even when collected from the same neurons, were collected in different blocks of trials, which might conceivably have produced contextual effects, and (c) there was no prior reason to suppose that the relation between the indices would be linear. This outcome was robust across changes in the epoch during which the firing rate was computed, changes in the properties of the neurons selected for study and changes in the metric used to characterize neuronal selectivity (Supplementary Materials, Section 7). We conclude that the ability of humans to discriminate between image pairs with differing global organization closely parallels the ability of IT neurons to discriminate between them.
Why are some pairs of images differing in global arrangement well discriminated while others are not? Is there some identifiable metric of the difference in global arrangement between two images that can explain this outcome? We assessed the ability of two metrics of image difference to explain the data. One is based on the Fourier power spectrum and the other on the distribution of brightness across pixels in image space.
The discriminability of images differing in global organization might be proportional to the difference in their Fourier power spectra. This would be consistent with the idea proposed for area V4 that pattern selectivity arises from neurons' possessing restricted receptive fields in orientation and spatial frequency space (David et al. 2006, 2008). To explore this idea, we first computed for each pair of images used in this study a map of power in Fourier space (Fig. 6A–B). Next, we generated a difference map by subtraction (Fig. 6C). Finally, by rectifying and integrating across the difference map, we derived a scalar index of the difference between the images that could in principle range from 0 to 2. Further details are given in Methods. Upon comparing the Fourier difference index to the neuronal and behavioral discrimination indices across image pairs (Fig. 6D–E), we discovered that the correlation, although positive, was weak and did not attain significance (neuronal index: r = 0.40, p = 0.11; behavioral index: r = 0.39, p = 0.12).
The discriminability of images differing in global organization might be inversely proportional to the degree to which their footprints overlap when superimposed. Differences measured in image space, unlike those measured in the Fourier power domain, contain information about spatial phase. A pixel-based measure could thus explain the observation that IT neurons are able to discriminate between images that differ only with regard to spatial phase – for example mirror images (Rollenhagen and Olson, 2000). To explore this idea, we first low-pass filtered each image of a pair, using a Gaussian blur function with a standard deviation equal to 0.08 times the longer dimension (Fig. 7 A–B). This reduced the spectral power by more than 80% beyond spatial frequencies of 3 cycles per object. Next we created a difference image by pixel-wise subtraction (Fig. 7C), and rectified and integrated it to obtain a difference index that could range from 0 to 2 (see Methods). The full results are provided in Supplementary Materials (Section 10). Plotting the neuronal and behavioral discrimination indices against this index across image pairs (Fig. 7D–E) revealed highly significant positive correlations (neuronal index: r = 0.89, p < 0.000005; behavioral index: r = 0.89, p < 0.000005). The strength of the correlations was reduced by blurring the images with Gaussian blur functions having standard deviations greater or less than 0.08 (Fig. 7F). We conclude that a simple measure based on the difference between the coarse (3 cycle per object) footprints of two images predicts the ability of IT neurons and humans engaged in visual search to detect a difference in global organization.
There is a simple mechanism by which better discrimination between a pair of images at the level of IT could give rise to a shorter reaction time for detecting one shape among distractors having the other shape. It is based on stimulus-specific spatial-contrast-enhancing lateral interactions between neurons responding to images at different locations. Neurons throughout the visual system respond more strongly to a preferred stimulus presented in the classic receptive field if it is different from items in the surround than if it is the same (Constantinidis and Steinmetz 2005, Lee et al. 2002). Surround modulation in low-order visual areas is conditional on the nature of low-order features such as orientation (Nothdurft et al. 1999, Li 1999, Bair et al. 2003). It is possible, by analogy, that surround modulation in high-order visual areas depends on global attributes for which the neurons in those areas are selective. The effect would be to enhance the response to an oddball distinguished from distractors by its global attributes, with the degree of enhancement dependent on the degree to which neurons differentiate between the two.
As proof of principle, we studied a network consisting of six units selective for the same image, with receptive fields at the six locations of the hexagonal search array (Fig. 8). The level of activation of each unit is the sum of its bottom-up excitatory input and lateral inhibitory input originating from the other five neurons (Methods, Neural Net Model of Visual Search). If the oddball is the preferred stimulus and the distractors are non-preferred stimuli, then the level of activation of the unit with the oddball in its receptive field will depend on (a) the strength of bottom-up excitatory input driven by the presence of the preferred stimulus in its receptive field and (b) the strength of lateral inhibition driven indirectly by the presence of the non-preferred stimulus in the receptive fields of the other units. The greater the discriminative capacity of the units (the greater the difference between bottom-up excitation elicited by the preferred stimulus and bottom-up excitation elicited by the non-preferred stimulus) the greater the level of activation of the unit with the oddball in its receptive field. The strength of activation can be transformed into a reaction time by accumulating it through an integrator until a decision threshold is reached. The reaction time is shorter for stimuli well discriminated by the units (Fig. 8A) than for poorly discriminated stimuli (Fig. 8B). The model thus provides a mechanistic transformation of neuronal discrimination ability into reaction time. Fitting the two free parameters of the model to data obtained with 17 image pairs yielded an extremely good fit (dotted white curve superimposed on best-fit line in Fig. 5). The assumptions on which this demonstration rests – that there is stimulus-specific lateral inhibition and that a population of neurons with variable discriminative capacity can be modeled by a few neurons with discriminative activity equal to the average across the population – remain to be tested. Nevertheless the demonstration makes clear that a simple mechanism based on contrast-enhancing lateral inhibition could account causally for the relation between neuronal discriminability in IT and human visual search efficiency as observed in these experiments.
We have assessed the ability of neurons in monkey IT and of humans engaged in visual search to discriminate between images that differ exclusively at the level of global organization which we define in the following manner. Imagine systematically scanning two images with a window of fixed diameter and characterizing each image as the collection of details seen through the window without regard to the specific location of any detail. A pair of images differs solely at the global level if, to detect a difference between them, it is necessary to scan them with a relatively large window. For the image pairs in Fig. 1, discrimination would be just possible with a window having a diameter one quarter of the image's height and would become progressively more robust as the diameter increased beyond that limit. The first essential finding of this study is that IT neurons are more sensitive to some global differences than to others. The second essential finding is that the ability of IT neurons to discriminate between a pair of such images is correlated with the ability of humans to discriminate between them during visual search.
The mere demonstration of a correlation between behavioral discrimination and neural discrimination is generally considered a sufficient endpoint in such studies (Allred et al., 2005; Edelman et al. 1998; Hausfhofer et al. 2008; Kayaert et al., 2005b; Op de Beeck et al., 2001). However, it is worthwhile to consider the underpinnings of the correlation by asking why, in physical terms, some image pairs are easier to discriminate than others. We explored this issue by applying to images in our experimental set two measures that are “global” in the sense that they depend on the distribution of content across the image as a whole but are also “low level” in the sense that they depend on convolving the image with simple filters. One of these, based on the Fourier power spectrum, provided a poor fit to the data. The other, based on the coarse footprint, provided a good fit. Although the coarse footprint measure worked well for images in this particular set, it is only a first step toward a general account of discrimination based on global structure. To explain results obtained with high-pass-filtered images will require refining the measure (Supplementary Materials, Section 8). Furthermore, any complete approach will have to take perceptual organization into account (Palmer 1999, Kimchi et al. 2003). Neurons in IT are sensitive to figure-ground organization (Baylis and Driver 2001). So are humans engaged in visual search (Davis and Driver 1994; Enns and Rensink 1990; Rensink and Enns 1998; He and Nakayama 1992). Figure-ground organization may have played a role even in the present experiment. For example, the line segments in the X may have been construed as having figural status and the line segments in the diamond may have been construed as the boundaries of an enclosed figure.
The simplest possible interpretation of the neuronal-behavioral correlation is (a) that humans posses an area homologous to IT, (b) that neurons in this area represent global image structure in a code equivalent to the code in IT, and (c) that humans base global search on activity in this area. These points seem plausible but each must be qualified. a) The human lateral occipital and fusiform regions are generally regarded as homologous to macaque IT on the grounds of their location and object-selective BOLD responses (Grill-Spector et al., 2001; Tootell et al., 2003; Orban et al., 2004; Pinsk et al., 2009) but no homology is certain. b) The idea that representations in lateral occipital and fusiform cortex are similar to those in IT remains to be established. One approach to characterize the object representation in humans is to assess the similarity in the BOLD activation patterns elicited by individual objects (Edelman et al. 1998; Hausfhofer et al. 2008; Williams et al. 2007). These similarity patterns in humans reveal a categorical structure that is correlated with the similarity patterns observed in monkey IT (Kriegeskorte et al., 2008). However, it is not clear whether this categorical structure reflects the geometry of objects or simply their category. c) Even findings based on this approach would not clinch the argument that humans engaged in global search rely on representations in lateral occipital and fusiform cortex. This might be accomplished by demonstrating selective impairment of visual search based on global image attributes after occipitotemporal injury. Visual search is certainly impaired in patients with visual agnosia arising from occipitotemporal damage (Ballaz et al. 2005; Foulsham et al. 2009; Humphreys et al. 1992; Kentridge et al 2004; Saumier et al. 2002). However, no tests have involved targets and distractors similar to the ones used in this study. We note finally that even if humans are guided by representations in IT-homologous cortex during visual search based on global attributes, it may still be the case that search based on local features depends on areas of lower order (Supplementary Materials, Section 11).
Feature integration theory (FIT) as propounded by Treisman and colleagues (Treisman and Gelade 1980, Treisman and Gormican 1988, Treisman and Sato 1990 and Treisman 2006) has been the dominant theoretical framework for understanding visual search over recent decades. The data that we have presented and the model with which we have been able to fit the data are, however, contrary in spirit to FIT. Our results are best understood as adding to an accumulating body of evidence that calls into question four intertwined assumptions embodied in FIT.
FIT posits that search is efficient if the target is distinguished from the distractors by a unique low-level feature but not if it is distinguished by a conjunction of features. In fact search for some conjunctions is efficient and search for some features is not. Xs and diamonds (different conjunctions of identical parts and locations: Von der Malsburg 1999) pop out from each other (Supplementary Materials, Section 6). So do different conjunctions of the same spatial frequencies and orientations (Sagi 1988). Conversely, a target distinguished by a unique feature will not pop out if the featural difference from the distractors is too small (Bauer et al. 1996; Moraglia 1989; Nagy and Cone 1996; Nagy and Sanchez 1990).
FIT posits that search is efficient if the target is distinguished from the distractors by an attribute represented explicitly by neurons in primary visual cortex (Nothdurft et al. 1999, Li 1999) and not otherwise. In fact, some complex image attributes to which striate neurons are insensitive support popout (Wolfe and Horowitz 2004). These include the arrangement of parts in the image plane (Conci et al. 2006, Davis and Driver 1994, Heathcote and Mewhort 1993, Pomerantz et al. 1977, Rensink and Enns 1998). The occurrence of popout in these cases implies that visual areas outside striate cortex can guide efficient search (Hochstein and Ahissar, 2002).
FIT posits that two measured behavioral phenomena (search time independent of set size vs. increasing with set size) indicate two modes of search (parallel vs. serial). In fact, a purely parallel model can shift gradually from seemingly parallel to seemingly serial behavior as the difference between the target and the distractors decreases (Deco and Zihl 2006). During conjunction search, which FIT supposes to be serial, neuronal activity in monkey extrastriate cortex shifts steadily as attention converges on the target and not discontinuously as would be expected from serially allocating attention to different items (Bichot et al. 2005).
FIT posits that there is a qualitative difference between the representation of an image formed during preattentive vision (when attention is distributed across the array) and the representation formed during attentive vision (when it is the sole object of attention). In fact, when multiple items appear in the visual field, all other things being equal, a neuron fires at a rate corresponding to the average of its responses to the individual items (Zoccolan et al. 2005). Attention to one of the items induces a quantitative increase in its degree of influence (Reynolds et al. 2000) but not a qualitative change such as would occur if neurons were sensitive only to the collection of basic features in an object forming part of an array but were sensitive to conjunctions of features in an isolated object.
The results of our study and other observations as noted above agree in supporting an alternative to FIT put forward by Duncan and Humphreys (1989, 1992). In their scheme, all search is parallel and operates on sophisticated representations of objects. Search can occupy any point along a continuum of efficiency, with efficiency increasing as the target and distractors become more dissimilar. A gap in this theory has been the lack of an operational measure of dissimilarity. Our results suggest a plausible measure based on differences in neuronal activity in visual areas including high-order cortex homologous to IT.
Supported by RO1 EY018620, P50 MH084053 and the Pennsylvania Department of Health through the Commonwealth Universal Research Enhancement Program. MRI supported by P41 EB001977. We thank Karen McCracken for technical assistance.