|Home | About | Journals | Submit | Contact Us | Français|
It is commonly thought that neurons in monkey inferotemporal cortex (IT) are conjunction selective – that a neuron will respond to an image if and only if it contains a required combination of parts. However, this view is based on the results of experiments manipulating closely adjacent or confluent parts. Neurons may have been sensitive not to the conjunction of parts as such but to the presence of a unique feature created where they abut. Here, we compare responses to two sets of images, one composed of spatially separate and the other of abutting parts. We show that the influences of spatially separate parts combine, to a very close approximation, according to a linear rule. Nonlinearities are more prominent – although still weak - in responses to images composed of abutting parts.
Inferotemporal cortex (IT) is the terminus of the ventral stream of visual areas in primates and as such plays a critical role in visual object recognition. The contribution of IT to object recognition is thought to hinge critically on its containing neurons that respond selectively to natural images. The mechanism of image selectivity is widely thought to be nonlinear on the basis of the classic observation that removing any part from a neuron’s preferred image can cause a dramatic reduction in the response (Perrett et al., 1982; Desimone et al., 1984; Tanaka et al., 1991; Kobatake & Tanaka 1994; Ito et al., 1995; Yamane et al., 2006). This phenomenon is commonly assumed to arise from a process whereby the conjunction of parts exerts a much stronger influence on neuronal firing than predicted from linear summation of the influences of the individual parts. However, it could also arise if the neuron were selective for some simple juxtapositional feature created where the parts abut. Nonlinear interactions between parts on closed surfaces may likewise indicate sensitivity to the smooth merger between the parts (Brincat and Connor, 2004, 2006; Yamane et al., 2008).
A few recent studies have circumvented this problem by measuring neuronal responses to objects composed of spatially separate parts (Baker et al., 2002; Baene et al., 2007). These have revealed statistically significant nonlinear part-part interactions in around 10% of IT neurons (Baker et al., 2002). However, this number is difficult to interpret. On the one hand, it may overestimate the contribution of nonlinear part-part interactions because the effects, although statistically significant, could be very weak. On the other hand, it may underestimate the rate of incidence of nonlinear interactions because it was based on testing each neuron with only a small set of parts. It is possible that the observed nonlinear interactions occurred when parts were especially ineffective or especially effective at eliciting neuronal activity. In fact, superadditive nonlinear interactions between ineffective and subadditive nonlinear interactions between effective parts occur in a common model of neuronal activation based on passing linearly summed inputs through a sigmoidal activation function (Koch, 1999).
To resolve this issue, we have assessed neuronal responses to objects composed of parts that cover a broad range in their individual ability to modulate neuronal activity and have analyzed the results using a measure of signal strength that allows for unbiased comparison between linear effects (dependent simply on which parts are present) and nonlinear effects (dependent on the particular combination of parts that is present). For objects containing spatially separate parts, we find that nonlinear effects occur and that when present they nudge neurons slightly toward conjunction selectivity. However, they are extremely weak, accounting for only around a twentieth of the modulation in firing rate induced by the stimulus, and develop only late in the response. For objects composed of confluent parts, nonlinear interactions are stronger but still not prominent, accounting for only around a quarter of the stimulus-induced modulation in neuronal firing rate.
Two rhesus macaque monkeys, one male and one female (laboratory designations Je and Ec) were used. All experimental procedures were approved by the Carnegie Mellon University Institutional Animal Care and Use Committee and were in compliance with the guidelines set forth in the United States Public Health Service Guide for the Care and Use of Laboratory Animals. All aspects of the behavioral experiment (stimulus presentation, eye position monitoring, and reward delivery) were under control of a computer running Cortex software (NIMH Cortex). Eye position was monitored by means of a scleral search coil system (Riverbend Instruments, Birmingham, AL). At the beginning of each day's session, a varnish-coated tungsten microelectrode with an initial impedance of ~1.0 megohms at 1 kHz (FHC, Bowdoinham, ME) was introduced into the temporal lobe through a transdural guide tube advanced such that its tip was roughly 10 mm above IT. The electrode was then advanced by use of a micromanipulator until phasic visual responses were observed. Action potentials of single neurons were isolated from the multi-neuronal trace using a commercially available spike-sorting system (Plexon Inc, Dallas TX). All waveforms were recorded during the experiments and spike sorting was performed offline using commercially available spike-sorting software (Plexon Inc, Dallas TX). At the end of the data collection period, recording sites were established by structural MRI to occupy the ventral bank of the superior temporal sulcus and the inferior temporal gyrus lateral to the rhinal sulcus (at levels ranging from A14–22 mm in Je and A9–18 mm in Ec relative to the interaural plane). The recording locations fell well within the range investigated in previous studies of shape selectivity in IT (Tanaka et al., 1991; Kobatake and Tanaka, 1994; Ito et al., 1995; Brincat and Connor, 2004; Yamane et al., 2008).
Upon identifying and isolating a neuron, we carried out preliminary testing by presenting (for 500 ms each at fixation) a series of items drawn from a 21-image library of parts (Fig. S1, supplementary material). We selected for use in systematic tests seven parts that seemed to elicit responses from the neuron covering a considerable range in strength and an eighth “blank” part: the unadorned end of the stem. The full stimulus set for each neuron consisted of 64 baton stimuli obtained by placing these eight parts in all possible combinations at the top and bottom of a vertical stem (Fig. 1). Shapes placed at the top and bottom were identical short of a 180° rotation. Each baton, consisting of a stem and two appended shapes, was approximately 2°wide and 6° tall. Batons were presented with the center of the rectangular stem at fixation. The responses of a total of 104 neurons were recorded in this manner.
We selected a neuron for recording if it appeared to respond to at least one of the hourglass stimuli. This was the only step of selection. Each stimulus in the hourglass set consisted of two parts, each of which was a subset of line segments of an equilateral triangle. Each segment of the triangle was 2.71° long and 0.22° thick. For testing in every neuron, we selected the exact same set of eight parts, namely the eight possible subsets of segments which together make up the triangle. Parts placed at the top and bottom were identical short of a 180°rotation. This resulted in a total of 64 stimuli obtained by combining the eight top and eight bottom parts in a combinatorial fashion. Stimuli were presented with the junction of the two triangles at fixation. The responses of 142 neurons were recorded in this manner. Among these, a subset of 58 neurons had responses to both the baton and hourglass stimulus sets.
We monitored neuronal activity during passive fixation. A trial began with onset of a small red fixation cross (measuring 0.4°) and attainment of fixation. After a delay of 200 ms, eight stimuli were presented in succession, each with a duration of 200 ms followed by an interval of 200 ms during which the fixation cross was again visible. After completion of the sequence, the display vanished and the monkey was rewarded with approximately 0.1 cc of water. Although the fixation window was large (4.2°), we found on post-hoc analysis that the gaze remained closely centered throughout the duration of each trial. The average across sessions of the standard deviation of horizontal and vertical gaze angle was 0.2°.
The trials conformed to 16 conditions. Each of the 16 conditions was identified with a set of eight stimuli. The 64 stimuli were divided into non-overlapping arbitrary groups of eight across conditions 1–8. On each trial conforming to a given condition, the stimuli associated with it were presented once each in random order. The fact that no stimulus was repeated within a trial minimized the potential for repetition suppression. The 64 stimuli were parceled out again across conditions 9–16 subject to the constraint that the eight stimuli in each condition had to have been drawn one each from conditions 1–8. Thus any two images presented in the same trial under conditions 1–8 were presented in separate trials under conditions 9–16. This minimized the potential for image-image interactions to influence the results. The 16 conditions were imposed in random order subject to the constraint that one trial conforming to each condition had to be completed successfully before a new block began. The fact that each stimulus was presented only twice within each block of 16 trials minimized the potential for repetition suppression. A complete session consisted of four blocks of 16 trials over the course of which each of the 64 stimuli was presented eight times.
The aim of this step of analysis was to compare main and interaction effects with respect to their ability to explain the variance of a neuron’s firing rate across stimuli. To do this accurately required bypassing a distortion inherent in measures of explained variance produced by an ANOVA. In an ANOVA assessing the dependence of firing rate on the identity of eight top and eight bottom parts, the underlying model possesses 64 degrees of freedom (one for baseline firing rate, seven for main effects of top, seven for main effects of bottom and 49 for interaction effects). Since the number of degrees of freedom in the data (responses to 64 stimuli) equals the number of degrees of freedom in the model, the model can always account for 100% of the explainable variance (i.e. the cross-stimulus variance) in firing rate. However, the variance explained contains terms that reflect both signal and noise: M=Ms+Mn, where M is variance explained by main effects, Ms is the moiety due to signal and Mn is the moiety due to noise; and likewise, for interaction effects, I=Is+In. The presence of the noise terms acts to move the observed ratio between main and interaction effects (M/I) closer to one than the true ratio (Ms/Is).
To bypass this problem, we adopted an alternative approach guaranteeing that the contribution of noise to each estimate of explained variance will fluctuate around zero. For each neuron’s odd-numbered trials, we performed an ANOVA with top and bottom part identity as factors and with firing rate in a window 50–350 ms after stimulus onset as the dependent variable. The underlying model contained eight coefficients reflecting the contributions of top parts, eight coefficients reflecting the contributions of bottom parts, and 64 coefficients reflecting the contributions of specific top-bottom combinations. From these coefficients, we generated three eight by eight matrices: a “top” matrix in which each row contained a coefficient (identical across columns) representing the contribution of the corresponding top, a “bottom” matrix in which each column contained one coefficient (identical across rows) representing the contribution of the corresponding bottom, and an interaction matrix in which the 64 interaction coefficients represented the contributions of the corresponding top-bottom pair. We analyzed the correlation of each of these matrices with the matrix of mean firing rates elicited by the 64 stimuli 50–350 ms after stimulus onset on even-numbered trials. We computed the fraction of variance across stimuli explained by each class of effect as sign(r)*r2, where r was the corresponding correlation coefficient.
The impact of this correction was nontrivial. In the case of pure noise, with the firing rate on odd- and even-numbered trials unrelated to stimulus identity, the estimate of variance explained by main and interaction effects would be 22% and 78% respectively without correction whereas, with correction, it would be zero on average. For the neuron of Fig. 1, the uncorrected estimates of variance explained by main and interaction effects were respectively 86% and 14% whereas the corrected estimates were 77% and 2%. On the basis of the uncorrected measures, we would say that the main effects explained six times as much variance as the interaction effects. On the basis of the corrected measures, we would say 40 times as much.
In the linear model, the response RL to a baton containing parts Pi at the top and Pj at the bottom is given by:
ai and bj represent the contributions of Pi and Pj respectively and c is a baseline constant. Because the presence of the eighth top (or bottom) part implies the absence of the other seven, only seven degrees of freedom are associated with each part. This is reflected in the constraint that a1,a2…a8 and b1, b2 …b8 must sum to zero. Thus, the model possesses 15 degrees of freedom (seven for the top part, seven for the bottom part and one for the baseline firing rate). These parameters, derivable by linear regression, were obtained by means of a built-in Matlab function (anovan, Matlab 7.0, Natick MA).
The response of the sigmoidal model RS simply transforms the output of the linear model RL using a sigmoidal function given by:
where A is the amplitude of the sigmoid, and N(μ,σ) is the Gaussian distribution with mean μ and standard deviation σ. The sigmoidal function is monotonic with a maximum of A and minimum of zero. The point of half-maximum of the sigmoid and its rate of rise are controlled by the mean and standard deviation of the Gaussian respectively. Thus, there are three additional parameters (A, μ and σ) in the sigmoidal model compared to the linear model. We estimated the parameters of the sigmoidal model by minimizing the sum-of-squared error between the observed data and the model’s response using built-in functions in Matlab (lsqcurvefit, Matlab 7.0, Natick, MA).
For any data set, fits obtained using the sigmoidal model must be as good as or better than the fits from the linear model simply because the sigmoidal model has three additional parameters. We performed a partial-F test in order to determine whether the improvement obtained with the sigmoidal model was significantly in excess of the improvement expected from the extra degrees of freedom in themselves.
To quantify neuronal shape tuning, we computed rank at half-height of the curve obtained by plotting firing rate (normalized to the response to the best baton) against rank (from best to worst baton). To characterize the effect of removing the interaction effects, we recalculated the tuning curve using firing rates predicted by main effects alone. To characterize the effect of randomizing the relation between interaction effects and main effects, we recalculated the tuning curve after shuffling the interaction effects. Before shuffling, the neuron’s interaction effects formed an 8 × 8 matrix with rows and columns corresponding to tops and bottoms and with entries corresponding to the interaction coefficients from the ANOVA. The sum of the eight coefficients in any given row or column was zero by the definition of interaction effects as orthogonal to main effects. To maintain this constraint while disrupting the relationship between interaction and main effects, we randomly shuffled the values within each row and then within each column.
The model implicit in the ANOVA carried out on data from individual neurons is a “part×location” model in the sense that completely independent sets of coefficients embody shape selectivity at the top and bottom locations. This model, given by equation (1), has 15 free parameters. We compared this to a simpler “part” model based on the assumption that a neuron exhibits identical part selectivity with only a difference in gain at the top and bottom locations. The response of the part model to a baton with parts Pi at the top and Pj at the bottom is given by:
where c is a baseline constant, ai is the contribution of Pi, aj is the contribution of Pj and k is a constant scaling factor capturing the difference in gain between the top and bottom locations. This model has nine degrees of freedom (seven for part identity and two for the constants k and c). We adjusted the parameters of the part model to obtain an optimal fit to each neuron’s data by minimizing the sum-of-squared errors between the predicted and observed responses (lsqcurvefit, Matlab 7.0, Natick MA). We then used a partial F-test to determine whether the improvement in fit attained by use of the “part×location” model as compared to the “part” model was significantly greater than expected simply from the increase in the number of free parameters.
The aim of this step of analysis was to calculate the strength of main and interaction effects as a function of time elapsed since stimulus onset. To do this accurately required correcting for a systematic inflation of signal strength occurring when an ANOVA is applied to the full body of raw data. To achieve this correction, we first performed for each neuron an ANOVA on data from odd-numbered trials, with firing rate in a window 50-200 ms after stimulus onset as the dependent variable and with the identity of the top part and the identity of the bottom part as factors. We then passed the firing rates in successive small bins on even-numbered trials through a filter based on the coefficients of the linear model generated by the ANOVA, using an approach such that the estimate of signal strength should hover around zero in the case of pure noise. The procedure was based on the reasonable assumption that stimulus selectivity, if present instantaneously, conformed to the pattern present on average across the 50–200 ms window.
The underlying model contained eight coefficients reflecting the contributions of top parts, eight coefficients reflecting the contributions of bottom parts, and 64 coefficients reflecting the contributions of specific top-bottom combinations. If any top part, bottom part or combination tended to increase the firing rate above (or decrease it below) the mean, the corresponding coefficient was positive (or negative). We converted each coefficient by a sign operation to +1 or −1. Coefficients of +1 (and −1) thus marked “preferred” (and “non-preferred”) parts or combinations of parts. We used the resulting arrays of eight, eight and 64 signed values as templates representing, respectively, the neuron’s pattern of top-part, bottom-part and combination selectivity during the 50–200 ms sampling interval.
For each 10 ms sampling period on even-numbered trials, we then performed an ANOVA with firing rate as the dependent variable and with top and bottom part identity as factors. This yielded eight coefficients reflecting the contributions of top parts, eight coefficients reflecting the contributions of bottom parts, and 64 coefficients reflecting the contributions of specific top-bottom combinations. We multiplied each coefficient by the corresponding signed value in the template. Then we computed the average of the resulting top-part terms, the average of the resulting bottom-part terms and the average of the resulting combination terms. Each average represented the degree by which the firing rate elicited by “preferred” stimuli exceeded the firing rate elicited by non-preferred stimuli. These three values can be thought of as reflecting how strongly on average across stimuli the influence of the top part in isolation, the influence of the bottom part in isolation and the influence of the particular combination pushed neuronal activity toward the pattern expected from the analysis carried out on odd-numbered trials. The main effect strength was calculated as the sum of the top-part and bottom-part averages.
To calculate the latency of main and interaction effects for the entire population of neurons, we took the distribution of the respective effect strengths across cells for each time bin, and identified the time bin in which it first deviated significantly from a mean of zero using a t-test (α = 0.01). A similar procedure was used to calculate the time of onset of the response itself as distinct from stimulus-selective activity.
We monitored the activity of neurons in the left inferotemporal cortex of two monkeys during passive viewing of images from two stimulus sets. Each stimulus set consisted of 64 stimuli created by combining eight top parts and eight bottom parts. Shapes used at the bottom were identical to those used at the top short of a 180° rotation. In the baton stimulus set, each image consisted of two parts joined by a vertical stem (Fig. 1). One of the eight parts was the unadorned end of the stem. The remaining seven were selected from a library of 21 available parts (Fig. S1, supplemental material) on the basis that, when presented in isolation during preliminary testing, they appeared to elicit a response from the neuron under study. In the hourglass stimulus set, the parts were the eight shapes (one a blank) obtained by hiding or displaying the three line segments of an equilateral triangle (Fig. 2). These sets differ with regard to the way in which an interaction effect (reflecting neuronal sensitivity to a particular combination of top and bottom parts) could arise. In the case of the baton set, because the parts are small relative to the object and are physically separate, any interaction effect would almost certainly arise from neuronal sensitivity to the combination of parts as such. In the case of the hourglass set, because the parts are large relative to the object and are contiguous, an interaction effect might also arise from neuronal sensitivity to a local feature (for instance a vertex) or a coarse shape (for instance a Z) created by the juxtaposition of two parts.
The response of an IT neuron to a full set of baton stimuli is shown in Fig. 1. This neuron clearly preferred some top parts over others (stimuli are arranged in order from those containing the least effective top part – row 1 – to those containing the most effective top part – row 8). Likewise, it preferred some bottom parts over others (stimuli are arranged in order from those containing the most effective bottom part – column 1 – to those containing the least effective bottom part – column 8). The most effective of the 64 stimuli was the baton containing the preferred top and the preferred bottom – as expected if the influences of the parts summed linearly. However, this does not rule out the occurrence of subtle nonlinear effects. To characterize linear and nonlinear effects, we performed an ANOVA (α = 0.05) with firing rate (in an interval from 50 to 350 ms) as the dependent variable and with the shape at the top (eight levels) and the shape at the bottom (eight levels) as factors. In this analysis, main and interaction effects represent respectively linear and nonlinear interactions between parts. The neuron of Fig. 1 exhibited significant main effects of top part and bottom part (p = 0) as well as a significant interaction effect (p = 0.004). However, the main effects were preponderant, accounting for 77% of the explainable variance in the firing rate as compared to 2% for the interaction effect.
The distribution across 104 neurons of significant effects elicited by baton stimuli is summarized in Fig. 3A. A majority of neurons (91%) exhibited a main effect of top and/or bottom. A subset of these neurons (33%) also exhibited an interaction effect. To compare the strengths of main and interaction effects, we split the data into halves, and calculated how much of the variance in one half was explained by main or interaction effects measured in the other half. This procedure guarantees that variance explained in the case of pure noise will fluctuate around zero and thus provides a level playing field on which to compare main and interaction effects. Across neurons that exhibited at least one significant effect, either main or interaction (n = 95), main effects accounted for a markedly larger fraction of variance (36% on average) than did interaction effect (2% on average). This pattern was highly consistent across neurons (Fig. 3B) and was statistically significant across data from both monkeys (paired t-test, p = 2.0e−26). It persisted when consideration was restricted to 32 neurons demonstrably sensitive to both top and bottom as indicated by the presence of two main effects and an interaction effect (main and interaction effects accounted for 51% and 6% of variance respectively; this difference was significant: paired t-test, p = 2.4e−14). Finally, it persisted and was statistically significant upon analysis of data from individual monkeys (variance accounted for by main and interaction effects: 36% and 2%, significantly different at p = 1.6e−13 for 48 neurons in monkey Je; 36% and 3%, significantly different at p = 3.5e−14 across 47 neurons in monkey Ec).
The response of an individual IT neuron to the full set of hourglass stimuli is shown in Figure 2. This neuron clearly preferred some top parts over others and some bottom parts over others. However, the most effective stimulus was not the one obtained by combining the best top part and best bottom part, contrary to the expectation based on simple linear summation. An ANOVA carried out on data from this neuron revealed significant main effects of top part and bottom part and a significant interaction effect (p = 0). The main effects accounted for 79% of the explainable variance in firing rate as compared to 11% for the interaction effect.
The distribution across 142 neurons of significant effects elicited by hourglass stimuli is summarized in Fig. 3C. A majority of neurons (92%) exhibited a main effect of top and/or bottom. A smaller but substantial fraction (54%) exhibited an interaction effect. To compare the strengths of main and interaction effects, we analyzed data from neurons that exhibited at least one significant effect (n = 132). Main effects accounted for a larger fraction of variance (22% on average) than interaction effects (7% on average). This pattern was reasonably consistent across neurons (Fig. 3D) and was statistically significant (paired t-test, p = 1.0e−18). It persisted when consideration was restricted to 64 neurons demonstrably sensitive to both top and bottom as indicated by the presence of two main effects and an interaction effect (main and interaction effects accounted for 32% and 11% of variance respectively and this difference was significant: paired t-test, p = 4.7e−12). Finally, it persisted and was statistically significant upon analysis of data from individual monkeys (variance accounted for by main and interaction effects: 22% and 8%, significantly different at p = 4.5e−8 for 65 neurons in monkey Je; 22% and 6%, significantly different at p = 1.5e−12 across 67 neurons in monkey Ec).
The results described above indicate that neuronal responses to stimuli in both sets were dominated by linear effects, reflecting independent sensitivity to the identity of the top part and the identity of the bottom part, in contrast to nonlinear effects reflecting sensitivity to specific combinations of parts. The hourglass set differed from the baton set in that interaction effects were more frequently significant (chi-squared test, p = 0.0012) and were stronger on average (t-test on fraction of firing-rate variance explained, p = 0.00007). This pattern persisted and was statistically significant upon analysis of data from individual monkeys (percent variance accounted for by interactions in baton and hourglass: 2% and 8%, significantly different at p = 0.001 for neurons from monkey Je; 3% and 6%, significantly different at p = 0.02 for neurons from monkey Ec). The greater incidence of interaction effects in hourglass stimuli may be attributable to cases in which neurons were sensitive to local features and coarse forms created by the juxtaposition of specific pairs of hourglass parts. That interaction effects were relatively weak even though they could have arisen by such a mechanism adds weight to the overall conclusion that linear effects predominate.
Interaction effects, although relatively weak, could still exert a systematic impact on neuronal stimulus selectivity. In particular, they could either sharpen or broaden the shape tuning curve. To explore this issue, we constructed a tuning curve for each neuron, plotting firing rate against stimulus rank – which ranged from 1 for the most effective baton to 64 for the least effective, with firing rate normalized to the range from the worst to the best baton. We carried out this procedure both on the observed firing rates and on firing rates obtained by taking into account main effects alone. We found (Fig. 4A) that the average of the tuning curves actually observed (thick line) was narrower than the average of the tuning curves obtained after interaction effects had been removed (thin line) or shuffled (broken line). To assess the consistency of this outcome across neuron, we computed, for each neuron, the stimulus rank at half height in the observed data and in the data with interaction effects removed (Fig. 4B). The tendency for the observed tuning curve to be narrower than the tuning curve without interaction effects was highly significant (observed mean = 18.2; interaction-absent mean 23.0; p = 4e−6, pairwise t-test).
We performed an identical analysis for the hourglass set (Fig. 4C–D). The observed tuning curves (Fig. 4C, thick line) were narrower than the tuning curves obtained after interaction effects had been removed (thin line) or shuffled (broken line). The sharpness of tuning, as measured by rank at half height, was narrower for the observed response compared to the response predicted from main effects (Fig. 4D; observed mean = 17.6, interaction-absent mean = 27.0; p = 1e−20, pair-wise t-test).
We conclude that nonlinear interactions, although comparatively rare and weak, are systematic in that they consistently sharpen the pattern of selectivity arising from main effects. The sharpening was greater for the hourglass set compared to the baton set (average difference in rank at half-heights: 9.4 for hourglass, 4.7 for baton, p = 0.0004, t-test) as might be expected in light of the greater frequency and strength of hourglass interaction effects.
Nonlinear interactions might arise from a process in which the influences of the top and bottom parts are first summed linearly and then processed through a nonlinear sigmoidal activation function. This would give rise to a systematic relation between the efficacy of the parts and the nature of the nonlinear effect. For weakly effective parts, a threshold effect would occur: the response to the combination would be greater than predicted by linear summation. This would result in narrower tuning than for a linear activation function. For strongly effective parts, a saturation effect would occur: the response to the combination would be less than predicted by linear summation. This would result in broader tuning than for a linear activation function.
To search for such a pattern, we plotted the observed responses to the 64 batons against the responses predicted from main effects alone. We did this both using the raw firing rate (Fig. 5A) and using the firing rate normalized, for each neuron, to the response elicited by its best baton (not shown). In the resulting plots, there was no hint of the pattern expected from passing the summed input through a sigmoidal activation function. The large white symbols, representing the running average of the surrounding cloud of data points, cleave to the identity line throughout each plot, whereas, according to the sigmoidal hypothesis, they should have been above it to the left and below it to the right of each graph.
To assess whether the responses of at least a few neurons could be explained by such a mechanism, we fit the responses of each neuron to two models: (1) a linear model in which the response is that predicted by the main effects (15 parameters) and (2) a sigmoidal model in which the output of the linear model is passed through a sigmoidal activation function (15 linear parameters + 3 sigmoidal parameters). Then, for each neuron, we performed a partial-F test to determine whether the improvement in fit afforded by the sigmoidal model was significantly greater than the improvement expected simply from adding three extra degrees of freedom. The cumulative distribution of the probability of the null hypothesis (sigmoidal model no better than linear model) is shown in Fig. 5B. The fraction of neurons (4%) in which the sigmoidal model yielded a significantly better fit (p < 0.05) was no greater than that expected from type I errors (5%).
We performed an identical analysis on the hourglass data (Fig. 5C–D). A plot of the observed firing rate against the firing rate predicted by main effects did not reveal any hint of the pattern expected from passing inputs through a sigmoidal activation function (Fig. 5C). The fraction of neurons significantly better fit by a sigmoidal model than by a linear model (9%; see Fig. 5D) was not significantly in excess of the fraction expected from type I errors (chi-squared test, p = 0.25).
We conclude that the relatively weak nonlinear interactions between top and bottom parts observed in neuronal responses to both stimulus sets arose from some mechanism other than passing the summed influences of the two parts through a sigmoidal activation function.
In the previous two sections, we showed that nonlinearities (a) sharpen neuronal tuning curves and (b) do not commonly conform to the pattern expected from passing inputs through a sigmoidal activation function. Here we consider whether these observations are related. Would sigmoidal nonlinearities, if present, have had a consistent impact on the sharpness of tuning? To explore this issue, we examined the responses of model neurons to pairs of simulated inputs (representing the influences of the top and bottom parts of an image) with strengths selected at random from normal distributions. We found that the transformation from a linear to a sigmoidal activation function can have no effect on the sharpness of tuning (Figure S2A–C, supplementary material), can enhance it (Fig. S2D–F, supplementary material) or can reduce it (Fig. S2G–I, supplementary material). The outcome is dependent on the distribution of input strengths relative to the inflection point of the sigmoid. When the inputs are biased to be especially weak, thereby largely impinging upon the compressive portion of the sigmoid (Fig. S2D–F, supplementary material), sigmoidal activation tends to sharpen the tuning curve. On the other hand, when inputs are biased to be especially strong, thereby impinging upon the expansive portion of the sigmoid (Fig. S2G–I, supplementary material), sigmoidal activation tends to broaden the tuning curve. Thus the question whether nonlinearities systematically affect the sharpness of tuning is orthogonal to the question whether the nonlinearities are of the form expected from a sigmoidal activation function.
To be sure that we had selected parts that varied broadly in efficacy, as required for the preceding analysis, we ranked the top and bottom parts used in testing each neuron from best to worst. The ranking was based on the average of the firing rates elicited by the eight images that contained a given part minus the pre-stimulus baseline firing rate. For baton stimuli, the response to the best part was on average around three and a half times as strong as the response to the worst part (Fig. 6A–B). There was no significant difference between neuronal responses to the best top part and the best bottom part (p = 0.82, t-test). Likewise neuronal selectivity for parts, as judged by the difference between the response to the best and worst parts, did not differ significantly between tops and bottoms (p = 0.83, t-test).
Top and bottom parts in the hourglass set elicited weaker responses overall. This effect may have arisen because the top and bottom parts were fixed in the hourglass set whereas in the baton set, parts were selected to drive neuronal responses. Nonetheless, neurons were highly selective for part identity, as evidenced by the fact that the response to the best part was on average five times as strong as the response to the worst part (Fig. 6C–D). Neurons tested with the hourglass set responded slightly more strongly to the preferred bottom part than to the preferred top part (p = 0.004, t-test) and were slightly more selective for bottom identity than for top identity (best-worst response, p = 0.0008, t-test).
Each stimulus set contained images in which the top or bottom was blank. In some batons, the top or bottom of the vertical bar was unadorned. In some hourglass stimuli, the top or bottom was altogether absent. To determine whether the response to the blank part was less than to other parts, we ranked the top and bottom parts for each neuron from 1 (strongest response) to 8 (weakest response). If the blank part tended to elicit weaker responses than the other parts, then the average value of its rank would be greater than 4.5. For the baton set, the average rank of the blank part (4.6) was slightly but not significantly (t-test, p = 0.42) greater than 4.5. In the hourglass set, the average rank of the blank part (4.8) significantly exceeded 4.5 (t-test, p = 0.01). Thus there was a common tendency for the blank part to elicit a weak response.
Because we used the same set of parts (short of a 180° rotation) to form the tops and bottoms of each stimulus set, we were able to compare neuronal shape preferences at the two locations. Even if neurons exhibited similar patterns of selectivity for tops and bottoms, they still might produce more deeply modulated responses at one of the locations. To factor out this effect prior to carrying out a correlation analysis, we calculated the normalized response to each top part as Ni = (Ri – Rmean)/S, where Ri is the average response to shapes with the ith top, and Rmean is the average of all top responses, and S is the standard deviation of the average responses to the eight tops. We calculated the normalized response to each bottom part in a similar fashion. To determine whether neurons had matching shape preferences at top and bottom, we plotted, for all shapes in all neurons, the normalized response when the shape was at the top against the normalized response when the same shape was at the bottom (Fig. S3A, supplementary material). The values were positively (r= 0.41) and highly significantly (p = 4.2e−35) correlated. Thus neurons exhibited reasonably consistent shape selectivity at the top and bottom locations.
The above result raises the possibility that neurons were sensitive simply to the presence of a shape and not to its location. To test this idea, we carried out for each neuron a test comparing two models relating the firing rate to the display: a simple “part” model (with 9 parameters, see Methods) in which firing depended on which parts were present and a complex “part×location” model in which firing depended on which part was at which location (with 15 parameters, see Methods). We then performed a partial-F test to determine whether the improvement in fit afforded by the “part×location” model was significantly greater than expected simply from its possessing six additional degrees of freedom. A cumulative distribution of the probability that the two models are equivalent is shown in Supplementary Figure S3B. For baton stimuli, the part × location model was significantly better (p < 0.05) in 45% of neurons (shaded region in Fig. S3B, supplementary material). Thus around half of the neurons were sensitive to which part was at which location or alternatively and - in this experiment – indissociably to the orientations of the parts.
A similar analysis performed on the hourglass data revealed a positive (r= 0.35) and significant (p = 2.8e−31) correlation between selectivity for tops and bottoms (Fig. S3C, supplementary material). About 28% of neurons were sensitive to which part was at which location in the hourglass set (Fig. S3D, supplementary material).
To explore the timing of main and interaction effects, we measured their strength as a function of time following stimulus onset in each neuron and then averaged the results (see Methods). For the baton stimulus set, curves representing the strength of the population signal (Fig. 7A) make clear that the onset of the visual response was followed at a short delay by the onset of the main-effect signal and at a longer delay by the onset of the interaction-effect signal. We calculated the population latency of each effect by finding the first time bin in which the mean of the distribution of the effect strength across neurons was significantly different from zero (t-test; α = 0.01). The time following stimulus onset at which the average firing rate began to increase (75 ms) was indeed earlier than the main-effect latency (95 ms) and the interaction-effect latency (135 ms). This pattern persisted upon analysis of data from individual monkeys (latency of main and interaction effects: 85 and 115 ms for 54 neurons from monkey Je; 105 and 125 ms for 50 neurons from monkey Ec). These observations were qualitatively unaltered by restricting consideration to 31 neurons demonstrably sensitive to both top and bottom as indicated by the presence of two main effects and an interaction effect. To determine whether the temporal offset between the main-effect signal and the interaction-effect signal was statistically significant, we carried out the following analysis. First, we normalized the time-varying signal of each neuron to the maximal average value of that signal 50–150 ms after stimulus onset as computed across all tested neurons. This was necessary to avoid confounding differences in timing with differences in net strength of the two signals. Next, we assessed whether there was any 10 ms bin in which the distribution across neurons of the strength of the main-effect signal was significantly greater than the strength of the interaction-effect signal. There were two bins (centered at 95 and 105 ms) for which this was true (p = 0.03, t-test). This analysis, when repeated on data from each monkey, yielded no significant time bins in the data from one monkey (Je) and one significant time bin (centered at 105 ms) in the data from the second monkey (Ec) – a pattern that is likely due to the lower numbers of neurons used for statistical comparisons. We conclude that the main-effect signal developed significantly earlier than the interaction-effect signal.
During testing with hourglass stimuli (Fig. 7B), the onset of the visual response and of the main-effect signal occurred simultaneously (85 ms) slightly earlier than the onset of the interaction-effect signal (95 ms). These observations were qualitatively unaltered by restricting consideration to 45 neurons sensitive to both top and bottom as indicated by the presence of two main effects and an interaction effect. The onset times of the main-effect and interaction-effect signals were not significantly different (p = 0.3, t-test). These results were consistent between data obtained from both monkeys (latency of main and interaction effects: 85 and 95 ms for 69 neurons from monkey Je; 95 and 95 ms for 73 neurons from monkey Ec; these differences failed to reach statistical significance).
The key conclusion of this analysis is that interaction-effect signals developed later than main-effect signals for both stimulus sets although the phenomenon was large and statistically significant only for batons. To assess the significance of the difference in timing between the two interaction-effect signals, we carried out an analysis analogous to the one employed above for comparing main and interaction effects. This revealed that the hourglass interaction-effect signal was significantly greater than the baton interaction-effect signal in a 10 ms bin centered at 95 ms after stimulus onset (p = 0.04, t-test). We conclude that nonlinear effects developed earlier for hourglass than for baton stimuli.
Because the hourglass set (unlike the baton set) consisted of individual top and bottom parts presented in isolation as well as whole objects constructed from pairs of parts, we were able to ask how the response to the whole object was related to the responses evoked by the individual parts in isolation. To answer this question we plotted the response to each whole object against the average of the responses evoked by the two parts (Fig. S3, supplementary material). According to the principle of divisive normalization, the two values should be equal (Reynolds et al., 1999). There was a strong and significant positive correlation. The response to the whole object was slightly but significantly greater than the average of the individual part responses (mean response to whole = 11.2 spikes/s, mean of the average of the responses to the two parts = 10.7 spikes/s, p = 1e−13, paired t-test). This enhancement – a slight net deviation from divisive normalization - was additive (the intercept of the best-fit line was 1.11) rather than multiplicative (the slope of the best-fit line was 0.94). We conclude that neuronal responses to hourglass shapes are roughly consistent with the pattern expected from divisive normalization.
The aim of this study was to characterize how the influences of parts within an object combine in driving the visual responses of IT neurons. Three features of the approach were critical. First, in common with a few previous studies (Baker et al., 2002; Baene et al., 2007), we made use of baton stimuli containing widely separated discrete parts that could be independently manipulated without risk of creating juxtapositional features. Second, in departure from previous studies, we employed numerous parts covering a wide range with regard to their individual efficacy in eliciting neuronal activity. This was required for testing a model in which nonlinear interactions arise from passing the summed inputs through a sigmoidal activation function. Third, we compared results obtained with the baton stimuli to results obtained with a stimulus set in which parts were confluent, with the consequence that apparent nonlinearities could arise from neuronal sensitivity to attributes created where two parts abut. We discuss the main conclusions below, first with reference to baton stimuli and then to the hourglass stimuli.
This is the key finding of the study. It stands in sharp contrast to the idea that IT neurons are nonlinearly selective for particular conjunctions of features (Perrett et al., 1982; Desimone et al., 1984; Tanaka et al., 1991; Kobatake & Tanaka 1994; Ito et al., 1995; Yamane et al., 2006). However, it is concordant with the finding that neuronal responses to face parts sum linearly with few nonlinear interactions (Freiwald et al., 2009). It also resonates with the observation that responses to multiple objects are governed by divisive normalization. In visual areas including IT (Zoccolan et al., 2005), MT (Majaj et al., 2007; Britten and Heuer, 1999), V4 (Reynolds et al., 1999; Ghose and Maunsell, 2008), V2 (Reynolds et al., 1999) and V1 (Heeger, 1993; Carandini, Heeger and Movshon, 1997), the response to a multi-element display is roughly equal to the average of the responses elicited by the individual elements (this is also true for responses to hourglass stimuli, see Fig. S3, supplementary material). This raises the question: could the monkeys have been processing the baton stimuli as multiple objects and could this account for our observations? We think not for two reasons. First – although this hardly constitutes objective evidence – the batons seem compellingly unitary to a human observer. Second, it has been shown that requiring monkeys to process batons as wholes – which might be thought to enhance perceiving them as unitary objects – does not affect neuronal responses to them in IT, which remain the same as during passive viewing (Baker et al. 2002). This is consonant with observations indicating that discrimination training affects neuronal responses to stimuli (Sigala and Logothetis, 2002) but active performance of the discrimination does not (Suzuki et al., 2006). We conclude that an essentially identical linear rule determines responses to displays consisting of multiple objects and objects consisting of multiple parts.
We found that statistically significant main effects of top and bottom identity were 2.8 times more common than statistically significant interaction effects. This is in good agreement with the results of a few other studies utilizing stimuli with discrete features (Baker et al., 2002, report a ratio of 3.2; Baene et al., 2007, describe linear summation as the dominant pattern; McMahon and Olson, 2009, report a ratio of 6.0). The minor nature of nonlinear effects was even more evident in a measure based on signal strength, which revealed a difference of an order of magnitude between main effects based on individual part identity and interaction effects based on the combination of parts (Fig’s. 3B, Fig’s 7A). The measure of signal strength that we used is worth particular note because it eliminates contributions from noise that have tended in previous studies (McMahon and Olson 2009) to inflate the relative strength of interaction effects.
The finding that interactions between spatially separate parts of an object develop around 40 ms later than main effects (Fig. 7A) is concordant with an earlier finding obtained using shapes constructed from contour fragments (Brincat and Connor, 2006) but is at odds with a recent report indicating that nonlinear interactions between color and shape develop simultaneously with main effects (McMahon and Olson, 2009). Perhaps nonlinear interactions between spatially separate elements (discrete parts of an object) involve a time-consuming process whereas interactions between spatially coextensive elements (a shape and its color) do not.
Nonlinear interactions, when present, did induce a slight increase of the tendency toward conjunction selectivity. This finding supports earlier observations that nonlinear effects, although rare and weak, do tend systematically to enhance neuronal selectivity, rendering the representation of images in IT sparser than it otherwise would be (Baker et al., 2002; McMahon and Olson, 2009).
Despite our best attempts to drive neurons across a wide dynamic range, we did not observe any threshold effects or response saturation. If nonlinearities are not the consequence of passing summed inputs through a sigmoidal activation function, then how do they arise? Synaptic inputs are thought to sum linearly when they impinge on separate dendritic branches but to sum nonlinearly when they impinge on the same branch (London and Hausser, 2005). Perhaps nonlinear interactions occur under the relatively rare condition when synapses conveying the influences of discrete parts of an object converge on the same dendritic branch. This alone cannot, however, account for the fact that interaction effects are delayed by several tens of ms relative to main effects.
Results obtained with hourglass stimuli differed from those obtained with baton stimuli in one key respect: nonlinear effects reflecting neuronal sensitivity to the particular combination of top part and bottom part were more prominent. This was evident in counts of statistically significant interaction effects (Fig. 3A vs. C), in measures of interaction-effect strength in individual neurons (Fig. 3B vs. D) and in the strength of the population interaction-effect signal (Fig. 7A vs. B). Hourglass stimuli also differed from baton stimuli with regard to two aspects of response timing. 1) The onset of main effects reflecting sensitivity to part identity was simultaneous with the onset of the visual response for hourglass stimuli whereas it was delayed for baton stimuli. This may have been due to the fact that features differentiating one part from another were defined on a relatively coarse scale for hourglass stimuli and on a relatively fine scale for baton stimuli. Information about image identity is represented in a coarse-to-fine sequence in IT (Sripati and Olson, 2009). 2) The interaction-effect signal developed earlier for hourglass than for baton stimuli (Fig. 7A vs. B). This may have been due to the fact that some interaction effects reflected sensitivity to a local feature created by the juxtaposition of two parts rather than to the combination of parts as such. Signals representing a particular combination of distant parts appear to develop more slowly than signals reflecting the identity of an individual part (Brincat and Connor, 2006).
The classic view that IT neurons are selective for combinations of features as distinct from individual features (Perrett et al., 1982; Desimone et al., 1984; Tanaka et al., 1991) is founded on the fact that removing any part from a preferred image can cause a drastic reduction in the response (Kobatake and Tanaka, 1994; Ito et al., 1995). However, removing either part from a preferred baton does not commonly cause a dramatic reduction. There must be a difference between this manipulation and the manipulations employed in the classic studies. What might the difference be? We can offer two suggestions. 1) Juxtapositional features. If the parts of an object are not physically separate then combining them might give rise to a new local feature where they adjoin (such as the X junction at the center of an hourglass image formed from a top triangle and a bottom triangle). A neuron selective for the X junction would give the appearance of being selective for the combination of two triangles. This would not have happened with baton stimuli because of the physical separation between the top and bottom parts. 2) Coarse footprint. The parts of an object might when combined create a coarse shape not implicit in either part alone (such as the Z of an hourglass image formed from two chevrons). We have recently found that IT neurons are selective for global shape up to a level of detail of around three cycles per object (Sripati and Olson, 2010). A neuron selective for a coarse footprint in the form of a Z would give the appearance of being selective for the conjunction of the two chevrons. This would not have happened with baton stimuli because the details differentiating one part from another are on such a fine scale as not appreciably to affect the coarse footprint. We surmise that IT neurons exhibit robust conjunction selectivity only when the conjunction of parts creates a preferred local juxtapositional feature or a preferred global coarse footprint. This view is consonant with the finding that nonlinear effects occur at a significantly higher rate for hourglass than for baton stimuli.
We thank Karen McCracken for technical assistance. Research support: NIH RO1 EY018620, P50 MH084053 and the Pennsylvania Department of Health through the Commonwealth Universal Research Enhancement Program. Technical support: NIH P30 EY08098. Collection of MR images: NIH P41 RR03631.