|Home | About | Journals | Submit | Contact Us | Français|
Critical to perceiving an object is the ability to bind its constituent features into a cohesive representation, yet the manner by which the visual system integrates object features to yield a unified percept remains unknown. Here, we present a novel application of multivoxel pattern analysis of neuroimaging data that allows a direct investigation of whether neural representations integrate object features into a whole that is different from the sum of its parts. We found that patterns of activity throughout the ventral visual stream (VVS), extending anteriorly into the perirhinal cortex (PRC), discriminated between the same features combined into different objects. Despite this sensitivity to the unique conjunctions of features comprising objects, activity in regions of the VVS, again extending into the PRC, was invariant to the viewpoints from which the conjunctions were presented. These results suggest that the manner in which our visual system processes complex objects depends on the explicit coding of the conjunctions of features comprising them.
How objects are represented in the brain is a core issue in neuroscience. In order to coherently perceive even a single object, the visual system must integrate its features (e.g., shape, color) into a unified percept (sometimes called the “binding problem”) and recognize this object across different viewing angles, despite the drastic variability in appearance caused by shifting viewpoints (the “invariance problem”). These are the most computationally demanding challenges faced by the visual system, yet humans can perceive complex objects across different viewpoints with exceptional ease and speed (Thorpe et al. 1996). The mechanism underlying this feat is one of the central unsolved puzzles in cognitive neuroscience.
Two main classes of models have been proposed. The first are hierarchical models, in which representations of low-level features are transformed into more complex and invariant representations as information flows through successive stages of the ventral visual stream (VVS), a series of anatomically linked cortical fields originating in V1 and extending into the temporal lobe (Hubel and Wiesel 1965; Desimone and Ungerleider 1989; Gross 1992; Tanaka 1996; Riesenhuber and Poggio 1999). These models assume explicit conjunctive coding of bound features: posterior VVS regions represent low-level features and anterior regions represent increasingly complex and invariant conjunctions of these simpler features. In contrast, an alternative possibility is a non-local binding mechanism, in which the perception of a unitized object does not necessitate explicit conjunctive coding of object features per se, but rather, the features are represented independently and bound by co-activation. Such a mechanism could include the synchronized activity of spatially distributed neurons that represent the individual object features (Singer and Gray 1995; Uhlhaas et al. 2009), or a separate brain region that temporarily reactivates and dynamically links otherwise disparate feature representations (Eckhorn 1999; Devlin and Price 2007). Thus, an explicit conjunctive coding mechanism predicts that the neural representation for a whole object should be different from the sum of its parts, whereas a nonlocal binding mechanism predicts that the whole should not be different from the sum of the parts, because the unique conjunctive representations are never directly coded.
The neuroimaging method of multivoxel pattern analysis (MVPA) offers promise for making more subtle distinctions between representational content than previously possible (Haxby et al. 2001; Kamitani and Tong 2005). Here, we used a novel variant of MVPA to adjudicate between these 2 mechanisms. Specifically, we measured whether the representation of a whole object differed from the combined representations of its constituent features (i.e., explicit conjunctive coding), and whether any such conjunctive representation was view-invariant. We examined the patterns of neural activity evoked by 3 features distributed across 2 individually presented objects during a 1-back task (Fig. (Fig.1).1). Our critical contrast measured the additivity of patterns evoked by different conjunctions of features across object pairs: A + BC versus B + AC versus C + AB, where A, B, and C each represent an object comprising a single feature, and AB, BC, and AC each represent an object comprising conjunctions of those features (Fig. (Fig.22A,B). Importantly, in this “conjunction contrast,” the object pairs were identical at the feature level (all contained A, B, and C), but differed in their conjunction (AB vs. BC vs. AC), allowing a clean assessment of the representation pertaining to the conjunction, over and above any information regarding the component features. This balanced design also ensured that mnemonic demands were matched across comparisons. A finding of equivalent additivity (i.e., if A + BC = B + AC = C + AB) would indicate that information pertaining to the specific conjunctions is not represented in the patterns of activity—consistent with a nonlocal binding mechanism in which the features comprising an object are bound by their co-activation. In contrast, if the pattern sums are not equivalent (i.e., if A + BC ≠ B + AC ≠ C + AB), then the neural code must be conjunctive, representing information about the specific conjunctions of features over and above information pertaining to the individual features themselves—consistent with an explicit conjunctive coding mechanism.
An important potential benefit of explicit conjunctive coding of whole objects is to provide stability of representation across changes in viewpoints, and invariance to the manifestation of individual object features (Biederman 1987). To investigate whether this was the case, in a second “viewpoint contrast,” we measured whether the representation for the conjunctions changed when they were presented from a different viewpoint (i.e., were the conjunctive representations view-invariant?) (Fig. (Fig.22C,D). Importantly, in both contrasts, our novel MVPA linearity design avoided making unbalanced comparisons (e.g., A + B + C vs. ABC) where the number of object features was confounded with the number of objects.
Indeed, an aspect of our design that should be emphasized is that during the task, participants viewed objects displayed in isolation. This is important, because presenting 2 objects simultaneously (e.g., Macevoy and Epstein 2009) could potentially introduce a bias, particularly when attention is divided between them (Reddy et al. 2009; Agam et al. 2010). Whereas objects were presented in isolation during the task, responses to single objects were then combined during analysis. On each side of every comparison, we combined across an equal number of objects (2), as there will be activity evoked by an object that does not scale with its number of features. So, for example, we rejected a simpler design in which A + B = AB was tested, as there are an unequal number of objects combined on the 2 sides of the comparison (2 vs. 1).
We hypothesized that any observation of explicit conjunctive coding would be found in anterior VVS, extending into anterior temporal regions. In particular, one candidate structure that has received intensified interest is the perirhinal cortex (PRC)—a medial temporal lobe (MTL) structure whose function is traditionally considered exclusive to long-term memory (Squire and Wixted 2011), but has recently been proposed to sit at the apex of the VVS (Murray et al. 2007; Barense, Groen et al. 2012). Yet to our knowledge, there have been no direct investigations of explicit conjunctive coding in the PRC. Instead, most empirical attention has focused on area TE in monkeys and the object-selective lateral occipital complex (LOC) in humans—structures posterior to PRC and traditionally thought to be the anterior pinnacle of the VVS (Ungerleider and Haxby 1994; Grill-Spector et al. 2001; Denys et al. 2004; Sawamura et al. 2005; Kriegeskorte et al. 2008). For example, single-cell recording in monkey area TE showed evidence for conjunctive processing whereby responses to the whole object could not be predicted from the sum of the parts (Desimone et al. 1984; Baker et al. 2002; Gross 2008), although these conjunctive responses might have arisen from sensitivity to new features created by abutting features (Sripati and Olson 2010). Here, with our novel experimental design, we were able to directly measure explicit conjunctive coding of complex objects for the first time in humans. We used both a whole-brain approach and an region of interest (ROI)-based analysis that focused specifically on the PRC and functionally defined anterior structures in the VVS continuum. Our results revealed that regions of the VVS, extending into the PRC, contained unique representations of bound object features, consistent with an explicit conjunctive coding mechanism predicted by hierarchical models of object recognition.
Twenty neurologically normal right-handed participants gave written informed consent approved by the Baycrest Hospital Research Ethics Board and were paid $50 for their participation. Data from one participant were excluded due to excessive head movement (>10° rotation), leaving 19 participants (18–26 years old, mean = 23.6 years, 12 females).
Participants viewed novel 3D objects created, using Strata Design 3D CX 6. Each object was assembled from one of two feature sets and was composed of a main body with 1, 2, or 3 attached features (depicted as “A,” “B,” and “C” in Fig. Fig.11A). There were 7 possible combinations of features within a feature set (A, B, C, AB, AC, BC, ABC). Features were not mixed between sets. Each object was presented from one of two possible angles separated by a 70° rotation along a single axis: 25° to the right, and 45° to the left from central fixation. We ensured that all features were always visible between angle changes. There were 28 images in total, created from every unique combination of the experimental factors: 2 (feature sets) × 2 (viewpoints) × 7 (possible combinations of features within a set). Figure Figure11A depicts the complete stimulus set.
To measure the basic visual similarity between the objects in our stimulus set, we calculated the root-mean-square difference (RMSD) between each of the 24 objects (all one-featured and two-featured objects) and compared this value to every other object using the following function:
where i is a pixel position in the image and n is the total number of pixels in the image. Thus, this function compares all of the pixels in 1 image with the corresponding pixels in a second image and yields a value that indicates the similarity between 2 images, ranging from 0 (if the 2 images are identical) to 1 (if the 2 images are completely different) (Fig. (Fig.33A). The purpose of this analysis was to determine how similar 2 images are on the most basic of levels—that is, how different the images would appear to the retina. Specifically, we conducted an analysis to ensure that our viewpoint manipulation in fact caused a substantial change in the visual appearance of the objects. For example, if our viewpoint shifts were insignificant (e.g., 1° difference), any observation of view-invariance for this small visual change would not be particularly meaningful. However, if we could show that our shift in viewpoint caused a visual change that was as significant as a change in the identity of the object itself, a demonstration of view-invariance would be much more compelling. To this end, we calculated a contrast matrix (Fig. (Fig.33B) that compared the RMSD values of the same objects from different viewpoints (dark orange) versus RMSD values of different object features from the same viewpoint (light orange). Different objects from the same viewpoint were compared only if they were matched for the number of features. A t-test revealed that there was a significant difference between RMSD values of the same features shown from different viewpoints (M = 0.16, SD = 0.03) compared with different features shown from the same viewpoint (M = 0.11, SD = 0.02; t(34) = 6.11, P < 0.001), revealing that that our viewpoint manipulation caused a change in visual appearance that was more drastic than maintaining the same viewpoint but changing the identity of the features altogether. Next, we conducted an RMSD analysis that was very similar to the viewpoint contrast (shown in Fig. Fig.22D) used in our MVPA. Here, we compared the RMSD values of a change in viewpoint (dark orange) with a change in both viewpoint and feature type (light orange) (Fig. (Fig.33C). A t-test revealed that there was no significant difference between RMSD values of the same features shown from different viewpoints (M = 0.16, SD = 0.03) compared with different features shown from different viewpoints (M = 0.17, SD = 0.02; t(70) = 1.28, P = 0.20). This indicates that the shift in viewpoint within a feature set caused a change in visual appearance that was as drastic as changing the features altogether. That is, our viewpoint manipulation was not trivial and caused a substantial visual change of the appearance of the objects.
It is worth noting that because RMSD values constitute a difference score that reflects the visual differences between one object with respect to another object, we could not calculate RMSD difference scores to the pairs of objects as we did in the MVPA (e.g., A + BC vs. B + AC). Put differently, in the MVPA we could measure the patterns of activity evoked by the presentation of a single object (e.g., A), add this pattern of activity to that evoked by a different single object (e.g., BC), and then compare across different pattern sums (e.g., A + BC vs. B + AC). In contrast, because an RMSD value reflects the visual difference between 2 objects (rather than to an object on its own), we could not measure an RMSD value to object “A” and add that value to the RMSD value of object “BC.” As such, it was not possible to calculate RMSD values for the pattern sums.
We administered 4 experimental scanning runs during which participants completed a 1-back task to encourage attention to each image. Participants were instructed to press a button with their right index finger whenever the same object appeared twice in succession, regardless of its viewpoint (behavioral results in Supplementary Table 1). Feedback was presented following each button press (correct or incorrect) and at the end of each run (proportion of correct responses during that run). Trials on which a response was made were not included in the analysis.
Objects were presented centrally on the screen and had a visual angle of 5.1° × 5.3°, which would likely encompass the receptive fields of PRC (~12°) (Nakamura et al. 1994), V4 (4–6° at an eccentricity of 5.5°) (Kastner et al. 2001), LOC (4–8°) (Dumoulin and Wandell 2008), and fusiform face area (FFA), and parahippocampal place area (PPA) (likely >6°) (Desimone and Duncan 1995; Kastner et al. 2001; Kornblith et al. 2013). The visual angle of the individual object features was approximately 2.1° × 2.2°, which would likely encompass the receptive fields of more posterior regions in the VVS (2–4° in V2) (Kastner et al. 2001). Each image was displayed for 1 s with a 2 s interstimulus interval. Each run lasted 11 min 30 s, and for every 42 s of task time, there was an 8 s break (to allow blood oxygen level dependent (BOLD) signal to reach baseline) during which a fixation cross appeared on the screen. Each run comprised 6 blocks of 28 trials, which were presented in a different order to each participant. The 14 images composing each feature set were randomly presented twice within each block. Across consecutive blocks, the feature sets alternated (3 blocks per feature set per run). Each block contained between 1 and 4 target objects (i.e., sequential repeats), such that the overall chance that an object was a target was 10%. In total, each image was presented 24 times (6 times within each run). Prior to scanning, each participant performed a 5-min practice of 60 trials.
After the 4 experimental runs, an independent functional localizer was administered to define participant-specific ROIs (LOC, FFA, and PPA, described next). Participants viewed scenes, faces, objects, and scrambled objects in separate 15-s blocks (there was no overlap between the images in the experimental task above and the localizer task). Within each block, 20 images were presented for 300 ms each with a 450-ms ISI. There were 4 groups of 12 blocks, with each group separated by a 15-s fixation-only block. Within each group, 3 scene, face, object, and scrambled object blocks were presented (order of block type was counterbalanced across groups). To encourage attention to each image, participants were instructed to press a button with their right index finger whenever the same image appeared twice in succession. Presentation of images within blocks was pseudo-random: immediate repeats occurred between 0 and 2 times per block.
Following scanning, participants were administered a memory task in which they determined whether a series of objects were shown during scanning (Supplementary Fig. 1 for description of the task and results). Half of the objects were seen previously in the scanner and half were novel recombinations of features from across the 2 feature sets. In brief, the results indicated that participants could discriminate easily between previously viewed objects and objects comprising novel reconfigurations of features, suggesting that the binding of features extended beyond the immediate task demands in the scanner, but also transferred into longer-term memory.
Scanning was performed using a 3.0-T Siemens MAGNETOM Trio MRI scanner at the Rotman Research Institute at Baycrest Hospital using a 32-channel receiver head coil. Each scanning session began with the acquisition of a whole-brain high-resolution magnetization-prepared rapid gradient-echo T1-weighted structural image (repetition time = 2 s, echo time = 2.63 ms, flip angle = 9°, field of view = 25.6 cm2, 160 oblique axial slices, 192 × 256 matrix, slice thickness = 1 mm). During each of four functional scanning runs, a total of 389 T2*-weighted echo-planar images were acquired using a two-shot gradient echo sequence (200 × 200 mm field of view with a 64 × 64 matrix size), resulting in an in-plane resolution of 3.1 × 3.1 mm for each of 40 2-mm axial slices that were acquired along the axis of the hippocampus. The interslice gap was 0.5 mm; repetition time = 2 s; echo time = 30ms; flip angle = 78°).
Functional images were preprocessed and analyzed using SPM8 (www.fil.ion.ucl.ac.uk/spm) and a custom-made, modular toolbox implemented in an automatic analysis pipeline system (https://github.com/rhodricusack/automaticanalysis/wiki). Prior to MVPA, the data were preprocessed, which included realignment of the data to the first functional scan of each run (after 5 dummy scans were discarded to allow for signal equilibrium), slice-timing correction, coregistration of functional and structural images, nonlinear normalization to the Montreal Neurological Institute (MNI) template brain, and segmentation of gray and white matter. Data were high-pass filtered with a 128-s cutoff. The data were then “denoised” by deriving regressors from voxels unrelated to the experimental paradigm and entering these regressors in a general linear model (GLM) analysis of the data, using the GLM denoise toolbox for Matlab (Kay et al. 2013). Briefly, this procedure includes taking as input a design matrix (specified by the onsets for each stimulus regardless of its condition) and an fMRI time-series, and returns as output an estimate of the hemodynamic response function (HRF) and BOLD response amplitudes (β weights). It is important to emphasize that the design matrix did not include the experimental conditions upon which our contrasts relied; these conditions were specified only after denoising the data. Next, a fitting procedure selected voxels that are unrelated to the experiment (cross-validated R2 < 0%), and a principal components analysis was performed on the time-series of these voxels to derive noise regressors. A cross-validation procedure then determined the number of regressors that were entered into the model (Kay et al. 2013).
We specified the onsets for each individual object (i.e., A, B, C, AB, BC, AC) for each of the 2 feature sets and 2 viewpoints. Our model then created a single regressor for each of the three different pairs of objects (i.e., A + BC, B + AC, and C + AB). This was done separately for each of the 2 feature sets and 2 viewpoints. For example, events corresponding to the singly presented “A” object from feature set 1 and viewpoint 2 and events corresponding to the singly presented “BC” object from feature set 1 and viewpoint 2 were combined to create the single regressor for “A + BC” from feature set 1 and viewpoint 2. More specifically, within each run, the voxel-wise data of each object pair were split into 3 subdivisions that were each composed of every third trial of a given image (following Zeithamova et al. 2012) (Fig. (Fig.22B,D, zoomed-in cells). The pattern similarity of each condition in each subdivision was compared with that of each condition in every other subdivision. We designed the subdivisions so that our comparisons were relatively equidistant in time. For example, the first subdivision for the A + BC regressor included A1st presentation + BC1st presentation + A4th presentation + BC4th presentation; the second subdivision included A2nd presentation + BC2nd presentation + A5th presentation + BC5th presentation, etc. This resulted in 36 regressors of interest per run [2 (feature sets) × 2 (viewpoints) × 3 (conjunctions) × 3 (subdivisions)]. We also modeled 8 regressors of no interest for each run: trials of three-featured objects (ABC), trials in which participants responded with a button press on the 1-back task, and 6 realignment parameters to correct for motion. Events were modeled with a delta (stick) function corresponding to the stimulus presentation onset convolved with the canonical HRF as defined by SPM8. This resulted in parameter estimates (β) indexing the magnitude of response for each regressor. Multivoxel patterns associated with each regressor were then Pearson-correlated. Thus, each cell in our planned contrast matrices was composed of a 12 × 12 correlation matrix that computed correlations within and across all runs and data subdivisions (Fig. (Fig.22B,D, zoomed-in cells; see also Supplementary Fig. 2 for the full data matrix). This process was repeated for each cell in the contrast matrix, and these correlation values were then averaged and condensed to yield the 12 × 12 contrast matrix (similar to Linke et al. 2011). We then subjected these condensed correlation matrices to our planned contrasts (Fig. (Fig.22B,D).
In addition to an analysis that computed correlations both across and within runs, we also conducted an analysis in which we ignored within-run correlations and computed correlations across runs only. Results from this across-run only analysis are shown in Supplementary Figure 3. In brief, this analysis revealed the same pattern of results as the analysis that computed correlations both across and within runs.
A spherical ROI (10 mm radius, restricted to gray matter voxels and including at least 30 voxels) was moved across the entire acquisition volume, and for each ROI, voxel-wise, unsmoothed β-values were extracted separately for each regressor (Kriegeskorte et al. 2006). The voxel-wise data (i.e., regressors of interest) were then Pearson-correlated within and across runs, and condensed into a 12 × 12 correlation matrix (see Fig. Fig.22B,D). Predefined similarity contrasts containing our predictions regarding the relative magnitude of pattern correlations within and between conjunction types specified which matrix elements were then subjected to a two-sample t-test. This analysis was performed on a single-subject level, and a group statistic was then calculated from the average results, indicating whether the ROI under investigation-coded information according to the similarity matrix. Information maps were created for each subject by mapping the t-statistic back to the central voxel of each corresponding ROI in that participant's native space. These single-subject t-maps were then normalized, and smoothed with a 12-mm full width at half maximum (FWHM) Gaussian kernel to compensate for anatomical variability across participants. The resulting contrast images were then subjected to a group analysis that compared the mean parameter-estimate difference across participants to zero (i.e., a one-sample t-test relative to zero). Results shown in Figure Figure44A,C are superimposed on the single-subject MNI brain template.
We investigated 4 ROIs defined a priori. Three were functionally defined regions well established as part of the VVS: LOC, FFA, and the PPA. The fourth ROI was the PRC, which was defined by an anatomical probability map created by Devlin and Price (2007). We included areas, which had at least a 30% or more probability of being the PRC, as done previously (Barense et al. 2011). For our functional localizer, we used identical stimuli to those employed in Watson et al. (2012). We defined the LOC as the region that was located along the lateral extent of the occipital lobe and responded more strongly to objects compared with scrambled objects (P < 0.001, uncorrected) (Malach et al. 1995). We defined the FFA as the set of contiguous voxels in the mid-fusiform gyrus that showed significantly higher responses to faces compared with objects (P < 0.001, uncorrected) (Liu et al. 2010), and the PPA as the set of contiguous voxels in the parahippocampal gyrus that responded significantly more to scenes than to objects (P < 0.001, uncorrected) (Reddy and Kanwisher 2007). These regions were defined separately for each participant by a 10-mm radius sphere centered around the peak voxel in each hemisphere from each contrast, using the MarsBar toolbox for SPM8 (http://marsbar.sourceforge.net/). All ROIs were bilateral, except for 3 participants in whom the left FFA could not be localized, and another participant in whom the right LOC could not be localized (Supplementary Table 2 displays the peak ROI coordinates for each ROI for each participant). The ROI MVPA was conducted in an identical manner to the searchlight analysis; voxel-wise data were Pearson-correlated and condensed into a 12 × 12 correlation matrix, except that here each ROI was treated as a single region (i.e., no searchlights were moved within an ROI). Before applying our contrasts of interest, we ensured that these correlation values were normally distributed (Jarque–Bera test; P > 0.05). We then applied our conjunction and viewpoint contrasts within each of the four ROIs and obtained, for each participant and each contrast, a t-value reflecting the strength of the difference between our correlations of interest (Fig. (Fig.22B,D). From these t-values, we calculated standard r-effect sizes that allowed us to compare the magnitude of effects across the ROIs (Rosenthal 1994) (Fig. (Fig.44B,D). Specifically, we transformed the r-effect sizes to Fisher's z-scores (as they have better distribution characteristics than correlations, e.g., Mullen 1989). We then conducted t-tests on the z-scores effect sizes obtained for each region, which provided a measure of the statistical significance between our cells of interest in each of our two contrast matrices (i.e., dark- and light-colored cells). We then compared the z-scores in each ROI to zero using Bonferroni-corrected one-sample t-tests, and conducted for each of the two contrasts paired-samples t-tests to compare the effect sizes observed in the PRC to the 3 more posterior ROIs (Fig. (Fig.44B,D).
To test the possibility that our results (Fig. (Fig.4)4) were driven by signal saturation due to nonlinearities in neurovascular or vascular MR coupling, we conducted a univariate ANOVA to test whether there were differences in the overall signal evoked by different single features (Fig. (Fig.5).5). A standard univariate processing pipeline was then followed, comprising the preprocessing steps described for the MVPA, but also smoothing of the imaging data with a 12-mm FWHM Gaussian kernel. We then conducted first-level statistical analyses. Within each run, there were 12 regressors of interest [2 (feature sets) × 2 (viewpoints) × 3 objects (A, B, and C)] and 8 regressors of no interest corresponding to trials of three-featured objects (ABC), trials in which participants responded with a button press on the 1-back task, and 6 realignment parameters to correct for motion. Within each regressor, events were modeled by convolving a delta (stick) function corresponding to the stimulus presentation onset with the canonical HRF as defined by SPM8. Second-level group analyses were conducted separately for each of the two feature sets by entering the parameter estimates for the 6 single-featured objects (i.e., A, B, and C from each viewpoint) of each subject into a single GLM, which treated participants as a random effect. This analysis was conducted using a factorial repeated-measures ANOVA, in which a model was constructed for the main effect of condition (i.e., the 6 single features). Within this model, an F-contrast was computed to test for areas that showed a significant effect of feature type for each feature set. Statistical parametric maps (SPMs) of the resulting F-statistic were thresholded at P < 0.001, uncorrected.
Our primary analyses involved 2 planned comparisons. The first comparison, the conjunction contrast, determined whether the neural patterns of activity demonstrated explicit conjunctive coding (i.e., whether activity patterns represented information specific to the conjunction of features comprising an object, over and above information regarding the features themselves) (Fig. (Fig.22A,B). The second comparison, the viewpoint contrast, investigated whether the conjunctive representations were view-invariant (Fig. (Fig.22C,D). For each of these two contrasts, we performed 2 independent planned analyses—a “searchlight analysis” to investigate the activity across the whole brain and an “ROI analysis” to investigate activation in specific VVS ROIs. Both contrasts were applied to a correlation matrix that included all possible correlations within and across the different conjunctions (Fig. (Fig.22B,D). The novelty of this design ensured that our comparisons were matched in terms of the number of features that needed to be bound. That is, our comparison terms (e.g., A + BC vs B + AC) included both a combination of a single-featured object and a two-featured object, and thus, binding and memory requirements were matched—what differed was the underlying representation for the conjunctions themselves.
For the conjunction contrast, a whole-brain searchlight MVPA (Kriegeskorte et al. 2006) revealed conjunctive coding in the VVS, with a global maxima in V4 (Rottschy et al. 2007) and activity that extended laterally into the LOC and posteriorly into V3 and V1 (peak x,y,z = 32,−80,−4, Z-value = 5.93), as well as conjunctive coding that extended anteriorly to the right PRC (Fig. (Fig.44A; peak x,y,z = 30,−6,−30, Z-value = 5.47) (all results reported are whole-brain FWE-corrected at P < 0.05; Supplementary Table 3 summarizes all regions). We next performed an ROI-based MVPA that applied the same contrast matrices and methods used for the searchlight analysis, except was focused only on the PRC and 3 functionally defined regions (LOC, FFA, and PPA) that are posterior to the PRC and are well established as part of the VVS. This analysis allowed direct comparison of conjunctive coding strength across regions (Fig. (Fig.44B). The conjunction contrast in this ROI MVPA revealed conjunctive coding in PRC (t(18) = 3.89, P < 0.01, reffect size = 0.24) and the LOC (t(18) = 3.85, P < 0.01, reffect size = 0.10), but not in FFA (t(18) = 1.80, P = 0.35, reffect size = 0.06), or PPA (t(18) = 2.66, P = 0.06, reffect size = 0.10) (all one-sample t-tests Bonferroni-corrected). Comparisons across ROIs demonstrated stronger conjunctive coding in the PRC relative to each of the three more posterior VVS ROIs (P's < 0.05). Thus, consistent with recent hierarchical models proposing explicit conjunctive coding in regions not traditionally associated with the VVS (Murray et al. 2007; Barense, Groen, et al. 2012), we found that PRC representations explicitly coded information regarding the object's conjunction, over and above its individual features.
In addition to the PRC, we also observed conjunctive coding in V4 as well as in LOC, indicating that the conjunctive coding mechanism is not selective to PRC. Indeed, there is evidence to suggest that V4 and LOC are important for representing feature conjunctions. For example, recent studies showed that learning-induced performance changes on a conjunctive visual search were correlated with increasing activity in V4 and LOC (Frank et al. 2014), and that these regions are important for conjoined processing of color and spatial frequency (Pollmann et al. 2014). In support of a causal role for both LOC and PRC in representing feature conjunctions, patients with selective LOC damage (Behrmann and Williams 2007; Konen et al. 2011) and those with PRC damage (Barense et al. 2005; Barense, Groen, et al. 2012) were impaired on tasks that tax integrating object features into a cohesive unit. Nonetheless, we did observe clear differences in conjunctive versus single-feature coding across our ROIs, with PRC demonstrating stronger conjunctive coding than LOC, FFA, and PPA (Fig. (Fig.44B), and LOC demonstrating stronger single-feature coding than PRC, FFA, or PPA (Supplementary Fig. 4). Although the current study lacks the temporal resolution to address this directly, one possibility is that this more posterior activity may also reflect higher level feedback, such as from PRC. Indeed, bidirectional interactions exist throughout the VVS (Hochstein and Ahissar 2002; Coutanche and Thompson-Schill 2015), and previous work has suggested that feedback from the PRC modulates familiarity responses to object parts in V2 (Barense, Ngo, et al. 2012; Peterson et al. 2012). An alternative possibility is that the PRC activity reflects a feedforward cascade from structures such as the LOC (Lamme and Roelfsema 2000; Bullier 2001).
Next, we investigated whether the conjunctive representations of the object features were view-invariant. The extent to which the human visual system supports view-invariant versus view-dependent representations of objects is unresolved (Biederman and Bar 1999; Peissig and Tarr 2007). Much research in this area has focused on VVS regions posterior to the MTL (Vuilleumier et al. 2002; Andresen et al. 2009) but some work has indicated that, for very complex stimuli, structures in the MTL may be central to view-invariance (Quiroga et al. 2005; Barense et al. 2010). Despite this, to our knowledge, no study has directly probed how the specific conjunctions comprising complex objects are neurally represented across viewpoints to support object recognition in different viewing conditions. That is, as the representations for objects become increasingly precise and dependent on the specific combinations of features comprising them, can they also become more invariant to the large changes in visual appearance caused by shifting viewpoints?
To investigate this question, we presented the objects at 70° rotations (Fig. (Fig.1),1), a manipulation that caused a more drastic visual change than changing the identity of the object itself (Fig. (Fig.3).3). To assess view-invariance, our viewpoint contrast compared the additivity of patterns (i.e., A + BC vs. B + AC vs. C + AB) across the 2 viewpoints (Fig. (Fig.22C,D). At a stringent whole-brain FWE-corrected threshold of P < 0.05, our searchlight MVPA revealed limited activity throughout the brain (2 voxels, likely in the orbitofrontal cortex, Supplementary Table 4A). However, at a more liberal threshold (P < 0.001 uncorrected), we observed view-invariant conjunctive coding in the VVS, with maxima in V4 (Rottschy et al. 2007); peak x,y,z = 32,−70,−4, Z-value = 4.07), lateral IT cortex (peak x,y,z = −52,−28,−20, Z-value = 4.15), as well as activity that extended anteriorly to the left PRC (peak x,y,z = −36,−4,−26, Z-value = 3.25) (Fig. (Fig.44C and Supplementary Table 4B). The ROI MVPA of the viewpoint contrast revealed view-invariance in the PRC (t(18) = 3.04, P < 0.05, reffect size = 0.21), but not the LOC, FFA, or PPA (t's < 0.75, P's > 0.99, all reffect size < 0.03; Fig. Fig.44D) (all tests Bonferroni-corrected). A direct comparison across regions confirmed that this view-invariant conjunctive coding was stronger in the PRC compared with the 3 more posterior VVS ROIs (P's < 0.05).
Taken together, results from the conjunction and viewpoint contrasts provide the first direct evidence that information coding in PRC simultaneously discriminated between the precise conjunctions of features comprising an object, yet was also invariant to the large changes in the visual appearance of those conjunctions caused by shifting viewpoints. These coding principles distinguished the PRC from other functionally defined regions in the VVS that are more classically associated with perception (e.g., LOC, FFA, and PPA). It is important to note that our results were obtained in the context of a 1-back memory task, and thus, one might be cautious to interpret our results in the context of a perceptual role for the PRC. Indeed, we prefer to consider our results in terms of the representational role for any given brain region, rather than in terms of the cognitive process it supports—be it either memory or perception. To this end, our experimental design ensured that memory demands were matched in our comparisons of interest—all that differed was how the same 3 features were arranged to create an object. With this design, explicit memory demands were equated, allowing a clean assessment of the underlying object representation. That said, one might still argue that these object representations were called upon in the service of a 1-back memory task. However, a wealth of evidence suggests that PRC damage impairs performance on both perceptual (e.g., Bussey et al. 2002; Bartko et al. 2007; Lee and Rudebeck 2010; Barense, Ngo, et al. 2012; Barense et al. 2012) and explicit memory tasks (Bussey et al. 2002; Winters et al. 2004; Barense et al. 2005; Bartko et al. 2010; McTighe et al. 2010). The critical factor in eliciting these impairments was whether the task required the objects to be processed in terms of their conjunctions of features, rather than on the basis of single features alone. We argue that these seemingly disparate mnemonic and perceptual deficits can be accounted for by the fact that both the mnemonic and perceptual tasks recruited the conjunctive-level representations we have measured in the current study.
Finally, it is important to rule out the possibility that the results we obtained were driven by signal saturation due to nonlinearities in neurovascular or vascular MR coupling (Fig. (Fig.5).5). Such signal saturation could produce nonlinearities that could be misinterpreted as a conjunction, when in fact the coding was linear, if there were stronger activation for some features more than others (Fig. (Fig.55A). To evaluate whether our data fell within this regime, we conducted a univariate ANOVA to test whether there were differences in the overall signal evoked by different features when these features were presented alone. These analyses revealed activity predominately in early visual areas that were largely nonoverlapping with the results from our critical conjunction and viewpoint contrasts (Fig. (Fig.55C,D and Supplementary Table 5). Importantly, there were no significant differences between the basic features in terms of overall activity in the PRC, or in the LOC typically observed in our participants (even at a liberal uncorrected threshold of P < 0.001). Thus, there was no evidence to suggest our observation of conjunctive coding is driven by spurious BOLD signal saturation.
In conclusion, this study provides new evidence regarding the functional architecture of object perception, demonstrating neural representations that both integrated object features into a whole that was different from the sum of its parts and were also invariant to large changes in viewpoint. To our knowledge, this constitutes the first direct functional evidence for explicit coding of complex feature conjunctions in the human PRC, suggesting that visual-object processing does not culminate in IT cortex as long believed (e.g., Lafer-Sousa and Conway 2013), but instead continues into MTL regions traditionally associated with memory. This is consistent with recent proposals that memory (MTL) and perceptual (VVS) systems are more integrated than previously appreciated (Murray et al. 2007; Barense, Groen, et al. 2012). Rather than postulating anatomically separate systems for memory and perception, these brain regions may be best understood in terms of the representation they support—and any given representation may be useful for many different aspects of cognition. In the case of the PRC, damage to these complex conjunctive representations would not only impair object perception, but would also cause disturbances in the recognition of objects and people that are critical components of amnesia.
This work was supported by the Canadian Institutes of Health Research (MOP-115148 to M.D.B.) and a Scholar Award from the James S. McDonnell Foundation to M.D.B.
We thank Christopher Honey and Andy Lee for their helpful comments on an earlier version of this manuscript, and Nick Rule and Alejandro Vicente-Grabovetsky for assistance with data analysis. Conflict of Interest: None declared.